# **Texting Mining - Lab 2 - TF-IDF**

**TF(t)** = (Number of times term t appears in a document) / (Total number of terms in the document).

**IDF(t)** = log_e(Total number of documents / Number of documents with term t in it).



A) Produce a dictionary of words and their occurence for each document.
* Doc 1: The sky is blue
* Doc 2: The sun is bright today
* Doc 3: The sun in the sky is bright
* Doc 4: We can see the shining sun the bright sun

In [1]:
# Create documents
document_1 = 'The sky is blue'
document_2 = 'The sun is bright today'
document_3 = 'The sun in the sky is bright'
document_4 = 'We can see the shining sun the bright sun'

# Convert to lower case and aggregate in a list
documents = [document_1.lower(),document_2.lower(),document_3.lower(),document_4.lower()]

# Get bag of words for every document
bagOfWords = [doc.split(' ') for doc in documents]

# Remove duplicates
uniqueWords = [set(i) for i in bagOfWords]

# Get union of all bag of words
uniqueWords = set().union(*uniqueWords)

# Dictionary of words and their occurence for each document 
dicOfWords = []
for bag in bagOfWords:
    numOfWords = dict.fromkeys(uniqueWords, 0)
    for word in bag:
        numOfWords[word] += 1
    dicOfWords.append(numOfWords)

B) Produce the term frequency matrix for all documents

In [2]:
# The number of times a word appears in a document divded by the total number of words in the document. Every document has its own term frequency.
def computeTF(wordDict, bagOfWords):
    tfDict = {}
    bagOfWordsCount = len(bagOfWords)
    for word, count in wordDict.items():
        tfDict[word] = count / float(bagOfWordsCount)
    return tfDict
tfs = [computeTF(dicOfWords[i], bagOfWords[i]) for i in range(0,len(documents))]

C) Produce the inverse document frequency for all documents.

In [3]:
# The log of the number of documents divided by the number of documents that contain the word w. 
# Inverse data frequency determines the weight of rare words across all documents in the corpus.
def computeIDF(documents):
    import math
    N = len(documents)
    
    idfDict = dict.fromkeys(uniqueWords, 0)
    for document in documents:
        for word, val in document.items():
            if val > 0:
                idfDict[word] += 1
    
    for word, val in idfDict.items():
        idfDict[word] = math.log(N / float(val),10)
    return idfDict
idfs = computeIDF(dicOfWords)

D) Produce the **TF-IDF** by multiplying the **term frequency matrix** by the **inverse document frequency**

In [4]:
def computeTFIDF(tfBagOfWords, idfs):
    tfidf = {}
    for word, val in tfBagOfWords.items():
        tfidf[word] = val * idfs[word]
    print(type(tfidf))
    return tfidf
tfidfs = [computeTFIDF(tf, idfs) for tf in tfs]
tfidfs

<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>


[{'blue': 0.15051499783199057,
  'bright': 0.0,
  'can': 0.0,
  'in': 0.0,
  'is': 0.031234684152074976,
  'see': 0.0,
  'shining': 0.0,
  'sky': 0.07525749891599529,
  'sun': 0.0,
  'the': 0.0,
  'today': 0.0,
  'we': 0.0},
 {'blue': 0.0,
  'bright': 0.02498774732165998,
  'can': 0.0,
  'in': 0.0,
  'is': 0.02498774732165998,
  'see': 0.0,
  'shining': 0.0,
  'sky': 0.0,
  'sun': 0.02498774732165998,
  'the': 0.0,
  'today': 0.12041199826559246,
  'we': 0.0},
 {'blue': 0.0,
  'bright': 0.017848390944042843,
  'can': 0.0,
  'in': 0.0860085701897089,
  'is': 0.017848390944042843,
  'see': 0.0,
  'shining': 0.0,
  'sky': 0.04300428509485445,
  'sun': 0.017848390944042843,
  'the': 0.0,
  'today': 0.0,
  'we': 0.0},
 {'blue': 0.0,
  'bright': 0.013882081845366656,
  'can': 0.06689555459199581,
  'in': 0.0,
  'is': 0.0,
  'see': 0.06689555459199581,
  'shining': 0.06689555459199581,
  'sky': 0.0,
  'sun': 0.027764163690733312,
  'the': 0.0,
  'today': 0.0,
  'we': 0.06689555459199581}]

---
**Exercise 3-2:** TF-IDF Construction from sklearn

A) Build the term frequency document matrix for the previous documents using <a href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html">CountVectorizer</a>
 from sklearn

1. Print the **sorted vocabulary** of the fitted documents and the id corresponding to the the word _"sky"_
2. Print the **term frequency matrix** of the transformed documents
3. Transform the sentence "The moon is bright today" using the same vectorizer
4. Check if the word "moon" appears in the original vocabulary

* Doc 1: The sky is blue
* Doc 2: The sun is bright today
* Doc 3: The sun in the sky is bright
* Doc 4: We can see the shining sun the bright sun

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectors = vectorizer.fit(documents)#learn vocabulary
print(vectors.vocabulary_.items())

dict_items([('the', 9), ('sky', 7), ('is', 4), ('blue', 0), ('sun', 8), ('bright', 1), ('today', 10), ('in', 3), ('we', 11), ('can', 2), ('see', 5), ('shining', 6)])


In [6]:
print(sorted(vectors.vocabulary_.items(), key=lambda x: x[1])) 

[('blue', 0), ('bright', 1), ('can', 2), ('in', 3), ('is', 4), ('see', 5), ('shining', 6), ('sky', 7), ('sun', 8), ('the', 9), ('today', 10), ('we', 11)]


In [7]:
transformed_documents = vectors.transform(documents)#transform into document term freq
print(transformed_documents.toarray())

[[1 0 0 0 1 0 0 1 0 1 0 0]
 [0 1 0 0 1 0 0 0 1 1 1 0]
 [0 1 0 1 1 0 0 1 1 2 0 0]
 [0 1 1 0 0 1 1 0 2 2 0 1]]


In [8]:
documents

['the sky is blue',
 'the sun is bright today',
 'the sun in the sky is bright',
 'we can see the shining sun the bright sun']

In [9]:
sentence = ['The moon is bright today']
print(vectors.transform(sentence).toarray())

[[0 1 0 0 1 0 0 0 0 1 1 0]]


In [10]:
print('moon' in vectors.vocabulary_)

False


B) Build the tf-idf document matrix for the previous documents using <a href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html">TfIdfVectorizer</a> from sklearn
1. Print the **sorted vocabulary** of the fitted documents
2. Print the tf-idf matrix

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
vv = TfidfVectorizer(norm = None)
tfidf = vv.fit_transform(documents)
print(sorted(vv.vocabulary_.items(), key=lambda x : x[1]))

[('blue', 0), ('bright', 1), ('can', 2), ('in', 3), ('is', 4), ('see', 5), ('shining', 6), ('sky', 7), ('sun', 8), ('the', 9), ('today', 10), ('we', 11)]


In [12]:
tfidf.toarray()

array([[1.91629073, 0.        , 0.        , 0.        , 1.22314355,
        0.        , 0.        , 1.51082562, 0.        , 1.        ,
        0.        , 0.        ],
       [0.        , 1.22314355, 0.        , 0.        , 1.22314355,
        0.        , 0.        , 0.        , 1.22314355, 1.        ,
        1.91629073, 0.        ],
       [0.        , 1.22314355, 0.        , 1.91629073, 1.22314355,
        0.        , 0.        , 1.51082562, 1.22314355, 2.        ,
        0.        , 0.        ],
       [0.        , 1.22314355, 1.91629073, 0.        , 0.        ,
        1.91629073, 1.91629073, 0.        , 2.4462871 , 2.        ,
        0.        , 1.91629073]])

In [13]:
for row in tfidf.toarray():
    print(["%.4f"% val for val in row])

['1.9163', '0.0000', '0.0000', '0.0000', '1.2231', '0.0000', '0.0000', '1.5108', '0.0000', '1.0000', '0.0000', '0.0000']
['0.0000', '1.2231', '0.0000', '0.0000', '1.2231', '0.0000', '0.0000', '0.0000', '1.2231', '1.0000', '1.9163', '0.0000']
['0.0000', '1.2231', '0.0000', '1.9163', '1.2231', '0.0000', '0.0000', '1.5108', '1.2231', '2.0000', '0.0000', '0.0000']
['0.0000', '1.2231', '1.9163', '0.0000', '0.0000', '1.9163', '1.9163', '0.0000', '2.4463', '2.0000', '0.0000', '1.9163']
