# NLP: Using TFIDF to generate useful features from text

## Basics of NLP

### What is NLP?
1) Science of how machines analyse, understand and generate written & spoken languages in order to interact with humans.
2) It is the science of ability of a computer program to understand human language as it is spon.
3) Programming computers to fruitfully process natural language data.

### How computers do NLP?
By converting words to numbers as computers are good at crunching numbers.

###  What is the scope of this turorial?
We will explore one of the many ways of converting text to numbers.

###  What is Corpus?
Corpus is a collection of all documents. 

###  What is "Bag of Words" model?
A model to represent a document as a bag of words.


In [1]:
###################################################################################
# Purpose: Calculate TFIDF scores for words in a given corpus                     #
# input:   A corpus of two sentences                                              #
# output:  TFIDF scores                                                           #
# #
###################################################################################

docA = "the cat sat on my face"
docB = "the dog sat on my bed"

In [3]:
# Tokenize

In [4]:
bowA = docA.split(" ")
bowB = docB.split(" ")

In [5]:
bowA

['the', 'cat', 'sat', 'on', 'my', 'face']

Splitting a document as words is called tokenizing. Now we have to covert these tokenized bags of words to numbers. 

In [6]:
wordSet = set(bowA).union(set(bowB))

In [7]:
print(wordSet)

{'cat', 'dog', 'the', 'sat', 'face', 'my', 'bed', 'on'}


In [39]:
# All words in all bags/documents.
wordSet

{'bed', 'cat', 'dog', 'face', 'my', 'on', 'sat', 'the'}

In [8]:
# Let u create dictionaries to keep word counts.
wordDictA = dict.fromkeys(wordSet, 0)
wordDictB = dict.fromkeys(wordSet, 0)

In [14]:
wordDictA

{'cat': 1, 'dog': 0, 'the': 1, 'sat': 1, 'face': 1, 'my': 1, 'bed': 0, 'on': 1}

In [10]:
wordDictB

{'cat': 0, 'dog': 0, 'the': 0, 'sat': 0, 'face': 0, 'my': 0, 'bed': 0, 'on': 0}

In [13]:
# Let us count the words in each of the bags
for word in bowA:
    wordDictA[word] +=1

In [15]:
for word in bowB:
    wordDictB[word] +=1 

In [16]:
wordDictA
wordDictB

{'cat': 0, 'dog': 1, 'the': 1, 'sat': 1, 'face': 0, 'my': 1, 'bed': 1, 'on': 1}

In [17]:
#Put both the dictionaries into a matrix.

import pandas as pd
pd.DataFrame([wordDictA, wordDictB])

Unnamed: 0,bed,cat,dog,face,my,on,sat,the
0,0,1,0,1,1,1,1,1
1,1,0,1,0,1,1,1,1


In [48]:
# Now we have converted the text into a linear algebra problem. 
# Computers can do linear algenra far better than humans. 


### TF-IDF is a better strategy
There is a problem with counts above. Though we use lot of words commonly, they just don't mean much. Look up Zipf's Law. Hence we need a better strategy called TF-IDF. TF-IDF score of a word is useful to rank its importance.

tfidf = tf(w)*idf(w)
tf(w) = No. of occurances of a word in a document / Total no. of words in the document
idf(w) = log(No. of documents) / No. of documents that contain word w)

In [18]:
def computeTF(wordDict, bow):
    tfDict = {}
    bowCount = len(bow)
    for word, count in wordDict.items():
        tfDict[word] = count / float(bowCount)
    return tfDict       

In [19]:
tfBowA = computeTF(wordDictA, bowA)
tfBowB = computeTF(wordDictB, bowA)

In [20]:
tfBowA

{'cat': 0.16666666666666666,
 'dog': 0.0,
 'the': 0.16666666666666666,
 'sat': 0.16666666666666666,
 'face': 0.16666666666666666,
 'my': 0.16666666666666666,
 'bed': 0.0,
 'on': 0.16666666666666666}

In [21]:
def computeIDF(docList):
    import math
    idfDict = {}
    N = len(docList)
    
    #Counts the no of documents that contain the word
    idfDict = dict.fromkeys(docList[0].keys(),0)
    for doc in docList:
        for word, val in doc.items():
            if val > 0:
                idfDict[word] += 1
    
    #Divide N by denomenator above and take log of it
    for word, val in idfDict.items():
        idfDict[word]=math.log( N / float(val))
    
    return idfDict

In [23]:
idfs = computeIDF([wordDictA, wordDictB])

In [24]:
idfs

{'cat': 0.6931471805599453,
 'dog': 0.6931471805599453,
 'the': 0.0,
 'sat': 0.0,
 'face': 0.6931471805599453,
 'my': 0.0,
 'bed': 0.6931471805599453,
 'on': 0.0}

In [25]:
def computeTFIDF(tfBow, idfs):
    tfidf = {}
    for word, val in tfBow.items():
        tfidf[word] = val * idfs[word]
    return tfidf   

In [26]:
tfidfBowA = computeTFIDF(tfBowA, idfs)
tfidfBowB = computeTFIDF(tfBowB, idfs)

In [27]:
# Finally put both of these in a matrix.

import pandas as pd
pd.DataFrame([tfidfBowA, tfidfBowB])

Unnamed: 0,bed,cat,dog,face,my,on,sat,the
0,0.0,0.115525,0.0,0.115525,0.0,0.0,0.0,0.0
1,0.115525,0.0,0.115525,0.0,0.0,0.0,0.0,0.0


In [None]:
# We noticed that most common words "on", "my", "sat" and 'the' 
# that occur commonly have 0 values above. 