# TF IDF Lab

### Introduction

In this lesson, build up both the bag of words and the term frequency inverse document frequencies from scratch in Python.  Let's get started.

### Loading the Data

In [5]:
import pandas as pd

In [6]:
url = "https://raw.githubusercontent.com/jigsawlabs-student/nlp-text-representation/master/coconut_water.csv"
coconut_df = pd.read_csv(url, index_col = 0)

In [7]:
coconut_df[:2]

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
47836,47837,B004SRH2B6,AKACGHPVILE9R,"Sophronia ""Euphemia""",1,1,1,1314144000,Switched to O.N.E.,Must admit the taste of O.N.E. coconut water i...
47837,47838,B004SRH2B6,A2GO0AIHB846UX,vinny,1,1,5,1313884800,WOW!!,I love this stuff! Perfect blend of dark choc...


Looking at the first couple of reviews, notice that we collected both text and a score for each review.  The score is the Amazon rating.

### Bag of Words to Term Frequency

Let's start with writing a function that performs bag of words, and then we'll move to term frequency.  To get started, let's assign the `text` column to a variable called `documents`.

In [8]:
documents = coconut_df.Text

Now bag of words is essentially a histogram.  In the `bag_of_words` method below, create a dictionary that consists of each word in a document, followed by the number of times that word appears.

> Take a look at the solution below to see the correct output.

In [9]:
def bag_of_words(document):
    terms = [term.lower() for term in document.split()]
    dictionary = dict.fromkeys(terms, 0)
    for term in terms:
        dictionary[term] += 1
    return dictionary

In [10]:
first_document = documents.iloc[0]
bag_of_words(first_document)

# {'must': 1,
#  'admit': 1,
#  'the': 2,
#  'taste': 1,
#  'of': 2,
#  'o.n.e.': 1,
#  'coconut': 2,
#  'water': 1,
#  'is': 1,
#  'better.': 1,
#  'took': 1,
#  'a': 1,
#  'long': 1,
#  'time': 1,
#  'to': 1,
#  'get': 1,
#  'through': 1,
#  'supply': 1,
#  'water.': 1}

{'must': 1,
 'admit': 1,
 'the': 2,
 'taste': 1,
 'of': 2,
 'o.n.e.': 1,
 'coconut': 2,
 'water': 1,
 'is': 1,
 'better.': 1,
 'took': 1,
 'a': 1,
 'long': 1,
 'time': 1,
 'to': 1,
 'get': 1,
 'through': 1,
 'supply': 1,
 'water.': 1}

Next let's incorporate use of spacy.  Remember that we can use spacy to token and lemmatize our document.

> First let's import and load up spacy.

In [14]:
import spacy
nlp = spacy.load("en_core_web_sm")

> We can see the document below.

In [15]:
doc = nlp(first_document)

doc

Must admit the taste of O.N.E. coconut water is better.  Took a long time to get through the supply of coconut water.

Update the bag of words model to return a historgram of the lemma's of the words, and only include the word if it is not a stop word, according to spacy, and it *is* an alphabetical term.

In [16]:
def bag_of_words_spacy(document):
    terms = [term.lemma_ for term in nlp(document) if term.is_alpha and not term.is_stop]
    dictionary = dict.fromkeys(terms, 0)
    for term in terms:
        dictionary[term] += 1
    return dictionary

In [17]:
bag_of_words_spacy(first_document)

# {'admit': 1,
#  'taste': 1,
#  'coconut': 2,
#  'water': 2,
#  'well': 1,
#  'take': 1,
#  'long': 1,
#  'time': 1,
#  'supply': 1}

{'admit': 1,
 'taste': 1,
 'coconut': 2,
 'water': 2,
 'well': 1,
 'take': 1,
 'long': 1,
 'time': 1,
 'supply': 1}

Now let's move from bag of words to term frequency.  Remember the term frequency calculates the percentage of times each word appears in the document. 
> Below, update the bag of words function to return the term frequency of each word in the document.

In [18]:
def term_frequency(document):
    terms = [term.lemma_ for term in nlp(document) if term.is_alpha and not term.is_stop]
    doc_length = len(terms)
    dictionary = dict.fromkeys(terms, 0)
    for term in terms:
        dictionary[term] += 1/doc_length
    return dictionary

In [20]:
term_frequency(first_document)

# {'admit': 0.09090909090909091,
#  'taste': 0.09090909090909091,
#  'coconut': 0.18181818181818182,
#  'water': 0.18181818181818182,
#  'well': 0.09090909090909091,
#  'take': 0.09090909090909091,
#  'long': 0.09090909090909091,
#  'time': 0.09090909090909091,
#  'supply': 0.09090909090909091}

{'admit': 0.09090909090909091,
 'taste': 0.09090909090909091,
 'coconut': 0.18181818181818182,
 'water': 0.18181818181818182,
 'well': 0.09090909090909091,
 'take': 0.09090909090909091,
 'long': 0.09090909090909091,
 'time': 0.09090909090909091,
 'supply': 0.09090909090909091}

### Inverse Document Frequency

Now let's use inverse document frequency.

Remember that: 
    
$\text{idf(term)} = \log(\frac{\text{# of documents}}{\text{# of documents with term}})$

> So the *larger* the number of documents with the term, the smaller the number idf of the term.  The idf is the inverse of the proportion of documents with term, then logged.

We'll provide a dictionary of the document frequency for you:

> Writing a function that calculates the actual document frequency is tricky as we'd need to coerce each word in each document to the lemmatized form.  And then we'd see the number of documents with the lemmatized word.  So we just provide a dictionary with some made up frequencies.

In [44]:
document_frequency_dict = {'salty': 5, 'water': 300, 'coconut': 250, 'taste': 280, 'sweet': 150}

> So the above says that 300 of the documents has the word water, and five of the documents had the word salty.

In [45]:
import numpy as np
def inverse_document_frequency(term, doc_length = 456):
    return np.log(doc_length/document_frequency_dict[term])

In [46]:
inverse_document_frequency('water', doc_length = 456)
# 0.41871033485818504

0.41871033485818504

In [48]:
inverse_document_frequency('salty', doc_length = 456)
# 4.513054897080286

4.513054897080286

We can see that the more frequent a word, the lower the inverse document frequency.

### Building a Dictionary of Inverse Document Frequencies

Now, the inverse document frequency of any given term is constant throughout the corpus, regardless of the particular document.  So, we can create a dictionary of each word along with the inverse document frequencies.

In [55]:
idf_dict = dict([(term, inverse_document_frequency(term)) for term in list(document_frequency_dict.keys())])
idf_dict

{'salty': 4.513054897080286,
 'water': 0.41871033485818504,
 'coconut': 0.6010318916521397,
 'taste': 0.4877032063451365,
 'sweet': 1.1118575154181303}

### Putting it Together

So we saw that inverse document frequency is constant throughout an entire corpus, as it is measuring the proportion of times that a term occurs in a document in the corpus.

Term frequency, by contrast, is per document.  And putting the two components together, to calculate a word's importance in the document we multiply how frequently the word appears in a document by the inverse of how many document the word appears in generally.

Let's write a function called `tf_idf` that takes in a term and a document and returns the `tf_idf` of that term for the document.

In [56]:
def tf_idf(term, document):
    tf_term = term_frequency(document)[term]
    return tf_term*idf_dict[term]

In [57]:
tf_idf('water', first_document)

0.07612915179239728

In [60]:
tf_idf('coconut', first_document)

0.10927852575493449

In [62]:
term_frequency(first_document)

{'admit': 0.09090909090909091,
 'taste': 0.09090909090909091,
 'coconut': 0.18181818181818182,
 'water': 0.18181818181818182,
 'well': 0.09090909090909091,
 'take': 0.09090909090909091,
 'long': 0.09090909090909091,
 'time': 0.09090909090909091,
 'supply': 0.09090909090909091}

So we can see that even though `coconut` and `water` appear with the same frequency in the document, coconut is weighted more because it occurs less frequently throughout the corpus.  

### Summary

In this lesson, we explored text representation of a document through TF-IDF.  As the name implies, this has two components.  The term frequency is per document, and calculates the proportion of times that a word appears in the document.  The inverse document frequency, by contrast is a score that stays constant throughout the corpus and is calculated by:

$\text{idf(term)} = \log(\frac{\text{# of documents}}{\text{# of documents with term}})$

So it's the inverse of the proportion of documents with term, and then logged.