# TF IDF Lab

### Introduction

### Loading the Data

In [29]:
import pandas as pd

df = pd.read_csv('./Reviews.csv.zip')

In [39]:
coconut_df = pd.read_csv('./coconut_water.csv', index_col = 0)

In [36]:
coconut_water.to_csv('./coconut_water.csv')

In [38]:
# [(top_product, df[df.ProductId == top_product].Score.var()) for top_product in top_products]

Let's use the `Score` as our target and the Text as 

### Bag of Words to Term Frequency

In [40]:
coconut_df.Text[:3]

47836    Must admit the taste of O.N.E. coconut water i...
47837    I love this stuff!  Perfect blend of dark choc...
47838    I am from the Philippines, a country where the...
Name: Text, dtype: object

Let's start with bag of words, and then we'll move to term frequency.

In [213]:
documents = coconut_df.Text


In [58]:
def bag_of_words(document):
    terms = [term.lower() for term in document.split()]
    dictionary = dict.fromkeys(terms, 0)
    for term in terms:
        dictionary[term] += 1
    return dictionary

In [None]:
# {'must': 1,
#  'admit': 1,
#  'the': 2,
#  'taste': 1,
#  'of': 2,
#  'o.n.e.': 1,
#  'coconut': 2,
#  'water': 1,
#  'is': 1,
#  'better.': 1,
#  'took': 1,
#  'a': 1,
#  'long': 1,
#  'time': 1,
#  'to': 1,
#  'get': 1,
#  'through': 1,
#  'supply': 1,
#  'water.': 1}

Next let's incorporate use of spacy.

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(document)
[token.lemma_ for token in doc if token.is_alpha and not token.is_stop]

First let's return a bag of the lemma's of the words instead of the pure words, and only include the word if it is not a stop word, according to spacy, and it *is* an alphabetical term.

In [99]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

def bag_of_words(document):
    terms = [term.lemma_ for term in nlp(document) if term.is_alpha and not term.is_stop]
    dictionary = dict.fromkeys(terms, 0)
    for term in terms:
        dictionary[term] += 1
    return dictionary

In [100]:
bag_of_words(document)

# {'admit': 1,
#  'taste': 1,
#  'coconut': 2,
#  'water': 2,
#  'well': 1,
#  'take': 1,
#  'long': 1,
#  'time': 1,
#  'supply': 1}

{'admit': 1,
 'taste': 1,
 'coconut': 2,
 'water': 2,
 'well': 1,
 'take': 1,
 'long': 1,
 'time': 1,
 'supply': 1}

Now let's move from bag of words to term frequency.  
> Divide each term by the total number of terms counted in bag of words.

In [112]:
def term_frequency(document):
    terms = [term.lemma_ for term in nlp(document) if term.is_alpha and not term.is_stop]
    doc_length = len(terms)
    dictionary = dict.fromkeys(terms, 0)
    for term in terms:
        dictionary[term] += 1/doc_length
    return dictionary

In [113]:
term_frequency(document)

{'admit': 0.09090909090909091,
 'taste': 0.09090909090909091,
 'coconut': 0.18181818181818182,
 'water': 0.18181818181818182,
 'well': 0.09090909090909091,
 'take': 0.09090909090909091,
 'long': 0.09090909090909091,
 'time': 0.09090909090909091,
 'supply': 0.09090909090909091}

### Inverse Document Frequency

Now let's use inverse document frequency.

Remember that: 
    
$idf(term) = \log(\frac{\text{# of documents}}{\text{# of documents with term}})$

Let's start by writing a function called document frequency.  This should calculate the number of documents that a term appears in. 

In [None]:
def document_frequency(term):
    return sum([term in doc_word for doc_word in doc_words])

In [201]:
document_frequency('salty')
# 5

5

In [200]:
document_frequency('water')
# 286

286

Now from here, let's write a function called `inverse_document_frequency`.  It should take in arguments of `term` and `doc_length`, and return the inverse document frequency score for a term.  

In [156]:
import numpy as np
def inverse_document_frequency(term, doc_length = 456):
    return np.log(doc_length/document_frequency(term))

In [202]:
inverse_document_frequency('water', doc_length = 456)
# 0.4665009986945335

0.4665009986945335

In [203]:
inverse_document_frequency('salty', doc_length = 456)
# 4.513054897080286

4.513054897080286

### Building a Dictionary of Inverse Document Frequencies

Now, the inverse document frequency is constant for each term, regardless of the document.  So, we'll create a dictionary of each word along with the inverse document frequencies.

In [210]:
doc_words = [list(tf.keys()) for tf in tfs]
unique_words = list(set(terms))

len(unique_words)
# 2469

2469

In [204]:
idfs = dict([(word, inverse_document_frequency(word, doc_length = 456)) for word in unique_words])

> If we order the words from lowest to highest, we can find the words that occur most frequently.

In [209]:
idf_series = pd.Series(idfs)

idf_series.sort_values(ascending = True)[:15]

coconut    0.449170
water      0.466501
taste      0.480586
like       0.870219
drink      0.902137
Zico       1.022626
good       1.066247
flavor     1.132060
try        1.145759
love       1.239691
bottle     1.270463
product    1.386294
buy        1.507372
great      1.558145
well       1.691676
dtype: float64

### Putting it Together

Finally, let's write a function called `tf_idf` that takes in a document and returns a dictionary of the `tf_idf` of the terms in that document.

In [182]:
def tf_idf(document):
    tf_dict = term_frequency(document)
    return {term:tf_dict[term]*idf_dict[term] for term in tf_dict.keys()}

In [215]:
first_document =  documents.iloc[0]

In [216]:
pd.Series(tf_idf(first_document)).sort_values(ascending = False)

# supply     0.493577
# admit      0.456716
# long       0.267676
# take       0.238726
# time       0.200952
# well       0.153789
# water      0.084818
# coconut    0.081667
# taste      0.043690
# dtype: float64

supply     0.493577
admit      0.456716
long       0.267676
take       0.238726
time       0.200952
well       0.153789
water      0.084818
coconut    0.081667
taste      0.043690
dtype: float64

In [217]:
pd.Series(tf_idf(documents.iloc[1])).sort_values(ascending = False)

choc          0.374438
stuff         0.239569
cow           0.211120
intolerant    0.211120
lactose       0.173237
forward       0.163317
milk          0.156713
case          0.156713
dark          0.139416
smooth        0.131721
blend         0.128434
perfect       0.120118
refresh       0.111452
close         0.103000
look          0.087551
come          0.074871
think         0.071705
well          0.058334
buy           0.051978
love          0.042748
drink         0.031108
like          0.030008
water         0.016086
coconut       0.015489
dtype: float64

We can begin to see that using TF-IDF is good for judging what makes each document unique.  In our second document, words like `water`, `coconut` and drink are towards the bottom, while words like `lactose` and `intolerant` are closer to the top.