# Module 3 - Comparing Corpuses and TF-IDF

### Sample Dataset 1: A Tale of Two Cities

In [1]:
sentences = ["It was the best of times",
"it was the worst of times",
"it was the age of wisdom",
"it was the age of foolishness"]

### Building A Vectorizer

In [48]:
import pandas as pd
import numpy  as np

In [8]:
#split each sentence by spaces, get each word, coerce into a set, coerce the set into a dataframe.
tokenized_sentences = [[t for t in sentence.split()] for sentence in sentences]
vocabulary = set([w for s in tokenized_sentences for w in s])
vector = pd.DataFrame([[w, i] for i,w in enumerate(vocabulary)])

In [7]:
vector

Unnamed: 0,0,1
0,foolishness,0
1,age,1
2,It,2
3,times,3
4,best,4
5,was,5
6,of,6
7,the,7
8,wisdom,8
9,worst,9


**To build a vectorizer, we create a set of columns indicating whether a document contains a word, with contains=1 and doesn't contain = 0.**

In [27]:
def onehot_encode(tokenized_sentence):
    return [1 if w in tokenized_sentence else 0 for w in vocabulary]

In [14]:
onehot = [onehot_encode(tokenized_sentence) for tokenized_sentence in tokenized_sentences] 

#the zip function takes the first element of two tuples and returns them together as (i,j) from a =(i1,i2,i3), b=(j1,j2,j3)
for (sentence, oh) in zip(sentences, onehot):
    print("%s: %s" % (oh, sentence))

[0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0]: It was the best of times
[0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1]: it was the worst of times
[0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1]: it was the age of wisdom
[1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1]: it was the age of foolishness


### Building a Vectorizer - DataFrame Version

**We can also be a little more professional and put this all in a dataframe.**

In [40]:
#an alternative means of doing this 
df = pd.DataFrame(data=sentences,columns=['sentence'])
df.head()

Unnamed: 0,sentence
0,It was the best of times
1,it was the worst of times
2,it was the age of wisdom
3,it was the age of foolishness


In [41]:
#create a function that lets us map to our df to make vocabulary set.
vocab=set()
def create_vocab(tokens):
    for token in tokens:
        vocab.add(token)

In [42]:
#tokenize
df['tokens'] = df['sentence'].map(str.split)
#create vocabulary
df['tokens'].map(create_vocab)
#vectorize
df['vector'] = df['tokens'].map(onehot_encode)
df.head()

Unnamed: 0,sentence,tokens,vector
0,It was the best of times,"[It, was, the, best, of, times]","[0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0]"
1,it was the worst of times,"[it, was, the, worst, of, times]","[0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1]"
2,it was the age of wisdom,"[it, was, the, age, of, wisdom]","[0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1]"
3,it was the age of foolishness,"[it, was, the, age, of, foolishness]","[1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1]"


### Dealing With Out Of Vocabulary Documents

Sometimes we will encounter sentences that do not have any interaction with our vocabulary. If this happens too many times, the vocabulary might have to be updated.

In [43]:
onehot_encode("John likes to watch movies. Mary likes movies too.".split())

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

### The Term Document Matrix
This is the natural conclusion of our one-hot encoding work - creating a matrix where each column represents a single vocabulary word. <div style='color:red'>**Warning:** The TD Matrix works best with a small vocabulary - at scale, we might switch to scikit for sparse matrix representation.</div>

In [44]:
tdm = pd.DataFrame(onehot, columns=vocabulary)
tdm

Unnamed: 0,foolishness,age,It,times,best,was,of,the,wisdom,worst,it
0,0,0,1,1,1,1,1,1,0,0,0
1,0,0,0,1,0,1,1,1,0,1,1
2,0,1,0,0,0,1,1,1,1,0,1
3,1,1,0,0,0,1,1,1,0,0,1


In [45]:
#put it all together
df = pd.merge(df, tdm, left_index=True, right_index=True)
df.head()

Unnamed: 0,sentence,tokens,vector,foolishness,age,It,times,best,was,of,the,wisdom,worst,it
0,It was the best of times,"[It, was, the, best, of, times]","[0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0]",0,0,1,1,1,1,1,1,0,0,0
1,it was the worst of times,"[it, was, the, worst, of, times]","[0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1]",0,0,0,1,0,1,1,1,0,1,1
2,it was the age of wisdom,"[it, was, the, age, of, wisdom]","[0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1]",0,1,0,0,0,1,1,1,1,0,1
3,it was the age of foolishness,"[it, was, the, age, of, foolishness]","[1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1]",1,1,0,0,0,1,1,1,0,0,1


### Calculating Similarities
This can be accomplished by seeing how many 1s a document has in common with another document. this is very fast at the bit level in Term Document Matrices:

**Method 1: Binary Comparrison**

In [47]:
#check that both values are 1 for each word in the vocabulary set.
sim = [onehot[0][i] & onehot[1][i] for i in range(0, len(vocabulary))]
sum(sim)

4

**Method 2: Dot Product of Two Vectors**

In [49]:
np.dot(onehot[0],onehot[1])

4

**Method 3: The similarity Matrix**
if we define the similarity of any two documents i,j as S*(i,j)* = di * dj (dot product of i and j),
we can generalize this across all documents to be a series that can be represented as a matrix:

 $$Si,j =\sum_k{Dik * Djk}$$
 
*or, for every vocabulary word k, get the product of the term document matrix for documents i and j for that specific word.*
<br>for instance, Best(i) * Best(j) + Of(i)* Of(J)+....

In [50]:
np.dot(onehot, np.transpose(onehot))

array([[6, 4, 3, 3],
       [4, 6, 4, 4],
       [3, 4, 6, 5],
       [3, 4, 5, 6]])

for row i, the ith column has the highest value because all documents are identical to themselves.

### OneHot with Scikit Learn

In [57]:
from sklearn.preprocessing import MultiLabelBinarizer
lb = MultiLabelBinarizer()
lb.fit([vocabulary])
lb.transform(df['tokens'])

array([[1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0],
       [0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1],
       [0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0],
       [0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0]])

### The Bag of Words Model

*OneHot encoding tells us that words are present, but not necesarrily how often they appear in a document or in the corpus. a BOW model is great for modeling classification and sentiment detection.*

**Using scikit-learn’s CountVectorizer**

In [58]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

In [60]:
more_sentences = sentences + \
["John likes to watch movies. Mary likes movies too.",
"Mary also likes to watch football games."]

### Step 1. The CountVectorizer needs to be trained on vocabulary words. we can pass an array of sentences in directly.

In [61]:
cv.fit(more_sentences)

We can also create a more complicated declaration, but it's unclear why we would.

In [69]:
CountVectorizer(
    analyzer='word', 
    binary=False, 
    decode_error='strict',
    dtype=np.dtype('int64'), 
    encoding='utf-8', 
    input='content',
    lowercase=True, 
    max_df=1.0, 
    max_features=None, 
    min_df=1,
    ngram_range=(1, 1), 
    preprocessor=None, 
    stop_words=None,
    strip_accents=None, 
    token_pattern='(?u)\\b\\w\\w+\\b',
    tokenizer=None, 
    vocabulary=None
)

In [71]:
print(cv.get_feature_names())

['age', 'also', 'best', 'foolishness', 'football', 'games', 'it', 'john', 'likes', 'mary', 'movies', 'of', 'the', 'times', 'to', 'too', 'was', 'watch', 'wisdom', 'worst']


### Transforming Documents into Vectors

the cv package creates a **Sparse** matrix, which only stores one. This way, instead of storing 6x20 points, it only stores 38.

In [74]:
dt = cv.transform(more_sentences)
dt

<6x20 sparse matrix of type '<class 'numpy.int64'>'
	with 38 stored elements in Compressed Sparse Row format>

we can recover our original matrix by creating a dataframe.
**Note that this matrix is alphabetically sorted, and some rows contain a 2 instead of a 1 or 0, which has many duplicates.**

In [75]:
pd.DataFrame(dt.toarray(), columns=cv.get_feature_names())



Unnamed: 0,age,also,best,foolishness,football,games,it,john,likes,mary,movies,of,the,times,to,too,was,watch,wisdom,worst
0,0,0,1,0,0,0,1,0,0,0,0,1,1,1,0,0,1,0,0,0
1,0,0,0,0,0,0,1,0,0,0,0,1,1,1,0,0,1,0,0,1
2,1,0,0,0,0,0,1,0,0,0,0,1,1,0,0,0,1,0,1,0
3,1,0,0,1,0,0,1,0,0,0,0,1,1,0,0,0,1,0,0,0
4,0,0,0,0,0,0,0,1,2,1,2,0,0,0,1,1,0,1,0,0
5,0,1,0,0,1,1,0,0,1,1,0,0,0,0,1,0,0,1,0,0


## Calculating Similarities Between Documents

A euclidean distance isn't very useful in high dimensional spaces, dot products are sensitive to document length, and counting the number of words in common is naive, so we can use the **Cosine Distance** instead:

In [76]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(dt[0], dt[1])

array([[0.83333333]])

### Calculating similarity for all documents:

In [78]:
pd.DataFrame(cosine_similarity(dt, dt))

Unnamed: 0,0,1,2,3,4,5
0,1.0,0.833333,0.666667,0.666667,0.0,0.0
1,0.833333,1.0,0.666667,0.666667,0.0,0.0
2,0.666667,0.666667,1.0,0.833333,0.0,0.0
3,0.666667,0.666667,0.833333,1.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.524142
5,0.0,0.0,0.0,0.0,0.524142,1.0


### The TF-IDF Model
<p>
The TF-IDF Model punishes words that show up too often in a corpus, and are interpreted to be so common (like stop words) that they don't add any unique meaning to a document. In our sentences variable, many sentences begin with "it was the time of" but this doesn't convey anything meaningful. TFIDF presumes that if a word is uncommon, the author wants to convey something unique.
</p>

In [80]:
#get the weights of each word
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer()
tfidf_dt = tfidf.fit_transform(dt)
pd.DataFrame(tfidf_dt.toarray(), columns=cv.get_feature_names())



Unnamed: 0,age,also,best,foolishness,football,games,it,john,likes,mary,movies,of,the,times,to,too,was,watch,wisdom,worst
0,0.0,0.0,0.56978,0.0,0.0,0.0,0.338027,0.0,0.0,0.0,0.0,0.338027,0.338027,0.467228,0.0,0.0,0.338027,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.338027,0.0,0.0,0.0,0.0,0.338027,0.338027,0.467228,0.0,0.0,0.338027,0.0,0.0,0.56978
2,0.467228,0.0,0.0,0.0,0.0,0.0,0.338027,0.0,0.0,0.0,0.0,0.338027,0.338027,0.0,0.0,0.0,0.338027,0.0,0.56978,0.0
3,0.467228,0.0,0.0,0.56978,0.0,0.0,0.338027,0.0,0.0,0.0,0.0,0.338027,0.338027,0.0,0.0,0.0,0.338027,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.305609,0.501208,0.250604,0.611219,0.0,0.0,0.0,0.250604,0.305609,0.0,0.250604,0.0,0.0
5,0.0,0.419233,0.0,0.0,0.419233,0.419233,0.0,0.0,0.343777,0.343777,0.0,0.0,0.0,0.0,0.343777,0.0,0.0,0.343777,0.0,0.0


In [81]:
#get the similarity (cosine)
pd.DataFrame(cosine_similarity(tfidf_dt, tfidf_dt))

Unnamed: 0,0,1,2,3,4,5
0,1.0,0.675351,0.457049,0.457049,0.0,0.0
1,0.675351,1.0,0.457049,0.457049,0.0,0.0
2,0.457049,0.457049,1.0,0.675351,0.0,0.0
3,0.457049,0.457049,0.675351,1.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.43076
5,0.0,0.0,0.0,0.0,0.43076,1.0
