### Pair Problem #1

You are given the following three documents:

```python
text = ["wookie stormtrooper",
        "wookie wookie wookie stormtrooper stormtrooper stormtrooper",
        "harry potter"]
```

* Transform this data into a bag of words representation, with simple counts. How informative is this format? How much information do you have about individual words?

In [1]:
text = ["wookie stormtrooper",
        "wookie wookie wookie stormtrooper stormtrooper stormtrooper",
        "harry potter"]

In [3]:
bag = ' '.join(text)

In [4]:
word_counts = {}
for word in bag.split():
    word_counts[word] = word_counts.get(word, 0) + 1

In [5]:
word_counts

{'harry': 1, 'potter': 1, 'stormtrooper': 4, 'wookie': 4}

* Calculate Euclidean and cosine distances between each pair of documents. How do these distances relate to your intuition for the documents' similarities?

In [6]:
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.feature_extraction.text import CountVectorizer

In [99]:
tfidf_vectorizer = TfidfVectorizer(norm=None)

In [100]:
tfidf_vectorizer.fit(text)

TfidfVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm=None, preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern=u'(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [101]:
x = tfidf_vectorizer.transform(text).todense()

In [93]:
(0.5)*np.log(1.5)

0.20273255405408219

In [102]:
x

matrix([[ 0.        ,  0.        ,  1.28768207,  1.28768207],
        [ 0.        ,  0.        ,  3.86304622,  3.86304622],
        [ 1.69314718,  1.69314718,  0.        ,  0.        ]])

In [11]:
import pandas as pd

In [32]:
tfidf_df = x.toarray()

In [33]:
pd.DataFrame(tfidf_df, columns=tfidf_vectorizer.get_feature_names())

Unnamed: 0,harry,potter,stormtrooper,wookie
0,0.0,0.0,0.707107,0.707107
1,0.0,0.0,0.707107,0.707107
2,0.707107,0.707107,0.0,0.0


In [25]:
count_vectorizer = CountVectorizer()
count_vectorizer.fit(text)
y = count_vectorizer.transform(text)
count_df = y.toarray()
pd.DataFrame(count_df, columns=count_vectorizer.get_feature_names())

Unnamed: 0,harry,potter,stormtrooper,wookie
0,0,0,1,1
1,0,0,3,3
2,1,1,0,0


In [26]:
text

['wookie stormtrooper',
 'wookie wookie wookie stormtrooper stormtrooper stormtrooper',
 'harry potter']

In [28]:
len(bag.split())

10

In [45]:
import numpy as np
from scipy.spatial.distance import euclidean, cosine

In [66]:
for i, vector in enumerate(tfidf_df):
    print 'distance between', i-1, 'and', i
    print euclidean(tfidf_df[i-1], tfidf_df[i])

distance between -1 and 0
1.41421356237
distance between 0 and 1
1.57009245868e-16
distance between 1 and 2
1.41421356237


In [67]:
for i, vector in enumerate(count_df):
    print 'distance between', i-1, 'and', i
    print cosine(count_df[i-1], count_df[i])

distance between -1 and 0
1.0
distance between 0 and 1
0.0
distance between 1 and 2
1.0


In [69]:
from sklearn.metrics.pairwise import pairwise_distances

In [70]:
pairwise_distances(count_df, metric='euclidean').round(3)

array([[ 0.   ,  2.828,  2.   ],
       [ 2.828,  0.   ,  4.472],
       [ 2.   ,  4.472,  0.   ]])

In [71]:
pairwise_distances(count_df, metric='cosine')

array([[ 0.,  0.,  1.],
       [ 0.,  0.,  1.],
       [ 1.,  1.,  0.]])

* Calculate one minus the cosine distance between each pair of de-meaned documents, and the Pearson correlation coefficient between each pair of documents. How are they related? Is this a coincidence? Find a counterexample or prove that there isn't one.

In [49]:
mu = tfidf_df.mean(axis=0)

In [51]:
tfidf_demeaned = tfidf_df - mu

In [53]:
pd.DataFrame(tfidf_demeaned, columns=tfidf_vectorizer.get_feature_names())

Unnamed: 0,harry,potter,stormtrooper,wookie
0,-0.235702,-0.235702,0.235702,0.235702
1,-0.235702,-0.235702,0.235702,0.235702
2,0.471405,0.471405,-0.471405,-0.471405


In [55]:
from scipy.stats import pearsonr

In [56]:
for i in range(len(tfidf_demeaned)):
    print 'distance between', i-1, 'and', i
    print 1 - cosine(tfidf_demeaned[i-1], tfidf_demeaned[i])
    print pearsonr(tfidf_demeaned[i-1], tfidf_demeaned[i])

distance between -1 and 0
-1.0
(-1.0, 0.0)
distance between 0 and 1
1.0
(1.0, 0.0)
distance between 1 and 2
-1.0
(-1.0, 0.0)


In [73]:
np.corrcoef(count_df)

array([[ 1.,  1., -1.],
       [ 1.,  1., -1.],
       [-1., -1.,  1.]])

In [74]:
np.corrcoef(tfidf_df)

array([[ 1.,  1., -1.],
       [ 1.,  1., -1.],
       [-1., -1.,  1.]])