In [1]:
%matplotlib inline
import pandas as pd

# Cosine similarity

I'll draw a picture of cosine similarity in class, but it basically means **hey, do these things look the same?**

Here are some people. They own animals. This example is terrible.

In [2]:
pet_owners = [
    { 'name': 'Matty', 'cats': 7, 'dogs': 22, 'mice': 0 },
    { 'name': 'Margo', 'cats': 12, 'dogs': 9, 'mice': 5 },
    { 'name': 'Marby', 'cats': 9, 'dogs': 5, 'mice': 2 },
    { 'name': 'Maaaa', 'cats': 1, 'dogs': 2, 'mice': 1 },
    { 'name': 'Mappi', 'cats': 5, 'dogs': 9, 'mice': 5 },
    { 'name': 'Maesa', 'cats': 10, 'dogs': 1, 'mice': 0 },
    { 'name': 'Mazda', 'cats': 2, 'dogs': 3, 'mice': 0 }
]

df = pd.DataFrame(pet_owners)
df

Unnamed: 0,cats,dogs,mice,name
0,7,22,0,Matty
1,12,9,5,Margo
2,9,5,2,Marby
3,1,2,1,Maaaa
4,5,9,5,Mappi
5,10,1,0,Maesa
6,2,3,0,Mazda


Let's compare them based on the number of animals they have. So those animal counts will be our **features.**

In [None]:
features_df = df[['cats','dogs','mice']]
features_df

And we'll save everyone's name into a list called **names**.

In [None]:
names = df.name
names

Scikit-learn has a lot of [pairwise metrics](http://scikit-learn.org/stable/modules/metrics.html#metrics) you can use to see how related things are. Y'know, like how we spend time clustering things, this is kind of the step in-between that you don't get to see.

You can use other metrics, but cosine similarity is a pretty good/safe/reasonably understandable one.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Use cosine_similarity to create a matrix of... similarities
# call it similarity

Looks fun, right? In that big matrix,**every column is a person and every row is a person**, and the intersection is how similar they are. If we want to see it a little more nicely...

In [None]:
# Turn it into a dataframe with an index and columns that make sense
# call it similarity_df

Everyone is 100% similar to themselves - `1.0` - but for everyone else it gets closer and closer to `0` if they're less related.

So, now that we've made a nice dataframe with nice rows and columns, who is Matty most similar to?

Let's visualize this while we're at it! We're going to use seaborn because it's so easy.

In [None]:
%matplotlib inline
import seaborn as sns

# cmap makes the colors better. The default one is pretty ugly.
ax = sns.heatmap(similarity_df, cmap="YlGnBu")

# Well, sure, okay. Let's try it with real data.

In [None]:
import glob

filenames = glob.glob("books/*/*.txt")
contents = [open(filename).read() for filename in filenames]
df = pd.DataFrame({
    'filename': filenames,
    'content': contents
})
df.head()

In [None]:
df['author'] = df.filename.str.extract("books/(.*)/", expand=False)
df['title'] = df.filename.str.extract("books/.*/(.*).txt", expand=False)
df.head()

### What words do those books use?

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vec = TfidfVectorizer(stop_words='english')
matrix = vec.fit_transform(df.content)
features_df = pd.DataFrame(matrix.toarray(), columns=vec.get_feature_names(), index=df.title)
features_df

### How similar are they? We could cluster, but instead...

In [None]:
# Use cosine similarity to make a similarity matrix, 
# then use that to make a dataframe with sensible indexes and columns

In [None]:
sns.heatmap(similarity_df, cmap="YlGnBu")

## Let's try it without stopwords to compare

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vec = TfidfVectorizer()
matrix = vec.fit_transform(df.content)
similarity = cosine_similarity(matrix)
similarity_df = pd.DataFrame(similarity, index=df.title, columns=df.title)
sns.heatmap(similarity_df, cmap="YlGnBu")

Hmm, interesting: **why does it change so much?**

## Is Pride and Prejudice more like State of the Union addresses or JRR Tolkein?