In [None]:
import pandas as pd
pd.set_option("display.max_columns", 100)
%matplotlib inline

# Even more text analysis with scikit-learn

We've spent the past week counting words, *and we're just going to keep right on doing it.*

The technical term for this is **bag of words** analysis, because it doesn't care about what order the words are in. It's like you just took all of the words in a speech or a book or whatever and just dumped them into a bag. A bag of words.

It seems like it would be terrible but it really gets the job done.

# Even more dumb sentences

We can't let go of fish, bugs, and Penny. But this time we also have some cats.

In [None]:
texts = [
    "Penny bought bright blue fishes.",
    "Penny bought bright blue and orange fish.",
    "The cat ate a fish at the store.",
    "Penny went to the store. Penny ate a bug. Penny saw a fish.",
    "It meowed once at the bug, it is still meowing at the bug and the fish",
    "The cat is at the fish store. The cat is orange. The cat is meowing at the fish.",
    "Penny is a fish"
]

# Exercise A : Put these sentences into TWO sensible groups

Not with programming, just with your brain.

In [None]:
texts = [
    "The cat ate a fish at the store.",
    "It meowed once at the bug, it is still meowing at the bug and the fish",
    "The cat is at the fish store. The cat is orange. The cat is meowing at the fish.",
]

texts = [
    "Penny bought bright blue fishes.",
    "Penny bought bright blue and orange fish.",
    "Penny went to the store. Penny ate a bug. Penny saw a fish.",
    "Penny is a fish"
]

# Exercise B: Put these sentences into THREE groups based on their content

Again, not with programming, just with your brain.

In [None]:
texts = [
    "Penny bought bright blue fishes.",
    "Penny bought bright blue and orange fish.",
    "The cat ate a fish at the store.",
    "Penny went to the store. Penny ate a bug. Penny saw a fish.",
    "It meowed once at the bug, it is still meowing at the bug and the fish",
    "The cat is at the fish store. The cat is orange. The cat is meowing at the fish.",
    "Penny is a fish"
]

texts = [
    "Penny bought bright blue fishes.",
    "Penny bought bright blue and orange fish.",
    "The cat ate a fish at the store.",
    "Penny went to the store. Penny ate a bug. Penny saw a fish.",
    "It meowed once at the bug, it is still meowing at the bug and the fish",
    "The cat is at the fish store. The cat is orange. The cat is meowing at the fish.",
    "Penny is a fish"
]

texts = [
    "Penny bought bright blue fishes.",
    "Penny bought bright blue and orange fish.",
    "The cat ate a fish at the store.",
    "Penny went to the store. Penny ate a bug. Penny saw a fish.",
    "It meowed once at the bug, it is still meowing at the bug and the fish",
    "The cat is at the fish store. The cat is orange. The cat is meowing at the fish.",
    "Penny is a fish"
]

# Now, on to the computer

We already know how to **vectorize**, how to convert sentences into numeric representations. We use a **vectorizer!** There are two options we've learned about, the `CountVectorizer` and the `TfidfVectorizer`.

* `CountVectorizer`: count the words
* `TfidfVectorizer`: percentage of the words in a sentence (kind of)

### CountVectorizer

Just normal counting

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer()
matrix = vec.fit_transform(texts)
pd.DataFrame(matrix.toarray(), columns=vec.get_feature_names())

### TfidfVectorizer

So far we've used `TfIdfVectorizer` to compare sentences of different length (your name in a tweet vs. your name in a book).

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vec = TfidfVectorizer(use_idf=False, norm='l1')
matrix = vec.fit_transform(texts)
pd.DataFrame(matrix.toarray(), columns=vec.get_feature_names())

## Stemming

That all seems fine, but we need to combine `meow` and `meowing` and whatever else, yeah? We'll use TextBlob for that, and give our vectorizer a custom tokenizer.

In [None]:
from textblob import TextBlob

def textblob_tokenizer(str_input):
    blob = TextBlob(str_input.lower())
    tokens = blob.words
    words = [token.stem() for token in tokens]
    return words

vec = CountVectorizer(tokenizer=textblob_tokenizer)
matrix = vec.fit_transform(texts)
pd.DataFrame(matrix.toarray(), columns=vec.get_feature_names())

## ...oh, and stopwords

And let's get rid of stopwords, too

In [None]:
vec = CountVectorizer(tokenizer=textblob_tokenizer, stop_words='english')
matrix = vec.fit_transform(texts)
pd.DataFrame(matrix.toarray(), columns=vec.get_feature_names())

# Section One: Term Frequency (TF)

We've talked about **term frequency** before, it's just the percentage of times the words are used in a sentence. Let's refresh what our sentences are, then use a `TfidfVectorizer`.

In [None]:
texts = [
    "Penny bought bright blue fishes.",
    "Penny bought bright blue and orange fish.",
    "The cat ate a fish at the store.",
    "Penny went to the store. Penny ate a bug. Penny saw a fish.",
    "It meowed once at the bug, it is still meowing at the bug and the fish",
    "The cat is at the fish store. The cat is orange. The cat is meowing at the fish.",
    "Penny is a fish"
]

In [None]:
# We have to use these other parameters because I SAID SO
vec = TfidfVectorizer(tokenizer=textblob_tokenizer,
                      stop_words='english',
                      norm='l1', # ELL - ONE
                      use_idf=False)
matrix = vec.fit_transform(texts)
df = pd.DataFrame(matrix.toarray(), columns=vec.get_feature_names())
df

## Which sentence is the most about fish?

In [None]:
df.sort_values(by='fish')

## What about fish AND meowing?

In [None]:
df[['fish', 'meow']]

In [None]:
df.meow + df.fish

In [None]:
pd.DataFrame({
    'fish': df.fish,
    'meow': df.meow,
    'meow + fish': df.meow + df.fish
})

Looks like index `4` and `6` are tied, but `meow` doesn't even show up in six! That's no good, or at least it *seems* silly.

It seems like since `fish` shows up again and again it should be weighted a little less - not like it’s a stopword, but just... it’s kind of cliche to have it show up in the text, so we want to make it less important.

So maybe, you know **popular words should be less important.**

# Section Two: Inverse Document Frequency (IDF)

The concept that words that are more popular across all of the documents should be less important is **inverse document frequency!** We're going to try it again, this time changing `use_idf=False` to `use_idf=True`. The vectorizer actually uses inverse document frequency by default, but this will help us remember what is going on.

In [None]:
# We have to use these other parameters because I SAID SO
vec = TfidfVectorizer(tokenizer=textblob_tokenizer,
                      stop_words='english',
                      norm='l1',
                      use_idf=True)
matrix = vec.fit_transform(texts)
idf_df = pd.DataFrame(matrix.toarray(), columns=vec.get_feature_names())
idf_df

In [None]:
# OLD dataframe
pd.DataFrame({
    'fish': df.fish,
    'meow': df.meow,
    'meow + fish': df.meow + df.fish
})

In [None]:
# NEW dataframe
pd.DataFrame({
    'fish': idf_df.fish,
    'meow': idf_df.meow,
    'meow + fish': idf_df.meow + idf_df.fish
})

Okay, so things changed a little, but **I'm honestly not that impressed.**'

You know how we've been setting `norm=l1` all of the time. By default it's actually uses an `l2`(Euclidean) norm, which works a lot better, pulling apart the differences between sentences. Why? I don't know. What does it mean? I don't know. How does it work? I don't know. But let's get rid of that "ELL ONE" in order to work with the defaults.

In [None]:
# We have to *get rid of* norm='l1' because I SAID SO
vec = TfidfVectorizer(tokenizer=textblob_tokenizer,
                      stop_words='english',
                      use_idf=True)
matrix = vec.fit_transform(texts)
idf_df = pd.DataFrame(matrix.toarray(), columns=vec.get_feature_names())
idf_df

Now let's compare again.

In [None]:
# OLD dataframe
pd.DataFrame({
    'fish': df.fish,
    'meow': df.meow,
    'meow + fish': df.meow + df.fish
})

In [None]:
# NEW dataframe
pd.DataFrame({
    'fish': idf_df.fish,
    'meow': idf_df.meow,
    'meow + fish': idf_df.meow + idf_df.fish
})

Now *that's* a lot better. Look at index 4! It's amazing! Sure, we have a **fish** but that **meow** is just powering beyond anything known to humankind!

# Section Three: Document Similarity

## Who cares? Why do we need to know this?

When someone dumps 100,000 documents on your desk in response to FOIA, you’ll start to care! One of the reasons understanding TF-IDF is important is because of **document similarity**. By knowing what documents are similar you’re able to find related documents and automatically group documents into clusters.

For example! Let’s cluster these documents using **K-Means clustering** (check out this [gif](http://practicalcryptography.com/media/miscellaneous/files/k_mean_send.gif)). K means basically plots all of the numbers on a graph and grabs the ones that group together. It doesn't make sense right now, but we'll do a simpler example in a second.

In [None]:
# We have to use these other parameters because I SAID SO
vec = TfidfVectorizer(tokenizer=textblob_tokenizer,
                      stop_words='english',
                      use_idf=True)
matrix = vec.fit_transform(texts)
idf_df = pd.DataFrame(matrix.toarray(), columns=vec.get_feature_names())
idf_df

In [None]:
# KMeans clustering a kind of clustering.
from sklearn.cluster import KMeans

number_of_clusters = 2
km = KMeans(n_clusters=number_of_clusters)
km.fit(matrix)

In [None]:
print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vec.get_feature_names()
for i in range(number_of_clusters):
    top_ten_words = [terms[ind] for ind in order_centroids[i, :5]]
    print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))

In [None]:
km.labels_

In [None]:
texts

In [None]:
results = pd.DataFrame()
results['text'] = texts
results['category'] = km.labels_
results

## How about 3 categories of documents?

## That was confusing. Can we visualize it?

This time we're going to say, **only find two important words to measure**. We're going to use `max_features=` to have it auto-select, but we could also use `vocabulary=` if we wanted to.

In [None]:
texts = [
    'Penny bought bright blue fishes.',
    'Penny bought bright blue and orange bowl.',
    'The cat ate a fish at the store.',
    'Penny went to the store. Penny ate a bug. Penny saw a fish.',
    'It meowed once at the bug, it is still meowing at the bug and the fish',
    'The cat is at the fish store. The cat is orange. The cat is meowing at the fish.',
    'Penny is a fish.',
    'Penny Penny she loves fishes Penny Penny is no cat.',
    'The store is closed now.',
    'How old is that tree?',
    'I do not eat fish I do not eat cats I only eat bugs.'
]

vec = TfidfVectorizer(tokenizer=textblob_tokenizer,
                      stop_words='english',
                      use_idf=True,
                      max_features=2)
matrix = vec.fit_transform(texts)
df = pd.DataFrame(matrix.toarray(), columns=vec.get_feature_names())
df

Notice how we now have two numbers for every sentence? Well, let's plot them!

In [None]:
ax = df.plot(kind='scatter', x='fish', y='penni', alpha=0.2, s=200)
ax.set_xlabel("Fish")
ax.set_ylabel("Penny")

You can see a few groups. 3 or 4, maybe? Let's see if we can do the same 

In [None]:
number_of_clusters = 3
km = KMeans(n_clusters=number_of_clusters)
km.fit(matrix)

In [None]:
# Move the labels into a column of our dataframe
# the first label matches the first row, second label is second row, etc
df['category'] = km.labels_
df

In [None]:
# Category 0 is red
# Category 1 is green
# Category 2 is blue
colormap = {
    0: 'red',
    1: 'green',
    2: 'blue'
}

# Create a list of colors from every single row
colors = df.apply(lambda row: colormap[row.category], axis=1)

# And plot it!
ax = df.plot(kind='scatter', x='fish', y='penni', alpha=0.1, s=300, c=colors)
ax.set_xlabel("Fish")
ax.set_ylabel("Penny")

Ooh, that's fun, right? Let's try it again, this time with **four categories** instead of three.

In [None]:
km = KMeans(n_clusters=4)
km.fit(matrix)
df['category'] = km.labels_

colormap = { 0: 'red', 1: 'green', 2: 'blue', 3: 'purple'}
colors = df.apply(lambda row: colormap[row.category], axis=1)
ax = df.plot(kind='scatter', x='fish', y='penni', alpha=0.1, s=300, c=colors)
ax.set_xlabel("Fish")
ax.set_ylabel("Penny")

Now just imagine instead of 2 dimensions (2 words), you have 100 dimensions (100 words). It's more complicated and you sure can't visualize it, but it's the same thing!

## Using more information

Right now we're only vectorizing **Penny** and **fish** - remember how we did `max_features`? Right now it's only doing term frequency across those two elements - it doesn't matter if there are 10000 words in a book, if "Penny" shows up once and "fish" shows up twice, the vectorizer is like "OH BOY THIS IS ALL ABOUT FISH."

If we wanted it to be a little more aware of the rest of the words, we could do our vectorization across *all* features (all words), then only selecting the `fish` and `penni` columns when *doing K-means fit*.

In [None]:
# Vectorize and save into a new dataframe
vec = TfidfVectorizer(tokenizer=textblob_tokenizer, stop_words='english', use_idf=True)
matrix = vec.fit_transform(texts)
df = pd.DataFrame(matrix.toarray(), columns=vec.get_feature_names())
df.head(2)

So now that we have a count of ALL of the words, let's ask K-Means to only pay attention to `fish` and `penni`.

In [None]:
# Cluster with 3 categories
# only using the 'fish' and 'penni' categories
km = KMeans(n_clusters=3)
km.fit(df[['fish', 'penni']])

# Assign the category to the dataframe
df['category'] = km.labels_

# Build our color map
colormap = { 0: 'red', 1: 'green', 2: 'blue' }
colors = df.apply(lambda row: colormap[row.category], axis=1)

# Plot our scatter
ax = df.plot(kind='scatter', x='fish', y='penni', alpha=0.1, s=300, c=colors)
ax.set_xlabel("Fish")
ax.set_ylabel("Penny")

Notice how we normally do `km.fit(matrix)` but this time we did `km.fit(df[['fish', 'penni']])`? It turns out that **you can use `matrix` and `df` interchangeably**. The `df` is just the matrix with column names.

## Time to get crazy

What if we're talking about **3 features?** 3 different words? It doesn't seem that nuts, but... can we graph that?

In [None]:
# Vectorize and save into a new dataframe
vec = TfidfVectorizer(tokenizer=textblob_tokenizer, max_features=3, stop_words='english', use_idf=True)
matrix = vec.fit_transform(texts)
df = pd.DataFrame(matrix.toarray(), columns=vec.get_feature_names())
df.head(2)

In [None]:
# Cluster
km = KMeans(n_clusters=4)
km.fit(df)

# Assign the category to the dataframe
df['category'] = km.labels_

# Build our color map
colormap = {0: 'red', 1: 'green', 2: 'blue', 3: 'orange'}
colors = df.apply(lambda row: colormap[row.category], axis=1)

In [None]:
# Plot our scatter
ax = df.plot(kind='scatter', x='fish', y='penni', alpha=0.2, s=300, c=colors)
ax.set_xlabel("Fish")
ax.set_ylabel("Penny")

In [None]:
# Plot our scatter
ax = df.plot(kind='scatter', x='penni', y='cat', alpha=0.2, s=300, c=colors)
ax.set_xlabel("Penni")
ax.set_ylabel("Cat")

In [None]:
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

def draw(ax, df):
    colormap = { 0: 'red', 1: 'green', 2: 'blue', 3: 'orange' }
    colors = df.apply(lambda row: colormap[row.category], axis=1)

    ax.scatter(df['fish'], df['penni'], df['cat'], c=colors, s=100, alpha=0.5)
    ax.set_xlabel('Fish')
    ax.set_ylabel('Penni')
    ax.set_zlabel('Cat')

chart_count_vert = 5
chart_count_horiz = 5
number_of_graphs = chart_count_vert * chart_count_horiz

fig = plt.figure(figsize=(3 * chart_count_horiz, 3 * chart_count_vert))

for i in range(number_of_graphs):
    ax = fig.add_subplot(chart_count_horiz, chart_count_vert, i + 1, projection='3d', azim=(-360 / number_of_graphs) * i)
    draw(ax, df)