In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Document clustering

### Document clustering: concepts
- 1. Clean data before processing
- 2. Determine the importance of the terms in a document(in TF-IDF matrix)
- 3. Cluster the TF-IDF matrix
- 4. Find top terms, documents in each cluster

### Clean and tokenize data
- Convert text into smaller parts called tokens, clean data for processing

In [2]:
from nltk.tokenize import word_tokenize
import re  # regular expressions

def remove_noise(text, stop_words = []):
    tokens = word_tokenize(text)
    cleaned_tokens = []
    for token in tokens:
        token = re.sub('[^A-Za-z0-9]+', '', token)
        if len(token) > 1 and token.lower() not in stop_words:
            # Get lowercase
            cleaned_tokens.append(token.lower())
    return cleaned_tokens

In [3]:
word_tokenize('It is lovely weather we are having. I hope the weather continues.')

['It',
 'is',
 'lovely',
 'weather',
 'we',
 'are',
 'having',
 '.',
 'I',
 'hope',
 'the',
 'weather',
 'continues',
 '.']

In [4]:
remove_noise("It is lovely weather we are having. I hope the weather continues.")

['it',
 'is',
 'lovely',
 'weather',
 'we',
 'are',
 'having',
 'hope',
 'the',
 'weather',
 'continues']

### TF-IDF (Term Frequency - Inverse Document Frequency)
- A weighted measure: evaluate how important a word  is to a document in a collection

In [5]:
plots = pd.read_csv('movies_plot.csv')['Plot'].loc[:249].values

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, 
                                   max_features=50, 
                                   min_df=0.2, 
                                   tokenizer=remove_noise)
tfidf_matrix = tfidf_vectorizer.fit_transform(plots)

### Clustering with sparse matrix
- kmeans( in scipy does not support sparse matricse
- Use .todense() to convert to a matrix

In [7]:
from scipy.cluster.vq import kmeans

num_clusters=3
cluster_centers, distortion = kmeans(tfidf_matrix.todense(), num_clusters)

In [8]:
cluster_centers

array([[0.03199333, 0.03800253, 0.03905884, 0.04780701, 0.15927658,
        0.08286124, 0.14366909, 0.10332187, 0.01572662, 0.06112195,
        0.04432267, 0.01627148, 0.0612348 , 0.16220105, 0.03320014,
        0.08915877, 0.1312196 , 0.12051358, 0.02581344, 0.02069035,
        0.06471153, 0.04109723, 0.06502353, 0.03929313, 0.04106087,
        0.05931895, 0.0856839 , 0.04183711, 0.03581026, 0.02019313,
        0.09768303, 0.05684122, 0.05990943, 0.04036268, 0.0065288 ,
        0.06716669, 0.09616371, 0.07169883, 0.02882143, 0.08292705,
        0.05855747, 0.0482609 , 0.0807424 , 0.06684079, 0.06749194,
        0.03200596, 0.0364663 , 0.05917189, 0.13646209, 0.02485002],
       [0.06669829, 0.10953079, 0.04895786, 0.04969485, 0.11803865,
        0.10019863, 0.15929194, 0.12475225, 0.07653305, 0.08340628,
        0.05981812, 0.05802609, 0.14886519, 0.18307346, 0.04583243,
        0.05107095, 0.17268784, 0.11298278, 0.05027139, 0.06399587,
        0.1086991 , 0.06914847, 0.10724338, 0.2

In [9]:
cluster_centers.shape

(3, 50)

### Top terms per cluster
- Cluster centers: lists with a size equal to the number of terms
- Each value in the cluster center is its importance
- Create a dictionary and print top terms

In [10]:
terms = tfidf_vectorizer.get_feature_names()
print(terms)

['about', 'after', 'all', 'also', 'an', 'are', 'as', 'at', 'back', 'be', 'been', 'before', 'but', 'by', 'can', 'father', 'for', 'from', 'goes', 'had', 'has', 'have', 'her', 'him', 'himself', 'into', 'it', 'new', 'not', 'off', 'on', 'one', 'out', 'she', 'tells', 'that', 'their', 'them', 'then', 'they', 'this', 'two', 'up', 'was', 'when', 'where', 'which', 'while', 'who', 'will']


In [11]:
for i in range(num_clusters):
    center_terms = dict(zip(terms, list(cluster_centers[i])))
    sorted_terms = sorted(center_terms, key=center_terms.get, reverse=True)
    print(sorted_terms[:3])

['by', 'an', 'as']
['that', 'him', 'by']
['her', 'she', 'that']


### More considerations
- Work with hyperlinks, emoticons, etc
- Normalize words (run, ran, running --> run)
- .todense() may not work with large datasets

### Exercise: TF-IDF of movie plots
Let us use the plots of randomly selected movies to perform document clustering on. Before performing clustering on documents, they need to be cleaned of any unwanted noise (such as special characters and stop words) and converted into a sparse matrix through TF-IDF of the documents.

Use the TfidfVectorizer class to perform the TF-IDF of movie plots stored in the list plots. The remove_noise() function is available to use as a tokenizer in the TfidfVectorizer class. The .fit_transform() method fits the data into the TfidfVectorizer objects and then generates the TF-IDF sparse matrix.

Note: It takes a few seconds to run the .fit_transform() method.

In [12]:
# Import TfidfVectorizer class from sklearn
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(min_df=0.1, 
                                   max_df=0.75, 
                                   max_features=50, 
                                   tokenizer=remove_noise)

# Use the .fit_transform() method on the list plots
tfidf_matrix = tfidf_vectorizer.fit_transform(plots)

That is correct! You have successfully created the sparse matrix. Let us now perform clustering on the matrix.

### Exercise: Top terms in movie clusters
Now that you have created a sparse matrix, generate cluster centers and print the top three terms in each cluster. Use the .todense() method to convert the sparse matrix, tfidf_matrix to a normal matrix for the kmeans() function to process. Then, use the .get_feature_names() method to get a list of terms in the tfidf_vectorizer object. The zip() function in Python joins two lists.

The tfidf_vectorizer object and sparse matrix, tfidf_matrix, from the previous have been retained in this exercise. kmeans has been imported from SciPy.

With a higher number of data points, the clusters formed would be defined more clearly. However, this requires some computational power, making it difficult to accomplish in an exercise here.

In [13]:
num_clusters = 2

# Generate cluster centers through the kmeans function
cluster_centers, distortion = kmeans(tfidf_matrix.todense(), num_clusters)

# Generate terms from the tfidf_vectorizer object
terms = tfidf_vectorizer.get_feature_names()

for i in range(num_clusters):
    # Sort the terms and print top 3 terms
    center_terms = dict(zip(terms, list(cluster_centers[i])))
    sorted_terms = sorted(center_terms, key=center_terms.get, reverse=True)
    print(sorted_terms[:3])

['her', 'she', 'him']
['him', 'they', 'an']


~~You are correct! Notice positive, warm words in the first cluster and words referring to action in the second cluster.~~