**CS3320 Lab 5. Document Clustering**

In this lab, we will learn how to cluster a set of documents using Python. My motivating example is to identify the latent structures within the synopses of the top 100 films of all time. We will be performing the following steps in order to achieve the clustering.
*   tokenizing and stemming each synopsis
*   transforming the corpus into vector space using tf-idf
*   calculating cosine distance between each document as a measure of similarity
*   clustering the documents using the k-means algorithm
*   using multidimensional scaling to reduce dimensionality within the corpus
*   plotting the clustering output using matplotlib


In [None]:
import numpy as np
import pandas as pd
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
import re
import os
import codecs
from sklearn import feature_extraction
from bs4 import BeautifulSoup
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.manifold import MDS
import joblib

import matplotlib.pyplot as plt
import matplotlib as mpl
import requests

nltk.download('stopwords')
nltk.download('punkt')

Let's download the dataset

In [None]:
!wget https://github.com/gvogiatzis/CS3320/raw/main/data/Lab5.zip
!unzip Lab5.zip -d docs

We have three primary lists:
'titles': the titles of the films in their rank order
'wiki synopses': the synopses of the films from wiki matched to the 'titles' order
'imdb synopses': the synopses of the films from imdb matched to the 'titles' order

The reading from file code is pretty simple - similar to previous workshops:

In [None]:
#import three lists: titles, wikipedia synopses and imdb synopses
#by reading the data from files
#ensure that we are reading only the first 100 records.
titles = open('docs/Lab5/title_list.txt').read().split('\n')
#ensures that only the first 100 are read in
titles = titles[:100]

synopses_wiki = open('docs/Lab5/synopses_list_wiki.txt').read().split('\n BREAKS HERE')
synopses_wiki = synopses_wiki[:100]

synopses_clean_wiki = []
for text in synopses_wiki:
    text = BeautifulSoup(text, 'html.parser').getText()
    #strips html formatting and converts to unicode
    synopses_clean_wiki.append(text)

synopses_wiki = synopses_clean_wiki

synopses_imdb = open('docs/Lab5/synopses_list_imdb.txt').read().split('\n BREAKS HERE')
synopses_imdb = synopses_imdb[:100]

synopses_clean_imdb = []

for text in synopses_imdb:
    text = BeautifulSoup(text, 'html.parser').getText()
    #strips html formatting and converts to unicode
    synopses_clean_imdb.append(text)

synopses_imdb = synopses_clean_imdb

print(str(len(titles)) + ' titles')
print(str(len(synopses_wiki)) + 'wiki synopses')
print(str(len(synopses_imdb)) + 'imdb synopses')

After reading the data from files, we are cleaning the synopses (wiki and imdb both) using BeautifulSoup. BeautifulSoup is a Python library that is used for web scraping purposes to pull the data out of HTML and XML files. It creates a parse tree from page source code that can be used to extract data in a hierarchical and more readable manner.

Complete the following task: extract the Aston University wiki page (https://en.wikipedia.org/wiki/Aston_University), get the page content using BeautifulSoup, and print the text. You can get the page using the following command:
page = requests.get(YOURLINK). Store the page text in astonText variable.
  

Now, Let's combine the film wiki and imdb synopses and generate the rank for each.

In [None]:
synopses = []

for i in range(len(synopses_wiki)):
    item = synopses_wiki[i] + synopses_imdb[i]
    synopses.append(item)

# generates index for each item in the corpora (in this case it's just rank) and I'll use this for scoring later
ranks = []
for i in range(0,len(titles)):
    ranks.append(i)

We will need to remove the stopwords and stemmer from synopses, therefore, we will be using the last workshop code.

In [None]:
# load nltk's English stopwords as variable called 'stopwords'
stopwords = nltk.corpus.stopwords.words('english')

# load nltk's SnowballStemmer as variabled 'stemmer'
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

def tokenize_and_stem(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems

def tokenize_only(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens

Let's call the function to process the synopses and store the proccessed (filtered) synopses in data frame.

In [None]:
totalvocab_stemmed = []
totalvocab_tokenized = []
for i in synopses:
  allwords_stemmed = tokenize_and_stem(i)
  totalvocab_stemmed.extend(allwords_stemmed)
    
  allwords_tokenized = tokenize_only(i)
  totalvocab_tokenized.extend(allwords_tokenized)

vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index = totalvocab_stemmed)

Complete the following task: remove the stopwords and stemmers from the astonText (This variable contains the Aston wiki page text).  

**Tf-idf and document similarity** 
We will be using frequency-inverse document frequency (tf-idf) vectorizer parameters and then convert the synopses list into a tf-idf matrix. Please refer to Lab 02 if you are unsimilar with Tf-idf.

In [None]:
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                                 min_df=0.2, stop_words='english',
                                 use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))

tfidf_matrix = tfidf_vectorizer.fit_transform(synopses)

print(tfidf_matrix.shape)

terms = tfidf_vectorizer.get_feature_names()

dist = 1 - cosine_similarity(tfidf_matrix)

K-means clustering: Using the tf-idf matrix, we can run a slew of clustering algorithms to better understand the hidden structure within the synopses. We will be using k-means algorithm with cluster size 5. Each observation is assigned to a cluster based on the cluster sum of squares. Next, the mean of the clustered observations is calculated and used as the new cluster centroid. Then, observations are reassigned to clusters and centroids recalculated in an iterative process until the algorithm reaches convergence.

In [None]:
num_clusters = 5
km = KMeans(n_clusters=num_clusters)

km.fit(tfidf_matrix)

clusters = km.labels_.tolist()

joblib.dump(km,'doc_cluster.pkl')
km = joblib.load('doc_cluster.pkl')
clusters = km.labels_.tolist()

films = { 'title': titles, 'rank': ranks, 'synopsis': synopses, 'cluster': clusters }

frame = pd.DataFrame(films, index = [clusters] , columns = ['rank', 'title', 'cluster'])

Let's display the number of films per cluster.

In [None]:
frame['cluster'].value_counts()

Let's groupby cluster for aggregation purposes and display the average rank (1 to 100) per cluster.

In [None]:
grouped = frame['rank'].groupby(frame['cluster'])
grouped.mean()

Let's identify the top n (here we are using 6) words per cluster that are nearest to the cluster centroid and display them. These words gives a good sense of the main topic of the cluster.

In [None]:
from __future__ import print_function

print("Top terms per cluster:")
print()
order_centroids = km.cluster_centers_.argsort()[:, ::-1]

for i in range(num_clusters):
    print("Cluster %d words:" % i, end='')
    for ind in order_centroids[i, :6]:
        print(' %s' % vocab_frame.loc[terms[ind].split(' ')].values.tolist()[0][0].encode('utf-8', 'ignore'), end=',')
    print()
    print()
    print("Cluster %d titles:" % i, end='')
    for title in frame.loc[i]['title'].values.tolist():
        print(' %s,' % title, end='')
    print()
    print()


Multidimensional scaling (MDS) is a technique that creates a map displaying the relative positions of a number of objects, given only a table of the distances between them. We have calculated the distance of each film, let's use it and display the information in graphical form.

In [None]:
#set up colors per clusters using a dict
cluster_colors = {0: '#1b9e77', 1: '#d95f02', 2: '#7570b3', 3: '#e7298a', 4: '#66a61e'}

#set up cluster names using a dict
cluster_names = {0: 'Family, home, war', 
                 1: 'Police, killed, murders', 
                 2: 'Father, New York, brothers', 
                 3: 'Dance, singing, love', 
                 4: 'Killed, soldiers, captain'}

MDS()

# two components as we're plotting points in a two-dimensional plane
# "precomputed" because we provide a distance matrix
# we will also specify `random_state` so the plot is reproducible.
mds = MDS(n_components=2, dissimilarity="precomputed", random_state=1)

pos = mds.fit_transform(dist)  # shape (n_components, n_samples)

xs, ys = pos[:, 0], pos[:, 1]

#create data frame that has the result of the MDS plus the cluster numbers and titles
df = pd.DataFrame(dict(x=xs, y=ys, label=clusters, title=titles)) 

#group by cluster
groups = df.groupby('label')

# set up plot
fig, ax = plt.subplots(figsize=(17, 9)) # set size
ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling

#iterate through groups to layer the plot
#note that I use the cluster_name and cluster_color dicts with the 'name' lookup to return the appropriate color/label
for name, group in groups:
    ax.plot(group.x, group.y, marker='o', linestyle='', ms=12, label=cluster_names[name], color=cluster_colors[name], mec='none')
    ax.set_aspect('auto')
    ax.tick_params(\
        axis= 'x',          # changes apply to the x-axis
        which='both',      # both major and minor ticks are affected
        bottom='off',      # ticks along the bottom edge are off
        top='off',         # ticks along the top edge are off
        labelbottom='off')
    ax.tick_params(\
        axis= 'y',         # changes apply to the y-axis
        which='both',      # both major and minor ticks are affected
        left='off',      # ticks along the bottom edge are off
        top='off',         # ticks along the top edge are off
        labelleft='off')
    
ax.legend(numpoints=1)  #show legend with only 1 point

#add label in x,y position with the label as the film title
for i in range(len(df)):
    ax.text(df.loc[i]['x'], df.loc[i]['y'], df.loc[i]['title'], size=8)     
    
plt.show() #show the plot

Write the code to save the graph in file