# 1. Setting Up


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from nltk import ConditionalFreqDist
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer

# 2. Loading the Dataset

We will be analyzing the dataset of the inaugural speeches by US presidents. Let's explore the dataset.

In [None]:
import nltk
nltk.download("inaugural")

from nltk.corpus import inaugural

In [None]:
raw_data = []
for fileid in inaugural.fileids():
    raw_data.append([fileid, " ".join(inaugural.words(fileid))])
data = pd.DataFrame(raw_data, columns=["File ID", "Text"])
data

# 2. Vectorize the Text

As we learnt in lecture, one way to vectorize text is using the [Term Frequency Inverse Document Frequency](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) (TF-IDF) featurizer. 

In [None]:
vectorizer = TfidfVectorizer(max_df =0.95, min_df = 3, stop_words = 'english')
X = vectorizer.fit_transform(data['Text'])

Print the shape of your dataset. **Question**: What does each dimension stand for?

In [None]:
words = vectorizer.get_feature_names_out()words = vectorizer.get_feature_names_out()

Print the stop words. **Question**: Do you think this is a reasonable list of stopwords?

In [None]:
vectorizer.get_stop_words()vectorizer.get_stop_words()

# 3. Running K-Means

We will now run k-means to cluster the dataset, using sklearn's [KMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html). Set `random_state=416`.

In [None]:
k = 5
kmeans = KMeans(n_clusters = k, random_state = 416)
kmeans.fit(X)

For visualization purposes, let's add the cluster labels to the pandas dataframe.

In [None]:
data["Clusters k=%d" % k] = kmeans.labels_
data

**Questions**: What trends do you observe? What underlying patterns might the clustering algorithm have picked up on?

To further analyze the clusters, let's print the most frequent words per cluster.

In [None]:
cluster_to_words_to_num_occurences = {}
for i in range(k):
    cluster_to_words_to_num_occurences[i] = {}
    for word in words:
        num_occurences = 0
        for _, text in data[(data["Clusters k=%d" % k] == i)]["Text"].iteritems():
            if word.lower() in text.lower().split(" "):
                num_occurences += 1
        cluster_to_words_to_num_occurences[i][word] = num_occurences

num_words = 10
for i in range(k):
    top_words = [(cluster_to_words_to_num_occurences[i][word], word) for word in cluster_to_words_to_num_occurences[i]]
    top_words.sort(reverse=True)
    print("Cluster %d: " % i, top_words[:num_words])

**Question**: What words are common across all clusters? What words are more unique to particular clusters?

**Question**: Why do the clusters not correspond to meaningful topics of words?

# 4. Selecting K

Sklearn's KMeans classifier's `inertia_` property returns the objective function, or quality, of the clustering. 

**Question**: What would we expect the inertia to be when k=59?

In [None]:
ks = []
inertias = []
for k in range(1, 60, 2):
    # TODO: train a classifier with this k compute its quality
    ks.append(k)
    kmeans = KMeans(n_clusters = k).fit(X)
    inertias.append(kmeans.inertia_)

Graph it out

In [None]:
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ks, inertias, marker='o')
ax.set_ylim(0, 50)
ax.set_xlabel("K")
ax.set_ylabel("Objective Function")
ax.set_xticks(range(0, ks[-1], 2), minor=True)
ax.grid(which='both')

**Question**: What appears to be the best value of k?


# 5. (Bonus) Exploring the Data!

The below function takes in a list of words and graphs their occurance in presidents' speeches over the years. Use it to identify trends in the data!

In [None]:
def words_over_time(words):
    cfd = ConditionalFreqDist(
        (target, int(fileid[:4]))
        for fileid in inaugural.fileids()
        for w in inaugural.words(fileid)
        for target in words
        if w.lower().startswith(target))
    cfd.plot()

In [None]:
plt.figure(figsize=(12, 5))
words_over_time(["war", "peace"])