# Clustering Text Data

BitTiger DS501

This assignment uses 'articles.pkl' file that has 1405 articles from 'Arts','Books','Business Day', 'Magazine', 'Opinion', 'Real Estate', 'Sports', 'Travel', 'U.S.', and 'World'. This is a [pickled](https://docs.python.org/2/library/pickle.html) data frame and can be loaded back into a [data frame](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_pickle.html#pandas.read_pickle).  You probably want to eventually get it out of pandas DataFrames when you perform your analysis.

| Section | count|
| :---| :--|
|Arts| 91|
|Automobiles| 5|
|Books| 37|
|Booming| 7|
|Business Day| 100|
|Corrections| 10|
|Crosswords & Games| 2|
|Dining & Wine| 19|
|Education| 4|
|Fashion & Style| 46|
|Great Homes and Destinations| 5|
|Health| 10|
|Home & Garden| 10|
|Magazine| 11|
|Movies| 28|
|N.Y. / Region| 92|
|Opinion| 84|
|Paid Death Notices| 11|
|Real Estate| 13|
|Science| 18|
|Sports| 134|
|Technology| 13|
|Theater| 16|
|Travel| 9|
|U.S.| 88|
|World | 131|
|Your Money | 6 |

1. Use pandas' `pd.read_pickle()`. to load data to DataFrame. Apply kmeans clustering to the `articles.pkl`.

2. To find out what "topics" Kmeans has discovered we must inspect the centroids. Print out the centroids of the Kmeans clustering.

   These centroids are simply a bunch of vectors.  To make any sense of them we need to map these vectors back into our 'word space'.  Think of each feature/dimension of the centroid vector as representing the "average" article or the average occurances of words for that cluster.

3. But for topics we are only really interested in the most present words, i.e. features/dimensions with the greatest representation in the centroid.  Print out the top ten words for each centroid.

    * Sort each centroid vector to find the top 10 features
    * Go back to your vectorizer object to find out what words each of these features corresponds to.

4. Look at the docs for `TfidfVectorizer` and see if you can limit the number of features (words) included in the feature matrix.  This can help reduce some noise and make the centroids slightly more sensible.  Limit the `max_features` and see if the words of the topics change at all.

5. An alternative to finding out what each cluster represents is to look at the articles that are assigned to it.  Print out the titles of a random sample of the articles assigned to each cluster to get a sense of the topic.

6. What 'topics' has kmeans discovered? Can you try to assign a name to each?  Do the topics change as you change k (just try this for a few different values of k)?

7. If you set k == to the number of NYT sections in the dataset, does it return topics that map to a section?  Why or why not?

8. Try your clustering only with a subset of the original sections.  Do the topics change or get more specific if you only use 3 sections (i.e. Sports, Art, and Business)?  Are there any cross section topics (i.e. a Sports article that talks about the economics of a baseball team) you can find? 

In [None]:
import matplotlib.pyplot as plt
% matplotlib inline
plt.style.use('ggplot')

## K-Means

#### Load data from articles.pkl to DataFrame

In [None]:
import numpy as np
import pandas as pd

In [None]:
articles_df = pd.read_pickle("data/articles.pkl")

In [None]:
articles_df

#### Vectorize the article content as tf-idf

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(articles_df['content'])
features = vectorizer.get_feature_names()


#### Apply k-means clustering to the vectors

In [None]:
from sklearn.cluster import KMeans

kmeans = KMeans()

kmeans.fit(X)

#### Inspect the centroids

In [None]:
print("cluster centers:")
print(kmeans.cluster_centers_)

#### Find the top 10 features for each cluster.

In [None]:
top_centroids = kmeans.cluster_centers_.argsort()[:,-1:-11:-1]
print("top features for each cluster:")
for num, centroid in enumerate(top_centroids):
    print("%d: %s" % (num, ", ".join(features[i] for i in centroid)))

#### Limit the number of features and see if the words of the topics change.

In [None]:
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
X = vectorizer.fit_transform(articles_df['content'])
features = vectorizer.get_feature_names()
kmeans = KMeans()
kmeans.fit(X)
top_centroids = kmeans.cluster_centers_.argsort()[:,-1:-11:-1]
print("top features for each cluster with 1000 max features:")
for num, centroid in enumerate(top_centroids):
    print("%d: %s" % (num, ", ".join(features[i] for i in centroid)))

#### Print out the titles of a random sample of the articles assigned to each cluster to get a sense of the topic.

In [None]:
assigned_cluster = kmeans.transform(X).argmin(axis=1)

# assigned_cluster = kmeans.predict(X)

In [None]:
for i in range(kmeans.n_clusters):
    cluster = np.arange(0, X.shape[0])[assigned_cluster==i]
    sample_articles = np.random.choice(cluster, 3, replace=False)
    print("cluster %d:" % i)
    for article in sample_articles:
        print("    %s" % articles_df.loc[article]['headline'])

#### If you set `k==` to the number of NYT sections in the dataset, does it return topics that map to a section?

In [None]:
from collections import Counter

In [None]:
kmeans = KMeans(n_clusters=10)
kmeans.fit(X)
assigned_cluster = kmeans.transform(X).argmin(axis=1)
print("top 2 topics for each cluster")
for i in range(kmeans.n_clusters):
    cluster = np.arange(0, X.shape[0])[assigned_cluster==i]
    topics = articles_df.loc[cluster].dropna()['section_name']
    most_common = Counter(topics).most_common()
    if len(most_common) > 1:
        print("Cluster %d: %s" % (i, most_common[0][0]),", %s" % (most_common[1][0]))


#### Try clustering with a subset of the sections.


In [None]:
# Create masks
cond_sports = articles_df['section_name']=='Sports'
cond_arts = articles_df['section_name']=='Arts'
cond_business_day = articles_df['section_name']=='Business Day'

In [None]:
three_articles_df = articles_df[cond_sports | cond_arts | cond_business_day]

In [None]:
three_articles_df

In [None]:
kmeans = KMeans(n_clusters=3)
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(three_articles_df['content'])
kmeans.fit(X)
assigned_cluster = kmeans.transform(X).argmin(axis=1)
print("top 2 topics for each cluster")
for i in range(kmeans.n_clusters):
    cluster = np.arange(0, X.shape[0])[assigned_cluster==i]
    topics = three_articles_df.loc[cluster].dropna()['section_name']
    most_common = Counter(topics).most_common()
    print("Cluster %d: %s" % (i, most_common[0][0]))
    if len(most_common) > 1:
        print(" %s" % (most_common[1][0]))


## Hierarchical Clustering

We have been introduced to distance metrics and the idea of similarity, but we will take a deeper dive here. For many machine learning algorithms, the idea of 'distance' between two points is a crucial abstraction to perform analysis. For Kmeans we are usually limited to use Euclidean distance even though our domain might have a more approprite distance function (i.e. Cosine similarity for text).  With Hierarchical clustering we will not be limited in this way.   
We already have our bags and played around with Kmeans clustering.  Now we are going to leverage [Scipy](http://www.scipy.org/) to perform [hierarchical clustering](http://en.wikipedia.org/wiki/Hierarchical_clustering).

1. Hierarchical clustering is more computationally intensive than Kmeans.  Also it is hard to visualize the results of a hierarchical clustering if you have too much data (since it represents its clusters as a tree). Create a subset of the original articles by filtering the data set to contain at least one article from each section and at most around 100 total articles.

    One issue with text (especially when visualzing/clustering) is high dimensionality.  Any method that uses distance metrics is susceptible to the [curse of dimensionality](http://www.visiondummy.com/2014/04/curse-dimensionality-affect-classification/). `scikit-learn` has some utility to do some feature selection for us on our bags.  

2. The first step to using `scipy's` Hierarchical clustering is to first find out how similar our vectors are to one another.  To do this we use the `pdist` [function](http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html) to compute a similarity matrix of our data (pairwise distances).  First we will just use Euclidean distance.  Examine the shape of what is returned.

3. A quirk of `pdist` is that it returns one looong vector.  Use scipy's [squareform](http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.squareform.html) function to get our long vector of distances back into a square matrix.  Look at the shape of this new matrix.

4. Now that we have a square similarity matrix we can start to cluster!  Pass this matrix into scipy's `linkage` [function](http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html) to compute our hierarchical clusters.

5. We in theory have all the information about our clusters but it is basically impossible to interpret in a sensible manner.  Thankfully scipy also has a function to visualize this madness.  Using scipy's `dendrogram` [function](http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.dendrogram.html) plot the linkages as a hierachical tree.

#### Create a subset of the original articles by filtering the data set to contain at least one article from each section and at most 100 total articles.

In [None]:
small_mask = np.zeros(len(articles_df)).astype(bool)
indices = np.arange(len(articles_df))
for category in articles_df['section_name'].unique():
    category_mask = (articles_df['section_name'] == category).values
    new_index = np.random.choice(indices[category_mask])
    small_mask[new_index] = True
additional_indices = np.random.choice(indices[np.logical_not(small_mask)],
                                      100 - sum(small_mask),
                                      replace=False)
small_mask[additional_indices] = True
small_df = articles_df.loc[small_mask]

In [None]:
# Verify that this is good:
assert len(small_df) == 100
assert len(small_df['section_name'].unique()) == len(articles_df['section_name'].unique())

#### First vectorize our articles

In [None]:
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
small_X = vectorizer.fit_transform(small_df['content'])
small_features = vectorizer.get_feature_names()

#### Before using scipy's Hierarchical clustering, we need to first find out how similar our vectors are to one another.

In [None]:
from scipy.spatial.distance import pdist, squareform

In [None]:
# now get pairwise distances
distxy = squareform(pdist(small_X.todense(), metric='cosine'))


#### Pass this matrix into scipy's linkage function to compute our hierarchical clusters.

In [None]:
from scipy.cluster.hierarchy import linkage

In [None]:
link = linkage(distxy, method='complete')

#### Using scipy's dendrogram function plot the linkages as a hierachical tree.

In [None]:
from scipy.cluster.hierarchy import dendrogram

In [None]:
dendro = dendrogram(link, color_threshold=1.5, leaf_font_size=9)
plt.show()

## Hierarchical Topics

#### To make your clusters more interpretable, change the labels on the data to be the titles of the articles.


In [None]:
dendro = dendrogram(link, color_threshold=1.5, leaf_font_size=9,
                    labels=small_df['headline'].values)

#### Label each point with the title and the section.


In [None]:
fig, ax = plt.subplots(1, figsize=(12, 6))

labels = (small_df['headline'] + ' :: ' + small_df['section_name']).values
dendro = dendrogram(link, color_threshold=1.5, leaf_font_size=9,
                    labels=labels)

In [None]:
fig, ax = plt.subplots(1, figsize=(12, 36))

labels = (small_df['headline'] + ' :: ' + small_df['section_name']).values
dendro = dendrogram(link, color_threshold=1.5, leaf_font_size=9,
                    labels=labels, orientation='right')

#### Form flat clusters from linakge matrix by setting threshold

In [None]:
# form clusters from linkage matrix by setting threshold
from scipy.cluster.hierarchy import fcluster
clusters = fcluster(link, t=1.14)

df_res = pd.DataFrame({'section_name':small_df['section_name'],'clusters':clusters})
#print df_res
df_res['count'] = 1
print(df_res[['section_name','count']].groupby(['section_name']).sum())
print(df_res[['clusters','count']].groupby(['clusters']).sum())
print(df_res[['clusters','section_name','count']].groupby(['clusters','section_name']).sum())

#### Explore different clusters on a per section basis.


In [None]:
def plot_dendrogram_by_categorty(articles_df, category, n_articles=20):
    mask = articles_df['section_name'] == category
    cat_df = articles_df[mask].sample(n=n_articles)
    
    vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
    cat_X = vectorizer.fit_transform(cat_df['content'])
    distxy = squareform(pdist(cat_X.todense(), metric='cosine'))
    fig, ax = plt.subplots(1, figsize=(6, 6))
    
    labels = cat_df['headline'].values
    # labels = (cat_df['headline'] + ' :: ' + cat_df['subsection_name']).values
    
    dendro = dendrogram(linkage(distxy, method='complete'),
                        color_threshold=4,
                        leaf_font_size=8,
                        labels=labels,
                        orientation='right')
    ax.set_title(category)


In [None]:
for category in ['Arts', 'Sports', 'World']:
    plot_dendrogram_by_categorty(articles_df, category)

#### Perform the same analysis as above and inspect the dendrogram with the words from the articles.

In [None]:
plt.figure(figsize=(12, 120))
distxy_words = squareform(pdist(small_X.T.todense(), metric='cosine'))
dendro = dendrogram(linkage(distxy_words, method='complete'),
                    color_threshold=2, leaf_font_size=8,
                    labels=small_features, orientation='right')