# Week 5, Lesson 2, Activity 3: Topic clustering algorithm

&copy;2021, Ekaterina Kochmar \
(edited: Nadejda Roubtsova, February 2022)

Your task in this activity is to:

- Implement a clustering algorithm and apply it to the set of posts from the `20 Newsgroups` dataset as specified in this notebook.

## Step 1: Data loading

First, let's import the libraries that we are going to use in this notebook. Then, let's define a method to load *training* and *test* subsets using a predefined list of categories. Note that following options are also available:
- you can use `load_dataset('all', categories)` to load the whole dataset
- you can use `load_dataset('train', None)` to load the set of all topics

Note that you are working with the same dataset as last week.

In [1]:
from sklearn.datasets import fetch_20newsgroups
import numpy as np

def load_dataset(a_set, cats):
    dataset = fetch_20newsgroups(subset=a_set, categories=cats,
                          remove=('headers', 'footers', 'quotes'),
                          shuffle=True)
    return dataset

categories = ["comp.windows.x", "misc.forsale", "rec.autos", "rec.motorcycles", "rec.sport.baseball"]
categories += ["rec.sport.hockey", "sci.crypt", "sci.med", "sci.space", "talk.politics.mideast"]

newsgroups_train = load_dataset('train', categories)
newsgroups_test = load_dataset('test', categories)

## Step 2: Data preprocessing

Now let's prepare the data for unsupervised approaches:

In [2]:
import random
random.seed(42)

all_news = list(zip(newsgroups_train.data, newsgroups_train.target))
all_news += list(zip(newsgroups_test.data, newsgroups_test.target))
random.shuffle(all_news)

all_news_data = [text for (text, label) in all_news]
all_news_labels = [label for (text, label) in all_news]

print("Data:")
print(str(len(all_news_data)) + " posts in "
      + str(np.unique(all_news_labels).shape[0]) + " categories\n")

print("Labels: ")
print(all_news_labels[:10])
num_clusters = np.unique(all_news_labels).shape[0]
print("Assumed number of clusters: " + str(num_clusters))

Data:
9850 posts in 10 categories

Labels: 
[2, 6, 1, 9, 0, 5, 1, 2, 9, 0]
Assumed number of clusters: 10


Since the original dimensionality of the data is prohibitively large to allow for efficient clustering, let's reduce its dimensionality using [`Singular Value Decomposition`](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html#sklearn.decomposition.TruncatedSVD):

In [3]:
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(min_df=2, max_df=0.5,
                             stop_words='english',
                             use_idf=True)

def transform(data, vectorizer, dimensions):
    trans_data = vectorizer.fit_transform(data)
    print("Transformed data contains: " + str(trans_data.shape[0]) +
          " with " + str(trans_data.shape[1]) + " features =>")

    #See more examples here:
    #https://scikit-learn.org/stable/auto_examples/text/plot_document_clustering.html#sphx-glr-auto-examples-text-plot-document-clustering-py
    svd = TruncatedSVD(dimensions)
    pipe = make_pipeline(svd, Normalizer(copy=False))
    reduced_data = pipe.fit_transform(trans_data)

    return reduced_data, svd

reduced_data, svd = transform(all_news_data, vectorizer, 300)
print("Reduced data contains: " + str(reduced_data.shape[0]) +
        " with " + str(reduced_data.shape[1]) + " features")

Transformed data contains: 9850 with 33976 features =>
Reduced data contains: 9850 with 300 features


## Step 3: Apply k-means clustering

Now, let's cluster the data using [`KMeans`](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) algorithm:

In [4]:
from sklearn.cluster import KMeans

def cluster(data, num_clusters):
    km = KMeans(n_clusters=num_clusters, init='k-means++', 
                max_iter=100, random_state=0)
    km.fit(data)
    return km

km = cluster(reduced_data, num_clusters)

## Step 4: Evaluate the results

Finally, let's evaluate the results. See the material from Lesson 3 to get more insights about how to interpret the results. What do the informative words suggest about each cluster?

In [6]:
from sklearn import metrics

def evaluate(km, labels, svd):
    print("Clustering report:\n")
    print(f"* Homogeneity: {str(metrics.homogeneity_score(labels, km.labels_))}")
    print(f"* Completeness: {str(metrics.completeness_score(labels, km.labels_))}")
    print(f"* V-measure: {str(metrics.v_measure_score(labels, km.labels_))}")

    print("\nMost discriminative words per cluster:")
    original_space_centroids = svd.inverse_transform(km.cluster_centers_)
    order_centroids = original_space_centroids.argsort()[:, ::-1]

    terms = vectorizer.get_feature_names_out()
    for i in range(num_clusters):
        print("Cluster " + str(i) + ": ")
        cl_terms = ""
        for ind in order_centroids[i, :50]:
            cl_terms += terms[ind] + " "
        print(cl_terms + "\n")
        
evaluate(km, all_news_labels, svd)

print("\nCategories:")
for i, category in enumerate(newsgroups_train.target_names):
    print("*", category)

Clustering report:

* Homogeneity: 0.44704612364219637
* Completeness: 0.48937733354001
* V-measure: 0.46725493318104117

Most discriminative words per cluster:
Cluster 0: 
don people know think just like time good right ve use doctor does way say make things years medical long really problem want did disease said cause thing going msg probably work read better sure food pain day doesn ll patients didn person cancer lot help believe case little new 

Cluster 1: 
car bike engine cars just like new miles ride good don rear ve oil know ford road speed think really drive time right riding dealer used bikes driving got make honda does gear problem power tires way buy little wheel manual clutch want auto turn thing need left year brake 

Cluster 2: 
israel jews israeli armenian arab jewish people armenians turkish arabs war muslims muslim killed said state genocide palestinian peace palestinians government did world just armenia turks rights turkey israelis population soldiers like human lan