# Week 5, Lesson 2, Activity 3: Topic clustering algorithm

&copy;2021, Ekaterina Kochmar \
(edited: Nadejda Roubtsova, February 2022)

Your task in this activity is to:

- Implement a clustering algorithm and apply it to the set of posts from the `20 Newsgroups` dataset as specified in this notebook.

## Step 1: Data loading

First, let's import the libraries that we are going to use in this notebook. Then, let's define a method to load *training* and *test* subsets using a predefined list of categories. Note that following options are also available:
- you can use `load_dataset('all', categories)` to load the whole dataset
- you can use `load_dataset('train', None)` to load the set of all topics

Note that you are working with the same dataset as last week.

In [None]:
from sklearn.datasets import fetch_20newsgroups
import numpy as np

def load_dataset(a_set, cats):
    dataset = fetch_20newsgroups(subset=a_set, categories=cats,
                          remove=('headers', 'footers', 'quotes'),
                          shuffle=True)
    return dataset

categories = ["comp.windows.x", "misc.forsale", "rec.autos", "rec.motorcycles", "rec.sport.baseball"]
categories += ["rec.sport.hockey", "sci.crypt", "sci.med", "sci.space", "talk.politics.mideast"]

newsgroups_train = load_dataset(# load the training dataset 'train' with the selected categories, as before
                                )
newsgroups_test = load_dataset(# load the training dataset 'test' with the selected categories, as before
                                )

## Step 2: Data preprocessing

Now let's prepare the data for unsupervised approaches:

In [None]:
import random
random.seed(42)

all_news = list(zip(newsgroups_train.data, newsgroups_train.target))
all_news += list(zip(# similarly, add the data and target labels from the test set
                     ))
random.shuffle(all_news)

all_news_data = [text for (text, label) in all_news]
all_news_labels = [# similar to above, add labels here
                   ]

print("Data:")
print(str(len(all_news_data)) + " posts in "
      + str(np.unique(all_news_labels).shape[0]) + " categories\n")

print("Labels: ")
print(# print the first 10 labels
      )
num_clusters = np.unique(all_news_labels).shape[0]
print("Assumed number of clusters: " + str(num_clusters))

Since the original dimensionality of the data is prohibitively large to allow for efficient clustering, let's reduce its dimensionality using [`Singular Value Decomposition`](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html#sklearn.decomposition.TruncatedSVD):

In [None]:
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(min_df=2, max_df=0.5,
                             stop_words='english',
                             use_idf=True)

def transform(data, vectorizer, dimensions):
    trans_data = vectorizer.fit_transform(data)
    print("Transformed data contains: " + str(trans_data.shape[0]) +
          " with " + str(# return the number of columns
                         ) + " features =>")

    #See more examples here:
    #https://scikit-learn.org/stable/auto_examples/text/plot_document_clustering.html#sphx-glr-auto-examples-text-plot-document-clustering-py
    svd = TruncatedSVD(dimensions)
    pipe = make_pipeline(svd, Normalizer(copy=False))
    reduced_data = # apply .fit_transform method to pipe, passing in trans_data as an argument

    return reduced_data, svd

reduced_data, svd = transform(all_news_data, vectorizer, 300)
print("Reduced data contains: " + str(reduced_data.shape[0]) +
        " with " + str(reduced_data.shape[1]) + " features") # this should tell you that reduced_data contains 300 "features"

## Step 3: Apply k-means clustering

Now, let's cluster the data using [`KMeans`](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) algorithm:

In [None]:
from sklearn.cluster import KMeans

def cluster(data, num_clusters):
    km = KMeans(n_clusters=num_clusters, init='k-means++', 
                max_iter=100, random_state=0)
    km.fit(data)
    return km

km = cluster(# apply to the relevant data structures
             )

## Step 4: Evaluate the results

Finally, let's evaluate the results. See the material from Lesson 3 to get more insights about how to interpret the results. What do the informative words suggest about each cluster?

In [None]:
from sklearn import metrics

def evaluate(km, labels, svd):
    print("Clustering report:\n")
    print(f"* Homogeneity: {str(metrics.homogeneity_score(labels, km.labels_))}")
    print(f"* Completeness: {str(# print out completeness_score
                                 )}")
    print(f"* V-measure: {str(# print out v_measure_score
                              )}")

    print("\nMost discriminative words per cluster:")
    original_space_centroids = svd.inverse_transform(km.cluster_centers_)
    order_centroids = original_space_centroids.argsort()[:, ::-1]

    terms = vectorizer.get_feature_names_out()
    for i in range(num_clusters):
        print("Cluster " + str(i) + ": ")
        cl_terms = ""
        for ind in order_centroids[i, :50]:
            cl_terms += terms[ind] + " "
        print(cl_terms + "\n")
        
evaluate(# apply to the relevant data structures
         )

print("\nCategories:")
for i, category in enumerate(newsgroups_train.target_names):
    print("*", category)