<div class="alert alert-info">
    
➡️ Before you start, make sure that you are familiar with the **[study guide](https://liu-nlp.ai/text-mining/logistics/)**, in particular the rules around **cheating and plagiarism** (found in the course memo).

➡️ If you use code from external sources (e.g. StackOverflow, ChatGPT, ...) as part of your solutions, don't forget to add a reference to these source(s) (for example as a comment above your code).

➡️ Make sure you fill in all cells that say **`YOUR CODE HERE`** or **YOUR ANSWER HERE**.  You normally shouldn't need to modify any of the other cells.

</div>

# L4: Clustering and Topic Modelling

Text clustering groups documents in such a way that documents within a group are more &lsquo;similar&rsquo; to other documents in the cluster than to documents not in the cluster. The exact definition of what &lsquo;similar&rsquo; means in this context varies across applications and clustering algorithms.

In this lab you will experiment with both hard and soft clustering techniques. More specifically, in the first part you will be using the $k$-means algorithm, and in the second part you will be using a topic model based on the Latent Dirichlet Allocation (LDA).

In [None]:
# Define some helper functions that are used in this notebook

%matplotlib inline
from IPython.display import display, HTML

def success():
    display(HTML('<div class="alert alert-success"><strong>Checks have passed!</strong></div>'))

## Dataset 1: Hard clustering

The raw data for the hard clustering part of this lab is a collection of product reviews. We have preprocessed the data by tokenization and lowercasing.

In [None]:
import pandas as pd
import bz2

with bz2.open('reviews.json.bz2') as source:
    df = pd.read_json(source)

When you inspect the data frame, you can see that there are three labelled columns: `category` (the product category), `sentiment` (whether the product review was classified as &lsquo;positive&rsquo; or &lsquo;negative&rsquo; towards the product), and `text` (the space-separated text of the review).

In [None]:
pd.set_option('display.max_colwidth', None)
df.head()

## Problem 1: K-means clustering

Your first task is to cluster the product review data using a tf–idf vectorizer and $k$-means clustering.

### Task 1.1

Start by **performing tf–idf vectorization**. In connection with vectorization, you should also **filter out standard English stop words**. While you could use [spaCy](https://spacy.io/) for this task, here it suffices to use the word list implemented in [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

After running the following cell:
- `vectorizer` should contain the vectorizer fitted on `df['text']`
- `reviews` should contain the vectorized `df['text']`

In [None]:
vectorizer, reviews = ..., ...

# YOUR CODE HERE
raise NotImplementedError()

### Task 1.2

Next, **write a function to cluster the vectorized data.**  For this, you can use scikit-learn’s [KMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) class, which has several parameters that you can tweak, the most important one being the _number of clusters_.  Your function should therefore take the number of clusters as an argument; you can leave all other parameters at their defaults. 

In [None]:
from sklearn.cluster import KMeans

def fit_kmeans(data, n_clusters):
    """Fit a k-means classifier to some data.

    Arguments:
        data: The vectorized data to train the classifier on.
        n_clusters (int): The number of clusters.

    Returns:
        The trained k-means classifier.
    """
    # YOUR CODE HERE
    raise NotImplementedError()

To sanity-check your clustering, **create a bar plot** with the number of documents per cluster:

In [None]:
import matplotlib.pyplot as plt

def plot_cluster_size(kmeans):
    """Produce & display a bar plot with the number of documents per cluster.

    Arguments:
        kmeans: The trained k-means classifier.
    """
    # YOUR CODE HERE
    raise NotImplementedError()

#### 🤞 Test your code

The following cell shows how your code should run.  The output of the cell should be the bar plot of the cluster sizes.  Note that sizes may vary considerably between clusters and among different random seeds, so there is no single “correct” output here!  Re-run the cell a couple of times to observe how the plot changes.

In [None]:
kmeans = fit_kmeans(reviews, 3)
plot_cluster_size(kmeans)

## Problem 2: Summarising clusters

Once you have a clustering, you can try to see whether it is meaningful. One useful technique in that context is to **generate a “summary”** for each cluster by extracting the $n$ highest-weighted terms from the centroid of each cluster. Your next task is to implement this approach.

In [None]:
import numpy as np

def compute_cluster_summaries(kmeans, vectorizer, top_n):
    """Compute the top_n highest-weighted terms from the centroid of each cluster.

    Arguments:
        kmeans: The trained k-means classifier.
        vectorizer: The fitted vectorizer; needed to obtain the actual terms
                    belonging to the items in the cluster.
        top_n: The number of terms to return for each cluster.

    Returns:
        A list of length k, where k is the number of clusters. Each item in the list
        should be a list of length `top_n` with the highest-weighted terms from that
        cluster.  Example:
          [["first", "foo", ...], ["second", "bar", ...], ["third", "baz", ...]]
    """
    # YOUR CODE HERE
    raise NotImplementedError()

### 🤞 Test your code

The following cell runs your code with `top_n=10` and prints the summaries:

In [None]:
summaries = compute_cluster_summaries(kmeans, vectorizer, 10)

for idx, terms in enumerate(summaries):
    print(f"Cluster {idx}: {', '.join(terms)}")

Once you have computed the cluster summaries, take a minute to reflect on their quality. Is it clear what the reviews in a given cluster are about? Do the cluster summaries contain any unexpected terms?

## Problem 3: Evaluate clustering performance

In some scenarios, you may have gold-standard class labels available for at least a subset of your documents.  In our case, we could use the gold-standard categories (from the `category` column) as class labels.  This means we’re making the assumption that a “good” clustering should put texts into the same cluster _if and only if_ they belong to the same category.

If we have such class labels, we can compute a variety of performance measures to see how well our $k$-means clustering resembles the given class labels.  Here, we will consider three of these measures: the **Rand index (RI)**; the **adjusted Rand index (RI)** which has been corrected for chance; and the **V-measure**.  For all of them (and more), we can make use of [implementations by scikit-learn](https://scikit-learn.org/1.5/modules/clustering.html#clustering-performance-evaluation).

Your task is to **compare the performance** of different $k$-means clusterings with $k = 1, \ldots, 10$ clusters.  As your evaluation data, use the _first 1000 documents_ from the original data set along with their gold-standard categories (from the `category` column).

**Visualise your results as a line plot**, where

- the $x$-axis corresponds to $k$
- the $y$-axis corresponds to the score of the evaluation measure
- each evaluation measure (RI, ARI, V) is shown by a differently-colored and/or -styled line in the plot

In [None]:
from sklearn.metrics import rand_score, adjusted_rand_score, v_measure_score
import seaborn as sns
sns.set()

# YOUR CODE HERE
raise NotImplementedError()

Remember that you may get different clusters each time you run the $k$-means algorithm, so re-run your solution above a few times to see how the results change.  Take a moment to think how you would interpret these results; you will need this for the reflection.

## Dataset 2: Topic modelling

The data set for the topic modelling part of this lab is the collection of all [State of the Union](https://en.wikipedia.org/wiki/State_of_the_Union) addresses from the years 1975–2000. These speeches come as a single text file with one sentence per line. The following code cell prints the first 5 lines from the data file:

In [None]:
from itertools import islice

with open('sotu_1975_2000.txt') as source:
    # Print the first 5 lines only
    for line in islice(source, 5):
        print(line.rstrip())

Take a few minutes to think about what topics you would expect in this data set.

## Problem 4: Train a topic model

In this problem, we will train an LDA model on the State of the Union&nbsp;(SOTU) dataset. For this, we will be using [spaCy](https://spacy.io/) and the [gensim](https://radimrehurek.com/gensim/) topic modelling library.


### Task 4.1: Preparing the data

Start by **preprocessing the data** using spaCy as follows:

- Filter out stop words, non-alphabetic tokens, and tokens less than 3 characters in length.
- Store the documents as a nested list where the first level of nesting corresponds to the sentences and the second level corresponds to the tokens in each sentence.

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [None]:
def load_and_preprocess_documents(filename="sotu_1975_2000.txt"):
    """Load and preprocess all documents in the given file.

    The preprocessing must filter out stop words, non-alphabetic tokens,
    and tokens less than 3 characters in length.

    Returns:
        A list of length n, where n is the number of documents.
        Each item in the list should be a list of tokens in the given
        document, after preprocessing.
    """
    # YOUR CODE HERE
    raise NotImplementedError()

#### 🤞 Test your code

Test your preprocessing by running the following cell. It will output the tokens (after preprocessing) for an example document and compare them against the expected output.

In [None]:
documents = load_and_preprocess_documents()

print(f"Document 42 after preprocessing: {' '.join(documents[42])}")
assert " ".join(documents[42]) == "reduce oil imports million barrels day end year million barrels day end"
success()

### Task 4.2: Training LDA

Now that we have the list of documents, we can use gensim to train an LDA model on them.  Gensim works a bit differently from scikit-learn and has its own interfaces, so you should skim the section [“Pre-process and vectorize the documents”](https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html#pre-process-and-vectorize-the-documents) of the documentation to learn how to create the dictionary and the vectorized corpus representation required by gensim.

Based on this, **write code to train an [LdaModel](https://radimrehurek.com/gensim/models/ldamodel.html)** for $k=10$ topics, and using default values for all other parameters.

In [None]:
from gensim.corpora.dictionary import Dictionary
from gensim.models import LdaModel

def train_lda_model(documents, num_topics, passes=1):
    """Create and train an LDA model.

    Arguments:
        documents: The preprocessed documents, as produced in Task 4.1.
        num_topics: The number of topics to generate.
        passes: The number of training passes. Defaults to 1; you will need
                this later for Task 5.

    Returns:
        The trained LDA model.
    """
    # YOUR CODE HERE
    raise NotImplementedError()

#### 🤞 Test your code

Run the following cell to test your code and print the topics:

In [None]:
model = train_lda_model(documents, 10)
model.print_topics()

Inspect the topics. Can you &lsquo;label&rsquo; each topic with a short description of what it is about? Do the topics match your expectations?

## Problem 5: Monitor a topic model for convergence

When learning an LDA model, it is important to make sure that the training algorithm has converged to a stable posterior distribution. One way to do so is to plot, after each training epochs (or &lsquo;pass&rsquo;, in gensim parlance) the log likelihood of the training data under the posterior. Your last task in this lab is to create such a plot and, based on this, to suggest an appropriate number of epochs.

To collect information about the posterior likelihood after each pass, we need to enable the logging facilities of gensim. Once this is done, gensim will add various diagnostics to a log file `gensim.log`.

In [None]:
import logging

logging.basicConfig(filename='gensim.log', format='%(asctime)s:%(levelname)s:%(message)s', level=logging.INFO)

def clear_logfile():
    # To empty the log file
    with open("gensim.log", "w"):
        pass

The following function will parse the generated logfile and return the list of log likelihoods.

In [None]:
import re

def parse_logfile():
    """Parse gensim.log to extract the log-likelihood scores.

    Returns:
        A list of log-likelihood scores.
    """
    matcher = re.compile(r'(-*\d+\.\d+) per-word .* (\d+\.\d+) perplexity')
    likelihoods = []
    with open('gensim.log') as source:
        for line in source:
            match = matcher.search(line)
            if match:
                likelihoods.append(float(match.group(1)))
    return likelihoods

Here's an example how to run it — note that we call `clear_logfile()` to empty the logfile before training the model. If your code from Problem&nbsp;4 was correct, the result should be a list with a single log-likehood score, since we are doing a single training pass:

In [None]:
clear_logfile()
model = train_lda_model(documents, 10, passes=1)
likelihoods = parse_logfile()
print(likelihoods)

### Task 5.1: Plotting log-likelihoods

Your task now is to **re-train your LDA model for 50&nbsp;passes**, retrieve the list of log likelihoods, and **create a plot** from this data.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Task 5.2: Interpreting log-likelihoods

How do you interpret the plot you produced in Task 5.1? Based on the plot, what would be a reasonable choice for the number of passes? **Retrain your LDA model with that number** and re-inspect the topics it finds.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

## Individual reflection

<div class="alert alert-info">
    <strong>After you have solved the lab,</strong> write a <em>brief</em> reflection (max. one A4 page) on the question(s) below.  Remember:
    <ul>
        <li>You are encouraged to discuss this part with your lab partner, but you should each write up your reflection <strong>individually</strong>.</li>
        <li><strong>Do not put your answers in the notebook</strong>; upload them in the separate submission opportunity for the reflections on Lisam.</li>
    </ul>
</div>

1. In Problem&nbsp;3, you performed an evaluation of $k$-means clustering with different values for $k$.  How do you interpret the results?  What would you expect to be a “good” number of clusters for this dataset?  What do the evaluation measures suggest would be a “good” number of clusters?
2. How did you choose the number of LDA passes in Task&nbsp;5.2?  Do you consider the topic clusters you got in Task&nbsp;5.2 to be “better” than the ones from Task&nbsp;4.2?  Base your reasoning on one or more concrete examples from the LDA output.

**Congratulations on finishing this lab! 👍**

<div class="alert alert-info">
    
➡️ Before you submit, **make sure the notebook can be run from start to finish** without errors.  For this, _restart the kernel_ and _run all cells_ from top to bottom. In Jupyter Notebook version 7 or higher, you can do this via "Run$\rightarrow$Restart Kernel and Run All Cells..." in the menu (or the "⏩" button in the toolbar).

</div>