In [1]:
import numpy as np
import random
# Set seed for reproducibility
np.random.seed(42)  # Set seed for NumPy
random.seed(42) # Set seed for random module

## Introduction

In this weeks tutorial we will work with __Topic modeling__, which is an __unsupervised__ method for text analysis. As manual annotation of data is very time-consuming, naturally there is much more raw than annotated data.
But we already know one method to analyze unlabeled data: __Clustering__. Topic modeling is a special case of clustering and is theoretically based on the "Distributional hypothesis of linguistics" which states that words that co-occur together in similar contexts tend to have similar meanings. Additionally these co-occurrence patterns can be interpreted as topics and used to cluster documents.

We are still working with the same dataset as before (IMDB) and since this dataset contains annotations (sentiment scores), we will finally use the created topic model as new features for predicting the sentiment.

## Data

The dataset we will use contains movie reviews from IMDB. Initially the data is stored as a dataframe with three columns (id, sentiment_human, text).

*Run the code below.*

In [2]:
import pandas as pd
#Loading the data from a csv file
reviews = pd.read_csv("https://raw.githubusercontent.com/kbrennig/MODS_WS24_25/refs/heads/main/data/imdb_sample.csv")

### Prepare data for classifier

For our classifier we only need the ground truth sentiment_score (which we again recode from 'positive'/'negative' to 1/0) and the topic columns.

*Run the code below.*

In [3]:
reviews['sentiment_positive'] = np.where(reviews['sentiment_human'] == 'positive', 1, 0)

### Split Dataset into Train- and Testset

We create a train- and testset so that we can later use the topics as input for a machine learning model.
*Run the code below.*

In [4]:
from sklearn.model_selection import train_test_split

#define X and y
X = reviews.drop(columns=['id','sentiment_human','sentiment_positive'])
y = reviews['sentiment_positive']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train = X_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)

## Preprocessing

Since unstructured data doesn't have an inherent and consistent structure we have to perform some preprocessing steps in order to make the data usable for the computer.
One thing to keep in mind is that the more preprocessing we perform the more information we lose, but the basic methods we are using here require it.

### Tokenize documents
First, we tokenize the texts. This means we transform the texts from one long string to a list of tokens.

### Stem all words
After tokenizing the texts we perform stemming (alternatively lemmatization could be performed). Stemming reduces every word to its stem.
The stemmer we use here is the Porter Stemmer.

### Remove stopwords
Finally we remove words that don't contain real meaning and are commonly used (e.g. 'this', 'the', 'a', etc.). 


Additionally, we remove unwanted characters (e.g., punctuation and numbers).

*Run the code below.*

In [None]:
# Preprocessing
import nltk
import string
import re
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import PorterStemmer

# Download the punkt resource
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

#Define function with all necessary preprocessing steps for our IMDB reviews. As in week 9 we now use Stemming again.
def preprocess(text):
    # tokenize the text
    tokens = nltk.word_tokenize(text)

    # create stemmer object
    stemmer = nltk.stem.PorterStemmer()

    # stem each token
    stemmed_tokens = [stemmer.stem(token) for token in tokens]

    # get list of stopwords in English
    stopwords = nltk.corpus.stopwords.words("english")

    # remove stopwords
    filtered_tokens = [token for token in stemmed_tokens if token.lower() not in stopwords]
    
    # remove punctuation. Here we use another kind of removing punctuation compared to last week. The Regex-based removal also removes punctuation attached to words (e.g., "hello," or "test.")
    filtered_tokens_nopunct = [re.sub(r'[^\w\s]', '', token) for token in filtered_tokens if token]

    return filtered_tokens_nopunct


### Apply preprocessing

After defining the different preprocessing steps, we now apply these preprocessing steps to our train and test set of our IMDB reviews. Running the code below we apply the preprocess function to the "text" column of our train and test set and save the new preprocessed reviews as a new column in our dataset named "tokens".

In [None]:
# Apply text preprocessing
# Preprocess text data
X_train['tokens'] = X_train['text'].apply(preprocess)
X_test['tokens'] = X_test['text'].apply(preprocess)

X_train['tokens'].iloc[0]  # Display first processed review

### Remove irrelevant words
In this case, we manually remove specific words that are irrelevant to the analysis. The words we want to remove are "movie" and "film". As we already performed preprocessing and stemmed the original tokens, we also need to use the stemmed version of these two words.

As the preprocess function returns a list of tokens instead an entire string, we can't use the replace function from Week 10. The replace function won't work as expected because it doesn't operate on individual elements within lists or substrings within strings. Instead, we need to process each list individually. The code below uses the apply function to iterate through each list in the column and filters out the unwanted tokens.

*Run the code below.*

In [None]:
# Remove additional irrelevant words (movie, film). 
X_train['tokens'] = X_train['tokens'].apply(lambda tokens: [token for token in tokens if token not in {'movi', 'film'}])
X_test['tokens'] = X_test['tokens'].apply(lambda tokens: [token for token in tokens if token not in {'movi', 'film'}])

X_train['tokens'].iloc[0]

## Topic Modeling

In this section you will learn how to use topic modeling as an unsupervised method to analyse text data.

### Prepare the tokens

In order to use the preprocessed tokens for our topic modeling, we first need to prepare the tokens and create a dictionary and a corpus.

*Run the code below.*

In [12]:
from gensim import corpora

# Create dictionary and corpus for Gensim
dictionary = corpora.Dictionary(X_train['tokens'])
dictionary.filter_extremes(no_below=5)

#the corpus shows which token appears how often in one review which can be used as input for the Topic Model
corpus_train = [dictionary.doc2bow(text) for text in X_train['tokens']]
corpus_test = [dictionary.doc2bow(text) for text in X_test['tokens']]

### Generate topic model (10 topics)

After preparing our dataset, we can now calculate the topic model. We are aiming to find 10 topics in our corpus. This value can be found through an iterative process, where you start with a number topics and then evaluate the resulting topic model (are topics to close to each other or even overlapping?, are multiple topics contained within one big topic?) and adjust the number accordingly.

*Run the code below.*

In [14]:
from gensim.models.ldamodel import LdaModel

# Train LDA model
k=10
model_10 = LdaModel(corpus=corpus_train, num_topics=k, id2word = dictionary, iterations=100, random_state=42)


### Explore model
We can also explore the topic model by looking at the top words per topic.

*Run the code below.*

In [None]:
# Explore the topic model by printing the topics
for topic_id, topic in model_10.print_topics(num_topics=10, num_words=10):
    print(f"Topic {topic_id + 1}: {topic}")


#### Topic Prevalence

We can also have a closer look at the *overall topic prevalence*, which helps prioritize the most dominant topics in the corpus. This is particularly useful for:

1. Understanding the overall themes in your dataset.
2. Identifying which topics have the most impact. 

First we get an overview of how the topics are distributed over all reviews.

*Run the code below.*

In [None]:
# Get topic distributions for all reviews
topic_distribution = [model_10.get_document_topics(bow) for bow in corpus_train]

#print topic distributions for all reviews to get an overview
for i, doc_distribution in enumerate(topic_distribution):
    print(f"Review {i+1}:")
    for topic_id, prob in doc_distribution:
        print(f"  Topic {topic_id+1}: {prob:.4f}")
    print("\n")

We can also take a closer look at which reviews are most strongly associated with a specific topic.

*Run the code below.*

In [None]:
topic_id = 8  # For example, Topic 9 (adjust for zero-based indexing)
n = 2  # Number of top documents to retrieve

# Store document relevance scores
doc_scores = []

# Extract the probability for the target topic
for i, doc_distribution in enumerate(topic_distribution):
    topic_prob = dict(doc_distribution).get(topic_id, 0)
    doc_scores.append((i, topic_prob))

# Sort documents by their relevance to the target topic
sorted_docs = sorted(doc_scores, key=lambda x: x[1], reverse=True)

# Get the top `n` documents
top_docs = sorted_docs[:n]

# Print the results
print(f"Top {n} documents for Topic {topic_id+1}:")
for doc_index, prob in top_docs:
    print(f"Document {doc_index+1}: Score = {prob:.4f}")
    print(f"Text: {X_train['text'][doc_index]}")
    print("\n")

We now calculate the overall topic prevalence. Therefore we first need to define a function to calculate the topic prevalence (this is similar to the preprocess function we defined in the beginning). Afterwards we apply this function to our model to get the topic prevalence.

*Run the code below.*

In [19]:
# define a function to calculate topic prevalence across the corpus
def calculate_topic_prevalence(lda_model, corpus):

    # Initialize an array to store the prevalence of each topic
    topic_prevalence = np.zeros(lda_model.num_topics)
    
    # Get topic distribution for each document and sum the probabilities for each topic
    for doc in corpus:
        topic_distribution = lda_model.get_document_topics(doc)
        for topic_id, prob in topic_distribution:
            topic_prevalence[topic_id] += prob  # Sum the probabilities for each topic
    
    return topic_prevalence

In [20]:
#Apply defined topic prevalence function and calculate topic prevalence for your topic model (model_10)
topic_prevalence = calculate_topic_prevalence(model_10, corpus_train)

We know can get an overview of the prevalence of each topic. For a better understanding and comparison we sort the topics according to their topic prevalence.

*Run the code below.*

In [None]:
# Sort the topics by prevalence in descending order
sorted_topic_prevalence = sorted(enumerate(topic_prevalence), key=lambda x: x[1], reverse=True)

# Print the sorted topic prevalence
print("Topic Prevalence (Sorted):")
for topic_id, prevalence in sorted_topic_prevalence:
    print(f"Topic {topic_id+1}: {prevalence:.2f}")

#### Plot word cloud for a single topic

Additionally we can plot word clouds for single topics.

*Run the code below.*

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Function to generate word cloud for a specific topic
def generate_wordcloud_for_topic(lda_model, topic_id, num_words):
    # Get the top words for the specified topic
    topic_words = lda_model.show_topic(topic_id, topn=num_words)
    
    # Prepare the words and their probabilities for the word cloud
    word_freq = {word: prob for word, prob in topic_words}
    
    # Generate the word cloud with the specified word frequencies
    wc = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(word_freq)
    
    # Plot the word cloud
    plt.figure(figsize=(10, 6))
    plt.imshow(wc, interpolation='bilinear')
    plt.axis('off')  # Hide the axes
    plt.title(f"Word Cloud for Topic {topic_id+1}")
    plt.show()

# Example: Generate word cloud for Topic 1
generate_wordcloud_for_topic(model_10, topic_id=0, num_words=50)


#### Print reviews most associated with a single topic

Finally we can print out the reviews that are most associated with a single topic. Therefore, we first define a function to get the reviews per topic.

*Run the code below.*

In [55]:
#define a function to get the reviews per topic
def get_reviews_by_topic(model, corpus, reviews, topic_id, threshold=0.5):
    selected_reviews = []

    # Iterate over the corpus and their corresponding document-topic distributions
    for doc_idx, doc_topics in enumerate(model.get_document_topics(corpus)):
        # Check the contribution of the specified topic
        for topic, proportion in doc_topics:
            if topic == topic_id and proportion >= threshold:
                selected_reviews.append(reviews['text'].iloc[doc_idx])
                break  # Stop checking other topics for this document

    return selected_reviews

We can then apply the defined function to get the reviews for one specific topic. In this case, we will have a closer look at Topic 1 (the topic_id for Topic 1 is '0').

*Run the code below.*

In [None]:
#filtering on one specific topic ID and apply the function
topic_id = 0
reviews_for_topic = get_reviews_by_topic(model_10, corpus_train, X_train, topic_id, threshold=0.5)

# Limit the number of reviews displayed to 10
reviews_to_display = reviews_for_topic[:10]

#print the reviews so that we can read them
print(f"Topic:{topic_id+1}\n")
for review in reviews_to_display:
    print(f"{review}\n")
    print("-" * 80)  # Separator line for better readability

## Supervised Sentiment Analysis using Topic Modeling Features

In the previous section we learned how we can use topic modeling as an unsupervised method to analyse textual data. Now we will use the created topic model as an additional feature in our supervised Sentiment Analysis with the aim to classify reviews as either positive or negative.

### Store the per-document topic distributions in a dataframe for subsequent analysis.

*Run the code below.*

In [27]:
# Function to extract per-document topic distributions
def get_document_topic_distribution(model, corpus):
    """Get topic distributions for each document in a corpus."""
    return pd.DataFrame(
        [
            [prob for _, prob in model.get_document_topics(doc, minimum_probability=0)]
            for doc in corpus
        ],
        columns=[f'Topic{i+1}' for i in range(model.num_topics)]
    )

train_topic_distributions = get_document_topic_distribution(model_10, corpus_train)
test_topic_distributions = get_document_topic_distribution(model_10, corpus_test)

In [None]:
train_topic_distributions

### Random Forest

#### Train random forest classifier

We train a random forest classifier onthe training set (without hyperparameter tuning) to classify the sentiment based on the processed text features.

*Run the code below.*

In [29]:
from sklearn.ensemble import RandomForestClassifier

# Train a Random Forest classifier
rf_topicmodel = RandomForestClassifier(random_state=42).fit(train_topic_distributions, y_train)

#### Make predictions and calculate evaluation metrics on test set

Similarly to last week, we can make predictions on the test set and calculate different evaluation metrics.

*Run the code below.*

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score

predictions_testset_rf_topicmodel = rf_topicmodel.predict_proba(test_topic_distributions)[:, 1]
predictions_testset_rf_topicmodel_binary = np.where(predictions_testset_rf_topicmodel > 0.5, 1, 0)

# Calculate Accuracy

accuracy_rf = accuracy_score(y_test, predictions_testset_rf_topicmodel_binary)
print("Accuracy (Random Forests):", accuracy_rf)

# Create the confusion matrix
ConfusionMatrixDisplay.from_predictions(y_test, predictions_testset_rf_topicmodel_binary)

#### ROC and Auc

Plot ROC curve and calculate AUC on test set.
With __binary__ classification we get relatively straight lines. With the classification __probability__ we can map the distribution better. That is why we use the classification probability (e.g., predictions_testset_rf_topicmodel) to calculate the AUC.

*Run the code below.*

In [None]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import RocCurveDisplay

# Calculate and Print the AUC score
auc_score = roc_auc_score(y_test, predictions_testset_rf_topicmodel)
print("AUC Score:", auc_score)

#plot ROC curve
RocCurveDisplay.from_predictions(y_test, predictions_testset_rf_topicmodel, plot_chance_level=True)

### CART

Instead of training a random forest we can also try out a basic decision tree and compare their performances. Therefor we simply grow a decision tree on the same training data as before and evaluate on the same test data.


### Train a simple classification tree and visualize it

*Run the code below.*

In [None]:
from sklearn.tree import DecisionTreeClassifier, plot_tree

# Train Decision Tree classifier
cart_topicmodel = DecisionTreeClassifier(min_samples_split=60, min_samples_leaf=20, max_depth=5, random_state=42).fit(train_topic_distributions, y_train)

# Visualize Decision Tree
plt.figure(figsize=(20,10))
plot_tree(cart_topicmodel, feature_names=train_topic_distributions.columns, impurity=False, filled=True, rounded=True, proportion=True, class_names=True, fontsize=10)
plt.show()

#### Make predictions and calculate evaluation metrics on test set

After calculating the decision tree's accuracy on the test set, we can see that there isn't a big difference in predictive performance, however the decision tree is a lot easier to interpret by a human compared to the random forest which is (without further analysis) a black box for us.

*Run the code below.*

In [None]:
predictions_testset_cart_topicmodel = cart_topicmodel.predict_proba(test_topic_distributions)[:, 1]
predictions_testset_cart_topicmodel_binary = np.where(predictions_testset_cart_topicmodel > 0.5, 1, 0)

# Calculate Accuracy
accuracy_cart = accuracy_score(y_test, predictions_testset_cart_topicmodel_binary)
print("Accuracy (Random Forests):", accuracy_cart)

# Create the confusion matrix
ConfusionMatrixDisplay.from_predictions(y_test, predictions_testset_cart_topicmodel_binary)

#### ROC and Auc

Plot ROC curve and calculate AUC on test set.
With __binary__ classification we get relatively straight lines. With the classification __probability__ we can map the distribution better. That is why we use the classification probability (e.g., predictions_testset_cart_topicmodel) to calculate the AUC.

*Run the code below.*

In [None]:
# Calculate and Print the AUC score
auc_score = roc_auc_score(y_test, predictions_testset_cart_topicmodel)
print("AUC Score:", auc_score)

#plot ROC curve
RocCurveDisplay.from_predictions(y_test, predictions_testset_cart_topicmodel, plot_chance_level=True)

## Summary

To sum it up let us have a look what we did in this week's tutorial:

+ First we learned how to use Topic Modeling as an unsupervised method to analyse textual data
+ Secondly we used the created topic model as an additional feature in our supervised Sentiment Analysis with the aim to classify reviews as either positive or negative

You can use the cell below to perform and evaluate different sentiment analyses

In [None]:
# Enter your code here!