## **#04. Topic Modeling**
- Instructor: [Jaeung Sim](https://jaeungs.github.io/) (University of Connecticut)
- Course: OPIM 5671 Data Mining and Time Series Forecasting
- Last updated: February 11, 2025

**Objectives**
1. Understand topic modeling and its applications.
1. Exercise topic modeling using Latent Dirichlet Allocation (LDA) in Python.

**References**
* [Topic model (Wikipedia)](https://en.wikipedia.org/wiki/Topic_model)
* [A Deeper Meaning: Topic Modeling in Python](https://www.toptal.com/python/topic-modeling-python)
* [Hands-On Topic Modeling with Python](https://towardsdatascience.com/hands-on-topic-modeling-with-python-1e3466d406d7)
* [Disneyland Reviews at Kaggle (Data Source)](https://www.kaggle.com/datasets/arushchillar/disneyland-reviews)
* [A friendly guide to NLP: Bag-of-Words with Python example](https://www.analyticsvidhya.com/blog/2021/08/a-friendly-guide-to-nlp-bag-of-words-with-python-example/)

#### **Part 1. Conceptual Background**

**A. Theoretical background**

A **topic model** is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body.

**Latent Dirichlet Allocation (LDA)**

LDA is a generative statistical model that explains a set of observations through unobserved groups, and each group explains why some parts of the data are similar. The LDA is a popular statistical unsupervised machine learning model for topic modeling. It assumes each topic is made up of words and each document (in our case each review) consists of a collection of these words. Therefore, LDA tries to find words that best describe each topic and matches reviews that are represented by these words.

LDA uses Dirichlet distribution, a generalization of Beta distribution that models probability distribution for two or more outcomes ($K$). Dirichlet distribution denoted with $Dir(\alpha)$ where $\alpha < 1$ (symmetric) indicates sparsity, and it is exactly how we want to present topics and words for topic modeling. As you can see below, with $\alpha < 1$ we have circles on sides/corners separated from each other (in other words sparse), and with $\alpha > 1$ we have circles in the center very close to each other and difficult to distinguish. You can imagine these circles as topics.

![image](https://miro.medium.com/v2/resize:fit:1100/format:webp/1*DTtEsh9WDcC8sSPA6dZSIg.jpeg)

LDA uses two Dirichlet distributions where
* $K$ is the number of topics.
* $M$ denotes the number of documents.
* $N$ denotes the number of words in a given document.
* $Dir(\alpha)$ is the Dirichlet distribution per-document topic distribution.
* $Dir(\beta)$ is the Dirichlet distribution per-topic word distribution.
* $\theta_{i}$ is the topic distribution for document $i$.
* $\varphi_{k}$ is the word distribution for topic $k$.
* $z_{ij}$ is the topic for the $j$-th word in document $i$.
* $w_{ij}$ is the specific word.




If we bring all the pieces together, we get the formula below, which describes the probability of a document with two Dirichlet distributions followed by multinomial distributions.

$P(\boldsymbol{W}, \boldsymbol{Z}, \boldsymbol{\theta}, \boldsymbol{\varphi}; \alpha, \beta) = \prod_{k=1}^K P(\varphi_k; \beta) \prod_{i=1}^M P(\theta_i; \alpha) \prod_{j=1}^N P(Z_{ij} | \theta_i) P(W_{ij} | \varphi_{z_{ij}})$

This suggests that you want to maximize the product of the three probability measures:
* $\prod_{k=1}^K P(\varphi_k; \beta)$: How accurate the predicted word distributions for topics are
* $\prod_{i=1}^M P(\theta_i; \alpha)$: How accurate the predicted topic distributions for documents are
* $\prod_{j=1}^N P(Z_{ij} | \theta_i) P(W_{ij} | \varphi_{z_{ij}})$: For the given distributions, how accurate the predicted co-occurrence probabilities of words and topics are

**B. Examples of Real-world Applications**

Please refer to the following papers:
* Jing Gong, Vibhanshu Abhishek, Beibei Li (2018) "Examining the Impact of Keyword Ambiguity on Search Advertising Performance: A Topic Modeling Approach," ***MIS Quarterly*** 42(3), pp. 805-829.
* Jorge Mejia, Shawn Mankad, Anandasivam Gopal (2021) "Service Quality Using Text Mining: Measurement and Consequences," ***Manufacturing & Service Operations Management*** 23(6), pp. 1354-1372.
* David Ardia, Keven Bluteau, Kris Boudt, Koen Inghelbrecht (2022) "Climate Change Concerns and the Performance of Green vs. Brown Stocks," ***Management Science***, forthcoming.



**C. Overview of Implementation**

1. Loading relevant libraries
1. Exploring data structures
1. Text parsing and filtering
1. Bag-of-Words
1. Determining the number of topics


#### **Part 2. Understanding the Data**

**Introduction to the Dataset**
* **Source:** Disney Land Review Dataset at Kaggle (<https://www.kaggle.com/datasets/arushchillar/disneyland-reviews>)
* **About this file**
  * The dataset includes 42,000 reviews of 3 Disneyland branches - Paris, California and Hong Kong, posted by visitors on Trip Advisor. You can refer to https://www.kaggle.com/datasets/arushchillar/disneyland-reviews for more details.
  * Column Description
    1. `Review_ID`: unique id given to each review
    1. `Rating`: ranging from 1 (unsatisfied) to 5 (satisfied)
    1. `Year_Month`: when the reviewer visited the theme park
    1. `Reviewer_Location`: country of origin of visitor
    1. `Review_Text`: comments made by visitor
    1. `Disneyland_Branch`: location of Disneyland Park

**Download data with Python codes**

In [None]:
# Libraries for data downloading and processing
import numpy as np
import pandas as pd
import kagglehub
import os

In [None]:
# Libraries for visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
plt.style.use("fivethirtyeight")
pd.set_option('display.max_colwidth', 80)
import matplotlib.patheffects as path_effects
import seaborn as sns

In [None]:
# Download latest version
path = kagglehub.dataset_download("arushchillar/disneyland-reviews")

print("Path to dataset files:", path)

**Deal with DataFrame**

In [None]:
# List files in the downloaded dataset directory
files = os.listdir(path)
print("Files in dataset:", files)

# Load the CSV file (assuming there's only one CSV file)
csv_file = [f for f in files if f.endswith('.csv')][0]  # Get the first CSV file
csv_path = os.path.join(path, csv_file)

In [None]:
# Read into DataFrame with default encoding (UTF-8)
df = pd.read_csv(csv_path) # Yield an error

In [None]:
# Attempt with ISO-8859-1 encoding
df = pd.read_csv(csv_path, encoding="ISO-8859-1") # Or type: encoding="latin-1" / encoding="Windows-1252"
df.head()

In [None]:
# Check for missing values
df.isnull().sum()

In [None]:
# Check data types
df.info()

In [None]:
# Create a bar plot with value counts
sns.countplot(x='Rating', data=df)

#### **Part 3. Implementation**

**Objectives**
* Process text data with `nltk` library.
* Draw text topics and determine their numbers.
* Explore topics with `pyLDAvis`.

**References**
* [A Deeper Meaning: Topic Modeling in Python](https://www.toptal.com/python/topic-modeling-python)
* [Hands-On Topic Modeling with Python](https://towardsdatascience.com/hands-on-topic-modeling-with-python-1e3466d406d7)
* [Disneyland Reviews at Kaggle (Data Source)](https://www.kaggle.com/datasets/arushchillar/disneyland-reviews)
* [Python | Lemmatization with NLTK](https://www.geeksforgeeks.org/python-lemmatization-with-nltk/)
* [A friendly guide to NLP: Bag-of-Words with Python example](https://www.analyticsvidhya.com/blog/2021/08/a-friendly-guide-to-nlp-bag-of-words-with-python-example/)

##### **3.1. Loading NLP Libraries**

**A. Natural Language Processing Tools**

In [None]:
# NLP libraries
import nltk
import gensim
import string
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from collections import Counter
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

In [None]:
# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')

##### **3.2. Text Processing**

Here are a few additions to in earlier notebooks:
* Dealing with contractions
* Extending the stop word set by adding contextual terms

**Considering contractions in English**

In [None]:
# A dictionary of main contractions in English
contractions = {
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he's": "he is",
"how'd": "how did",
"how'll": "how will",
"how's": "how is",
"i'd": "i would",
"i'll": "i will",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'll": "it will",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"must've": "must have",
"mustn't": "must not",
"needn't": "need not",
"oughtn't": "ought not",
"shan't": "shall not",
"sha'n't": "shall not",
"she'd": "she would",
"she'll": "she will",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"that'd": "that would",
"that's": "that is",
"there'd": "there had",
"there's": "there is",
"they'd": "they would",
"they'll": "they will",
"they're": "they are",
"they've": "they have",
"wasn't": "was not",
"we'd": "we would",
"we'll": "we will",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what will",
"what're": "what are",
"what's": "what is",
"what've": "what have",
"where'd": "where did",
"where's": "where is",
"who'll": "who will",
"who's": "who is",
"won't": "will not",
"wouldn't": "would not",
"you'd": "you would",
"you'll": "you will",
"you're": "you are",
"you've": "you have"
}

**Extending the stop word set by adding contextual terms**

In [None]:
# Define a basic stop word set
stop_words = set(stopwords.words('english'))

In [None]:
# Extend the stop word set
stop_words.update(['park', 'disney', 'disneyland']) # Context-specific stopwords

**Define a text processing function**

In [None]:
# Bring lemmatizer
lemmatizer = WordNetLemmatizer()

In [None]:
# Define a text pre-processing function
def process_text(text):
    # Lowercasing
    text = text.lower()

    # Expand contractions
    if True:
        text = text.split()
        new_text = []
        for word in text:
            if word in contractions:
                new_text.append(contractions[word])
            else:
                new_text.append(word)
        text = " ".join(new_text)

    # Remove punctuation and special characters
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)
    text = re.sub(r'\'', ' ', text)

    # Tokenization
    tokens = word_tokenize(text)

    # Remove stopwords & perform lemmatization
    filtered_tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]

    return ' '.join(filtered_tokens)

In [None]:
# Apply the pre-processing function
df['Review_Clean'] = df['Review_Text'].apply(process_text)
df['Review_Clean']

In [None]:
from collections import Counter

# Join text together
review_words = ','.join(list(df['Review_Clean'].values))

# Count each word
Counter = Counter(review_words.split())
most_frequent = Counter.most_common(30)

# Bar plot of frequent words
fig = plt.figure(1, figsize = (20,10))
_ = pd.DataFrame(most_frequent, columns=("words","count"))
sns.barplot(x = 'words', y = 'count', data = _, palette = 'winter')
plt.xticks(rotation=45);

In [None]:
# Generate the word cloud
wordcloud = WordCloud(background_color="white",
                      max_words= 200,
                      contour_width = 8,
                      contour_color = "steelblue",
                      collocations=False).generate(review_words)

# Visualize the word cloud
fig = plt.figure(1, figsize = (10, 10))
plt.axis('off')
plt.imshow(wordcloud)
plt.show()

##### **3.3. Bag-of-Words**

In order to use text as an input to machine learning algorithms, we need to present it in a numerical format. **Bag-of-words** is a vector space model and represents the occurrence of words in the document. In other words, bag-of-words converts each review into a collection of word counts without giving importance to the order or meaning.

Here are a few example sentences about Game of Thrones:
* Review 1: Game of Thrones is an amazing tv series!
* Review 2: Game of Thrones is the best tv series!
* Review 3: Game of Thrones is so great.

Each row corresponds to a different review, while the rows are the unique words, contained in the three documents.

![Image](https://cdn-images-1.medium.com/max/1000/1*cHKkqYIhaYuYwuuhBiSlHw.png)


In [None]:
# Ensure 'Review_Clean_List' contains tokenized text (lists of words)
df['Review_Clean_List'] = df['Review_Clean'].apply(lambda x: x.split() if isinstance(x, str) else x)


In [None]:
df['Review_Clean_List']

In [None]:
# Create Dictionary
id2word = gensim.corpora.Dictionary(df['Review_Clean_List'])

# Create Corpus: Term Document Frequency
corpus = [id2word.doc2bow(text) for text in df['Review_Clean_List']]

##### **3.4. Determining the Number of Topics**

Deciding on the number of topics for the topic modeling can be difficult. Since we have initial knowledge of the context, determining the number of topics for modeling wouldn't be too outraging. However, if this number is too much then the model might fail to detect a topic that is actually broader and if this number is too less then topics might have large overlapping words. Because of these reasons, we will use the topic coherence score.

In [None]:
# Compute coherence score
from gensim.models import CoherenceModel

# By the number of topics from 1 to 7 (~1 min per each)
number_of_topics = []
coherence_score = []
for i in range(1,8):
  lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=id2word, iterations=50, num_topics=i)
  coherence_model_lda = CoherenceModel(model=lda_model, texts=df['Review_Clean_List'], dictionary=id2word, coherence='c_v')
  coherence_lda = coherence_model_lda.get_coherence()
  number_of_topics.append(i)
  coherence_score.append(coherence_lda);

In [None]:
# Create a dataframe of coherence score by number of topics
topic_coherence = pd.DataFrame({'number_of_topics':number_of_topics,
                                'coherence_score':coherence_score})

In [None]:
# Print a line plot
sns.lineplot(data=topic_coherence, x='number_of_topics', y='coherence_score')

In [None]:
# Explore words occurring in each topic with their relative weight
for idx, topic in lda_model.print_topics(-1):
    print("Topic: {} Word: {}".format(idx, topic))
    print("\n")

##### **3.5. Getting Topic Weights with the Optimal Number**

In [None]:
# Define LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=id2word, iterations=50, num_topics=5)

# Compute topic distributions for each document in the corpus
lda_topics = [lda_model.get_document_topics(doc, minimum_probability=0) for doc in corpus]

# Convert topic distributions into a DataFrame
topic_weights = pd.DataFrame([[topic_prob for _, topic_prob in doc] for doc in lda_topics],
                             columns=[f"Topic_{i}" for i in range(lda_model.num_topics)])

# Merge topic weights with original DataFrame
df_with_topics = pd.concat([df['Review_Clean_List'], topic_weights], axis=1)

In [None]:
# Check the results
df_with_topics

**Let's create some variables with topics!**

In [None]:
# Ensure the necessary columns exist in the DataFrame
topic_columns = ['Topic_0', 'Topic_1', 'Topic_2', 'Topic_3', 'Topic_4']

# Compute mean and variance of the five topic columns
df_with_topics['Topic_Mean'] = df_with_topics[topic_columns].mean(axis=1)
df_with_topics['Topic_Variance'] = df_with_topics[topic_columns].var(axis=1)

# Display the updated DataFrame
df_with_topics