Problem Statement
---

The objective is to extract meaningful topics using two different topic modelling approaches: LDA and BERTopic. Your task is to identify thematic structures within the movie synopses (use the synopsis column in the attached data) and compare the topics generated by the traditional method (LDA) with those produced by the more recent, embedding-based method (BERTopic).

Instructions for Topic Modeling with LDA and BERTopic

1. Data Preprocessing
   - Clean the dataset by removing any special characters and numbers, perform tokenization, remove stop words, and apply stemming or lemmatization.
   - Ensure the data is in a suitable format for each model.

2. LDA Topic Modeling
   - Convert the cleaned text data into a useful format.
   - Train the LDA model, choosing an appropriate number of topics based on the dataset.
   - Interpret and label the topics using the top words associated with each topic from the LDA model.
   - Evaluate the LDA model performance using coherence scores and perplexity.

3. BERTopic Modeling
   - Install the BERTopic library if not already available in your environment.
   - Fit the BERTopic model to the raw text data, which will utilize BERT embeddings for creating semantically rich topics.
   - Interpret and label the topics using BERTopic's feature of extracting top words for each topic.
   - Evaluate the BERTopic model by analyzing topic coherence and stability (coherence score).

4. Comparison and Analysis
   - Compare the topics generated by LDA and BERTopic, discussing the differences in terms of coherence, interpretability, and the granularity of topics.
   - Analyze the practicality of both methods in terms of computational cost and ease of use.
   - Discuss which method produced more meaningful and distinct topics and hypothesize why that might be the case.

5. Conclusion
   - Conclude with a reflection on the benefits and limitations of each method.
   - Provide recommendations for which types of datasets each method might be better suited for.
   - Encourage discussion about the potential for combining both methods or using them in different stages of a larger data analysis pipeline.

In [212]:
import re
import pandas as pd 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from tqdm.notebook import tqdm
import string 

import spacy
from wordcloud import WordCloud
from sklearn.decomposition import LatentDirichletAllocation as LDA
from sklearn.feature_extraction.text import CountVectorizer

import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

from bertopic import BERTopic

import ast
import gensim
from gensim import corpora

from bertopic import BERTopic

plt.style.use('rose-pine-moon')

In [183]:
data = pd.read_csv('movie_data.csv')
data.head()

Unnamed: 0,Movie,Genre,Runtime,Rating,Votes,Year,Synopsis,Actors,Certificate,Image
0,Gen V,"Action, Adventure, Comedy",,8.0,13679,2023–,"From the world of ""The Boys"" comes ""Gen V,"" wh...","Jaz Sinclair, Chance Perdomo, Lizze Broadway, ...",,https://m.media-amazon.com/images/S/sash/4Fyxw...
1,Ahsoka,"Action, Adventure, Drama",,7.8,69947,2023–,"After the fall of the Galactic Empire, former ...","Rosario Dawson, David Tennant, Natasha Liu Bor...",,https://m.media-amazon.com/images/S/sash/4Fyxw...
2,Loki,"Action, Adventure, Fantasy",53 min,8.2,359924,2021–,The mercurial villain Loki resumes his role as...,"Tom Hiddleston, Owen Wilson, Sophia Di Martino...",,https://m.media-amazon.com/images/S/sash/4Fyxw...
3,The Wheel of Time,"Action, Adventure, Drama",60 min,7.1,125052,2021–,Set in a high fantasy world where magic exists...,"Rosamund Pike, Daniel Henney, Madeleine Madden...",,https://m.media-amazon.com/images/S/sash/4Fyxw...
4,One Piece,"Action, Adventure, Comedy",60 min,8.4,109063,2023–,"In a seafaring world, a young pirate captain s...","Iñaki Godoy, Emily Rudd, Mackenyu, Vincent Regan",,https://m.media-amazon.com/images/S/sash/4Fyxw...


In [184]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Movie        500 non-null    object 
 1   Genre        500 non-null    object 
 2   Runtime      450 non-null    object 
 3   Rating       475 non-null    float64
 4   Votes        500 non-null    object 
 5   Year         500 non-null    object 
 6   Synopsis     500 non-null    object 
 7   Actors       500 non-null    object 
 8   Certificate  8 non-null      object 
 9   Image        500 non-null    object 
dtypes: float64(1), object(9)
memory usage: 39.2+ KB


In [185]:
data.describe()

Unnamed: 0,Rating
count,475.0
mean,7.495158
std,0.985591
min,1.6
25%,7.1
50%,7.6
75%,8.2
max,9.4


In [186]:
data.describe(include = 'O')

Unnamed: 0,Movie,Genre,Runtime,Votes,Year,Synopsis,Actors,Certificate,Image
count,500,500,450,500,500,500,500,8,500
unique,483,56,75,475,234,499,500,6,1
top,Battlestar Galactica,"Animation, Action, Adventure",60 min,No votes yet,2023–,Plot kept under wraps.,"Jaz Sinclair, Chance Perdomo, Lizze Broadway, ...",15,https://m.media-amazon.com/images/S/sash/4Fyxw...
freq,3,145,77,25,50,2,1,2,500


Defining Preprocessing Functions
-

In [187]:
def cleanup_text(text):
    """
    Function to preprocess text and return words as a comma-separated string
    """
    # Step 1: Convert text to lowercase
    text = text.lower()

    # Step 2: Remove special characters and digits
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # Step 3: Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))

    # Step 4: Remove stopwords (common words like "the," "is," etc.)
    custom_stopwords = stopwords.words('english')
    
    custom_stopwords.extend(['from', 'subject', 're', 'edu', 'use', 'not', 'would', 'say', 'could', 'may', 'take',
                             '_', 'be', 'know', 'good', 'go', 'get', 'do', 'done', 'try', 'many', 'some', 'come'
                             'nice', 'thank', 'think', 'see', 'rather', 'easy', 'easily', 'lot', 'lack', 'make', 
                             'want', 'seem', 'run', 'need', 'even', 'right', 'line', 'even', 'also'])

    text = " ".join([word for word in word_tokenize(text) if word.lower() not in custom_stopwords])

    # Step 5: Remove short words (length < 3)
    text = " ".join([word for word in word_tokenize(text) if len(word) >= 3])

    return text

 
def lemmatize_text(text):
    """
    Function for lemmatization
    """
    lemmatizer = WordNetLemmatizer()

    # Lemmatize each word in the text
    text = " ".join([lemmatizer.lemmatize(word) for word in nltk.word_tokenize(text)])

    return text



def preprocess_data(df, col, subset = 'all'):
    
    if 'preprocessed_text' not in df.columns: #<--- Check if 'preprocessed_text' column exists, if not, create it
        df['preprocessed_text'] = ""

    if subset != 'all':                  #<--- Select number of rows to perform the preprocessing on
        subset_df = df.iloc[:subset].copy()
    else:
        subset_df = df.copy()

    # Apply preprocessing to the subset of text data with progress bar
    preprocessed_texts = []

    for text in tqdm(subset_df[col]):
        # Step 6: Preprocess the text
        cleaned_text = cleanup_text(text) #<---invoking cleanup_text function to clean the text

        # Step 7: Lemmatize the preprocessed text
        lemmatized_text = lemmatize_text(cleaned_text) #<--- invoking the lemmatize_text function for lemmatization

        # Append the processed text to the list
        preprocessed_texts.append(lemmatized_text)

    # # Convert the list of preprocessed texts to a comma-separated string
    # preprocessed_texts_str = [text.replace(" ", ",") for text in preprocessed_texts]

    # Store preprocessed words as comma-separated strings in the 'preprocessed_text' column
    subset_df['preprocessed_text'] = preprocessed_texts
    print(f"{len(subset_df)} rows has been preprocessed")
    
    return subset_df

#### Data Preprocessing and data formatting

In [188]:
processed_ = preprocess_data(data,'Synopsis', subset ='all')

text_list = processed_['preprocessed_text'].values.tolist()
len(text_list)

text_to_list = []
for text in text_list:
    words = text.split()
    quoted_words = ["'{}'".format(word) for word in words]
    text_to_list.append("[{}]".format(', '.join(quoted_words)))

  0%|          | 0/500 [00:00<?, ?it/s]

500 rows has been preprocessed


In [189]:
word_list_of_lists = [ast.literal_eval(text) for text in text_to_list]

Now that we have prprocessed our data, lets try to implement BoW (bag of words) to get a dictionary that would represent words and the number of times appeared. and corpus which would be used in bulding our LDA model.

#### BoW Dictionary & Corpus

In [190]:
dictionary = corpora.Dictionary(word_lists)  # Dictionary of words (key) & number of times each word appeard (value)

dictionary.filter_extremes(no_below=5)   # filter out words that appeared inless than 5 sentences

# Convert the preprocessed word lists to bag_of_words representation
corpus = [dictionary.doc2bow(text) for text in word_list_of_lists]

#### Building Model

In [191]:
lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                      id2word=dictionary,
                                      num_topics=10,   
                                      random_state=100,
                                      chunksize=100,
                                      passes=10,
                                      per_word_topics=True)

In [192]:
# Get the topic-word probabilities
topic_word_probs = lda_model.get_topics()

# Get the vocabulary from the id2word dictionary
vocab = list(dictionary.values())

n_top_words = 10
for i, topic_probs in enumerate(topic_word_probs):
    topic_words = np.array(vocab)[np.argsort(topic_probs)][:-(n_top_words + 1):-1]
    print('Topic {}: {}'.format(i, ' | '.join(topic_words)))

Topic 0: mission | crew | galaxy | mystery | young | must | year | land | return | secret
Topic 1: two | day | team | boy | fight | evil | batman | crime | must | century
Topic 2: adventure | friend | new | life | one | series | way | game | year | love
Topic 3: series | earth | world | planet | adventure | find | home | one | time | war
Topic 4: world | dragon | human | find | life | follows | power | set | become | journey
Topic 5: epic | american | world | child | discovers | family | power | young | must | two
Topic 6: adventure | city | young | ninja | superheroes | new | art | martial | detective | skill
Topic 7: world | save | based | young | evil | group | battle | ancient | known | partner
Topic 8: band | school | named | high | find | help | group | star | girl | different
Topic 9: new | city | fight | life | power | mysterious | find | family | island | help


Based on the topics, here are the possible context.

- Topic 0: Galactic Mission
  
Keywords: mission, crew, galaxy, mystery, young, must, year, land, return, secret
Possible Interpretation: This topic seems to revolve around a space mission or adventure involving a crew exploring a galaxy, with elements of mystery and secrecy

- Topic 1: Dynamic Team and Crime Fighting

Keywords: two, day, team, boy, fight, evil, batman, crime, must, century
Possible Interpretation: This topic suggests a storyline involving a dynamic team, possibly led by a young boy, engaged in fighting crime and facing challenges across different time periods.

- Topic 2: Life's Adventure and Friendship

Keywords: adventure, friend, new, life, one, series, way, game, year, love
Possible Interpretation: This topic is about the adventures of a character or group of friends in a new phase of life, possibly exploring love and relationships.

- Topic 3: Time Travel and War

Keywords: series, earth, world, planet, adventure, find, home, one, time, war
Possible Interpretation: This topic suggests a storyline involving time travel, war, and the search for a home or a resolution in different worlds or planets.

- Topic 4: Fantasy World with Dragons and Power

Keywords: world, dragon, human, find, life, follows, power, set, become, journey
Possible Interpretation: This topic involves a fantasy world with dragons, humans, and a journey where characters seek power and transformation.

- Topic 5: Epic American Family Adventure

Keywords: epic, American, world, child, discovers, family, power, young, must, two
Possible Interpretation: This topic may depict an epic adventure involving an American family, where a child discovers power and must navigate challenges with another character.

- Topic 6: Adventure in the City with Superheroes

Keywords: adventure, city, young, ninja, superheroes, new, art, martial, detective, skill
Possible Interpretation: This topic suggests an adventure set in a city involving young characters, ninjas, superheroes, martial arts, and detective skills.

- Topic 7: World-saving Battle with Evil

Keywords: world, save, based, young, evil, group, battle, ancient, known, partner
Possible Interpretation: This topic revolves around a world-saving mission where young characters, possibly part of a group, engage in a battle against ancient evil forces.

- Topic 8: High School Band and Star Adventures

Keywords: band, school, named, high, find, help, group, star, girl, different
Possible Interpretation: This topic involves a narrative centered around a high school band, possibly named or unique, where characters find help and embark on adventures related to stars.

- Topic 9: Mysterious City Life

Keywords: new, city, fight, life, power, mysterious, find, family, island, help
Possible Interpretation: This topic suggests a storyline set in a new city where characters engage in fights, explore life, possess mysterious powers, and seek the support of family or help on an island.

#### Perplexity

In [193]:
log_perplexity = lda_model.log_perplexity(corpus)

# Convert log perplexity to perplexity
perplexity = 2**(-log_perplexity)
print('Perplexity: ', perplexity)

Perplexity:  72.17637793418177


In [195]:
from gensim.models import CoherenceModel

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=word_list_of_lists, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence Score: ', coherence_lda)

Coherence Score:  0.3861074002785128


The LDA model's perplexity of 72 suggests that, on average, the model assigns a lower probability to the observed data. This score is reasonable. The  coherence score of 39 indicates a moderate level of semantic similarity among high-scoring words within topics.

#### Creating a bertopicmodel

In [215]:
documents = processed_["preprocessed_text"].tolist()

# Create a document-term matrix using CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)

documents_as_strings = [" ".join(vectorizer.get_feature_names_out()[doc.indices]) for doc in X]

# Create and fit the BERTopic model
bertopic_model = BERTopic(nr_topics=10)
topics, _ = bertopic_model.fit_transform(documents_as_strings)

# Get the topics and top words for each topic
top_words = bertopic_model.get_topic_info()

# Print the topics and top words
for i, (topic, words) in enumerate(zip(topics, top_words["Representation"])):
    print(f"Topic {i}: {words[:10]}")

2023-11-27 22:12:50,047 - BERTopic - Transformed documents to Embeddings
2023-11-27 22:12:54,752 - BERTopic - Reduced dimensionality
2023-11-27 22:12:54,776 - BERTopic - Clustered reduced embeddings
2023-11-27 22:12:54,832 - BERTopic - Reduced number of topics from 8 to 8


Topic 0: ['adventure', 'world', 'life', 'must', 'series', 'monster', 'find', 'story', 'group', 'island']
Topic 1: ['agent', 'spy', 'international', 'secret', 'discovers', 'operative', 'become', 'secretly', 'cia', 'job']
Topic 2: ['superheroes', 'crime', 'city', 'superhero', 'team', 'fight', 'justice', 'adventure', 'hero', 'comic']
Topic 3: ['time', 'travel', 'present', 'world', 'around', 'space', 'people', 'traveler', 'life', 'day']
Topic 4: ['crew', 'galaxy', 'year', 'earth', 'planet', 'alien', 'us', 'space', 'mission', 'captain']
Topic 5: ['friend', 'adventure', 'family', 'girl', 'way', 'town', 'best', 'boy', 'series', 'group']
Topic 6: ['magic', 'demon', 'world', 'peace', 'boy', 'school', 'magician', 'ichigo', 'destiny', 'soul']
Topic 7: ['world', 'ninja', 'dragon', 'japan', 'quest', 'force', 'power', 'princess', 'great', 'warrior']


- Topic 0: Adventure World:

Keywords: world, life, adventure, monster, group, dangerous, island.
Possible Interpretation: This topic may represent movies involving adventurous journeys in a fantasy world with encounters of monsters and dangerous situations.

- Topic 1: International Spy Elite:

Keywords: agent, international, spy, secret, discovers, operative, elite, secretly, CIA.
Possible Interpretation: This topic suggests espionage and international intrigue, featuring secret agents, discoveries, and elite operations, possibly with a focus on the CIA.

- Topic 2: Time Travel Exploration:

Keywords: time, travel, present, world, space, traveler, day.
Possible Interpretation: This topic could indicate movies centered around time travel, exploring different time periods and the impact of time travel on the present world.

- Topic 3: Galactic Mission:

Keywords: crew, galaxy, planet, year, earth, mission, space, captain.
Possible Interpretation: This topic may represent space exploration and missions involving a crew, galaxy, and planets, led by a captain.

- Topic 4: Family Adventure Series:

Keywords: friend, adventure, family, series, girl, boy, town, best, new.
Possible Interpretation: This topic could encompass family-oriented adventure series, possibly involving friends, new experiences, and life in a town.

- Topic 5: Superhero Comic Justice:

Keywords: superheroes, city, crime, superhero, team, hero, fight, justice, adventure, comic.
Possible Interpretation: This topic suggests superhero-themed movies with a focus on combating crime, teamwork, and justice, potentially inspired by comic book stories.

- Topic 6: Ninja Samurai World:

Keywords: ninja, Japan, skill, samurai, Uzumaki, Naruto, world, leader, great, Hokage.
Possible Interpretation: This topic may represent movies set in a world of ninjas and samurais, with a notable reference to Naruto, a well-known ninja character.

- Topic 7: Magical Demon Land:

Keywords: world, demon, magic, princess, peace, king, ancient, friend, young, land.
Possible Interpretation: This topic could involve a magical world with demons, princesses, and a quest for peace, featuring ancient lands and friendships.

In [216]:
coherence_model_lda = CoherenceModel(model=model, texts=word_list_of_lists, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence Score: ', coherence_lda)

Coherence Score:  1.0


Coherence scores closer to 1.0 suggest that the topics are well-separated and the words within each topic are highly associated. But here wehave 1.0 which means the model is perfectly separating words and this sort of suggests overfitting.  

In [228]:
# # Tuning
# documents = processed_["preprocessed_text"].tolist()

# # Create a document-term matrix using CountVectorizer
# vectorizer = CountVectorizer()
# X = vectorizer.fit_transform(documents)

# documents_as_strings = [" ".join(vectorizer.get_feature_names_out()[doc.indices]) for doc in X]

# # Create and fit the BERTopic model
# bertopic_model = BERTopic(
#     nr_topics=10,
#     top_n_words=10,   # Number of top words per topic
#     # nr_top_terms=15,  # Number of top terms to consider when calculating topic representation
#     n_gram_range=(1, 1),  # Adjust the n-gram range
#     min_topic_size=5,  # Minimum number of documents in a topic
#     # umap_args={'n_neighbors': 15, 'n_components': 5, 'metric': 'cosine'},  # Adjust UMAP parameters
#     # hdbscan_args={'min_cluster_size': 10, 'metric': 'euclidean'}  # Adjust HDBSCAN parameters
# )
# topics, _ = bertopic_model.fit_transform(documents_as_strings)

# # Get the topics and top words for each topic
# top_words = bertopic_model.get_topic_info()

# # Print the topics and top words
# for i, (topic, words) in enumerate(zip(topics, top_words["Representation"])):
#     print(f"Topic {i}: {words[:10]}")

- Coherence:
The coherence scores for the LDA model indicate a perplexity of 72.77 and a coherence score of 39. While perplexity is relatively high, the coherence score suggests that the topics generated by LDA might have some overlapping terms, impacting the overall interpretability. On the other hand, BERTopic consistently produces a coherence score of 1.0, indicating non-overlapping and well-defined topics. This suggests that BERTopic excels in generating topics that are distinct and easily interpretable.

- Interpretability:
LDA assigns probabilities to each word in the vocabulary for its association with a particular topic. Topics are then represented by the most probable words. While this provides a certain level of interpretability, the coherence score suggests that there might be some ambiguity in the topics. In contrast, BERTopic represents topics by extracting the most representative words in the cluster. This approach potentially captures more contextually relevant words, enhancing the overall interpretability of topics.

- Granularity of Topics:
The granularity of topics refers to how specific and detailed the identified themes are. LDA, with a chosen number of 10 topics, tends to generate broader themes. The interpretability of these topics may be affected by a certain level of generality. BERTopic, leveraging BERT embeddings, has the potential to capture more nuanced and specific topics due to its contextual understanding. The high coherence score further supports the notion that BERTopic produces well-separated and granular topics.

In summary, the coherence, interpretability, and granularity of topics suggest that BERTopic outperforms LDA in this particular context. The consistently high coherence score indicates that the topics generated by BERTopic are not only more interpretable but also more distinct and granular, providing a richer representation of the underlying thematic structures within the movie synopses.

#### Conclusion

- LDA:
Latent Dirichlet Allocation (LDA) is a robust and computationally efficient method for topic modeling. It is particularly well-suited for large datasets, providing probabilistic insights into word-topic associations. However, its reliance on bag-of-words representations and the assumption of a static topic distribution may limit its effectiveness in capturing dynamic or nuanced semantic relationships.

- BERTopic:
BERTopic, leveraging BERT embeddings, excels in capturing context and semantic nuances. It offers highly coherent and interpretable topics, even with smaller datasets. While computationally more demanding and requiring a pre-trained BERT model, BERTopic's ability to understand word context enhances its performance in complex thematic structures.

- Recommendations:
For larger datasets where computational efficiency is crucial, LDA is a solid choice. It performs well when a static topic distribution assumption is reasonable. On the other hand, BERTopic is recommended for smaller datasets where capturing context and semantic relationships is a priority, offering enhanced interpretability and coherence.

Combining Both Methods:
A strategic combination of LDA and BERTopic in different stages of analysis can leverage their respective strengths. LDA can provide an initial broad overview of topics, and BERTopic can refine and enhance the granularity of identified themes. This integrated approach allows for a comprehensive and nuanced understanding of complex datasets.