#### NOVA IMS / BSc in Data Science / Text Mining 2024/2025
### <b>Group Project: "Solving the Hyderabadi Word Soup"</b>
#### Notebook `Classify reviews according to emergent topics (Topic Modeling)`

#### Group:
- Dinis Fernandes #20221848
- Dinis Gaspar #20221869
- Inês Santos #20221916
- Luis Davila #20221949
- Sara Ferrer #20221947

#### <font color='#BFD72F'>Table of Contents </font> <a class="anchor" id='toc'></a> 
- [0. Literature Review](#0-literature-review)
- [1. Imports](#1-imports)
- [2. Data Understanding](#2-data-understanding)
- [3.Explanatory Data Analysis (EDA)](#3-explanatory-data-analysis-eda)
- [4. Reviews Preprocessing](#4-reviews-preprocessing)
- [5. Perform Latent Semantic Analysis](#5-perform-latent-semantic-analysis)
    - [5.1 Using sklearn](#51-using-sklearn)
    - [5.2 Using gensim](#52-using-gensim)
- [6. Perform Latent Dirichlet Allocation](#6-perform-latent-dirichlet-allocation)
    - [6.1 Using sklearn](#61-using-sklearn)
    - [6.2 Using gensim](#62-using-gensim)
- [7. Topic Model using BERTopic](#7-topic-model-using-bertopic)
- [8. Conclusion](#8-conclusion)

## 0. Literature Review

[1] Baby, Anusuya. (2023). Exploring the Power of Topic Modeling Techniques in Analyzing Customer Reviews: A Comparative Analysis. 10.48550/arXiv.2308.11520. 

## 1. Imports
  
[Back to TOC](#toc)

In [1]:
%load_ext autoreload
%autoreload 2

#General-Purpos
import pandas as pd
import numpy as np
import re
from tqdm import tqdm
import nltk
from nltk import pos_tag
from nltk.tokenize import word_tokenize

#Preprocessing
from utils import pipeline_v2
from sklearn.decomposition import TruncatedSVD
from gensim import corpora

#Vectorization
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

#Topic Modelling
from gensim.models import LsiModel, LdaModel
from bertopic import BERTopic
from sklearn.decomposition import LatentDirichletAllocation
from gensim.corpora import Dictionary


#Assessment
from gensim.models.coherencemodel import CoherenceModel

In [2]:
import warnings
warnings.filterwarnings('ignore')

## 2. Data Undertsanding

  
[Back to TOC](#toc)

Loading the dataset:

In [3]:
reviews = pd.read_csv("data/10k_reviews.csv").rename({'Review':'raw_review'}, axis=1)
reviews.drop(columns = ["Restaurant","Reviewer", "Metadata", "Time", "Pictures","Rating"], axis = 1, inplace=True)

reviews

Unnamed: 0,raw_review
0,"The ambience was good, food was quite good . h..."
1,Ambience is too good for a pleasant evening. S...
2,A must try.. great food great ambience. Thnx f...
3,Soumen das and Arun was a great guy. Only beca...
4,Food is good.we ordered Kodi drumsticks and ba...
...,...
9995,Madhumathi Mahajan Well to start with nice cou...
9996,This place has never disappointed us.. The foo...
9997,"Bad rating is mainly because of ""Chicken Bone ..."
9998,I personally love and prefer Chinese Food. Had...


In [None]:
reviews_copy = reviews.copy()

## 3. Explanatory Data Analysis (EDA)
  
[Back to TOC](#toc)

In [4]:
reviews_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 1 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   raw_review  9955 non-null   object
dtypes: object(1)
memory usage: 78.3+ KB


In [5]:
reviews_copy.dropna(inplace=True)

With prior analysis we noticed that there were 2 reviews that were affecting topic modelling, so we decided to drop them:

In [6]:
review_at_1564 = reviews_copy.loc[1564, 'raw_review']
review_at_2336 = reviews_copy.loc[2337, 'raw_review']

# Print the reviews
print(f"Review at index 1564:\n{review_at_1564}\n")
print(f"Review at index 2336:\n{review_at_2336}")

Review at index 1564:
Hhsjoibohoogogigivigigu8gihohohohphpjpjpjjohohohohohohohohohohojojojpjpjpjohpjpjohohohohhjohojpjojohohohohohhohohohojojojojohohohigufufyfyfufufugkbkhkhkgigkghighihhohohih

Review at index 2336:
good good goodgoodgoodgood goodgoodgoodgoodgoodgoodgoodgoodgoodgoodgoodgoodgoodgoodgoodgoodgoodgoodgoodgoodgoodgoodgoodgoodgoodgoodgoodgoodgoodgoodgoodgoodgoodgood


In [7]:
# Drop the review at index 1564
reviews_copy = reviews_copy.drop(index=[1564, 2337])

# Reset the index if needed
reviews_copy = reviews_copy.reset_index(drop=True)

## 4. Reviews Preprocessing

[Back to TOC](#toc)

Create a function to remove some noise:

In [8]:
def clean_review(text):
    # Remove gibberish or non-alphabetic noise
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Remove short words that might be gibberish (less than 3 characters)
    text = re.sub(r'\b\w{1,2}\b', '', text)
    # Clean repeated characters (keep two consecutive ones only)
    text = re.sub(r'(.)\1{2,}', r'\1\1', text)
    # Remove random repeated letters patterns
    text = re.sub(r'\b([a-zA-Z])\1{2,}\b', '', text)
    return text.strip()

In [9]:
reviews_copy["raw_review"] = reviews_copy["raw_review"].map(lambda content: clean_review(content))
reviews_copy

Unnamed: 0,raw_review
0,The ambience was good food was quite good had...
1,Ambience too good for pleasant evening Servi...
2,must try great food great ambience Thnx for th...
3,Soumen das and Arun was great guy Only becaus...
4,Food goodwe ordered Kodi drumsticks and baske...
...,...
9948,Madhumathi Mahajan Well start with nice court...
9949,This place has never disappointed The food th...
9950,Bad rating mainly because Chicken Bone found...
9951,personally love and prefer Chinese Food Had be...


Preprocessing and Vectorizaion:

In [10]:
# Create full preproc column with pipeline_v2
preprocessor = pipeline_v2.MainPipeline(lemmatized=True, custom_stopwords=["good", "super","excelent","awesome","delicious","amazing"]).main_pipeline
reviews_copy["preproc_review"] = reviews_copy["raw_review"].map(lambda content: preprocessor(content))

# Tokenized preprocessor
tokenized_preprocessor = pipeline_v2.MainPipeline(lemmatized=True,
                                                  tokenized_output=True,
                                                  custom_stopwords=["good", "super","excelent","awesome","delicious","amazing"]).main_pipeline
reviews_copy["tokenized_review"] = reviews_copy["raw_review"].map(lambda content: tokenized_preprocessor(content))

# Doc2Vec preprocessor
doc2vec_preprocessor = pipeline_v2.MainPipeline(lemmatized=False, no_stopwords=False, lowercase=False).main_pipeline
reviews_copy["doc2vec_review"] = reviews_copy["raw_review"].map(lambda content: doc2vec_preprocessor(content))

# Vectorize using BOW
bow_vectorizer = CountVectorizer(ngram_range=(1, 1), token_pattern=r"(?u)\b\w+\b") 
reviews_copy_bow_td_matrix = bow_vectorizer.fit_transform(reviews_copy["preproc_review"]).toarray()
reviews_copy["bow_vector"] = reviews_copy_bow_td_matrix.tolist()

# Vectorize using TF-IDF
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 1), token_pattern=r"(?u)\b\w+\b")  
reviews_copy_tfidf_td_matrix = tfidf_vectorizer.fit_transform(reviews_copy["preproc_review"]).toarray()
reviews_copy["tfidf_vector"] = reviews_copy_tfidf_td_matrix.tolist()

# Vectorize using Doc2Vec
d2v = Doc2Vec
documents = [TaggedDocument(doc.split(), [i]) for i, doc in enumerate(reviews_copy["doc2vec_review"])]
d2v_model = d2v(documents, vector_size=100, window=6, min_count=1, workers=4, epochs=20)
reviews_copy["doc2vec_vector"] = [d2v_model.dv[idx].tolist() for idx in tqdm(range(len(reviews_copy)))]


100%|██████████| 9953/9953 [00:00<00:00, 243182.41it/s]


In [11]:
reviews_copy

Unnamed: 0,raw_review,preproc_review,tokenized_review,doc2vec_review,bow_vector,tfidf_vector,doc2vec_vector
0,The ambience was good food was quite good had...,ambience food quite saturday lunch cost effect...,"[ambience, food, quite, saturday, lunch, cost,...",The ambience was good food was quite good had ...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.20979316532611847, 0.0457620695233345, -0...."
1,Ambience too good for pleasant evening Servi...,ambience pleasant evening service prompt food ...,"[ambience, pleasant, evening, service, prompt,...",Ambience too good for pleasant evening Service...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.10242153704166412, 0.16454319655895233, -0..."
2,must try great food great ambience Thnx for th...,must try great food great ambience thnx servic...,"[must, try, great, food, great, ambience, thnx...",must try great food great ambience Thnx for th...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.19489170610904694, 0.1490982174873352, -0...."
3,Soumen das and Arun was great guy Only becaus...,soumen da arun great guy behavior sincerety fo...,"[soumen, da, arun, great, guy, behavior, since...",Soumen das and Arun was great guy Only because...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.08144622296094894, 0.1950145810842514, -0.2..."
4,Food goodwe ordered Kodi drumsticks and baske...,food goodwe ordered kodi drumstick basket mutt...,"[food, goodwe, ordered, kodi, drumstick, baske...",Food goodwe ordered Kodi drumsticks and basket...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.001121580135077238, 0.03058222122490406, -..."
...,...,...,...,...,...,...,...
9948,Madhumathi Mahajan Well start with nice court...,madhumathi mahajan well start nice courteous s...,"[madhumathi, mahajan, well, start, nice, court...",Madhumathi Mahajan Well start with nice courte...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.21116457879543304, 0.32110172510147095, 0.3..."
9949,This place has never disappointed The food th...,place never disappointed food courteous staff ...,"[place, never, disappointed, food, courteous, ...",This place has never disappointed The food the...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.01284628827124834, 0.27450934052467346, -0..."
9950,Bad rating mainly because Chicken Bone found...,bad rating mainly chicken bone found veg food ...,"[bad, rating, mainly, chicken, bone, found, ve...",Bad rating mainly because Chicken Bone found V...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.18376021087169647, 0.4185904860496521, -0.1..."
9951,personally love and prefer Chinese Food Had be...,personally love prefer chinese food couple tim...,"[personally, love, prefer, chinese, food, coup...",personally love and prefer Chinese Food Had be...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.24170441925525665, -0.057226669043302536, ..."


BoW/TF-IDF term-document matrices and a Doc2Vec component-document matrix for Topic Modelling input:

In [12]:
reviews_copy_bow_td_matrix = np.array([[component for component in doc] for doc in reviews_copy["bow_vector"]])
reviews_copy_tfidf_td_matrix = np.array([[component for component in doc] for doc in reviews_copy["tfidf_vector"]])
reviews_copy_doc2vec_td_matrix = np.array([[component for component in doc] for doc in reviews_copy["doc2vec_vector"]])

Test document to illustrate the topic modelling results:

In [13]:
reviews_copy["raw_review"].iloc[10]

'The service was great and the food was awesome The service staff Manab and Papiya were very courteous and attentive  would like  come frequently  this place'

## 5. Perform Latent Semantic Analysis (LSA)

[Back to TOC](#toc)

## 5.1 Using sklearn

Choosing the optimal number of components based on coherence since our purpose is to extract understandable and meaningful topics:

In [53]:
tokenized_docs = reviews_copy["tokenized_review"].tolist()  # Tokenize the documents for coherence calculation
reviews_copy_dict = corpora.Dictionary(reviews_copy["tokenized_review"]) #Assigns unique ID to each word in the corpus

In [54]:
# Initialize an empty list to store coherence scores for different n_components
coherence_scores = []

# Step 2: Loop through different values of n_components
for n_components in range(5, 101, 5):  #n_components from 5 to 100 with step of 5
    # Perform LSA
    lsa = TruncatedSVD(n_components=n_components)
    lsa_result = lsa.fit_transform(reviews_copy_bow_td_matrix)
    
    # Step 3: Get topics (top words for each topic)
    topics_words = []
    for topic in range(n_components):
        topic_words = [bow_vectorizer.get_feature_names_out()[index] for index in np.argsort(lsa.components_[topic])[-10:]]  # Top 10 words for each topic
        topics_words.append(topic_words)
    
    # Step 4: Calculate coherence score for the current n_components
    coherence_model = CoherenceModel(topics=topics_words, texts=tokenized_docs, dictionary=reviews_copy_dict, coherence='c_v')
    coherence_score = coherence_model.get_coherence()
    
    # Store the coherence score
    coherence_scores.append((n_components, coherence_score))

# Step 5: Find the optimal number of components (n_components with highest coherence score)
optimal_n_components = max(coherence_scores, key=lambda x: x[1])
print(f"Optimal number of components: {optimal_n_components[0]} with coherence score: {optimal_n_components[1]}")


Optimal number of components: 10 with coherence score: 0.5101721407952204


Using Sklearn implementation of TruncatedSVD to perform LSA:

In [15]:
lsa = TruncatedSVD(n_components=10) 
lsa_result = lsa.fit_transform(reviews_copy_bow_td_matrix)

In [16]:
lsa_result.shape

(9953, 10)

Topic that contributes more to the test document:

In [18]:
lsa_result[10]

array([ 1.66698894, -1.23066459, -0.21339529,  0.2619197 ,  1.0443001 ,
        0.7002964 ,  0.20882635, -0.49386726,  0.12955836,  0.1354248 ])

In [19]:
test_topic = np.where(lsa_result[10] == lsa_result[10].max())[0][0]
test_topic

0

Extracting and Mapping Word Contributions to Topics:

In [20]:
lsa.components_.shape

(10, 16935)

In [77]:
bow_vocab = bow_vectorizer.get_feature_names_out()

In [78]:
word_topic_dict = dict(zip(bow_vocab,[lsa.components_[:,i] for i in range(len(bow_vocab))]))

Topic distribution associated with the word "chicken"

In [69]:
word_topic_dict["chicken"]

array([ 0.29840079,  0.6390906 , -0.03460895,  0.48515163,  0.03007336,
        0.13089254, -0.25775635, -0.15720506,  0.10963389,  0.03609643])

- We can see that "chicken" for example is most strongly associated with topic 2 (0.639 of probability)

How much each token contributes to each topic:

In [75]:
topic_word_dict = [{word : value for word, value in zip(bow_vocab,component)} for component in lsa.components_]

In [26]:
topic_df = pd.DataFrame(topic_word_dict)
topic_df

Unnamed: 0,aachar,aachari,aalishaan,aalishaanthis,aalloo,aalo,aaloo,aalu,aam,aamirs,...,zomos,zomoto,zomtato,zomto,zone,zoneincrease,zonequality,zoomato,zucchini,zyada
0,5.5e-05,0.000207,0.000315,0.000315,2.9e-05,1.501548e-05,0.000215,9.5e-05,5e-05,4.6e-05,...,8.2e-05,0.000121,3.2e-05,0.000113,0.000752,7.951168e-06,4.6e-05,8.2e-05,0.000619,3.160273e-07
1,-5.7e-05,0.000724,0.000675,0.000675,-6e-06,2.015317e-05,0.00021,-2e-06,3e-05,2.1e-05,...,-0.000106,0.000292,-1.8e-05,-5.9e-05,-0.000155,9.355188e-07,5.9e-05,-0.000183,0.000915,-3.273058e-07
2,-0.000229,1.6e-05,0.000196,0.000196,-0.000111,-9.442795e-07,-0.000181,0.000197,7.5e-05,-7.5e-05,...,2.8e-05,-0.000159,-0.000129,6.4e-05,-5e-06,-2.682442e-05,-6.2e-05,-0.000204,0.000243,2.379159e-07
3,-7e-06,0.000439,-3.5e-05,-3.5e-05,-0.000109,-9.624364e-05,-0.000437,-0.000462,-0.000216,-0.000173,...,-6.9e-05,-0.000736,-0.000281,9.7e-05,-0.001346,3.119079e-06,-0.000346,-9.5e-05,-0.001244,-2.76938e-06
4,0.000292,0.0002,0.000589,0.000589,-0.000265,-6.302178e-05,2.4e-05,-4.3e-05,5e-05,-8e-06,...,0.000302,-0.000906,0.000102,0.000145,3e-06,8.787188e-05,6.4e-05,-0.000248,0.000153,-6.492189e-06
5,-6.7e-05,-0.000403,-0.001073,-0.001073,5e-06,-5.646006e-05,-0.000539,-0.000206,-0.000151,-0.000183,...,0.000231,0.000953,0.000579,-0.000269,-0.001948,0.0002477089,0.000252,0.000158,-0.001801,1.229322e-06
6,0.000318,-0.000268,-0.000565,-0.000565,7.2e-05,-3.796446e-05,0.000123,0.000229,-9e-05,-5.2e-05,...,-0.000125,0.000633,-0.00022,-8e-06,0.000347,0.0001667046,6.6e-05,-0.000368,0.000233,3.883008e-06
7,0.000122,0.000635,-0.000843,-0.000843,-0.000174,-1.491508e-05,0.000295,-0.000373,0.000115,-4.8e-05,...,0.000641,0.000186,-0.000123,-0.0002,0.000487,7.754325e-05,0.000658,0.000191,3e-05,-6.0288e-06
8,-0.000144,-0.000997,-0.000259,-0.000259,-0.000265,0.0001053204,-0.000189,-0.000257,-0.000153,3e-06,...,-0.000484,-1.2e-05,-0.000115,-4e-06,0.002199,0.0001229924,-4.9e-05,0.0001,-0.00058,-2.532925e-06
9,0.000135,0.000158,-0.001413,-0.001413,-0.000114,3.947409e-05,-5.7e-05,0.000232,-1.4e-05,-0.000112,...,6.9e-05,-0.000548,-6e-05,-3.8e-05,0.001641,-7.737922e-05,0.000173,0.000205,-0.000995,-3.840217e-06


Identify Key Words in test topic:

In [27]:
topic_tgt = topic_df.loc[test_topic]
topic_tgt = topic_tgt.sort_values(ascending=False)
topic_tgt

place                       4.362017e-01
food                        4.213584e-01
chicken                     2.984008e-01
service                     1.855055e-01
taste                       1.629927e-01
                                ...     
gg                         -6.437468e-26
yfjgz                      -7.855594e-26
foodjayantakushalhasebul   -2.165040e-24
ggyggyggyygghh             -2.165040e-24
servicehygenic             -2.165040e-24
Name: 0, Length: 16935, dtype: float64

- ex. place, food, chicken, service, taste represent the most important and meaningful words in the test topic 

Explained Variance Ratio to discover how much variance do the topics explain:

In [28]:
np.sum(lsa.explained_variance_ratio_)

0.17760940780196632

Sklearn LSA singular values vector to understand the importance of each latent topic:


In [29]:
lsa.singular_values_

array([196.83623115, 100.80266338,  85.78483937,  68.65113284,
        61.18488604,  59.37372305,  56.55326012,  54.86806108,
        54.09057673,  51.58944232])

From the sklearn's implementation of LSA:
- We can conclude that most of the variance is explained by the first components 
- There exists a large drop off in the singular values after the first few topics

## 5.2 Using gensim

Choosing the optimal number of components based on coherence since our purpose is to extract understandable and meaningful topics:

In [55]:
#Prepare the corpus
reviews_copy_corpus = [reviews_copy_dict.doc2bow(doc) for doc in reviews_copy["tokenized_review"]]

In [None]:
#List to store coherence scores for different numbers of topics
coherence_scores = []

#Loop through different numbers of topics
for num_topics in range(5, 51): 
    # Train the LSI model with the current number of topics
    lsi_model = LsiModel(reviews_copy_corpus, id2word=reviews_copy_dict, num_topics=num_topics)
    
    # Compute the coherence score using Gensim's CoherenceModel
    coherence_model = CoherenceModel(model=lsi_model, texts=reviews_copy["tokenized_review"], dictionary=reviews_copy_dict, coherence='c_v')
    coherence_score = coherence_model.get_coherence()
    
    # Append the number of topics and its coherence score to the list
    coherence_scores.append((num_topics, coherence_score))

#Find the optimal number of topics based on coherence score
optimal_topics = max(coherence_scores, key=lambda x: x[1])

# Output the optimal number of topics and the corresponding coherence score
print(f"Optimal number of topics: {optimal_topics[0]} with coherence: {optimal_topics[1]}")


Optimal number of topics: 10 with coherence: 0.4974223025827819


LSI Model:

In [31]:
reviews_copy_lsi = LsiModel(reviews_copy_corpus, id2word=reviews_copy_dict, num_topics=10)

LSI results to discover how much each token contributes to each topic

In [32]:
reviews_copy_lsi.get_topics().shape

(10, 16935)

In [33]:
reviews_copy_lsi.show_topics()

[(0,
  '0.436*"place" + 0.421*"food" + 0.298*"chicken" + 0.185*"service" + 0.163*"taste" + 0.160*"one" + 0.129*"ordered" + 0.126*"ambience" + 0.124*"time" + 0.121*"really"'),
 (1,
  '-0.639*"chicken" + 0.408*"food" + 0.329*"place" + 0.183*"service" + -0.182*"biryani" + -0.133*"taste" + -0.108*"dish" + 0.098*"ambience" + -0.095*"fried" + -0.090*"fish"'),
 (2,
  '0.712*"place" + -0.609*"food" + -0.161*"service" + -0.135*"restaurant" + 0.080*"one" + 0.065*"best" + -0.062*"taste" + 0.060*"visit" + -0.055*"ordered" + -0.049*"quality"'),
 (3,
  '-0.485*"chicken" + 0.334*"one" + -0.303*"food" + -0.247*"place" + 0.194*"taste" + 0.192*"restaurant" + 0.185*"time" + 0.156*"like" + 0.127*"order" + 0.123*"even"'),
 (4,
  '-0.556*"service" + -0.286*"great" + 0.264*"food" + 0.261*"ordered" + 0.250*"biryani" + 0.194*"taste" + -0.191*"ambience" + 0.173*"place" + -0.165*"starter" + 0.153*"order"'),
 (5,
  '0.484*"service" + 0.423*"biryani" + 0.254*"time" + -0.250*"food" + 0.203*"order" + 0.178*"ordered"

- We can see that Topic 1 focused on negative sentiments associated with the taste of biryani chicken , such as negative values for "chicken", "biryani" and "taste"

Gensim LSI singular values vector to understand the importance of each latent topic:

In [35]:
reviews_copy_lsi.projection.s

array([196.84128664, 100.80278324,  85.78486628,  68.65017698,
        61.18429337,  59.39230239,  56.56837415,  54.91073643,
        54.11733828,  51.77932814])

From the Gensim's implementation of LSA:
- We can identify some customer sentiment
- We can conclude that most of the variance is explained by the first components 
- There is a large drop off in the singular values after the first few topics

## 6. Perform Latent Dirichlet Allocation (LDA)

[Back to TOC](#toc)

## 6.1 Using sklearn

Choosing the optimal number of components based on coherence since our purpose is to extract understandable and meaningful topics:

In [57]:
# List to store perplexity and coherence scores
perplexity_scores = []
coherence_scores = []

# Loop through different values for `n_components` (number of topics)
for n_topics in range(5, 21):  # Example range from 5 to 20
    # Fit the LDA model
    lda = LatentDirichletAllocation(n_components=n_topics, max_iter=10, random_state=42)
    lda_result = lda.fit_transform(reviews_copy_bow_td_matrix)

    # Calculate perplexity for the current model
    perplexity = lda.perplexity(reviews_copy_bow_td_matrix)
    perplexity_scores.append((n_topics, perplexity))
    
    # Calculate coherence score for the current model
    topic_words = []
    for topic_idx, topic in enumerate(lda.components_):
        top_words = [bow_vectorizer.get_feature_names_out()[i] for i in topic.argsort()[:-11:-1]]
        topic_words.append(top_words)

    # Calculate coherence using Gensim's CoherenceModel
    coherence_model = CoherenceModel(topics=topic_words, texts=tokenized_docs, dictionary=reviews_copy_dict, coherence='c_v')
    coherence_score = coherence_model.get_coherence()
    coherence_scores.append((n_topics, coherence_score))

# Find the optimal number of topics based on perplexity and coherence
optimal_perplexity = min(perplexity_scores, key=lambda x: x[1])  # Minimize perplexity
optimal_coherence = max(coherence_scores, key=lambda x: x[1])  # Maximize coherence

# Output the results
print(f"Optimal number of topics based on perplexity: {optimal_perplexity[0]} with perplexity: {optimal_perplexity[1]}")
print(f"Optimal number of topics based on coherence: {optimal_coherence[0]} with coherence: {optimal_coherence[1]}")

Optimal number of topics based on perplexity: 5 with perplexity: 1864.1612802977609
Optimal number of topics based on coherence: 10 with coherence: 0.4893404615864171


Sklearn implementation of LDA:

In [37]:
lda = LatentDirichletAllocation(n_components=10,doc_topic_prior=None,topic_word_prior=None,max_iter=10,verbose=True)
lda_result = lda.fit_transform(reviews_copy_bow_td_matrix)

iteration: 1 of max_iter: 10
iteration: 2 of max_iter: 10
iteration: 3 of max_iter: 10
iteration: 4 of max_iter: 10
iteration: 5 of max_iter: 10
iteration: 6 of max_iter: 10
iteration: 7 of max_iter: 10
iteration: 8 of max_iter: 10
iteration: 9 of max_iter: 10
iteration: 10 of max_iter: 10


In [None]:
lda_result.shape 

(9953, 10)

Topic that contributes the most to the test document:

In [39]:
test_topic = np.where(lda_result[10] == lda_result[10].max())[0][0]
test_topic

6

In [40]:
lda_result[10]

array([0.00666724, 0.00666737, 0.00666742, 0.00666736, 0.00666696,
       0.00666731, 0.93999337, 0.00666775, 0.00666783, 0.00666737])

LDA results to discover how much each token contributes to each topic:


In [41]:
topic_word_dict = [{word : value for word, value in zip(bow_vocab,component)} for component in lda.components_]
lda_sklearn_df = pd.DataFrame(topic_word_dict)
lda_sklearn_df

Unnamed: 0,aachar,aachari,aalishaan,aalishaanthis,aalloo,aalo,aaloo,aalu,aam,aamirs,...,zomos,zomoto,zomtato,zomto,zone,zoneincrease,zonequality,zoomato,zucchini,zyada
0,0.1,0.1,0.100003,0.100003,0.1,0.1,0.1,0.1,0.100007,0.1,...,0.1,0.1,0.1,0.1,0.100005,0.100004,0.1,0.1,0.100003,0.100006
1,0.1,0.1,0.1,0.1,0.1,0.100066,0.100069,1.100002,0.1,1.099994,...,0.1,0.1,0.1,0.1,0.1,0.100018,0.1,0.1,0.1,0.1
2,1.099996,1.1,0.10006,0.10006,0.1,0.100048,3.609872,1.12205,0.100005,0.100006,...,0.100002,0.1,0.1,0.100005,1.270876,0.1,0.1,0.1,4.099984,0.1
3,0.1,0.1,0.1,0.1,1.1,0.1,0.1,0.100015,1.09997,0.1,...,0.1,6.099998,0.1,0.1,0.1,0.1,0.100018,0.1,0.1,0.100001
4,0.1,0.1,0.1,0.1,0.1,0.100024,1.590096,1.077932,0.1,0.1,...,1.099977,0.1,0.1,0.1,2.186959,0.1,0.1,0.1,0.1,0.1
5,0.1,0.1,0.1,0.1,0.1,0.1,0.100041,0.1,0.1,0.1,...,0.1,0.100002,0.1,0.1,0.100002,0.1,0.1,1.099996,0.1,0.1
6,0.100004,0.1,0.1,0.1,0.1,0.1,0.100005,0.100001,0.100005,0.1,...,0.100003,0.1,0.1,0.1,0.60929,0.1,0.1,0.1,0.100002,1.099994
7,0.1,0.1,1.099937,1.099937,0.1,0.1,1.099918,0.1,0.100013,0.1,...,0.100016,0.1,1.1,0.1,0.100006,0.1,1.099982,0.100015,0.1,0.1
8,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,...,0.1,0.1,0.1,0.1,0.1,0.1,0.1,1.099988,0.1,0.1
9,0.1,0.1,0.1,0.1,0.1,1.099861,0.1,0.1,0.1,0.1,...,0.100001,0.1,0.1,1.099995,3.332862,1.099979,0.1,0.1,0.100011,0.1


In [42]:
topic_tgt = topic_df.loc[test_topic]
topic_tgt = topic_tgt.sort_values(ascending=False)
topic_tgt

taste         0.555173
biryani       0.473616
great         0.300633
ambience      0.118154
really        0.103382
                ...   
even         -0.101264
time         -0.107735
restaurant   -0.132315
order        -0.195676
chicken      -0.257756
Name: 6, Length: 16935, dtype: float64

- ex. "taste" (0.555173) and "biryani" (0.473616) are strongly related to the topic, meaning that the topic is likely centered around these words

Assess LDA model using perplexity:

In [43]:
lda.perplexity(reviews_copy_bow_td_matrix)

1919.7510213451624

## 6.2 Using gensim

Choosing the optimal number of components based on coherence since our purpose is to extract understandable and meaningful topics:

In [59]:
# List to store perplexity and coherence scores
perplexity_scores = []
coherence_scores = []

# Loop through different values for num_topics 
for n_topics in range(5, 21):  
    # Fit the LDA model using Gensim
    lda_gensim = LdaModel(corpus=reviews_copy_corpus, id2word=reviews_copy_dict, num_topics=n_topics, iterations=50, random_state=42)
    
    # Calculate perplexity for the current model
    perplexity = lda_gensim.log_perplexity(reviews_copy_corpus)  # Log perplexity is used in Gensim
    perplexity_scores.append((n_topics, perplexity))
    
    # Calculate coherence score for the current model
    coherence_model = CoherenceModel(model=lda_gensim, texts=tokenized_docs, dictionary=reviews_copy_dict, coherence='c_v')
    coherence_score = coherence_model.get_coherence()
    coherence_scores.append((n_topics, coherence_score))

# Find the optimal number of topics based on perplexity and coherence
optimal_perplexity = min(perplexity_scores, key=lambda x: x[1])  #(lower is better)
optimal_coherence = max(coherence_scores, key=lambda x: x[1])  #(higher is better)

# Output the results
print(f"Optimal number of topics based on perplexity: {optimal_perplexity[0]} with perplexity: {optimal_perplexity[1]}")
print(f"Optimal number of topics based on coherence: {optimal_coherence[0]} with coherence: {optimal_coherence[1]}")

Optimal number of topics based on perplexity: 20 with perplexity: -9.303851594156953
Optimal number of topics based on coherence: 5 with coherence: 0.4271311372394617


LDA Model with Gensim:

In [44]:
lda_gensim = LdaModel(reviews_copy_corpus, id2word=reviews_copy_dict, num_topics=10,iterations=50)

In [45]:
lda_gensim.get_topics().shape

(10, 16935)

In [46]:
lda_gensim.show_topics()

[(0,
  '0.048*"nice" + 0.022*"cafe" + 0.015*"food" + 0.012*"place" + 0.010*"lip" + 0.010*"smacking" + 0.007*"oily" + 0.006*"service" + 0.006*"sumptuous" + 0.005*"paying"'),
 (1,
  '0.020*"order" + 0.015*"bad" + 0.015*"money" + 0.014*"food" + 0.011*"waste" + 0.010*"even" + 0.010*"ordered" + 0.009*"biryani" + 0.009*"place" + 0.009*"worst"'),
 (2,
  '0.021*"food" + 0.019*"ordered" + 0.017*"chicken" + 0.014*"place" + 0.013*"service" + 0.011*"taste" + 0.011*"pizza" + 0.009*"dont" + 0.009*"order" + 0.008*"noodle"'),
 (3,
  '0.041*"place" + 0.032*"food" + 0.029*"time" + 0.013*"great" + 0.013*"service" + 0.013*"nice" + 0.012*"visit" + 0.011*"ambience" + 0.009*"really" + 0.009*"friend"'),
 (4,
  '0.023*"place" + 0.022*"taste" + 0.022*"chicken" + 0.016*"biryani" + 0.014*"food" + 0.011*"one" + 0.009*"dish" + 0.009*"ordered" + 0.008*"momos" + 0.008*"spicy"'),
 (5,
  '0.046*"food" + 0.038*"place" + 0.024*"service" + 0.015*"ambience" + 0.013*"best" + 0.012*"great" + 0.010*"staff" + 0.009*"must" + 0.

- We can see that for example Topic 0 focused on positive sentiments with words like "nice," "cafe," "food," "service," 
- While Topic 1 focused on negative sentiments associated with ordering issues, such as "bad," "waste," "money," and "worst."

Document-Topic Distribution Matrix from LDA Model:

In [47]:
lda_doc_topic_matrix = np.array([[components[1] for components in lda_gensim.__getitem__(doc,eps=0.000)] for doc in tqdm(reviews_copy_corpus)])
lda_doc_topic_matrix.shape

100%|██████████| 9953/9953 [00:02<00:00, 4739.31it/s]


(9953, 10)

Topic that contributes the most to the test document:

In [48]:
test_topic = np.where(lsa_result[10] == lsa_result[10].max())[0][0]
test_topic

0

In [49]:
lda_gensim.show_topic(test_topic,topn=5)

[('nice', 0.048145466),
 ('cafe', 0.021631312),
 ('food', 0.015300503),
 ('place', 0.012029382),
 ('lip', 0.0098394565)]

Assess LDA model using perplexity:

In [50]:
lda_gensim.log_perplexity(reviews_copy_corpus) 

-8.228859316177264

Assess LDA model using the gensim implementation of coherence

In [51]:
cm = CoherenceModel(model=lda_gensim, texts=reviews_copy["tokenized_review"], coherence='c_v')
cm.get_coherence()  

0.3969112932050881

- Through the Coherence of approximately 0.4 we can conclude that this is a good LDA model for real world data

Coherence per topic:

In [52]:
cm.get_coherence_per_topic()

[0.387239546981948,
 0.3913635315894034,
 0.3042149870297127,
 0.4292396632703153,
 0.35616674775875545,
 0.38419009272334936,
 0.44683738633303793,
 0.41807732763739197,
 0.3896882855370777,
 0.46209536318988914]

From the Gensim's implementation of LDA:
- We can arrive to some customer Positive and Negative Sentiment 
- We can conclude that that the most clear and specific topic was related to the vegetarian and non vegetarian option of biryani (famous dish of the Hyderabadi Cuisine)

----

Through the implementation of LSA and LDA we can conclude that both arrive to broad topics with the lack of complex ones. Being the Gensim's implementation a more robust and interpetable implementation in comparison with sklearns one.

---

## 7. Topic Model using BERTopic

[Back to TOC](#toc)

Find the optimal number of Topics:

In [61]:
docs = reviews_copy["preproc_review"].reset_index(drop=True) #properly indexed reviews for BERTopic

In [None]:
# Initialize a list to store the coherence scores for different numbers of topics
coherence_scores = []

# Loop through different values for `nr_topics`
for n_topics in range(20, 101): 
    # Initialize the BERTopic model
    topic_model = BERTopic(nr_topics=n_topics)
    
    # Fit the model
    topics, probs = topic_model.fit_transform(docs)
    
    # Get the top words for each topic
    topics_words = topic_model.get_topics()
    
    # Prepare the topics for coherence calculation (list of top word IDs for each topic)
    topic_words = []
    for i in range(n_topics):
        if i in topics_words:  
            # Convert the top words into their corresponding token IDs
            topic_words.append([reviews_copy_dict.token2id[word] for word, _ in topics_words[i] if word in reviews_copy_dict.token2id and word != ''])

    # Compute the coherence score using Gensim's CoherenceModel
    coherence_model = CoherenceModel(topics=topic_words, texts=tokenized_docs, dictionary=reviews_copy_dict, coherence='c_v')
    coherence_score = coherence_model.get_coherence()
    
    # Append the number of topics and the corresponding coherence score
    coherence_scores.append((n_topics, coherence_score))

# Find the optimal number of topics with the highest coherence score
optimal_topics = max(coherence_scores, key=lambda x: x[1])

# Output the optimal number of topics and the corresponding coherence score
print(f"Optimal number of topics: {optimal_topics[0]} with coherence: {optimal_topics[1]}")


Optimal number of topics: 85 with coherence: 0.4630936967180281


Topic Modeling with BERTopic:

In [62]:
topic_model = BERTopic(nr_topics=85)
topics, probs = topic_model.fit_transform(docs)
reviews_copy_topics_df = pd.DataFrame({'topic': topics, 'document': docs})

In [297]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,4071,-1_food_place_ambience_service,"[food, place, ambience, service, great, chicke...",[visited place friday evening saw rating zomat...
1,0,537,0_manager_table_worst_asked,"[manager, table, worst, asked, even, customer,...",[visited place team lunch experience pathetic ...
2,1,461,1_biryani_mutton_chicken_biriyani,"[biryani, mutton, chicken, biriyani, ordered, ...",[paradise biryani really service also staff fr...
3,2,446,2_goo_yup_nil_hmm,"[goo, yup, nil, hmm, verry, upto, mark, one, , ]","[hmm, nil, goo]"
4,3,238,3_noodle_rice_manchurian_fried,"[noodle, rice, manchurian, fried, ordered, chi...",[ordered egg mushroom fried rice food rather s...
...,...,...,...,...,...
80,79,12,79_awsome_thank_thanks_udipi,"[awsome, thank, thanks, udipi, keep, guy, alwa...","[awsome, awsome, awsome]"
81,80,12,80_zomato_gold_near_ambience,"[zomato, gold, near, ambience, drinkfirst, aga...",[ambience dance floor primary attraction resta...
82,81,12,81_salt_salty_badtaste_saltyother,"[salt, salty, badtaste, saltyother, couldnot, ...","[much salt, food salty couldnot eat money wast..."
83,82,12,82_chutney_spilled_cooking_instruction,"[chutney, spilled, cooking, instruction, mynoe...",[chutney supplied item cooking instruction fol...


In [298]:
topic_model.visualize_topics()

- Overlapping circles indicate similarity between those topics and the fact that the majority focus on the left side could point that there are more general topics in the dataset

BERTopic default stack to generate and visualise the document-topic matrix:

In [299]:
topic_model.visualize_documents(docs)

- The main overlapping of topics indicate many similar topics but with some nuances 

Topic Assignment and Document Mapping:

In [300]:
reviews_copy_topics_df

Unnamed: 0,topic,document
0,-1,ambience food quite saturday lunch cost effect...
1,9,ambience pleasant evening service prompt food ...
2,-1,must try great food great ambience thnx servic...
3,-1,soumen da arun great guy behavior sincerety fo...
4,-1,food goodwe ordered kodi drumstick basket mutt...
...,...,...
9948,-1,madhumathi mahajan well start nice courteous s...
9949,-1,place never disappointed food courteous staff ...
9950,-1,bad rating mainly chicken bone found veg food ...
9951,-1,personally love prefer chinese food couple tim...


Topic Distribution Approximation for Documents:

In [301]:
topic_distr, _ = topic_model.approximate_distribution(docs)
topic_distr.shape

(9953, 84)

Access Topic Distribution for test document:

In [302]:
topic_distr[10]

array([0.0465915 , 0.        , 0.        , 0.        , 0.04700868,
       0.        , 0.        , 0.04578106, 0.0991406 , 0.08475331,
       0.        , 0.04678516, 0.        , 0.        , 0.04738701,
       0.03596829, 0.        , 0.05979963, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.04707929, 0.        , 0.14489688,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.03794124, 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.0468243 , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.01773862, 0.04157069, 0.05450196,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.09623178, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.     

In [303]:
reviews_copy["raw_review"].iloc[10]

'The service was great and the food was awesome The service staff Manab and Papiya were very courteous and attentive  would like  come frequently  this place'

Visualize topic distribution for Test Document:

In [304]:
topic_model.visualize_distribution(topic_distr[10])

- We can conclude that the main topics of test document is related to an excellent service

## 8. Conclusion

  
[Back to TOC](#toc)

From this notebook we can take the following insights:
- LSA and LDA are able to arrive to more broad topics 
- With LSA we were able to comprehend that the variance (singular_values_) is explained by few topics
- LDA was able to give broad but meaningful topics related to dishes and customer sentiment 
- Gensim's implementation is the most robust and interpetable implementation
- Bertopic is the algorithm that arrives to the most interesting topics, like:
  - Specific Cuisines
  - Ordering Problems
  - Specific Complaints
  - Restaurant Ambience
  - Food Trends
- Topics in general are similar to each other but with some nuances