Topic Modeling is an NLP process to discover abstract topics that occur within a collection of documents.

It helps in identifying patterns and structures within the text data which can be used to organize, summarise or explore large volumes of information.

Topics are the cluster of words that frequently occur together. Each topic is characterised by a distribution of words and each document in a corpus can be identified or described by a distribution of topics.

Some methods for topic modeling are 
* Latent Dirichlet Allocation (LDA)
* Non-Negative Matrix Factorization (NMF)
* Latent Semantic Analysis (LSA) / Latent Semantic Indexing (LSI)
* Hierarchical Dirichlet Process (HDP)
* Correlated Topic Model (CTM)
* Biterm Topic Model (BTM)
* Neural Topic Models

* * ProdLDA (Product of Experts LDA)
* * Embedded Topic Model (ETM)
* * Top2Vec
* Gibbs Sampling for Dirichlet Multinomial Mixture (GSDMM)
* BERTopic

After preprocessing and modeling, the interpretation of the results are needed. The resulting topics are analyzed along with the words from each topic to understand the meaning of them.

# Prequisites

In [43]:
import pandas as pd
from tqdm.notebook import tqdm
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import pyLDAvis.lda_model

import pickle
 

In [2]:
nltk.download("stopwords")
nltk.download("punkt_tab") #For Word Tokenize
nltk.download('wordnet')   #For

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Snowwolf\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Snowwolf\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Snowwolf\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
df = pd.read_csv("tripadvisor_hotel_reviews.csv")

In [3]:
df.head(2)

Unnamed: 0,Review,Rating
0,nice hotel expensive parking got good deal sta...,4
1,ok nothing special charge diamond member hilto...,2


In [4]:
lemmetizer = WordNetLemmatizer()
stop_words_set = set(stopwords.words("english"))

In [6]:
for i in tqdm(range(0, len(df))):
    review = df.iloc[i]["Review"]
    tokens = word_tokenize(review)
    clean = ""
    for j in tokens: #Conversion to lower anc check for special character,spaces and stop words
        j = j.lower() 
        j = lemmetizer.lemmatize(j)
        if(j in stop_words_set or not(j.isalpha()) or j.isspace()):
            pass
        else:
            clean = clean + j +" "
    clean = clean[:-1]
    df.loc[i, "Review"] = clean 
            

  0%|          | 0/20491 [00:00<?, ?it/s]

In [8]:
df.head(3)

Unnamed: 0,Review,Rating
0,nice hotel expensive parking got good deal sta...,4
1,ok nothing special charge diamond member hilto...,2
2,nice room experience hotel monaco seattle good...,3


# LDA 

LDA needs a document term matrix where every row is a document and contains the frequency of words present w.r.t the corpus.

In [9]:
vectorizer = CountVectorizer()

dtm = vectorizer.fit_transform(df["Review"])

feature_names = vectorizer.get_feature_names_out()

In [65]:
lda_model = LatentDirichletAllocation(n_components=13, random_state=20, 
                                      n_jobs=-1, verbose=1)
lda_model.fit(dtm)

#n_components defines how many topics we want

iteration: 1 of max_iter: 10
iteration: 2 of max_iter: 10
iteration: 3 of max_iter: 10
iteration: 4 of max_iter: 10
iteration: 5 of max_iter: 10
iteration: 6 of max_iter: 10
iteration: 7 of max_iter: 10
iteration: 8 of max_iter: 10
iteration: 9 of max_iter: 10
iteration: 10 of max_iter: 10


In [66]:
pickle.dump(lda_model, open("lda_model.pickle", "wb"))

In [67]:
lda_model = pickle.load(open("lda_model.pickle", "rb"))

The number of feature names are equivalent to the number of features created for every document(row) by the Count Vectorizer

In [68]:
#Display topics
top_words = 5

for topic_index, topic in enumerate(lda_model.components_):
    temp = ""
    for i in topic.argsort()[-top_words:]:
        topic_name = feature_names[i]
        temp = temp + topic_name + " "
    print(f"Topic number {topic_index} with topics : {temp}")

Topic number 0 with topics : hospital doctor palace riu waikiki 
Topic number 1 with topics : location good great room hotel 
Topic number 2 with topics : nice great pool hotel beach 
Topic number 3 with topics : nice hotel bed night room 
Topic number 4 with topics : day room royal service resort 
Topic number 5 with topics : good stay great hotel room 
Topic number 6 with topics : great stay staff room hotel 
Topic number 7 with topics : restaurant pool service villa bali 
Topic number 8 with topics : night san inn parking car 
Topic number 9 with topics : great good food beach resort 
Topic number 10 with topics : day told desk hotel room 
Topic number 11 with topics : water pool service hotel room 
Topic number 12 with topics : stay stayed york time room 


In [69]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.lda_model.prepare(lda_model, dtm, vectorizer)
vis

In [70]:
#Assigning most dominant topic to each review

topic_distribution = lda_model.transform(dtm)

df["topic"] = topic_distribution.argmax(axis=1) #Assigning most probable topic to each review 

df.head()

Unnamed: 0,Review,Rating,topic
0,nice hotel expensive parking got good deal sta...,4,5
1,ok nothing special charge diamond member hilto...,2,10
2,nice room experience hotel monaco seattle good...,3,3
3,unique great stay wonderful time hotel monaco ...,5,5
4,great stay great stay went seahawk game awesom...,5,10


## Evaluate Model using Gensim 

In [57]:
import gensim
from gensim.models.coherencemodel import CoherenceModel

In [49]:
#Creating a list of all reviews 

review_list =[]
for i in tqdm(range(0, len(df))):
    review = df.iloc[i]["Review"]
    review = review.split()
    review_list.append(review)

  0%|          | 0/20491 [00:00<?, ?it/s]

In [50]:
#Create a gensim dictionary from all these reviews 

dictionary = gensim.corpora.Dictionary(review_list)

In [55]:
#Convert the tokenized reviews to a bag of words corpus 

corpus = []

for review in review_list:
    review = dictionary.doc2bow(review)
    corpus.append(review) 

In [59]:
lda_gensim = gensim.models.LdaModel(corpus=corpus, id2word=dictionary, num_topics=5, passes=10)
coherence_model_lda = CoherenceModel(model=lda_gensim, texts=review_list,
                                     dictionary=dictionary, coherence="c_v")
coherence_score = coherence_model_lda.get_coherence()

In [60]:
print("Coherence Score is ",coherence_score)

Coherence Score is  0.3551422832494412


In [61]:
def compute_coherence_values(corpus, dictionary, k, texts):
    lda_model = gensim.models.LdaModel(corpus=corpus, id2word=dictionary, num_topics=k, random_state=42)
    coherence_model_lda = CoherenceModel(model=lda_model, texts=texts, dictionary=dictionary, coherence='c_v')
    return coherence_model_lda.get_coherence()

In [63]:
coherence_values = []
for num_topics in tqdm(range(2, 15)):
    coherence_values.append(compute_coherence_values(corpus, dictionary, num_topics, review_list))

  0%|          | 0/13 [00:00<?, ?it/s]

In [64]:
optimal_topics = range(2, 21)[coherence_values.index(max(coherence_values))]
print(f'Optimal number of topics: {optimal_topics}')

Optimal number of topics: 13
