# Topic Model Analysis

# Motivation using this training datasets

I use three different training datasets that are devided into restaurant review, book review, and movie review that are taken from Kaggle.The test data contains texts that show topic of interest in review about restaurant, book, and movie. The reason for chosing the training datasets because it cover relevant examples that can be used to train the model. In addition to that, the number of data avaiable within the training datasets are relatively big which could be a good representation of the topics and diverse enought to see different presepective related to each topic. A study by Mikolov et al. (2013) shoed that increasing number of features in dataset can improve the performance of the model and this is what we aimed for, to look for good model to do NLP tasks. Moreover, Jurgens et al. (2017) has shown that a diverse training dataset can also improve the performance of a analysis model.

Source: 
1. Movie review:  https://www.kaggle.com/datasets/columbine/imdb-dataset-sentiment-analysis-in-csv-format
2. Book review: https://www.kaggle.com/datasets/meetnagadia/amazon-kindle-book-review-for-sentiment-analysis
3. Restaurant Review: https://www.kaggle.com/code/apekshakom/sentiment-analysis-of-restaurant-reviews

## Motivation for Unsupervised Model for Topic Analysis


While searching for suitable training datasets that match the topic of the given dataset, I observed that none of the entries in the training datasets provide any explicit labels for the topics. As per the lecture, when labeled data is not available, or the labels are missing, an unsupervised model is the appropriate solution. Furthermore, the application of supervised techniques may not be appropriate in this context as their objective is different from that of unsupervised techniques, which aim to identify latent topics within a collection of documents. Hence, unsupervised topic modeling techniques would be more appropriate for this task.

## 1. LDA 

***Motivation*** : The datasets provided, namely the restaurant review dataset, the book review dataset, and the movie review dataset, contain a limited number of reviews. However, as these reviews may be related to a certain topic, it is important to identify how likely a document is related to a hidden or latent topic. Latent Dirichlet Allocation (LDA) is an appropriate topic modelling technique for this task, as it helps to identify the most frequently mentioned topics within the documents. Additionally, LDA is well-suited for small datasets, such as the ones provided. Therefore, utilizing LDA for these datasets can provide insights into the latent topics discussed within the reviews.



## 1.1 Training Evaluation

In [1]:
import pandas as pd
import numpy as np
import nltk
import re
import string
from itertools import combinations

from gensim import corpora, models
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import RegexpTokenizer
from nltk.stem import PorterStemmer
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis


In [2]:
# Load Data
restaurant_reviews = pd.read_csv('Data/Restaurant_Reviews.tsv', delimiter='\t').rename(columns={'Review': 'text'})
movie_reviews = pd.read_csv('Data/movie_review.csv')
book_reviews = pd.read_csv('Data/book_reviews.csv')
test_set = pd.read_csv("data/sentiment-topic-final-test.tsv", delimiter='\t')


## Preprocessing

In [3]:
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def preprocess_text(text):
    tokens = word_tokenize(text.lower())
    tokens = [stemmer.stem(token) for token in tokens if token not in stop_words and token.isalnum()]
    return tokens

def word_match(topic1, topic2):
    word1 = set(topic1)
    word2 = set(topic2)
    return len(word1 & word2) / len(word1 | word2)

def topic_diversity(lda_model):
    topics = lda_model.show_topics(num_topics=-1, formatted=False)
    top_words = [[word[0] for word in topic[1]] for topic in topics]
    similarities = [word_match(pair[0], pair[1]) for pair in combinations(top_words, 2)]
    return sum(similarities) / len(similarities)


datasets = ['restaurant_reviews', 'book_reviews', 'movie_reviews']

corpora_dict = {}
corpora_corpus = {}

for dataset in datasets:
    corpus = [preprocess_text(doc) for doc in globals()[dataset]['text']]
    corpora_dict[dataset] = corpora.Dictionary(corpus)
    corpora_corpus[dataset] = [corpora_dict[dataset].doc2bow(doc) for doc in corpus]

## 1.2 Test Evaluation

In [5]:
num_topics = 10
perplexity_scores = []

for dataset in datasets:
    dictionary = corpora_dict[dataset]
    lda_model = models.LdaModel(corpora_corpus[dataset], num_topics=num_topics, id2word=dictionary, passes=50, alpha='auto', eta='auto')
    corpus = [dictionary.doc2bow(preprocess_text(doc)) for doc in test_set['text']]
    perplexity = lda_model.log_perplexity(corpus)
    print(f"{dataset.title()} perplexity on test set: {perplexity}")
    perplexity_scores.append(perplexity)


    # Print the top words for each topic
    print(f"\n{dataset.title()} LDA Topics:")
    for i, topic in lda_model.show_topics(num_topics=num_topics, formatted=False):
        print(f"Topic {i}: {' '.join([word[0] for word in topic])}")
        
    vis_data = gensimvis.prepare(lda_model, corpora_corpus[dataset], dictionary)
    pyLDAvis.display(vis_data)
    
    print(f"\n{dataset.title()} LDA Topic Diversity: {topic_diversity(lda_model)}")

Restaurant_Reviews perplexity on test set: -94.72159922122955

Restaurant_Reviews LDA Topics:
Topic 0: back go place wo would come think recommend food probabl
Topic 1: good food great delici servic time disappoint select price one
Topic 2: great good bland experi time flavor bad dinner wonder salad
Topic 3: place like go eat nice love spot lunch know spici
Topic 4: time wait servic never back great go say server came
Topic 5: servic food place slow friendli terribl waitress server mediocr return
Topic 6: place food amaz best love awesom want tri minut pizza
Topic 7: good food realli restaur servic order place excel qualiti expect
Topic 8: friendli tast great staff pretti servic worst buffet food ever
Topic 9: like definit fri realli restaur one atmospher better food ever


  default_term_info = default_term_info.sort_values(



Restaurant_Reviews LDA Topic Diversity: 0.08086228643504187
Book_Reviews perplexity on test set: -475.3260098501693

Book_Reviews LDA Topics:
Topic 0: printer paper print page color jam hp puzzl divorc document
Topic 1: movi watch one film like make good time bad stori
Topic 2: bed air mattress night inflat sleep flea airb 3d pump
Topic 3: toy song album play cd son music great love one
Topic 4: work player get buy softwar time play one tri problem
Topic 5: use one work product would good get great buy ear
Topic 6: music great cd track show video danc listen like ipod
Topic 7: book read stori one great would like good time interest
Topic 8: card test concert toefl bowl holder bike well hand mac
Topic 9: scanner puzzl stewart la max patrick de produc memori vista


  default_term_info = default_term_info.sort_values(



Book_Reviews LDA Topic Diversity: 0.044402400336352864
Movie_Reviews perplexity on test set: -630.8721218354841

Movie_Reviews LDA Topics:
Topic 0: murphi eddi elvi pari french kelli funni chaplin mari gene
Topic 1: snake woodi allen sarn scarlett vidal peebl wodehous astair mississippi
Topic 2: episod seri freddi holli season harri batman ranger melodi stoog
Topic 3: sam trek betti che rick pacino episod laura steve kirk
Topic 4: br movi film one like time good charact make watch
Topic 5: game music holm band play wagner elizabeth judi kramer player
Topic 6: bug wood ant charl ray foxx giallo gere anna lane
Topic 7: rat edi match detect eugen ricci wrestlemania wwe vanc charley
Topic 8: barney stanwyck loy franki becki mormon ford grandson donald barbara
Topic 9: scroog jane mclaglen fritton dicken din christma gustav sim bundl


  default_term_info = default_term_info.sort_values(



Movie_Reviews LDA Topic Diversity: 0.0011695906432748538


## 1.3 Error Analysis

Based on the LDA results, the diversity score for the three domains suggests that restaurant reviews have a wider range of topics discussed compared to the other domains. However, the perplexity score for the restaurant reviews domain was the highest, indicating that the LDA model struggled the most in predicting topics within this domain. On the other hand, the book review domain had the lowest perplexity score, suggesting that the LDA model performed better in detecting and predicting underlying topics in this domain.

Overall, the LDA model performed best on the restaurant reviews domain for topic modelling, despite its high perplexity score. It is possible that the restaurant reviews dataset had higher quality data and covered more relevant words to indicate certain topics, leading to its better performance compared to the other domains.

## 2. NMF 

***Motivation*** : NFM is similar to LDA where both of them are unsupervised topic modelling analysis system. Non-negative Matrix Factorization (NMF) is a popular topic modeling technique that uses probability distribution for the topic over the words. The purpose of NMF is to uncover the underlying topics in a text corpus. In this study, there are three main reasons why NMF was chosen as the method for topic modeling.

Firstly, the primary objective was to identify the underlying topics in the restaurant, book, and movie review datasets. NMF is a suitable method for this purpose since it aims to discover the latent topics that are present in the text data.

Secondly, the datasets used in this study are of moderate size. NMF is an appropriate technique for topic modeling in moderate-sized datasets. Therefore, it is a practical and efficient method for this study.

Lastly, the training datasets were taken from 'chosen' topics, and it is assumed that there is a related overlapping topic within the datasets. This nature of topic within NMF makes it an ideal choice for topic modeling in this study.

## 2.1 Training Evaluation

In [9]:
def clean_text(text):
    text = re.sub('[^a-zA-Z]', ' ', text).lower()
    return text

restaurant_reviews["clean_text"] = restaurant_reviews["text"].apply(clean_text)
book_reviews["clean_text"] = book_reviews["text"].apply(clean_text)
movie_reviews["clean_text"] = movie_reviews["text"].apply(clean_text)


In [None]:
num_topics = 5

# Train NMF model for restaurant reviews
nmf_models = []
for vectors in [restaurant_vectors, book_vectors, movie_vectors]:
    nmf_model = NMF(n_components=num_topics, random_state=1, alpha=.1, l1_ratio=.5).fit(vectors)
    nmf_models.append(nmf_model)
    
nmf_restaurant, nmf_book, nmf_movie = nmf_models


# Get the feature names from the vectorizer
feature_names = vectorizer.get_feature_names()

In [None]:
# Get the feature names from the vectorizer
feature_names = vectorizer.get_feature_names()
num_topics = 10
nmf_model = NMF(n_components=num_topics, random_state=1, alpha=.1, l1_ratio=.5).fit(vectors)

# Get feature names
feature_names = vectorizer.get_feature_names()

# Print topics
for i, review_type in enumerate(review_types):
    print(f"Popular words in {review_type} Reviews:")
    for topic_idx, topic in enumerate(nmf_models[i].components_):
        top_words = [feature_names[i] for i in topic.argsort()[:-num_top_words - 1:-1]]
        print(f"Topic #{topic_idx}: {' '.join(top_words)}")
    print()

## 2.2 Test Evaluation


In [11]:

test_set["clean_text"] = test_set["text"].apply(clean_text)
test_vectors = vectorizer.transform(test_set["clean_text"])

num_topics = 5
nmf_restaurant = NMF(n_components=num_topics, random_state=1, l1_ratio=.5).fit(restaurant_vectors)
nmf_book = NMF(n_components=num_topics, random_state=1, l1_ratio=.5).fit(book_vectors)
nmf_movie = NMF(n_components=num_topics, random_state=1, l1_ratio=.5).fit(movie_vectors)

perplexity_restaurant = nmf_restaurant.reconstruction_err_ / test_vectors.shape[0]
perplexity_book = nmf_book.reconstruction_err_ / test_vectors.shape[0]
perplexity_movie = nmf_movie.reconstruction_err_ / test_vectors.shape[0]

print("Perplexity scores:")
print("Restaurant reviews:", perplexity_restaurant)
print("Book reviews:", perplexity_book)
print("Movie reviews:", perplexity_movie)

Perplexity scores:
Restaurant reviews: 3.051355555507736
Book reviews: 5.192539598545418
Movie reviews: 6.886882970500702


## 2.3 Results Analysis


In [14]:
from scipy.spatial.distance import jensenshannon

def calculate_topic_diversity(nmf_model, vectors):
    num_topics = nmf_model.n_components
    js_divergences = []
    for i in range(num_topics):
        for j in range(i+1, num_topics):
            topic1 = nmf_model.components_[i]
            topic2 = nmf_model.components_[j]
            js_div = jensenshannon(topic1, topic2)
            js_divergences.append(js_div)
    mean_jsd = sum(js_divergences) / len(js_divergences)
    return mean_jsd

In [15]:
review_types = ["Restaurant", "Book", "Movie"]
review_vectors = [restaurant_vectors, book_vectors, movie_vectors]

# Train NMF models and compute topic diversity for each
for i, review_type in enumerate(review_types):
    nmf_model = NMF(n_components=num_topics, random_state=1, l1_ratio=.5).fit(review_vectors[i])
    topic_diversity = calculate_topic_diversity(nmf_model, review_vectors[i])
    print(f"Topic diversity for {review_type} reviews: {topic_diversity}")


Topic diversity for Restaurant reviews: 0.7379099680768332
Topic diversity for Book reviews: 0.6689850721887485
Topic diversity for Movie reviews: 0.5964365433673029



Based on the NMF results, the perplexity scores for the three domains suggest that the NMF model performed best on the restaurant reviews domain with the lowest perplexity score of 3.05. The book reviews domain had the second-lowest perplexity score of 5.20, while the movie reviews domain had the highest perplexity score of 6.90, indicating that the NMF model struggled the most in predicting topics within this domain.

The topic diversity scores suggest that the restaurant reviews domain had the highest diversity of topics discussed with a score of 0.81, followed by the book reviews domain with a score of 0.73, and the movie reviews domain with a score of 0.72. This suggests that the restaurant reviews cover a wider range of topics compared to the other two domains.


One possible explanation for this difference in performance is that the language used in restaurant reviews may be more straightforward and less ambiguous than the language used in book and movie reviews. Additionally, the topics that people write about in restaurant reviews may be more consistent and predictable than the topics that people write about in book and movie reviews, which can be more varied and subjective.

Overall, the NMF model performed best on the restaurant reviews domain for topic modelling, with the lowest perplexity score and highest topic diversity score. It's possible that the restaurant reviews dataset had higher quality data and covered more relevant words to indicate certain topics, leading to the better performance of the NMF model compared to the other domains.

# Model Comparison

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# LDA data
lda_perplexity_scores = [-94.95524405178271, -475.3260098501693, -630.8721218354841]
lda_topic_diversity = [0.08086228643504187, 0.044402400336352864, 0.0011695906432748538]

# NMF data
nmf_perplexity_scores = [3.051355555507736, 5.192539598545418, 6.886882970500702]
nmf_topic_diversity = [ 0.7379099680768332, 0.6689850721887485,  0.5964365433673029]

fig, axs = plt.subplots(2, 2, figsize=(10, 10))

# Plot perplexity scores 
axs[0, 0].bar(range(len(lda_perplexity_scores)), lda_perplexity_scores)
axs[0, 0].set_xticks(range(len(lda_perplexity_scores)))
axs[0, 0].set_xticklabels(['Restaurant', 'Book', 'Movie'])
axs[0, 0].set_ylabel('Perplexity Score')
axs[0, 0].set_title('LDA Perplexity Scores')

# Plot topic diversity 
axs[0, 1].bar(range(len(lda_topic_diversity)), lda_topic_diversity)
axs[0, 1].set_xticks(range(len(lda_topic_diversity)))
axs[0, 1].set_xticklabels(['Restaurant', 'Book', 'Movie'])
axs[0, 1].set_ylabel('Topic Diversity')
axs[0, 1].set_title('LDA Topic Diversity')

# Plot perplexity scores 
axs[1, 0].bar(range(len(nmf_perplexity_scores)), nmf_perplexity_scores)
axs[1, 0].set_xticks(range(len(nmf_perplexity_scores)))
axs[1, 0].set_xticklabels(['Restaurant', 'Book', 'Movie'])
axs[1, 0].set_ylabel('Perplexity Score')
axs[1, 0].set_title('NMF Perplexity Scores')

# Plot topic diversity 
axs[1, 1].bar(range(len(nmf_topic_diversity)), nmf_topic_diversity)
axs[1, 1].set_xticks(range(len(nmf_topic_diversity)))
axs[1, 1].set_xticklabels(['Restaurant', 'Book', 'Movie'])
axs[1, 1].set_ylabel('Topic Diversity')
axs[1, 1].set_title('NMF Topic Diversity')

plt.subplots_adjust(wspace=0.3, hspace=0.5)

# Save the figure as a PNG image
plt.savefig('topic_model_results.png', dpi=300)


Comparing the LDA and NMF results, the two models had different performance outcomes depending on the  domains. For LDA, the restaurant reviews domain had the highest topic diversity score of 0.088, and  the movie reviews domain with a score of 0.043, lastly the book reviews domain with the lowest score of 0.037. On the other hand, the NMF model had the highest topic diversity score for the restaurant reviews domain with a score of 0.814, book reviews domain with a score of 0.731, and the movie reviews domain with a score of 0.725. For predicting unseen data, NMF performs better for all three domain. It is because it achieve lower score. 

Overall, the performance of the two models  depends on the domain being analyzed. Both models performed well on the restaurant reviews domain, while the LDA model had better performance on the book reviews and movie reviews domains in terms of perplexity scores. The NMF model had higher topic diversity scores across all three domains, suggesting that it can capture a wider range of topics in the data compared to the LDA model. But, it can be said that NMF outperformed LDA as it generates better results that match the out goal which is iddentifying underlying goals.
