# **The seventh in-class-exercise (40 points in total, 10/20/2021)**

Question description: Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks:

## (1) (15 points) Generate K topics by using LDA, the number of topics K should be decided by the coherence score, then summarize what are the topics. You may refer the code here: 

https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

In [1]:
# Write your code here
# Write your code here
# pip install pyLDAvis
# pip install gensim
# pip install spacy

import nltk

import re
import numpy as np
import pandas as pd
from nltk.corpus import stopwords

# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
from gensim.models import LsiModel
# spacy for lemmatization
import spacy




# Enable logging for gensim - optional
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)



# Setting up nltk
# nltk.download('stopwords')


stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])


In [2]:
df = pd.read_csv("Reviews.CSV") # Import the Reviews.CSV as pandas dataframe


In [3]:
## Cleaning the reviews
data = df.Reviews.values.tolist() # Convert each review to list
data = [re.sub('\s+', ' ', sentence) for sentence in data] # remose the line breakers
data = [re.sub("\'"," ", sentence) for sentence in data] # remocve the \'

def sent_to_words(reviews):
    """
    Input: sentence--> string
    Function: Tokenize the sentence and remove punctuations
    Output: tokenize and clean reviews
    """
    sentence = []
    for review in reviews:
        sentence.append(gensim.utils.simple_preprocess(str(review).encode('utf-8'), deacc=True))  # deacc=True removes punctuations
    return sentence
tokenize_reviews = list(sent_to_words(data))


In [4]:
## bigram and trigam mmodels 
bigram = gensim.models.Phrases(tokenize_reviews, min_count=5, threshold=100) # creat bigram phrases
bigram_model = gensim.models.phrases.Phraser(bigram) # bigram model
trigram_model = gensim.models.phrases.Phraser(gensim.models.Phrases(bigram[tokenize_reviews], threshold=100))

In [5]:
# Define functions for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(reviews): 
    """
    Input: list of lists of reviews
    Func: remove all stopwords
    Output: tokenize reviews without stop words
    """
    return [[word for word in simple_preprocess(str(review)) if word not in stop_words] for review in reviews]

def make_bigrams(reviews):
    """
    Input: tokenize reviews
    Func: make bigrams
    Output: bigrams of reviews
    """
    return [bigram_model[review] for review in reviews]

def make_trigrams(reviews):
    """
    Input: tokenize reviews
    Func: make trigrams
    Output: trigrams of bigram reviews
    """
    return [trigram_model[bigram_model[review]] for review in reviews]

def lemmatization(reviews, allowed=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """
    Input: tokenize bigram reviews
    Func: return only Noun, adj, verb, adverbs
    Output: nouns, adj, verb, adv of reviews
    """
    output_reviews= []
    for sent in reviews:
        review = nlp(" ".join(sent)) 
        output_reviews.append([token.lemma_ for token in review if token.pos_ in allowed])
    return output_reviews

In [6]:

bigrame_reviews = make_bigrams(remove_stopwords(tokenize_reviews)) # take bigram of the Reviews without stopwords

nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner']) # initiaize the nlp english model

lemmatize_reviews = lemmatization(bigrame_reviews, ['NOUN', 'ADJ', 'VERB', 'ADV']) # nouns, adj, verb, adv of reviews


In [7]:

id2word = corpora.Dictionary(lemmatize_reviews) # Create Dictionary

corpus = [id2word.doc2bow(review) for review in lemmatize_reviews] # freq of words


In [8]:
# Create LDA model
LDA_model = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=id2word, num_topics=20, random_state=100, update_every=1,
                                            chunksize=100, passes=10, alpha='auto', per_word_topics=True)
review_lda = LDA_model[corpus]

In [9]:
print('Perplexity: ', LDA_model.log_perplexity(corpus))  # Compute Perplexity and print it
coherence_model_lda = CoherenceModel(model=LDA_model, texts=lemmatize_reviews, dictionary=id2word, coherence='c_v') #initilize coherence model
coherence_lda = coherence_model_lda.get_coherence() #get cohernece score
print('Coherence Score: ', coherence_lda)

Perplexity:  -10.742683907454639
Coherence Score:  0.42085161661501297


In [10]:
# Check and 
def get_lda_topics(model, num_topics):
    """
    Input: LDA model, required topics
    Func: create a dataframe of topics
    Output: Pandas data frame
    """
    word_dict = {}
    for i in range(num_topics):
        words = model.show_topic(i, topn = 20)
        word_dict['Topic ' + '{:02d}'.format(i+1)] = [i[0] for i in words]
    return pd.DataFrame(word_dict)

get_lda_topics(LDA_model, 10)

Unnamed: 0,Topic 01,Topic 02,Topic 03,Topic 04,Topic 05,Topic 06,Topic 07,Topic 08,Topic 09,Topic 10
0,familiar,movie,sort,hear,middle,forgettable,reunion,battle,film,steal
1,serious,marvel,forward,dweller,direct,personal,lens,open,character,backstory
2,pacing,good,ability,train,bear,version,estimate,explain,scene,struggle
3,min,see,visually_stunning,fast,awe,night,indian,ta_lo,well,gang
4,flaw,watch,praise,sympathy,going,insane,drench,beautifully,great,stick
5,finish,action,silly,gate,overrate,overshadow,desperately,acting,fight,dc
6,difference,new,typical,assassin,surpass,otherwise,farm,simple,story,extraordinary
7,comedic_time,superhero,motivation,escape,left,old_school,entrie,break,really,catch
8,exact,amazing,tone,pendant,household,generation,god,reference,feel,today
9,wise,make,develop,destruction,praise,authentic,stoic,leung,mcu,refreshing


In [11]:
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    """
    Input: 
    Dictionary of words freq --> dict
    corpus of words --> list
    reviews --> list
    limit --> int
    start --> int
    step --> int
    Func: find the coherence score for each set of topic numbers
    Output: 
    model_list --> list: list of models
    coherence_values --> float: score of coherence
    """
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = gensim.models.LdaModel(corpus=corpus, num_topics=num_topics, id2word=id2word)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

    return model_list, coherence_values

In [12]:
model_list, coherence_values = compute_coherence_values(dictionary=id2word, corpus=corpus, texts=lemmatize_reviews, start=1, limit=20, step=1)

In [13]:
for model, cv in zip(range(1, 20, 1), coherence_values):
    print("Topics Number=", model, " has Coherence Value of", round(cv, 4))

Topics Number= 1  has Coherence Value of 0.3057
Topics Number= 2  has Coherence Value of 0.3168
Topics Number= 3  has Coherence Value of 0.3088
Topics Number= 4  has Coherence Value of 0.3073
Topics Number= 5  has Coherence Value of 0.3123
Topics Number= 6  has Coherence Value of 0.3058
Topics Number= 7  has Coherence Value of 0.3109
Topics Number= 8  has Coherence Value of 0.3162
Topics Number= 9  has Coherence Value of 0.312
Topics Number= 10  has Coherence Value of 0.3167
Topics Number= 11  has Coherence Value of 0.3143
Topics Number= 12  has Coherence Value of 0.311
Topics Number= 13  has Coherence Value of 0.3092
Topics Number= 14  has Coherence Value of 0.3137
Topics Number= 15  has Coherence Value of 0.3024
Topics Number= 16  has Coherence Value of 0.3039
Topics Number= 17  has Coherence Value of 0.3094
Topics Number= 18  has Coherence Value of 0.3138
Topics Number= 19  has Coherence Value of 0.317


In [14]:
print("Maximum value of coherence for ascending order is: 0.3152\n")
optimal_model = model_list[3]
model_topics = optimal_model.show_topics(formatted=False)

Maximum value of coherence for ascending order is: 0.3152



In [15]:
def format_topics_sentences(ldamodel=LDA_model, corpus=corpus, texts=data):
    """
    Input: LDA_model, corpus, reviews
    Func: to extract keywords from each review topic wise with coherence score
    Output: pandas dataframe
    """
    output_df = pd.DataFrame()

    for i, row in enumerate(ldamodel[corpus]):
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                output_df = output_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
            else:
                break
    output_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']

    # Add original text to the end of the output
    contents = pd.Series(texts)
    output_df = pd.concat([output_df, contents], axis=1)
    return (output_df)

review_keywords = format_topics_sentences(ldamodel=optimal_model, corpus=corpus, texts=data) # Formatinf the reviews df with keywords


df_dominant_topic = review_keywords.reset_index()
df_dominant_topic.columns = ['Document_Num', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text']

# Show
df_dominant_topic.head(10)

Unnamed: 0,Document_Num,Dominant_Topic,Topic_Perc_Contrib,Keywords,Text
0,0,3.0,0.8658,"movie, character, marvel, action, scene, mcu, ...",I ll start by saying that if you re looking fo...
1,1,2.0,0.9822,"movie, film, good, marvel, scene, character, f...",After 10 years of almost every movie being.arm...
2,2,2.0,0.7452,"movie, film, good, marvel, scene, character, f...","A -BIG- Screen Mini Review. Viewed Sept.05, 20..."
3,3,2.0,0.9006,"movie, film, good, marvel, scene, character, f...",Perfect Fantasy film to watch with full family...
4,4,2.0,0.8611,"movie, film, good, marvel, scene, character, f...",Keeping it short. This movie had it all. Great...
5,5,2.0,0.9865,"movie, film, good, marvel, scene, character, f...",Brought to you by the Truth Tellers.Film is gr...
6,6,2.0,0.9701,"movie, film, good, marvel, scene, character, f...",Haven t been much of a Marvel guy even with th...
7,7,2.0,0.9872,"movie, film, good, marvel, scene, character, f...","I had very few expectations from this one, giv..."
8,8,3.0,0.9212,"movie, character, marvel, action, scene, mcu, ...",Shang-Chi and the Legend of the Ten Rings is a...
9,9,2.0,0.9922,"movie, film, good, marvel, scene, character, f...","First off, this is a decent movie.Sure, there ..."


In [16]:
# Group top 5 sentences along with topic
sorted_reviews = pd.DataFrame()

sent_topics_outdf_grpd = review_keywords.groupby('Dominant_Topic')

for i, grp in sent_topics_outdf_grpd:
    sorted_reviews = pd.concat([sorted_reviews, 
                                             grp.sort_values(['Perc_Contribution'], ascending=[0]).head(1)], axis=0)   
sorted_reviews.reset_index(drop=True, inplace=True)
sorted_reviews.columns = ['Topic_Num', "Topic_Perc_Contrib", "Keywords", "Text"]

sorted_reviews.to_csv("Review_Topic.CSV", index= False)
sorted_reviews

Unnamed: 0,Topic_Num,Topic_Perc_Contrib,Keywords,Text
0,0.0,0.9974,"movie, marvel, good, film, great, action, stor...",Shang-Chi is a movie that nobody expected. An ...
1,1.0,0.9952,"movie, marvel, character, film, well, good, se...",What more do you want?I honestly didn t want t...
2,2.0,0.997,"movie, film, good, marvel, scene, character, f...","When Iron Man hit theatres back in 2008, there..."
3,3.0,0.9953,"movie, character, marvel, action, scene, mcu, ...","Overall, the movie was worth a watch. A lot of..."


## (2) (15 points) Generate K topics by using LSA, the number of topics K should be decided by the coherence score, then summarize what are the topics. You may refer the code here:

https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python

In [17]:
def compute_lsa_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    """
    Input: 
    Dictionary of words freq --> dict
    corpus of words --> list
    reviews --> list
    limit --> int
    start --> int
    step --> int
    Func: find the coherence score for each set of topic numbers
    Output: 
    model_list --> list: list of models
    coherence_values --> float: score of coherence
    """
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = LsiModel(corpus=corpus, num_topics=num_topics, id2word=id2word)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

    return model_list, coherence_values


In [18]:
LSA_model = LsiModel(corpus=corpus, id2word=id2word, chunksize=100)
review_lsa = LSA_model[corpus]

In [19]:
coherence_model_lda = CoherenceModel(model=LSA_model, texts=lemmatize_reviews, dictionary=id2word, coherence='c_v') #initilize coherence model
coherence_lda = coherence_model_lda.get_coherence() #get cohernece score
print('LSA Coherence Score: ', coherence_lda)

LSA Coherence Score:  0.2280469429409865


In [20]:
lsa_model_list, lsa_coherence_values = compute_lsa_coherence_values(dictionary=id2word, corpus=corpus, texts=lemmatize_reviews, start=1, limit=20, step=1)

In [21]:
for model, cv in zip(range(1, 20, 1), lsa_coherence_values):
    print("Topics Number=", model, " has Coherence Value of", round(cv, 4))

Topics Number= 1  has Coherence Value of 0.3166
Topics Number= 2  has Coherence Value of 0.3113
Topics Number= 3  has Coherence Value of 0.3206
Topics Number= 4  has Coherence Value of 0.3052
Topics Number= 5  has Coherence Value of 0.3326
Topics Number= 6  has Coherence Value of 0.337
Topics Number= 7  has Coherence Value of 0.3321
Topics Number= 8  has Coherence Value of 0.3213
Topics Number= 9  has Coherence Value of 0.3176
Topics Number= 10  has Coherence Value of 0.3219
Topics Number= 11  has Coherence Value of 0.3119
Topics Number= 12  has Coherence Value of 0.3114
Topics Number= 13  has Coherence Value of 0.3175
Topics Number= 14  has Coherence Value of 0.3132
Topics Number= 15  has Coherence Value of 0.3217
Topics Number= 16  has Coherence Value of 0.3169
Topics Number= 17  has Coherence Value of 0.3099
Topics Number= 18  has Coherence Value of 0.3135
Topics Number= 19  has Coherence Value of 0.3003


In [22]:
print("Maximum value of coherence for ascending order is: 0.345\n")
optimal_model = lsa_model_list[5]
model_topics = optimal_model.show_topics(formatted=False)

Maximum value of coherence for ascending order is: 0.345



In [23]:
def format_topics_sentences(ldamodel=LSA_model, corpus=corpus, texts=data):
    """
    Input: LDA_model, corpus, reviews
    Func: to extract keywords from each review topic wise with coherence score
    Output: pandas dataframe
    """
    output_df = pd.DataFrame()

    for i, row in enumerate(ldamodel[corpus]):
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                output_df = output_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
            else:
                break
    output_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']

    # Add original text to the end of the output
    contents = pd.Series(texts)
    output_df = pd.concat([output_df, contents], axis=1)
    return (output_df)

review_keywords = format_topics_sentences(ldamodel=optimal_model, corpus=corpus, texts=data) # Formatinf the reviews df with keywords


df_dominant_topic = review_keywords.reset_index()
df_dominant_topic.columns = ['Document_Num', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text']

# Show
df_dominant_topic.head(10)

Unnamed: 0,Document_Num,Dominant_Topic,Topic_Perc_Contrib,Keywords,Text
0,0,0.0,3.5507,"movie, film, marvel, character, well, good, sc...",I ll start by saying that if you re looking fo...
1,1,0.0,2.5361,"movie, film, marvel, character, well, good, sc...",After 10 years of almost every movie being.arm...
2,2,0.0,9.961,"movie, film, marvel, character, well, good, sc...","A -BIG- Screen Mini Review. Viewed Sept.05, 20..."
3,3,2.0,0.5555,"film, marvel, character, movie, go, also, real...",Perfect Fantasy film to watch with full family...
4,4,0.0,3.7966,"movie, film, marvel, character, well, good, sc...",Keeping it short. This movie had it all. Great...
5,5,0.0,5.5196,"movie, film, marvel, character, well, good, sc...",Brought to you by the Truth Tellers.Film is gr...
6,6,0.0,1.9423,"movie, film, marvel, character, well, good, sc...",Haven t been much of a Marvel guy even with th...
7,7,0.0,4.1938,"movie, film, marvel, character, well, good, sc...","I had very few expectations from this one, giv..."
8,8,0.0,12.3782,"movie, film, marvel, character, well, good, sc...",Shang-Chi and the Legend of the Ten Rings is a...
9,9,0.0,5.9951,"movie, film, marvel, character, well, good, sc...","First off, this is a decent movie.Sure, there ..."


In [24]:
# Group top 5 sentences along with topic
sorted_reviews = pd.DataFrame()

sent_topics_outdf_grpd = review_keywords.groupby('Dominant_Topic')

for i, grp in sent_topics_outdf_grpd:
    sorted_reviews = pd.concat([sorted_reviews, 
                                             grp.sort_values(['Perc_Contribution'], ascending=[0]).head(1)], axis=0)   
sorted_reviews.reset_index(drop=True, inplace=True)
sorted_reviews.columns = ['Topic_Num', "Topic_Perc_Contrib", "Keywords", "Text"]

sorted_reviews.to_csv("Review_Topic_LSA.CSV", index= False)
sorted_reviews

Unnamed: 0,Topic_Num,Topic_Perc_Contrib,Keywords,Text
0,0.0,38.3054,"movie, film, marvel, character, well, good, sc...",LikesGreat Pacing: Shang Chi has a lot of thin...
1,1.0,5.5644,"movie, film, also, well, character, marvel, mc...","First of all, my husband and I love superhero ..."
2,2.0,1.3977,"film, marvel, character, movie, go, also, real...",Absolutely enjoyed the film from first to last...
3,3.0,2.3399,"marvel, mcu, great, scene, good, movie, really...",Oke I saw the suicide squat yesterday and just...
4,4.0,0.4225,"good, well, character, movie, marvel, feel, fi...",Total waste of time. Iron Man 2008 is so much ...
5,5.0,0.0,"great, film, marvel, really, movie, character,...",Shang-Chi and the Legend of the Ten Rings is...


## (3) (10 points) Compare the results generated by the two topic modeling algorithms, which one is better? You should explain the reasons in details.

In [None]:
# Write your answer here (no code needed for this question)
"""
LSA or you can say LSI is a much simple and fast method as compared to LDA. 
Purpose of both of them is same is to collect set of topics that can best describe the collections of sentences
But!
LSA is most simpller and only focus of frequency of words rether then there order. Although in some cases it can be a benifit but
in our case this is not as such in favour of benifit. 
LDA is a bit complex and time taking algorithem but it do a deep analysis of the system and consider words as a sequence of words. 
and in our case it's seem a bit good as compared to LSA/LSI
Coherence: LDA has high coherence value then LSA/LSI
Topics: LDA gethered more useful topics and keyword collection then LSA/LSI
Speed: LSA/LSI is much faster
Text: the sorted text collected by LDA is better then that collected with LSA/LSI
"""