<span style="color: red; font-family: Calibri Light;">
  <h1><b>Topic Modelling with LDA: Evaluation</b></h1>
    <p style = "color: black"> Using LDA model on pre-processed data. Stop word removal during pre-processing is limited to common english stop words, top 100 most common words, and words occuring less than 10 times.<br>Any further stop word removal will be done when creating bag of words with the gensim library 
</span>

<span style="color: red; font-family: Calibri Light;">
  <h2><b>I. Setting Up Environment</b></h2>
</span>

In [1]:
#data transformation libraries
import pandas as pd
from pandas import option_context
import numpy as np
import ast
import csv

#NLP specific libraries
from gensim.models import Word2Vec

#topic modelling libraries
from gensim import corpora
from gensim.models import LdaModel, CoherenceModel, LdaMulticore


#for visualization
import matplotlib.pyplot as plt
import seaborn as sns
from prettytable import PrettyTable
import pyLDAvis
import pyLDAvis.gensim_models

from PIL import Image
from wordcloud import WordCloud, ImageColorGenerator

#others
import time
import os
import random
from glob import glob
import pickle


In [2]:
#set seed so that code output is deterministic
random.seed(200)  # Set the seed for Python's random module
np.random.seed(200)  # Set the seed for NumPy's random module

<span style="color: red; font-family: Calibri Light;">
  <h2><b>II. Import Data</b></h2>
</span>

<span style="color: red; font-family: Calibri Light;">
  <h2><b>a. Import preprocessed data</b></h2>
</span>

In [3]:
#import cleaned and pre-processed data

def list_converter(text):
    #to revert list->str conversion from pd.read_csv
    return ast.literal_eval(text)


data = pd.read_csv('../../../Data/lda_train.csv', converters ={'tokens':list_converter})
data = data.drop(columns = ['index'])
data.sort_values(by='date_created', inplace = True, ignore_index = True)
data.head()

Unnamed: 0,text_type,ID,date_created,year,long_text,clean_text,tokens,word_count
0,comment,c6d18gk,2012-09-25 07:57:13,2012,Yet i stared at the picture for a good 45 seco...,stare picture second miss,"[stare, picture, second, miss]",4
1,comment,c6d2fss,2012-09-25 09:13:23,2012,"[FYSR] = from your sister subreddit.\n\nIMO, i...",sister subreddit mildly interesting chance eve...,"[sister, subreddit, mildly, interesting, chanc...",18
2,comment,c6d46es,2012-09-25 12:32:08,2012,common give prince william harry a break he ju...,common prince harry break,"[common, prince, harry, break]",4
3,submission,1sur9h,2013-12-14 11:02:08,2013,Took this image of the Burj Khalifa from Souk ...,image burj khalifa souk bahar yesterday build ...,"[image, burj, khalifa, souk, bahar, yesterday,...",11
4,comment,ce1gf68,2013-12-14 12:07:24,2013,"Sorry pal, but you took an artistically not-so...",sorry impressive photo landmark resident karma...,"[sorry, impressive, photo, landmark, resident,...",8


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61376 entries, 0 to 61375
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   text_type     61376 non-null  object
 1   ID            61376 non-null  object
 2   date_created  61376 non-null  object
 3   year          61376 non-null  int64 
 4   long_text     61376 non-null  object
 5   clean_text    61376 non-null  object
 6   tokens        61376 non-null  object
 7   word_count    61376 non-null  int64 
dtypes: int64(2), object(6)
memory usage: 3.7+ MB


<span style="color: red; font-family: Calibri Light;">
  <h2><b>b. Import Bag-of-Words</b></h2>
</span>

In [5]:
docs = data['tokens'].tolist()
# Create bigrams - code from gensim documentation page
from gensim.models import Phrases

# Add bigrams and trigrams to docs (only ones that appear 20 times or more).
bigram = Phrases(docs, min_count=20)
for idx in range(len(docs)):
    for token in bigram[docs[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            docs[idx].append(token)
            
dictionary = corpora.Dictionary(docs)

In [6]:
with open("../models/bow_corpus.pkl", "rb") as f:
    bow = pickle.load(f)

In [7]:
bow[0]

[(0, 1), (1, 1), (2, 1), (3, 1)]

<span style="color: red; font-family: Calibri Light;">
  <h2><b>c. Import Word2Vec Model</b></h2>
</span>

In [8]:
w2v_model = Word2Vec.load('../models/w2v_model_lda_bigrams.model')

<span style="color: red; font-family: Calibri Light;">
  <h2><b>III. LDA Model Visual Evaluation</b></h2>
</span>
<p>A review of the top 10 words in each topic to determine the following:
    <ul>
    <li>Do they make sense?</li>
    <li>can the topic be given a label?</li>
    <li>Using a sample of the submissions in the training data, determine if the topics are accurate
    </ul>

In [11]:
#functions for evaluating the models
#function to compute coherence, diversity and perplexity metrics
def eval_metrics (lda_model, docs, dictionary, num_topics, corpus, top_n = 10):
    #Compute c_v score
    c_v = CoherenceModel(model=lda_model, texts=docs, dictionary=dictionary, coherence='c_v')
    cv_lda = c_v.get_coherence()
    
    # Compute u_mass score
    u_mass = CoherenceModel(model=lda_model, texts=docs, dictionary=dictionary, coherence='u_mass')
    umass_lda = u_mass.get_coherence()
    
    # Compute c_npmi score
    c_npmi = CoherenceModel(model=lda_model, texts=docs, dictionary=dictionary, coherence='c_npmi')
    cnpmi_lda = c_npmi.get_coherence()
    
    # Compute perplexity
    perplexity = lda_model.log_perplexity(corpus)
    
    # Compute topic diversity
    top_words = [word for topic_id in range(num_topics) for word, _ in lda_model.show_topic(topic_id, topn=top_n)]
    diversity = len(set(top_words)) / (num_topics * top_n)
    
    print(f"For {num_topics} topics:\nCoherence(c_v) = {cv_lda},\nCoherence(c_npmi) = {cnpmi_lda},\nCoherence(u_mass) = {umass_lda},\nPerplexity = {perplexity},\nTopic Diversity = {diversity}\n")
    
    #return cv_lda, umass_lda, cnpmi_lda, perplexity, diversity

#function to check average word similarities for the topics 
def average_similarity(lda_model, num_topics, w2v_model, top_n=10):
    #extract top 10 words for each topic
    top_words_per_topic =[] 
    for topic_id in range(num_topics):
        top_words = lda_model.show_topic(topic_id, topn=top_n)
        top_words = [word for word, _ in top_words]
        top_words_per_topic.append(top_words)
        
    # 2. Compute pairwise similarities for each topic
    average_similarities = []
    for top_words in top_words_per_topic:
        total_similarity = 0
        count = 0
        for i in range(len(top_words)):
            for j in range(i+1, len(top_words)):  # Compare each word with the words after it
                if top_words[i] in w2v_model.wv and top_words[j] in w2v_model.wv:
                    similarity = w2v_model.wv.similarity(top_words[i], top_words[j])
                    total_similarity += similarity
                    count += 1
        average_similarity = total_similarity / count if count != 0 else 0
        average_similarities.append(average_similarity)
    #print average similarities for each topic
    for idx, avg_sim in enumerate(average_similarities):
        print(f"Average similarity for topic {idx}: {avg_sim:.4f}")
        
    #return average_similarities

<span style="color: red; font-family: Calibri Light;">
  <h2><b>a. model 1: 5 topics</b></h2>
</span>

In [9]:
model_1 = LdaModel.load('../models/model_1_5tpcs/lda_model_1_5tpcs')

In [12]:
#quantitative evaluation
eval_metrics (lda_model = model_1,docs = docs,
              dictionary = dictionary, num_topics = model_1.num_topics, 
               corpus = bow, top_n = 10)

average_similarity(lda_model = model_1, num_topics = model_1.num_topics, 
                   w2v_model = w2v_model, top_n=10)




For 5 topics:
Coherence(c_v) = 0.6049680135364921,
Coherence(c_npmi) = 0.01809764921147028,
Coherence(u_mass) = -3.6348576844708056,
Perplexity = -8.133242504850717,
Topic Diversity = 1.0

Average similarity for topic 0: 0.4982
Average similarity for topic 1: 0.4522
Average similarity for topic 2: 0.2859
Average similarity for topic 3: 0.4426
Average similarity for topic 4: 0.5138


In [13]:
model_1.print_topics(num_topics = -1)

[(0,
  '0.016*"report" + 0.012*"area" + 0.011*"car" + 0.011*"close" + 0.010*"fast" + 0.009*"speed" + 0.008*"metro" + 0.008*"traffic" + 0.008*"park" + 0.008*"turn"'),
 (1,
  '0.010*"bank" + 0.009*"rent" + 0.007*"sell" + 0.007*"send" + 0.007*"week" + 0.007*"property" + 0.007*"cheap" + 0.006*"order" + 0.006*"option" + 0.006*"charge"'),
 (2,
  '0.010*"mall" + 0.010*"walk" + 0.008*"reddit" + 0.007*"video" + 0.007*"stuff" + 0.006*"outside" + 0.006*"night" + 0.006*"tip" + 0.006*"visit" + 0.006*"social"'),
 (3,
  '0.007*"kid" + 0.006*"care" + 0.006*"hard" + 0.006*"speak" + 0.006*"sorry" + 0.006*"culture" + 0.006*"learn" + 0.006*"situation" + 0.006*"school" + 0.005*"believe"'),
 (4,
  '0.007*"passport" + 0.006*"allow" + 0.006*"government" + 0.006*"apply" + 0.006*"travel" + 0.005*"base" + 0.005*"rule" + 0.005*"employee" + 0.005*"india" + 0.005*"job"')]

In [14]:
#dataset of topcis and topic representation
num_topics = model_1.num_topics

topics_words = []

for topic in range(num_topics):
    topic_words = model_1.show_topic(topic, topn = 10)
    words = [word[0] for word in topic_words]
    topics_words.append({"topic": topic, "words": words})
    

#create a dataframe
model_1_topics_df = pd.DataFrame(topics_words)

with option_context('display.max_colwidth', None):
    display(model_1_topics_df)
    


Unnamed: 0,topic,words
0,0,"[report, area, car, close, fast, speed, metro, traffic, park, turn]"
1,1,"[bank, rent, sell, send, week, property, cheap, order, option, charge]"
2,2,"[mall, walk, reddit, video, stuff, outside, night, tip, visit, social]"
3,3,"[kid, care, hard, speak, sorry, culture, learn, situation, school, believe]"
4,4,"[passport, allow, government, apply, travel, base, rule, employee, india, job]"


<ul>
    <li>topic 0: <em>'car', 'fast', 'speed', 'park', 'turn',</em> --> <em><b>urban_mobility</b></em>.
    That means of the top 10 words, 5 relate to the same topic.</li><br>
    <li>topic 1: <em> 'rent', 'property', 'cheap'</em> --> <em><b>accommodation</b></em>. However, the terms <em> 'sell', 'send', 'cheap', 'order', 'option', 'charge'</em> --> <em><b>shopping and purchases</b></em>. This topic contains two sub-topics.</li><br>
    <li>topic 2: <em>'mall', 'walk', 'outside', 'night', 'visit'</em>might be said to indicate <em><b>recreation</b></em>, but its a bit of a stretch</li><br>
    <li>topic 3: contains a jumble of words that don't fit together to specify one coherent topic</li><br>
    <li>topic 4: <em>'passport', 'travel' </em>--> <em><b>immigration & travel</b></em>, but the topic also contains <em>'apply', 'base', 'employee', 'job'</em> --> <em><b>employment_opportunities</b></em></li>
</ul>
<p> Of the five topics extracted, only two have up to 5 coherent words that indicate a topic of interest i.e topic 0 for <em><b>urban_mobility</b></em>, and topic 1 for <em><b>shopping and purchases</b></em>. Topics 1 and 4 have sub topics in them that are not related to each other.This model is likely not the best fit for our data.</p>

In [15]:
#attach topic label to topic terms

topic_label ={
    0: "urban_mobility",
    1: "accommodation/shopping_pur",
    2: "undefined/recreation",
    3: "undefined",
    4: "employment_opportunities/immigration_travel",
}

model_1_topics_df['label'] = model_1_topics_df['topic'].map(topic_label)

with option_context('display.max_colwidth', None):
    display(model_1_topics_df)

Unnamed: 0,topic,words,label
0,0,"[report, area, car, close, fast, speed, metro, traffic, park, turn]",urban_mobility
1,1,"[bank, rent, sell, send, week, property, cheap, order, option, charge]",accommodation/shopping_pur
2,2,"[mall, walk, reddit, video, stuff, outside, night, tip, visit, social]",undefined/recreation
3,3,"[kid, care, hard, speak, sorry, culture, learn, situation, school, believe]",undefined
4,4,"[passport, allow, government, apply, travel, base, rule, employee, india, job]",employment_opportunities/immigration_travel


In [16]:
#include column for most probable topic for each entry

top_topic_per_document = []

for doc in bow:
    topics = model_1.get_document_topics(doc, minimum_probability = 0)
    top_topic = sorted(topics, key=lambda x: x[1], reverse = True)[0][0]
    top_topic_per_document.append(top_topic)
    
#add column to data dataframe for the selected topic
data['top_topic'] = top_topic_per_document    

In [19]:
# Create a mapping from df1
topic_to_label = model_1_topics_df.set_index('topic')['label'].to_dict()

# Map the values to df2
data['label'] = data['top_topic'].map(topic_to_label)
data.head()

Unnamed: 0,text_type,ID,date_created,year,long_text,clean_text,tokens,word_count,top_topic,label
0,comment,c6d18gk,2012-09-25 07:57:13,2012,Yet i stared at the picture for a good 45 seco...,stare picture second miss,"[stare, picture, second, miss]",4,3,undefined
1,comment,c6d2fss,2012-09-25 09:13:23,2012,"[FYSR] = from your sister subreddit.\n\nIMO, i...",sister subreddit mildly interesting chance eve...,"[sister, subreddit, mildly, interesting, chanc...",18,1,accommodation/shopping_pur
2,comment,c6d46es,2012-09-25 12:32:08,2012,common give prince william harry a break he ju...,common prince harry break,"[common, prince, harry, break]",4,3,undefined
3,submission,1sur9h,2013-12-14 11:02:08,2013,Took this image of the Burj Khalifa from Souk ...,image burj khalifa souk bahar yesterday build ...,"[image, burj, khalifa, souk, bahar, yesterday,...",11,2,undefined/recreation
4,comment,ce1gf68,2013-12-14 12:07:24,2013,"Sorry pal, but you took an artistically not-so...",sorry impressive photo landmark resident karma...,"[sorry, impressive, photo, landmark, resident,...",8,2,undefined/recreation


In [21]:
#include column for second most probable topic for each entry

top_topic_per_document = []

for doc in bow:
    topics = model_1.get_document_topics(doc)
    top_topic = sorted(topics, key=lambda x: x[1], reverse = True)[1][0]
    top_topic_per_document.append(top_topic)
    
#add column to data dataframe for the selected topic
data['topic_2'] = top_topic_per_document   

In [26]:
data['label_2'] = data['topic_2'].map(topic_to_label)

#visually evaluate a small subset of submissions
sample = data[data.text_type == 'submission'].sample(n = 10, random_state = 42)

with option_context('display.max_colwidth', None):
    display(sample[['date_created', 'long_text', "top_topic",'label', "topic_2", 'label_2']])

Unnamed: 0,date_created,long_text,top_topic,label,topic_2,label_2
38303,2022-07-17 08:23:29,"Anybody else been forced to start a business after losing a job, and wake up every morning wishing you never had to open your eyes? Every day is a minor heart attack asking yourself, ""can I do this ethically?"" All you see is people conning each other, and all you hear is, ""this is how you have to play the game"" I've been conned by people who I thought were friends. And then been in business with another ""friend"" with questionable business practices. I'm not suicidal, but I do keep wishing this would all end.",2,undefined/recreation,3,undefined
59518,2023-06-18 20:08:38,"require some suggestions - banking / accounts / loans Hi All,\n\nNeed some suggestions or opinions and better personal experiences.\n\nI have been banking and have account and credit cards with a bank here for almost a decade. It's a conventional bank, but sooner or later I am going to go for a home loan - for which I would prefer an Islamic bank. Should I open some account with an Islamic bank to have some relationship over period of time OR it doesn't matter - when the need to loan arises I can go for looking around for loan and based on my AECB reports bank would provide rates - so effectively meaning that longer or shorter relationship wouldn't mean anything it just that transaction and the score at the point of time which would matter?",1,accommodation/shopping_pur,4,employment_opportunities/immigration_travel
3608,2020-02-10 10:02:39,Current Global Economic Situation,4,employment_opportunities/immigration_travel,3,undefined
60089,2023-06-19 20:13:26,I need to get a broken hard disk repaired. I’ve got a ton of data on it with no backup. Any idea if there are any reliable places in Dubai or Sharjah for data recovery ? Tried searching online and didn’t find anything convincing.,1,accommodation/shopping_pur,3,undefined
61240,2023-06-21 17:53:14,"What to do when the neighbour parks like this? Hello Dubai community!\n\nGuys it is getting out of hands, I never met this guy, but he keeps on parking like this recently. \n\nAny advice from professional residents how to deal with this?",4,employment_opportunities/immigration_travel,0,urban_mobility
6118,2020-05-11 11:11:08,Any news on gyms reopening? Does anybody have an insider scoop on when they're reopening?\n\nNewspaper people if you are reading you can chime in. In the form of a news article in 2 days or something.,2,undefined/recreation,4,employment_opportunities/immigration_travel
36450,2022-06-06 21:33:34,Experience with Techem ACs I’m planning to move in a building which has Techem as their AC providers. Can someone share their experience with the bills? I am hearing they are quite high.,1,accommodation/shopping_pur,3,undefined
9116,2020-06-16 13:35:29,"Entry permit validity Hey fellas, just came to know that the entry permit validity for residents stuck abroad is 21 days. It’s there on their website and the call centre also confirmed it. But given that many airports are still not opened up to international travel and fewer flights operating, what happens if one is not able to make it into UAE before the 21 day limit? Any input is appreciated. Thanks.",4,employment_opportunities/immigration_travel,1,accommodation/shopping_pur
52698,2023-04-19 08:54:12,Where can I find reasonably-priced jewelry and gold in Dubai? Looking to get my mam a gift Where can I find reasonably-priced jewelry and gold in Dubai? Looking to get my mam a gift,1,accommodation/shopping_pur,2,undefined/recreation
42845,2022-09-22 17:29:32,"Any CUD current or former students here? If so, any advice, stories or highlights for a soon to be student at the Uni? General Uni advice appreciated as well. \nI’m hopefully gonna be attending the spring semester next year and studying Computer Engineering but been getting cold feet since i don’t really know anyone there and don’t know what to expect.\n\n\nAnd while i’ve heard plenty of praise for this Uni i’ve also heard rumors of there being a whole social status hierarchy deal going on which has me nervous-ish.\n\nEdit: Just so yk i already heard the song and dance of Unis here not being great but i don’t have a choice but to study in Dubai/Sharjah.",4,employment_opportunities/immigration_travel,2,undefined/recreation


---

<span style="color: red; font-family: Calibri Light;">
  <h2><b>a. model 2: 10 topics</b></h2>
</span>

In [27]:
model_2 = LdaModel.load('../models/model_2_10tpcs/lda_model_2_10tpcs')

In [28]:
#quantitative evaluation
eval_metrics (lda_model = model_2,docs = docs,
              dictionary = dictionary, num_topics = model_2.num_topics, 
               corpus = bow, top_n = 10)

average_similarity(lda_model = model_2, num_topics = model_2.num_topics, 
                   w2v_model = w2v_model, top_n=10)




For 10 topics:
Coherence(c_v) = 0.4895138013398749,
Coherence(c_npmi) = -0.02281372631638639,
Coherence(u_mass) = -4.798116025994018,
Perplexity = -8.206242643921952,
Topic Diversity = 1.0

Average similarity for topic 0: 0.7092
Average similarity for topic 1: 0.5578
Average similarity for topic 2: 0.4880
Average similarity for topic 3: 0.5324
Average similarity for topic 4: 0.5290
Average similarity for topic 5: 0.6461
Average similarity for topic 6: 0.5922
Average similarity for topic 7: 0.3237
Average similarity for topic 8: 0.4213
Average similarity for topic 9: 0.4569


In [29]:
display(model_2.print_topics(num_topics = -1))

[(0,
  '0.023*"fast" + 0.021*"car" + 0.021*"speed" + 0.020*"traffic" + 0.017*"limit" + 0.015*"license" + 0.012*"vehicle" + 0.011*"slow" + 0.011*"ticket" + 0.011*"plate"'),
 (1,
  '0.019*"bank" + 0.017*"rent" + 0.013*"property" + 0.011*"send" + 0.011*"plan" + 0.011*"credit" + 0.011*"account" + 0.010*"sell" + 0.010*"market" + 0.010*"charge"'),
 (2,
  '0.010*"area" + 0.010*"hour" + 0.009*"week" + 0.009*"close" + 0.008*"open" + 0.008*"walk" + 0.008*"visit" + 0.007*"mall" + 0.007*"away" + 0.006*"usually"'),
 (3,
  '0.030*"speak" + 0.029*"learn" + 0.028*"school" + 0.022*"word" + 0.019*"arabic" + 0.018*"english" + 0.017*"play" + 0.016*"language" + 0.015*"rich" + 0.013*"fuck"'),
 (4,
  '0.025*"test" + 0.020*"tip" + 0.017*"covid" + 0.015*"daily" + 0.014*"health" + 0.012*"thread" + 0.010*"positive" + 0.009*"medical" + 0.009*"pregnant" + 0.009*"june"'),
 (5,
  '0.018*"report" + 0.014*"reddit" + 0.012*"message" + 0.012*"website" + 0.011*"video" + 0.010*"detail" + 0.010*"google" + 0.010*"write" + 0

In [30]:
#dataset of topcis and topic representation
num_topics = model_2.num_topics

topics_words = []

for topic in range(num_topics):
    topic_words = model_2.show_topic(topic, topn = 10)
    words = [word[0] for word in topic_words]
    topics_words.append({"topic": topic, "words": words})
    

#create a dataframe
model_1_topics_df = pd.DataFrame(topics_words)

with option_context('display.max_colwidth', None):
    display(model_1_topics_df)
    


Unnamed: 0,topic,words
0,0,"[fast, car, speed, traffic, limit, license, vehicle, slow, ticket, plate]"
1,1,"[bank, rent, property, send, plan, credit, account, sell, market, charge]"
2,2,"[area, hour, week, close, open, walk, visit, mall, away, usually]"
3,3,"[speak, learn, school, word, arabic, english, play, language, rich, fuck]"
4,4,"[test, tip, covid, daily, health, thread, positive, medical, pregnant, june]"
5,5,"[report, reddit, message, website, video, detail, google, write, link, news]"
6,6,"[cheap, order, restaurant, delivery, expensive, quality, contract, bill, store, extra]"
7,7,"[middle, agent, class, crime, east, low, saudi, project, mask, power]"
8,8,"[kid, movie, parent, child, wife, muslim, wear, hate, religion, accident]"
9,9,"[care, consider, situation, passport, culture, hard, mention, matter, arab, believe]"


<ul>
    <li>topic 0: The top 10 words are all representative terms --> <em><b>urban_mobility</b></em>.</li><br>
    <li>topic 1: contains a jumble of words that don't fit together to specify one coherent topic, or a topic of interest</li><br>
    <li>topic 2: <em>'passport', 'travel', </em>--> <em><b>immigration & travel</b></em>, but the topic also contains other words that are not representative of that topic, or any coherent topics</li><br>
    <li>topic 3: <em> property', 'cheap'</em> -->  <em><b>accommodation</b></em>. However, the terms <em> 'sell', 'cheap', 'order', 'option', 'charge',</em> --> <em><b>shopping and purchases</b></em>. Also, these terms along with <em>'restaurant'</em> --> <em><b> food/dining_experience</b></em> This topic contains three sub-topics.</li><br>
    <li>topic 4: contains a jumble of words that don't fit together to specify one coherent topic, or a topic of interest</li><br>
    <li>topic 5: contains a jumble of words that don't fit together to specify one coherent topic,  or a topic of interest</li><br>
    <li> topic 6: <em>'job', 'hire'</em> --> <em><b>employment_opportunities</b></em>, <em>'loan', 'debt'</em> --> <em><b>finance/financial_services</b></em></li><br>
    <li>topic 7: contains a jumble of words that don't fit together to specify one coherent topic,  or a topic of interest</li><br>
    <li>topic 8: </li>
</ul>
<p> Of the five topics extracted, only two have up to 5 coherent words that indicate a topic of interest i.e topic 0 for <em><b>urban_mobility</b></em>, and topic 1 for <em><b>shopping and purchases</b></em>. Topics 1 and 4 have sub topics in them that are not related to each other.<br>This model is likely not the best fit for our data</p>

In [88]:
data.columns

Index(['text_type', 'ID', 'date_created', 'year', 'long_text', 'clean_text',
       'tokens', 'word_count', 'top_topic', 'label', 'topic_2', 'label_2'],
      dtype='object')