# LDA for COVID-19 Tweet Topic Identification

This notebook to identify the primary topics in COVID-19 vaccine tweets is based on a variety of guides written by others:
https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/; 
https://thinkinfi.com/guide-to-build-best-lda-model-using-gensim-python/; https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24




First, we load in the packages we'll need - we'll primarily be using Gensim for our LDA. We'll also load in our pre-processed data.

In [1]:
import pandas as pd
import numpy as np
import gensim
import gensim.corpora as corpora
from gensim.models import CoherenceModel
import pyLDAvis
import pyLDAvis.gensim_models
import matplotlib.pyplot as plt
import os

  def _figure_formats_changed(self, name, old, new):


In [2]:
tweets = os.listdir('data/pre-processed')

  and should_run_async(code)


In [4]:
tweets_dfs = []
for tweet in tweets:
    tw_file = 'data/pre-processed/' + tweet
    df = pd.read_csv(tw_file)
    tweets_dfs.append(df) 
tweets_clean = pd.concat(tweets_dfs)
tweets_clean = tweets_clean[tweets_clean['text_cln_tok'].notna()]

  and should_run_async(code)


In [5]:
print("""
    Size of combined df:\t{}
    First five rows:

    {}
""".format(
    tweets_clean.shape,
    tweets_clean.head()
)
)



    Size of combined df:	(155092, 10)
    First five rows:

                           created_at  \
0  Fri Apr 23 04:00:26 +0000 2021   
1  Fri Apr 23 04:00:35 +0000 2021   
2  Fri Apr 23 04:00:15 +0000 2021   
3  Fri Apr 23 04:00:36 +0000 2021   
4  Fri Apr 23 04:00:10 +0000 2021   

                                                text in_reply_to_screen_name  \
0  @kimsmann @makaumutua @WHO @ahmednasirlaw @Don...                kimsmann   
1  The world can now properly thank US, UK, Brazi...                     NaN   
2  Truck drivers from Manitoba are receiving thei...                     NaN   
3  Here we go...free passes if you vaccinate? htt...                     NaN   
4  Here are some opportunities in the coming days...                     NaN   

   retweet_count favorite_count              source               id_str  \
0            0.0              0  Twitter for iPhone  1385443422522183680   
1            0.0              0  Twitter for iPhone  1385443459432017921   
2  

Create bigram and trigram lists as well? Additional models to try? If so, use Gensim.models.phrases and gensim.models.phraser?

Next, we use Gensim to create a dictionary of the unique words that appear mapped to an id. (We may also want to filter out from the dictionary some words that don't appear enough or appear in too many tweets.) Second, we'll create a corpus of the tweets, which contains the number of times a given word (identified by id) appeared in each tweet. 

In [6]:
import ast
tweets_clean_lst = []
for tweet in tweets_clean['text_cln_tok']:
    tweets_clean_lst.append(ast.literal_eval(tweet))

  and should_run_async(code)


In [7]:
single_dict = corpora.Dictionary(tweets_clean_lst)
single_dict.filter_extremes(no_below=50, no_above=0.80, keep_n=1000000)

single_corpus = [single_dict.doc2bow(tweet) for tweet in tweets_clean_lst]

  and should_run_async(code)


In [14]:
print(single_dict.token2id)

 3421, 'lied': 3422, 'profits': 3423, 'laminate': 3424, 'voters': 3425, 'committed': 3426, 'hoax': 3427, 'bars': 3428, 'feet': 3429, 'waived': 3430, 'petition': 3431, 'bought': 3432, 'ppe': 3433, 'activity': 3434, 'shut': 3435, 'appeals': 3436, 'shift': 3437, 'thousands': 3438, 'exports': 3439, 'confident': 3440, 'participants': 3441, 'tokyo': 3442, 'institutions': 3443, 'kudos': 3444, 'reporter': 3445, 'contracts': 3446, 'departments': 3447, 'invest': 3448, 'handle': 3449, 'educated': 3450, 'her': 3451, 'river': 3452, 'consequences': 3453, 'account': 3454, 'word': 3455, 'wto': 3456, 'bigpharma': 3457, 'closely': 3458, 'watching': 3459, 'handling': 3460, 'scenes': 3461, 'expanded': 3462, 'professionals': 3463, 'show': 3464, 'mainstream': 3465, 'some': 3466, 'posting': 3467, 'omg': 3468, 'maine': 3469, 'docs': 3470, 'blaming': 3471, 'female': 3472, 'mp': 3473, 'signal': 3474, 'travellers': 3475, 'many': 3476, 'estimate': 3477, 'slowing': 3478, 'its': 3479, 'circulating': 3480, 'itself':

In [15]:
print(single_corpus[0])

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1)]
  and should_run_async(code)


Now, we can try training an initial model.

In [20]:
#import logging
#logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
single_model = gensim.models.ldamodel.LdaModel(corpus=single_corpus,
                                           id2word=single_dict,
                                           num_topics=6, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

  and should_run_async(code)


In [22]:
single_model.print_topics()

  coro = run_cell(code, store_history=store_history, silent=silent)


[(0,
  '0.067*"waiver" + 0.066*"patent" + 0.056*"protections" + 0.046*"india" + 0.035*"supports" + 0.025*"canada" + 0.023*"ip" + 0.023*"global" + 0.019*"effective" + 0.016*"south"'),
 (1,
  '0.111*"covid" + 0.098*"19" + 0.093*"vaccine" + 0.027*"s" + 0.024*"vaccinated" + 0.019*"coronavirus" + 0.018*"people" + 0.018*"biden" + 0.010*"administration" + 0.010*"support"'),
 (2,
  '0.114*"vaccination" + 0.052*"available" + 0.040*"appointments" + 0.039*"05" + 0.038*"cvs" + 0.037*"sign" + 0.035*"near" + 0.032*"00" + 0.029*"06" + 0.023*"help"'),
 (3,
  '0.085*"pfizer" + 0.056*"waiving" + 0.055*"says" + 0.045*"moderna" + 0.029*"protection" + 0.024*"know" + 0.022*"astrazeneca" + 0.021*"died" + 0.017*"biontech" + 0.016*"variant"'),
 (4,
  '0.117*"vaccines" + 0.091*"covid19" + 0.022*"t" + 0.021*"property" + 0.020*"intellectual" + 0.019*"world" + 0.018*"pandemic" + 0.014*"patents" + 0.014*"backs" + 0.011*"like"'),
 (5,
  '0.020*"it" + 0.018*"variants" + 0.014*"said" + 0.014*"time" + 0.013*"need" + 0.

We can use Perplexity and Coherence as two methods for considering our model's accuracy:

In [23]:
# Perplexity
perp = single_model.log_perplexity(single_corpus)  # a measure of how good the model is. lower the better.
print(perp)

  and should_run_async(code)
-7.115721600186615


In [None]:
# Compute Coherence Score
single_coherence_model_lda = CoherenceModel(model=single_model, texts=tweets_clean_lst, dictionary=single_dict, coherence='c_v')
single_coherence_lda = single_coherence_model_lda.get_coherence()

In [24]:
print(single_coherence_lda)

0.30482965200722806
  coro = run_cell(code, store_history=store_history, silent=silent)


We can also visualize the topics and their overlap:

In [25]:
pyLDAvis.enable_notebook()
single_plot = pyLDAvis.gensim_models.prepare(single_model, single_corpus, single_dict)
single_plot


  coro = run_cell(code, store_history=store_history, silent=silent)


We can also try building a Mallet LDA model, which can give better topic results than Gensim's LDA model.

In [10]:
# Download MalletLDA with: wget http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip
mallet_path = '/usr/lib/mallet-2.0.8/bin/mallet'
mallet_lda = gensim.models.wrappers.LdaMallet(mallet_path=mallet_path, corpus=single_corpus, num_topics=8, id2word=single_dict)

  and should_run_async(code)


In [14]:
print(mallet_lda.show_topics())

[(0, '0.198*"vaccine" + 0.075*"coronavirus" + 0.029*"news" + 0.012*"live" + 0.011*"good" + 0.011*"read" + 0.008*"uk" + 0.008*"great" + 0.007*"supply" + 0.007*"rollout"'), (1, '0.138*"covid" + 0.123*"19" + 0.088*"vaccine" + 0.035*"johnson" + 0.021*"health" + 0.018*"biden" + 0.017*"cdc" + 0.017*"astrazeneca" + 0.011*"support" + 0.011*"blood"'), (2, '0.028*"cases" + 0.025*"dose" + 0.024*"health" + 0.023*"today" + 0.021*"vaccinations" + 0.015*"appointment" + 0.014*"deaths" + 0.014*"state" + 0.013*"received" + 0.012*"april"'), (3, '0.106*"vaccines" + 0.078*"covid19" + 0.028*"world" + 0.026*"shot" + 0.025*"pandemic" + 0.023*"countries" + 0.016*"million" + 0.015*"doses" + 0.014*"share" + 0.014*"life"'), (4, '0.100*"covid19" + 0.044*"india" + 0.016*"free" + 0.015*"government" + 0.014*"covid19vaccine" + 0.014*"covidvaccine" + 0.012*"vaccines" + 0.012*"information" + 0.011*"oxygen" + 0.010*"hospital"'), (5, '0.081*"vaccine" + 0.057*"covid" + 0.038*"19" + 0.030*"pfizer" + 0.030*"vaccines" + 0.013

In [16]:
coherence_model_malletlda = CoherenceModel(model=mallet_lda,texts=tweets_clean_lst, dictionary=single_dict, coherence='c_v')
coherence_malletlda = coherence_model_malletlda.get_coherence()

  and should_run_async(code)


In [17]:
print(coherence_malletlda)

0.42764912960929946
  and should_run_async(code)


Other resources used: https://www.geeksforgeeks.org/python-convert-a-string-representation-of-list-into-list/; https://stackoverflow.com/questions/66759852/no-module-named-pyldavis; http://mallet.cs.umass.edu/download.php; https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/; https://thinkinfi.com/guide-to-build-best-lda-model-using-gensim-python/; https://medium.com/swlh/topic-modeling-lda-mallet-implementation-in-python-part-2-602ffb38d396