# LDA for COVID-19 Tweet Topic Identification

This notebook to identify the primary topics in COVID-19 vaccine tweets is based on a variety of guides written by others:
https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/; 
https://thinkinfi.com/guide-to-build-best-lda-model-using-gensim-python/; https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24




First, we load in the packages we'll need - we'll primarily be using Gensim and the Gensim wrapper for Mallet for our LDA. We'll also load in our pre-processed, labeled data.

In [1]:
import pandas as pd
import numpy as np
import gensim
import gensim.corpora as corpora
from gensim.models import CoherenceModel
import pyLDAvis
import pyLDAvis.gensim_models
import matplotlib.pyplot as plt
import os

  def _figure_formats_changed(self, name, old, new):


In [2]:
tweets = os.listdir('data/labeled')
tweets_dfs = []
for tweet in tweets:
    tw_file = 'data/labeled/' + tweet
    df = pd.read_json(tw_file)
    tweets_dfs.append(df) 
tweets_clean = pd.concat(tweets_dfs)

In [26]:
df_test = pd.read_json('data/labeled/2021-04-19_cln_labeled.json')

In [28]:
df_test.shape

(40040, 8)

Filter to include only tweets which are negative or neutral, in order to better identify topics related to vaccine hesitancy.

In [4]:
tweets_negneut = tweets_clean[tweets_clean['score']<=0]

In [5]:
print("""
    Size of combined df:\t{}
    First five rows:

    {}
""".format(
    tweets_negneut.shape,
    tweets_negneut.head()
)
)



    Size of combined df:	(71810, 8)
    First five rows:

                     created_at  \
1 2021-04-27 04:03:50+00:00   
2 2021-04-27 04:04:06+00:00   
3 2021-04-27 04:03:50+00:00   
5 2021-04-27 04:04:00+00:00   
6 2021-04-27 04:04:03+00:00   

                                            text_cln  \
1  don t know liberals s given doses capita count...   
2       pharmeasy initiates covid  vaccination drive   
3  india receive batch russia s covid  vaccine rd...   
5  feel freely large moms family covid seriously ...   
6    europe news coronavirus eu sues astrazeneca ...   

                                        text_cln_tok  positive  neutral  \
1  [don know liberals given doses capita countrie...     0.000    0.722   
2                        [pharmeasy initiates drive]     0.000    1.000   
3  [india receive batch russia rdif india news la...     0.000    1.000   
5  [feel freely large moms family seriously fuck ...     0.282    0.397   
6  [europe news sues astrazeneca deliv

First (if desired), we can perform a grid search of possible parameters for both the Gensim and Mallet LDA models to identify the most promising. To do this, use the function choose_lda_models() in pipeline.py with the text_cln_tok column of the full tweets dataframe. In order to successfully run pipeline (for the Mallet LDA model), you'll need to download the Mallet LDA (download with: wget http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip), unzip it, and re-assign the variable MALLET_PATH in pipeline.py to be the file path where the ballet-2.0.8/bin/mallet files are located (eg, MALLET_PATH = '/usr/lib/mallet-2.0.8/bin/mallet'). You also need to insure all packages used in pipeline.py are installed. 

Note on grid search: currently, the parameters to search through are hard-coded in pipeline.py. To search through different parameters, you'll need to adjust the parameter values in choose_lda_models. The initial step of creating a dictionary and corpus to use in the LDA models also takes parameters, which are currently hard-coded to not consider words which appear less than 50 times, words which appear in more than 80 % of the documents, and to filter for only the top 1000000 words. This can also be changed in pipeline.py when build_corpus_dict is called by choose_lda_models.

In [72]:
import pipeline
#import importlib
#from src import pipeline
#import src
#reload(src.pipeline)
import importlib
importlib.reload(pipeline)

<module 'pipeline' from '/mnt/c/Users/natra/Documents/Education/UChicago/MLforPP/ml-for-pp_vaccine-hesitancy/pipeline.py'>

In [68]:
test_tweets = df_test
test_tweets = test_tweets[test_tweets['text_cln_tok'].astype(str).map(len)!=2]

In [69]:
test_tweets.head()

Unnamed: 0,created_at,text_cln,text_cln_tok,positive,neutral,negative,compound,score
0,2021-04-19 04:02:07+00:00,prelim data israel suggest people vaccinated w...,"['prelim', 'data', 'israel', 'suggest', 'peopl...",0.0,0.919,0.081,-0.296,-1
1,2021-04-19 04:02:11+00:00,sc hear pleas seeking covid vaccination years...,"['hear', 'pleas', 'seeking', 'years', 'age', '...",0.0,1.0,0.0,0.0,0
2,2021-04-19 04:02:34+00:00,covid vaccination officers staff cgda prevent...,"['officers', 'staff', 'cgda', 'preventive', 'm...",0.0,1.0,0.0,0.0,0
3,2021-04-19 04:02:34+00:00,pm chill now kiddo picked gfs mum shes sweet ...,"['chill', 'now', 'kiddo', 'picked', 'gfs', 'mu...",0.24,0.76,0.0,0.7184,1
4,2021-04-19 04:02:34+00:00,mass vaccination weeks help country fighting c...,"['mass', 'weeks', 'help', 'country', 'fighting...",0.178,0.658,0.164,0.0516,1


In [73]:
#results = pipeline.choose_lda_models(tweets_negneut['text_cln_tok'])
results = pipeline.choose_lda_models(test_tweets['text_cln_tok'])

Created doc_lst [['prelim', 'data', 'israel', 'suggest', 'people', 'pfizer', 'develop', 'fold', 'lower', 'viral', 'load', 'people', 'this', 'indicate', 'reduced', 'transmissibility', 'viral', 'load', 'identified', 'key', 'driver', 'transmission'], ['hear', 'pleas', 'seeking', 'years', 'age', 'april'], ['officers', 'staff', 'cgda', 'preventive', 'measures', 'contain', 'spread', 'officers', 'staff', 'cgda', 'circular', 'dated', 'post'], ['chill', 'now', 'kiddo', 'picked', 'gfs', 'mum', 'shes', 'sweet', 'hes', 'night', 'feed', 'animals', 'got', 'motivated', 'filled', 'form', 'week', 'early'], ['mass', 'weeks', 'help', 'country', 'fighting', 'speed', 'economic', 'recovery', 'process'], ['mass', 'campaign'], ['india', 'wait', 'long', 'time', 'foreign', 'available', 'country', 'companies', 'pfizer', 'moderna', 'apply', 'emergency', 'use', 'licences', 'allowed', 'recently', 'changed', 'rules'], ['ask', 'number', 'means', 'virus', 'ask', 'number', 'means', 'age', 'person', 'thing', 'rollout', 

In [74]:
results.sort_values('Coherence',ascending=False)

Unnamed: 0,LDA Model,Params,Time Elapsed,Coherence,Perplexity,Topics
0,GensimLDA,"{'chunksize': 2000, 'num_topics': 5, 'alpha': ...",0 days 00:00:51.352003,0.313928,-6.321619,"[(0, 0.055*""available"" + 0.052*""appointments"" ..."
1,GensimLDA,"{'chunksize': 2000, 'num_topics': 5, 'alpha': ...",0 days 00:00:51.501660,0.293632,-6.207316,"[(0, 0.057*""appointments"" + 0.056*""available"" ..."
6,GensimLDA,"{'chunksize': 2000, 'num_topics': 5, 'alpha': ...",0 days 00:00:40.763946,0.291672,-6.171491,"[(0, 0.047*""available"" + 0.038*""news"" + 0.037*..."
8,GensimLDA,"{'chunksize': 2000, 'num_topics': 5, 'alpha': ...",0 days 00:00:42.839278,0.288162,-6.176527,"[(0, 0.047*""available"" + 0.038*""news"" + 0.037*..."
2,GensimLDA,"{'chunksize': 2000, 'num_topics': 5, 'alpha': ...",0 days 00:00:49.581861,0.287996,-6.294063,"[(0, 0.056*""available"" + 0.052*""appointments"" ..."
7,GensimLDA,"{'chunksize': 2000, 'num_topics': 5, 'alpha': ...",0 days 00:00:42.629080,0.285318,-6.198679,"[(0, 0.047*""available"" + 0.038*""news"" + 0.037*..."
3,GensimLDA,"{'chunksize': 2000, 'num_topics': 5, 'alpha': ...",0 days 00:00:43.655137,0.279818,-6.193472,"[(0, 0.050*""available"" + 0.041*""news"" + 0.040*..."
4,GensimLDA,"{'chunksize': 2000, 'num_topics': 5, 'alpha': ...",0 days 00:00:46.638070,0.278139,-6.200984,"[(0, 0.050*""available"" + 0.041*""news"" + 0.040*..."
5,GensimLDA,"{'chunksize': 2000, 'num_topics': 5, 'alpha': ...",0 days 00:00:42.075033,0.274915,-6.194202,"[(0, 0.050*""available"" + 0.041*""news"" + 0.040*..."


In [120]:
results.to_csv('lda_models_gs_3day.csv')

  


Once the ideal parameters are selected, we can manually create the models to consider specific aspects more in-depth (and to create the dynamic visualizations below). First, we use Gensim to create a dictionary of the unique words that appear mapped to an id. (We are still filtering out from the dictionary words that don't appear enough or appear in too many tweets.) Second, we'll create a corpus of the tweets, which contains the number of times a given word (identified by id) appeared in each tweet. 

In [56]:
import ast
tweets_nn_lst = []
for tweet in tweets_negneut['text_cln_tok']:
    tweets_nn_lst.append(ast.literal_eval(tweet))

SyntaxError: invalid syntax (<unknown>, line 1)

In [57]:
import ast
tweets_nn_lst = []
for tweet in test_tweets['text_cln_tok']:
    tweets_nn_lst.append(ast.literal_eval(tweet))

In [61]:
single_dict = corpora.Dictionary(tweets_nn_lst)
single_dict.filter_extremes(no_below=100, no_above=0.80, keep_n=1000000)

single_corpus = [single_dict.doc2bow(tweet) for tweet in tweets_nn_lst]

To see the dictionary and corpus contents, run the below 2 cells:

In [62]:
print(single_dict.token2id)

{'people': 0, 'week': 1, 'world': 2, 'cases': 3, 'india': 4, 'age': 5, 'years': 6, 'doses': 7, 'getting': 8, 'need': 9, 'pfizer': 10, 'shot': 11, 'new': 12, 'appointments': 13, 'available': 14, 'near': 15, 'sign': 16, 'know': 17, 'today': 18, 'government': 19, 'adults': 20, 'april': 21, 'eligible': 22, 'dose': 23, 'says': 24, 'health': 25, 'news': 26, 'astrazeneca': 27, 'received': 28, 'second': 29, 'pandemic': 30, 'states': 31}


In [14]:
print(single_corpus[0])

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1)]
  


Next, we can train the model with the parameters we identified above (or, with any other parameters).

In [129]:
#import logging
#logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
single_model = gensim.models.ldamodel.LdaModel(corpus=single_corpus,
                                           id2word=single_dict,
                                           num_topics=5, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=2000,
                                           passes=10,
                                           alpha=0.1,
                                           eta=1,
                                           per_word_topics=True)

  


In [127]:
# View the topics identified in the above model
single_model.print_topics()

  


[(0,
  '0.072*"available" + 0.071*"covid" + 0.071*"19" + 0.069*"vaccination" + 0.068*"appointments" + 0.058*"sign" + 0.057*"near" + 0.056*"04" + 0.054*"00" + 0.052*"cvs"'),
 (1,
  '0.040*"dose" + 0.038*"cases" + 0.036*"doses" + 0.025*"covid19" + 0.022*"people" + 0.021*"vaccinated" + 0.020*"million" + 0.020*"new" + 0.020*"received" + 0.018*"deaths"'),
 (2,
  '0.020*"covid19" + 0.020*"vaccines" + 0.020*"people" + 0.013*"t" + 0.013*"vaccinated" + 0.012*"vaccine" + 0.009*"india" + 0.008*"need" + 0.007*"getting" + 0.007*"know"'),
 (3,
  '0.108*"covid" + 0.102*"vaccine" + 0.097*"19" + 0.038*"johnson" + 0.029*"s" + 0.020*"j" + 0.020*"vaccines" + 0.019*"coronavirus" + 0.018*"covid19" + 0.016*"vaccination"'),
 (4,
  '0.019*"eligible" + 0.017*"april" + 0.014*"today" + 0.014*"state" + 0.014*"appointment" + 0.013*"age" + 0.012*"health" + 0.012*"county" + 0.011*"receive" + 0.010*"students"')]

We can use Coherence as one method for considering our model's accuracy:

In [130]:
# Compute Coherence Score
single_coherence_model_lda = CoherenceModel(model=single_model, texts=tweets_nn_lst, dictionary=single_dict, coherence='c_v')
single_coherence_lda = single_coherence_model_lda.get_coherence()
print(single_coherence_lda)

  
0.3935333507272311


We can also visualize the topics and their overlap:

In [21]:
pyLDAvis.enable_notebook()
single_plot = pyLDAvis.gensim_models.prepare(single_model, single_corpus, single_dict)
single_plot


  


We can also try building a Mallet LDA model using either parameters identified above or any other parameters.

In [97]:
# Download MalletLDA with: wget http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip
mallet_path = '/usr/lib/mallet-2.0.8/bin/mallet'
mallet_lda = gensim.models.wrappers.LdaMallet(mallet_path=mallet_path, corpus=single_corpus, num_topics=4, alpha='0.6', id2word=single_dict)

  


In [99]:
print(mallet_lda.show_topics())

[(0, '0.046*"people" + 0.016*"virus" + 0.015*"anti" + 0.014*"dose" + 0.014*"fully" + 0.013*"pfizer" + 0.013*"cdc" + 0.013*"risk" + 0.012*"variants" + 0.010*"doctor"'), (1, '0.052*"india" + 0.035*"cases" + 0.034*"doses" + 0.015*"total" + 0.015*"health" + 0.014*"deaths" + 0.013*"million" + 0.011*"sputnik" + 0.011*"world" + 0.010*"news"'), (2, '0.038*"emergency" + 0.032*"moderna" + 0.029*"health" + 0.028*"johnson" + 0.027*"immunity" + 0.027*"usa" + 0.026*"wait" + 0.026*"fake" + 0.025*"die" + 0.024*"company"'), (3, '0.080*"appointments" + 0.072*"sign" + 0.070*"cvs" + 0.028*"india" + 0.027*"age" + 0.026*"people" + 0.022*"today" + 0.020*"modi" + 0.017*"group" + 0.016*"hospital"')]
  


In [98]:
coherence_model_malletlda = CoherenceModel(model=mallet_lda,texts=tweets_nn_lst, dictionary=single_dict, coherence='c_v')
coherence_malletlda = coherence_model_malletlda.get_coherence()
print(coherence_malletlda)

  
0.3216132084984953


Additional things to try? Create bigram and trigram lists as well? Additional models to try? If so, use Gensim.models.phrases and gensim.models.phraser?

Other resources used: https://www.geeksforgeeks.org/python-convert-a-string-representation-of-list-into-list/; https://stackoverflow.com/questions/66759852/no-module-named-pyldavis; http://mallet.cs.umass.edu/download.php; https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/; https://thinkinfi.com/guide-to-build-best-lda-model-using-gensim-python/; https://medium.com/swlh/topic-modeling-lda-mallet-implementation-in-python-part-2-602ffb38d396; https://www.linkedin.com/pulse/nlp-a-complete-guide-topic-modeling-latent-dirichlet-sahil-m/