# LDA for COVID-19 Tweet Topic Identification

This notebook to identify the primary topics in COVID-19 vaccine tweets is based on a variety of guides written by others:
https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/; 
https://thinkinfi.com/guide-to-build-best-lda-model-using-gensim-python/; https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24




First, we load in the packages we'll need - we'll primarily be using Gensim and the Gensim wrapper for Mallet for our LDA. We'll also load in our pre-processed, labeled data.

In [1]:
import pandas as pd
import numpy as np
import gensim
import gensim.corpora as corpora
from gensim.models import CoherenceModel
import pyLDAvis
import pyLDAvis.gensim_models
import matplotlib.pyplot as plt
import os

  def _figure_formats_changed(self, name, old, new):


In [7]:
tweets = os.listdir('data/labeled')
tweets_dfs = []
for tweet in tweets:
    tw_file = 'data/labeled/' + tweet
    df = pd.read_json(tw_file)
    tweets_dfs.append(df) 
tweets_clean = pd.concat(tweets_dfs)

  


Filter to include only tweets which are negative or neutral, in order to better identify topics related to vaccine hesitancy.

In [9]:
tweets_negneut = tweets_clean[tweets_clean['score']<=0]

  


In [10]:
print("""
    Size of combined df:\t{}
    First five rows:

    {}
""".format(
    tweets_negneut.shape,
    tweets_negneut.head()
)
)



    Size of combined df:	(17075, 8)
    First five rows:

                     created_at  \
2 2021-05-01 04:01:04+00:00   
3 2021-05-01 04:01:40+00:00   
4 2021-05-01 04:01:06+00:00   
5 2021-05-01 04:01:56+00:00   
7 2021-05-01 04:01:08+00:00   

                                            text_cln  \
2  appointments available roxbury covid  vaccinat...   
3  im afraid childs future federally run schools ...   
4  indonesia approves sinopharm covid  vaccine em...   
5  thread viral people know culprits anti vaccine...   
7  bmc carry token covid  vaccination people age ...   

                                        text_cln_tok  positive  neutral  \
2  ['appointments', 'available', 'roxbury', 'site...     0.000    1.000   
3  ['afraid', 'childs', 'future', 'federally', 'r...     0.000    0.895   
4  ['indonesia', 'approves', 'sinopharm', 'emerge...     0.239    0.531   
5  ['thread', 'viral', 'people', 'know', 'culprit...     0.000    0.583   
7  ['bmc', 'carry', 'token', 'people',

First (if desired), we can perform a grid search of possible parameters for both the Gensim and Mallet LDA models to identify the most promising. To do this, use the function choose_lda_models() in pipeline.py with the text_cln_tok column of the full tweets dataframe. In order to successfully run pipeline (for the Mallet LDA model), you'll need to download the Mallet LDA (download with: wget http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip), unzip it, and re-assign the variable MALLET_PATH in pipeline.py to be the file path where the ballet-2.0.8/bin/mallet files are located (eg, MALLET_PATH = '/usr/lib/mallet-2.0.8/bin/mallet'). You also need to insure all packages used in pipeline.py are installed. 

Note on grid search: currently, the parameters to search through are hard-coded in pipeline.py. To search through different parameters, you'll need to adjust the parameter values in choose_lda_models. The initial step of creating a dictionary and corpus to use in the LDA models also takes parameters, which are currently hard-coded to not consider words which appear less than 50 times, words which appear in more than 80 % of the documents, and to filter for only the top 1000000 words. This can also be changed in pipeline.py when build_corpus_dict is called by choose_lda_models.

In [85]:
#import pipeline
import importlib
importlib.reload(pipeline)

  


<module 'pipeline' from '/mnt/c/Users/natra/Documents/Education/UChicago/MLforPP/ml-for-pp_vaccine-hesitancy/pipeline.py'>

In [None]:
results_exp2 = pipeline.choose_lda_models(tweets_negneut['text_cln_tok'])

In [65]:
results_exp = pipeline.choose_lda_models(tweets_negneut['text_cln_tok'])

: GensimLDA | {'chunksize': 2000, 'num_topics': 3, 'alpha': 0.3, 'eta': 0.9, 'random_state': 100}
Time Elapsed: 0:00:19.120891
Training model: GensimLDA | {'chunksize': 2000, 'num_topics': 3, 'alpha': 0.3, 'eta': 'symmetric', 'random_state': 100}
Time Elapsed: 0:00:19.284752
Training model: GensimLDA | {'chunksize': 2000, 'num_topics': 3, 'alpha': 0.6, 'eta': 0.1, 'random_state': 100}
Time Elapsed: 0:00:19.164481
Training model: GensimLDA | {'chunksize': 2000, 'num_topics': 3, 'alpha': 0.6, 'eta': 0.3, 'random_state': 100}
Time Elapsed: 0:00:19.254130
Training model: GensimLDA | {'chunksize': 2000, 'num_topics': 3, 'alpha': 0.6, 'eta': 0.6, 'random_state': 100}
Time Elapsed: 0:00:19.190248
Training model: GensimLDA | {'chunksize': 2000, 'num_topics': 3, 'alpha': 0.6, 'eta': 0.9, 'random_state': 100}
Time Elapsed: 0:00:19.181296
Training model: GensimLDA | {'chunksize': 2000, 'num_topics': 3, 'alpha': 0.6, 'eta': 'symmetric', 'random_state': 100}
Time Elapsed: 0:00:19.263923
Training mo

In [67]:
results_exp.sort_values('Coherence',ascending=False)

  


Unnamed: 0,LDA Model,Params,Time Elapsed,Coherence,Perplexity,Topics
151,GensimLDA,"{'chunksize': 1000, 'num_topics': 5, 'alpha': ...",0 days 00:00:18.660695,0.481878,-5.901042,"[(0, 0.083*""people"" + 0.024*""fully"" + 0.023*""n..."
163,GensimLDA,"{'chunksize': 1000, 'num_topics': 5, 'alpha': ...",0 days 00:00:18.176696,0.481878,-5.901042,"[(0, 0.083*""people"" + 0.024*""fully"" + 0.023*""n..."
173,GensimLDA,"{'chunksize': 1000, 'num_topics': 5, 'alpha': ...",0 days 00:00:18.843764,0.481878,-5.901042,"[(0, 0.083*""people"" + 0.024*""fully"" + 0.023*""n..."
172,GensimLDA,"{'chunksize': 1000, 'num_topics': 5, 'alpha': ...",0 days 00:00:19.690805,0.481878,-5.901042,"[(0, 0.083*""people"" + 0.024*""fully"" + 0.023*""n..."
171,GensimLDA,"{'chunksize': 1000, 'num_topics': 5, 'alpha': ...",0 days 00:00:20.158660,0.481878,-5.901042,"[(0, 0.083*""people"" + 0.024*""fully"" + 0.023*""n..."
...,...,...,...,...,...,...
147,GensimLDA,"{'chunksize': 1000, 'num_topics': 3, 'alpha': ...",0 days 00:00:18.093882,0.331503,-5.955754,"[(0, 0.041*""people"" + 0.017*""india"" + 0.017*""d..."
148,GensimLDA,"{'chunksize': 1000, 'num_topics': 3, 'alpha': ...",0 days 00:00:18.248764,0.331503,-5.955754,"[(0, 0.041*""people"" + 0.017*""india"" + 0.017*""d..."
149,GensimLDA,"{'chunksize': 1000, 'num_topics': 3, 'alpha': ...",0 days 00:00:18.344837,0.331503,-5.955754,"[(0, 0.041*""people"" + 0.017*""india"" + 0.017*""d..."
126,GensimLDA,"{'chunksize': 1000, 'num_topics': 3, 'alpha': ...",0 days 00:00:18.057247,0.331503,-5.955754,"[(0, 0.041*""people"" + 0.017*""india"" + 0.017*""d..."


In [68]:
results_exp.to_csv('lda_models_grid_search.csv')

  


In [55]:
results_exp.sort_values('Coherence', ascending=False)

  


Unnamed: 0,LDA Model,Params,Time Elapsed,Coherence,Perplexity
11,GensimLDA,"{'chunksize': 1000, 'num_topics': 5, 'random_s...",0 days 00:00:18.326424,0.481878,-5.901042
15,MalletLDA,{'random_seed': 100},0 days 00:00:40.614905,0.470407,
12,GensimLDA,"{'chunksize': 1000, 'num_topics': 10, 'random_...",0 days 00:00:19.540925,0.446513,-5.850479
13,GensimLDA,"{'chunksize': 1000, 'num_topics': 15, 'random_...",0 days 00:00:19.223515,0.420163,-5.922036
9,GensimLDA,"{'chunksize': 500, 'num_topics': 20, 'random_s...",0 days 00:00:18.387839,0.407586,-8.15214
14,GensimLDA,"{'chunksize': 1000, 'num_topics': 20, 'random_...",0 days 00:00:19.082389,0.402379,-6.381634
8,GensimLDA,"{'chunksize': 500, 'num_topics': 15, 'random_s...",0 days 00:00:17.467022,0.399271,-6.284955
4,GensimLDA,"{'chunksize': 100, 'num_topics': 20, 'random_s...",0 days 00:00:17.142501,0.393271,-15.103237
0,GensimLDA,"{'chunksize': 100, 'num_topics': 3, 'random_st...",0 days 00:00:16.765448,0.38319,-6.104693
2,GensimLDA,"{'chunksize': 100, 'num_topics': 10, 'random_s...",0 days 00:00:15.813604,0.380098,-6.110908


Once the ideal parameters are selected, we can manually create the models to consider specific aspects more in-depth (and to create the dynamic visualizations below). First, we use Gensim to create a dictionary of the unique words that appear mapped to an id. (We are still filtering out from the dictionary words that don't appear enough or appear in too many tweets.) Second, we'll create a corpus of the tweets, which contains the number of times a given word (identified by id) appeared in each tweet. 

In [56]:
import ast
tweets_nn_lst = []
for tweet in tweets_negneut['text_cln_tok']:
    tweets_nn_lst.append(ast.literal_eval(tweet))

  


In [57]:
single_dict = corpora.Dictionary(tweets_nn_lst)
single_dict.filter_extremes(no_below=50, no_above=0.80, keep_n=1000000)

single_corpus = [single_dict.doc2bow(tweet) for tweet in tweets_nn_lst]

  


To see the dictionary and corpus contents, run the below 2 cells:

In [13]:
print(single_dict.token2id)

  


In [14]:
print(single_corpus[0])

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1)]
  


Next, we can train the model with the parameters we identified above (or, with any other parameters).

In [83]:
#import logging
#logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
single_model = gensim.models.ldamodel.LdaModel(corpus=single_corpus,
                                           id2word=single_dict,
                                           num_topics=5, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=1000,
                                           passes=10,
                                           alpha=0.1,
                                           eta=1,
                                           per_word_topics=True)

  


In [75]:
# View the topics identified in the above model
single_model.print_topics()

  


[(0,
  '0.063*"people" + 0.023*"fully" + 0.018*"virus" + 0.017*"dose" + 0.017*"nation" + 0.016*"situation" + 0.015*"getting" + 0.013*"time" + 0.012*"shot" + 0.012*"times"'),
 (1,
  '0.068*"cases" + 0.048*"total" + 0.034*"new" + 0.032*"india" + 0.025*"doses" + 0.025*"deaths" + 0.022*"health" + 0.022*"death" + 0.018*"reports" + 0.017*"recoveries"'),
 (2,
  '0.043*"india" + 0.020*"fight" + 0.018*"research" + 0.017*"center" + 0.015*"appointment" + 0.015*"oxygen" + 0.015*"australia" + 0.015*"wrong" + 0.014*"supply" + 0.014*"global"'),
 (3,
  '0.132*"available" + 0.111*"appointments" + 0.106*"near" + 0.105*"sign" + 0.103*"cvs" + 0.041*"pfizer" + 0.024*"johnson" + 0.015*"variants" + 0.012*"canada" + 0.010*"health"'),
 (4,
  '0.059*"immunity" + 0.049*"you" + 0.038*"fake" + 0.033*"die" + 0.032*"company" + 0.032*"medicine" + 0.031*"wait" + 0.029*"usa" + 0.026*"mask" + 0.025*"cure"')]

We can use Coherence as one method for considering our model's accuracy:

In [84]:
# Compute Coherence Score
single_coherence_model_lda = CoherenceModel(model=single_model, texts=tweets_nn_lst, dictionary=single_dict, coherence='c_v')
single_coherence_lda = single_coherence_model_lda.get_coherence()
print(single_coherence_lda)

  
0.49433827902219873


We can also visualize the topics and their overlap:

In [21]:
pyLDAvis.enable_notebook()
single_plot = pyLDAvis.gensim_models.prepare(single_model, single_corpus, single_dict)
single_plot


  


We can also try building a Mallet LDA model using either parameters identified above or any other parameters.

In [22]:
# Download MalletLDA with: wget http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip
mallet_path = '/usr/lib/mallet-2.0.8/bin/mallet'
mallet_lda = gensim.models.wrappers.LdaMallet(mallet_path=mallet_path, corpus=single_corpus, num_topics=8, alpha='auto', id2word=single_dict)

  


In [23]:
print(mallet_lda.show_topics())

[(0, '0.145*"people" + 0.041*"virus" + 0.031*"cdc" + 0.031*"risk" + 0.027*"variants" + 0.026*"don" + 0.022*"public" + 0.021*"spread" + 0.020*"trump" + 0.020*"sites"'), (1, '0.094*"india" + 0.088*"cases" + 0.044*"deaths" + 0.041*"health" + 0.036*"total" + 0.035*"world" + 0.032*"death" + 0.028*"state" + 0.026*"reports" + 0.022*"americans"'), (2, '0.079*"news" + 0.038*"immunity" + 0.036*"die" + 0.034*"fake" + 0.033*"usa" + 0.033*"wait" + 0.030*"company" + 0.030*"medicine" + 0.029*"anti" + 0.027*"warned"'), (3, '0.080*"india" + 0.045*"drive" + 0.034*"time" + 0.034*"hospital" + 0.034*"country" + 0.027*"appointment" + 0.027*"start" + 0.025*"adults" + 0.025*"people" + 0.024*"times"'), (4, '0.043*"pandemic" + 0.038*"canada" + 0.036*"modi" + 0.035*"government" + 0.034*"johnson" + 0.027*"days" + 0.025*"year" + 0.022*"oxygen" + 0.021*"fight" + 0.020*"global"'), (5, '0.100*"doses" + 0.049*"fully" + 0.039*"million" + 0.037*"shot" + 0.036*"day" + 0.034*"administered" + 0.028*"population" + 0.028*"ra

In [26]:
coherence_model_malletlda = CoherenceModel(model=mallet_lda,texts=tweets_nn_lst, dictionary=single_dict, coherence='c_v')
coherence_malletlda = coherence_model_malletlda.get_coherence()
print(coherence_malletlda)

  


Additional things to try? Create bigram and trigram lists as well? Additional models to try? If so, use Gensim.models.phrases and gensim.models.phraser?

Other resources used: https://www.geeksforgeeks.org/python-convert-a-string-representation-of-list-into-list/; https://stackoverflow.com/questions/66759852/no-module-named-pyldavis; http://mallet.cs.umass.edu/download.php; https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/; https://thinkinfi.com/guide-to-build-best-lda-model-using-gensim-python/; https://medium.com/swlh/topic-modeling-lda-mallet-implementation-in-python-part-2-602ffb38d396; https://www.linkedin.com/pulse/nlp-a-complete-guide-topic-modeling-latent-dirichlet-sahil-m/