# LDA for COVID-19 Tweet Topic Identification

This notebook to identify the primary topics in COVID-19 vaccine tweets is based on a variety of guides written by others:
https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/; 
https://thinkinfi.com/guide-to-build-best-lda-model-using-gensim-python/; https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24




First, we load in the packages we'll need - we'll primarily be using Gensim for our LDA. We'll also load in our pre-processed data.

In [1]:
import pandas as pd
import numpy as np
import gensim
import gensim.corpora as corpora
from gensim.models import CoherenceModel
import pyLDAvis
import pyLDAvis.gensim_models
import matplotlib.pyplot as plt

In [3]:
tweets_clean = pd.read_csv('data/pre-processed/2021-04-27_cln.csv')
#tweets_clean.drop('Unnamed: 0', axis=1, inplace=True)
tweets_clean.head()

  and should_run_async(code)


Unnamed: 0,created_at,text,in_reply_to_screen_name,retweet_count,favorite_count,source,id_str,is_retweet,text_cln
0,Tue Apr 27 04:03:47 +0000 2021,"We could have just, you know, not hoarded mill...",,1.0,1,Twitter Web App,1386893816607559684,False,just know hoarded millions doses vaccine allow...
1,Tue Apr 27 04:03:50 +0000 2021,@kmcinnes2 @Rythmol81 Don't know about Liberal...,kmcinnes2,0.0,0,Twitter Web App,1386893829249318912,False,don t know liberals s given doses capita count...
2,Tue Apr 27 04:04:06 +0000 2021,PharmEasy initiates COVID-19 vaccination drive...,,0.0,1,IFTTT,1386893894336532480,False,pharmeasy initiates covid 19 vaccination drive
3,Tue Apr 27 04:03:50 +0000 2021,India to receive first batch of Russia's Covid...,,0.0,0,IFTTT,1386893829106774019,False,india receive batch russia s covid 19 vaccine ...
4,Tue Apr 27 04:04:00 +0000 2021,"@axios ""Several state university systems and p...",axios,0.0,0,TweetDeck,1386893868961091587,False,several state university systems public unive...


In [34]:
tweets_clean.head()

  and should_run_async(code)


Unnamed: 0,created_at,text,in_reply_to_screen_name,retweet_count,favorite_count,source,id_str,is_retweet,text_cln,text_list
0,Tue Apr 27 04:03:47 +0000 2021,"We could have just, you know, not hoarded mill...",,1.0,1,Twitter Web App,1386893816607559684,False,just know hoarded millions doses vaccine allow...,"['just', 'know', 'hoarded', 'millions', 'doses..."
1,Tue Apr 27 04:03:50 +0000 2021,@kmcinnes2 @Rythmol81 Don't know about Liberal...,kmcinnes2,0.0,0,Twitter Web App,1386893829249318912,False,don t know liberals s given doses capita count...,"['don', 't', 'know', 'liberals', 's', 'given',..."
2,Tue Apr 27 04:04:06 +0000 2021,PharmEasy initiates COVID-19 vaccination drive...,,0.0,1,IFTTT,1386893894336532480,False,pharmeasy initiates covid 19 vaccination drive,"['pharmeasy', 'initiates', 'covid', '19', 'vac..."
3,Tue Apr 27 04:03:50 +0000 2021,India to receive first batch of Russia's Covid...,,0.0,0,IFTTT,1386893829106774019,False,india receive batch russia s covid 19 vaccine ...,"['india', 'receive', 'batch', 'russia', 's', '..."
4,Tue Apr 27 04:04:00 +0000 2021,"@axios ""Several state university systems and p...",axios,0.0,0,TweetDeck,1386893868961091587,False,several state university systems public unive...,"['several', 'state', 'university', 'systems', ..."


In [33]:
tweets_clean['text_list'] = 0
for idx in tweets_clean.index:
    lst = list(str(tweets_clean['text_cln'][idx]).split(" "))
    if ' ' in lst:
        lst.remove(' ')
    if "" in lst:
        lst.remove("")
    tweets_clean['text_list'][idx] = str(lst)


  and should_run_async(code)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tweets_clean['text_list'][idx] = str(lst)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


Create bigram and trigram lists as well? Additional models to try? If so, use Gensim.models.phrases and gensim.models.phraser?

Next, we use Gensim to create a dictionary of the unique words that appear mapped to an id. (We may also want to filter out from the dictionary some words that don't appear enough or appear in too many tweets.) Second, we'll create a corpus of the tweets, which contains the number of times a given word (identified by id) appeared in each tweet. 

In [36]:
import ast
tweets_clean_lst = []
for tweet in tweets_clean['text_list']:
    tweets_clean_lst.append(ast.literal_eval(tweet))

  and should_run_async(code)


In [41]:
single_dict = corpora.Dictionary(tweets_clean_lst)
#for actual tweet data, commented params might be a better place to start
single_dict.filter_extremes(no_below=50, no_above=0.80, keep_n=1000000)
#single_dict.filter_extremes(no_below=10, no_above=0.80, keep_n=1000000)

single_corpus = [single_dict.doc2bow(tweet) for tweet in tweets_clean_lst]

  and should_run_async(code)


In [42]:
print(single_dict.token2id)

  and should_run_async(code)


In [43]:
print(single_corpus[0])

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1)]
  and should_run_async(code)


Now, we can try training an initial model.

In [44]:
#import logging
#logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
single_model = gensim.models.ldamodel.LdaModel(corpus=single_corpus,
                                           id2word=single_dict,
                                           num_topics=8, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

  and should_run_async(code)


In [45]:
single_model.print_topics()

  and should_run_async(code)


[(0,
  '0.159*"covid" + 0.142*"19" + 0.132*"vaccine" + 0.042*"people" + 0.032*"coronavirus" + 0.017*"health" + 0.014*"28" + 0.010*"virus" + 0.009*"cdc" + 0.009*"news"'),
 (1,
  '0.140*"vaccinated" + 0.031*"fully" + 0.029*"today" + 0.027*"registration" + 0.023*"vaccinations" + 0.022*"state" + 0.021*"young" + 0.019*"private" + 0.017*"register" + 0.016*"dr"'),
 (2,
  '0.036*"study" + 0.030*"good" + 0.026*"working" + 0.024*"citizens" + 0.023*"employ" + 0.020*"online" + 0.017*"immune" + 0.016*"body" + 0.016*"out" + 0.015*"may"'),
 (3,
  '0.131*"vaccination" + 0.046*"available" + 0.044*"appointments" + 0.038*"near" + 0.036*"sign" + 0.035*"04" + 0.033*"00" + 0.032*"cvs" + 0.021*"school" + 0.020*"pa"'),
 (4,
  '0.066*"t" + 0.036*"need" + 0.028*"like" + 0.024*"countries" + 0.023*"don" + 0.020*"said" + 0.019*"know" + 0.018*"age" + 0.018*"covaxin" + 0.017*"life"'),
 (5,
  '0.070*"new" + 0.051*"mask" + 0.040*"cases" + 0.028*"masks" + 0.024*"you" + 0.024*"wear" + 0.020*"end" + 0.018*"death" + 0.018

We can use Perplexity and Coherence as two methods for considering our model's accuracy:

In [46]:
# Perplexity
single_model.log_perplexity(single_corpus)  # a measure of how good the model is. lower the better.

# Compute Coherence Score
single_coherence_model_lda = CoherenceModel(model=single_model, texts=tweets_clean_lst, dictionary=single_dict, coherence='c_v')
single_coherence_lda = single_coherence_model_lda.get_coherence()


  and should_run_async(code)


In [47]:
print(single_coherence_lda)

0.2755605520579796
  and should_run_async(code)


We can also visualize the topics and their overlap:

In [48]:
pyLDAvis.enable_notebook()
single_plot = pyLDAvis.gensim_models.prepare(single_model, single_corpus, single_dict)
single_plot


  and should_run_async(code)


Other resources used: https://www.geeksforgeeks.org/python-convert-a-string-representation-of-list-into-list/; https://stackoverflow.com/questions/66759852/no-module-named-pyldavis