# LDA for COVID-19 Tweet Topic Identification

This notebook to identify the primary topics in COVID-19 vaccine tweets is based on a variety of guides written by others:
https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/; 
https://thinkinfi.com/guide-to-build-best-lda-model-using-gensim-python/; https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24




First, we load in the packages we'll need - we'll primarily be using Gensim for our LDA. We'll also load in our pre-processed data.

In [79]:
import pandas as pd
import numpy as np
import gensim
import gensim.corpora as corpora
from gensim.models import CoherenceModel
import pyLDAvis
import pyLDAvis.gensim_models
import matplotlib.pyplot as plt

  and should_run_async(code)


In [8]:
tweets_clean = pd.read_csv('speeches_demo.csv')
tweets_clean.drop('Unnamed: 0', axis=1, inplace=True)
tweets_clean.head()

Unnamed: 0,normalized_text
0,"['chief', 'justice', 'roberts', 'vice', 'presi..."
1,"['fellow', 'american', 'year', 'ago', 'launch'..."
2,"['fellow', 'american', 'want', 'speak', 'tonig..."
3,"['like', 'begin', 'address', 'heinous', 'attac..."
4,"['know', 'pain', 'know', 'hurt', 'election', '..."


Create bigram and trigram lists as well? Additional models to try? If so, use Gensim.models.phrases and gensim.models.phraser?

Next, we use Gensim to create a dictionary of the unique words that appear mapped to an id. (We may also want to filter out from the dictionary some words that don't appear enough or appear in too many tweets.) Second, we'll create a corpus of the tweets, which contains the number of times a given word (identified by id) appeared in each tweet. 

In [50]:
import ast
tweets_clean_lst = []
for tweet in tweets_clean['normalized_text']:
    tweets_clean_lst.append(ast.literal_eval(tweet))

In [56]:
single_dict = corpora.Dictionary(tweets_clean_lst)
# for actual tweet data, commented params might be a better place to start
# dictionary.filter_extremes(no_below=50, no_above=0.80, keep_n=1000000)
single_dict.filter_extremes(no_below=10, no_above=0.80, keep_n=1000000)

single_corpus = [single_dict.doc2bow(tweet) for tweet in tweets_clean_lst]

In [67]:
print(single_dict.token2id)

ernal': 2734, 'japanese': 2735, 'johnson': 2736, 'kuwait': 2737, 'laugh': 2738, 'loophole': 2739, 'meantime': 2740, 'millions': 2741, 'mountain': 2742, 'odd': 2743, 'pakistan': 2744, 'panel': 2745, 'poland': 2746, 'pressure': 2747, 'prime': 2748, 'resign': 2749, 'resolution': 2750, 'rival': 2751, 'robert': 2752, 'russian': 2753, 'sale': 2754, 'singapore': 2755, 'suspect': 2756, 'ted': 2757, 'telephone': 2758, 'wise': 2759, 'abundance': 2760, 'accept': 2761, 'acquire': 2762, 'affirm': 2763, 'afghan': 2764, 'alarm': 2765, 'argue': 2766, 'assert': 2767, 'await': 2768, 'aware': 2769, 'beneficial': 2770, 'chip': 2771, 'conquest': 2772, 'cope': 2773, 'crusade': 2774, 'currency': 2775, 'deficit': 2776, 'delegate': 2777, 'dependent': 2778, 'desire': 2779, 'domination': 2780, 'dump': 2781, 'electronic': 2782, 'embolden': 2783, 'empower': 2784, 'empowerment': 2785, 'enlist': 2786, 'escalate': 2787, 'expense': 2788, 'exploit': 2789, 'failure': 2790, 'genuinely': 2791, 'governmental': 2792, 'grand

In [69]:
print(single_corpus[0])

[(0, 1), (1, 1), (2, 3), (3, 1), (4, 1), (5, 2), (6, 3), (7, 2), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 2), (14, 1), (15, 1), (16, 2), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 3), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 2), (36, 1), (37, 1), (38, 2), (39, 2), (40, 2), (41, 1), (42, 4), (43, 1), (44, 5), (45, 1), (46, 6), (47, 1), (48, 1), (49, 1), (50, 1), (51, 5), (52, 1), (53, 1), (54, 2), (55, 1), (56, 1), (57, 1), (58, 1), (59, 2), (60, 3), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 3), (67, 1), (68, 1), (69, 3), (70, 3), (71, 1), (72, 1), (73, 1), (74, 1), (75, 2), (76, 1), (77, 1), (78, 1), (79, 2), (80, 4), (81, 1), (82, 2), (83, 1), (84, 11), (85, 1), (86, 1), (87, 1), (88, 1), (89, 1), (90, 1), (91, 1), (92, 2), (93, 3), (94, 1), (95, 2), (96, 1), (97, 1), (98, 1), (99, 1), (100, 1), (101, 1), (102, 1), (103, 1), (104, 2), (105, 1), (106, 2), (107, 1), (108, 1), (109, 2), (110, 1)

Now, we can try training an initial model.

In [72]:
#import logging
#logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
single_model = gensim.models.ldamodel.LdaModel(corpus=single_corpus,
                                           id2word=single_dict,
                                           num_topics=8, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

 + 0.016*"think" + 0.015*"like" + 0.014*"lot" + 0.013*"be" + 0.011*"happen"
2021-04-30 19:47:40,016 : INFO : topic #6 (0.170): 0.028*"think" + 0.023*"mr." + 0.011*"question" + 0.011*"go" + 0.008*"believe" + 0.007*"say" + 0.007*"q." + 0.006*"house" + 0.006*"ask" + 0.005*"congress"
2021-04-30 19:47:40,018 : INFO : topic #3 (0.181): 0.019*"peace" + 0.015*"soviet" + 0.013*"war" + 0.010*"force" + 0.010*"vietnam" + 0.008*"military" + 0.007*"south" + 0.007*"union" + 0.006*"agreement" + 0.006*"nuclear"
2021-04-30 19:47:40,020 : INFO : topic #5 (0.334): 0.011*"freedom" + 0.008*"peace" + 0.007*"let" + 0.007*"life" + 0.006*"history" + 0.006*"free" + 0.006*"human" + 0.005*"future" + 0.005*"believe" + 0.004*"child"
2021-04-30 19:47:40,020 : INFO : topic diff=0.120056, rho=0.219159
2021-04-30 19:47:40,257 : INFO : -7.285 per-word bound, 156.0 perplexity estimate based on a held-out corpus of 82 documents with 85452 words
2021-04-30 19:47:40,258 : INFO : PROGRESS: pass 16, at document #382/382
2021-0

In [73]:
single_model.print_topics()

2021-04-30 19:51:24,362 : INFO : topic #0 (0.099): 0.039*"go" + 0.031*"have" + 0.025*"say" + 0.017*"get" + 0.016*"thing" + 0.016*"think" + 0.015*"like" + 0.014*"lot" + 0.013*"be" + 0.011*"happen"
2021-04-30 19:51:24,365 : INFO : topic #1 (0.018): 0.045*"space" + 0.042*"border" + 0.029*"immigration" + 0.024*"research" + 0.018*"cell" + 0.017*"human" + 0.016*"stem" + 0.015*"immigrant" + 0.014*"illegal" + 0.012*"law"
2021-04-30 19:51:24,367 : INFO : topic #2 (0.115): 0.012*"iraq" + 0.010*"security" + 0.009*"terrorist" + 0.008*"war" + 0.007*"force" + 0.006*"iraqi" + 0.006*"military" + 0.006*"fight" + 0.006*"israel" + 0.005*"troop"
2021-04-30 19:51:24,370 : INFO : topic #3 (0.203): 0.019*"peace" + 0.014*"war" + 0.013*"vietnam" + 0.012*"soviet" + 0.010*"force" + 0.010*"south" + 0.007*"military" + 0.006*"north" + 0.006*"agreement" + 0.006*"union"
2021-04-30 19:51:24,373 : INFO : topic #4 (0.158): 0.016*"congress" + 0.012*"program" + 0.011*"$" + 0.011*"tax" + 0.009*"federal" + 0.009*"increase" 

[(0,
  '0.039*"go" + 0.031*"have" + 0.025*"say" + 0.017*"get" + 0.016*"thing" + 0.016*"think" + 0.015*"like" + 0.014*"lot" + 0.013*"be" + 0.011*"happen"'),
 (1,
  '0.045*"space" + 0.042*"border" + 0.029*"immigration" + 0.024*"research" + 0.018*"cell" + 0.017*"human" + 0.016*"stem" + 0.015*"immigrant" + 0.014*"illegal" + 0.012*"law"'),
 (2,
  '0.012*"iraq" + 0.010*"security" + 0.009*"terrorist" + 0.008*"war" + 0.007*"force" + 0.006*"iraqi" + 0.006*"military" + 0.006*"fight" + 0.006*"israel" + 0.005*"troop"'),
 (3,
  '0.019*"peace" + 0.014*"war" + 0.013*"vietnam" + 0.012*"soviet" + 0.010*"force" + 0.010*"south" + 0.007*"military" + 0.006*"north" + 0.006*"agreement" + 0.006*"union"'),
 (4,
  '0.016*"congress" + 0.012*"program" + 0.011*"$" + 0.011*"tax" + 0.009*"federal" + 0.009*"increase" + 0.009*"budget" + 0.008*"energy" + 0.006*"spend" + 0.006*"economy"'),
 (5,
  '0.010*"freedom" + 0.008*"peace" + 0.007*"let" + 0.007*"life" + 0.006*"history" + 0.006*"free" + 0.005*"human" + 0.005*"futur

We can use Perplexity and Coherence as two methods for considering our model's accuracy:

In [74]:
# Perplexity
single_model.log_perplexity(single_corpus)  # a measure of how good the model is. lower the better.

# Compute Coherence Score
single_coherence_model_lda = CoherenceModel(model=single_model, texts=tweets_clean_lst, dictionary=single_dict, coherence='c_v')
single_coherence_lda = single_coherence_model_lda.get_coherence()


2021-04-30 19:54:22,132 : INFO : -7.280 per-word bound, 155.4 perplexity estimate based on a held-out corpus of 382 documents with 418860 words
2021-04-30 19:54:22,135 : INFO : using ParallelWordOccurrenceAccumulator(processes=15, batch_size=64) to estimate probabilities from sliding windows
2021-04-30 19:54:22,681 : INFO : 1 batches submitted to accumulate stats from 64 documents (94182 virtual)
2021-04-30 19:54:22,735 : INFO : 2 batches submitted to accumulate stats from 128 documents (183238 virtual)
2021-04-30 19:54:22,776 : INFO : 3 batches submitted to accumulate stats from 192 documents (260057 virtual)
2021-04-30 19:54:22,824 : INFO : 4 batches submitted to accumulate stats from 256 documents (345742 virtual)
2021-04-30 19:54:22,866 : INFO : 5 batches submitted to accumulate stats from 320 documents (422276 virtual)
2021-04-30 19:54:22,910 : INFO : 6 batches submitted to accumulate stats from 384 documents (498768 virtual)
2021-04-30 19:54:22,915 : INFO : serializing accumulato

In [75]:
print(single_coherence_lda)

0.5159888425093138


We can also visualize the topics and their overlap:

In [80]:
pyLDAvis.enable_notebook()
single_plot = pyLDAvis.gensim_models.prepare(single_model, single_corpus, single_dict)
single_plot


  and should_run_async(code)
2021-04-30 20:04:44,407 : INFO : Note: NumExpr detected 16 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2021-04-30 20:04:44,408 : INFO : NumExpr defaulting to 8 threads.


Other resources used: https://www.geeksforgeeks.org/python-convert-a-string-representation-of-list-into-list/; https://stackoverflow.com/questions/66759852/no-module-named-pyldavis