# Twitter Topic Analysis of News Media

Political topics in news media seem to follow a moderately short cycle while appearing quite alarming when you live them in the moment. 

## Goals
- Get tweets from news media accounts in the US on a regular (daily?) basis
- Process tweets into daily bags of words to be stored in a database
- Train topic analysis model (LCA?) on a weekly(?) basis
- Understand topic duration and predict whether a topic has staying power
- Visualize in a website/dashboard

## Setup

In [1]:
import numpy as np
import pandas as pd
import tweepy
import json

In [2]:
import os
import sys

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

from src import utils

In [3]:
# Credentials
with open("../credentials/config.json", 'r') as f:
    cfg = json.load(f)

print(cfg.keys())


dict_keys(['consumer_key', 'consumer_secret', 'bearer_token'])


## Connect to Twitter API and search tweets

In [4]:
auth = tweepy.AppAuthHandler(cfg["consumer_key"],cfg["consumer_secret"])
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

In [6]:
SINCE_ID = 1428817650549985024

news_accounts = ['foxnews', 'nytimes', 'CNN', 'NBCNews', 'voxdotcom', 'washingtonpost', 'WSJ', 'AP', 'Reuters', 'newsmax', 'OANN']
query = " OR ".join([f"from:{account}" for account in news_accounts])
cursor = tweepy.Cursor(api.search, q=query, since_id=SINCE_ID, count=100, result_type='recent', tweet_mode='extended').items()

attrs = ['id', 'user.screen_name', 'user.followers_count', 'created_at', 'retweet_count', 'favorite_count', 'lang', 'retweeted_status.user.screen_name', 'full_text', 'retweeted_status.full_text']
tweets = [utils.attrgetter(*attrs)(i) for i in cursor]
print(f"Retrived {len(tweets)} tweets from {len(news_accounts)} accounts.")

Retrived 1100 tweets from 11 accounts.


In [5]:
news_accounts = ['foxnews', 'nytimes', 'CNN', 'NBCNews', 'voxdotcom', 'washingtonpost', 'WSJ', 'AP', 'Reuters', 'newsmax', 'OANN']
query = " OR ".join([f"from:{account}" for account in news_accounts])
cursor = tweepy.Cursor(api.search, q='from:nytimes', count=100, result_type='recent', tweet_mode='extended').items()

attrs = ['id', 'user.screen_name', 'user.followers_count', 'created_at', 'retweet_count', 'favorite_count', 'lang', 'retweeted_status.user.screen_name', 'full_text', 'retweeted_status.full_text']
tweets = [utils.attrgetter(*attrs)(i) for i in cursor]
print(f"Retrived {len(tweets)} tweets from {len(news_accounts)} accounts.")

Retrived 942 tweets from 11 accounts.


In [40]:
data = pd.DataFrame(tweets, columns=attrs)
data.head()

Unnamed: 0,id,user.screen_name,user.followers_count,created_at,retweet_count,favorite_count,lang,retweeted_status.user.screen_name,full_text,retweeted_status.full_text
0,1429529301452476425,nytimes,50335345,2021-08-22 19:41:59,64,337,en,,Two prominent Republicans — Representative Ad...,
1,1429528820177068032,nytimes,50335345,2021-08-22 19:40:04,27,84,en,,Fears are growing over the safety of roughly 3...,
2,1429528355246952451,nytimes,50335345,2021-08-22 19:38:13,13,63,en,,Correction: We deleted an earlier tweet that i...,
3,1429528260132671499,nytimes,50335345,2021-08-22 19:37:51,15,79,en,,“Business and feelings and emotions don’t work...,
4,1429523791647023111,nytimes,50335345,2021-08-22 19:20:05,196,894,en,,"Josephine Baker, an American-born Black dancer...",


In [41]:
data = utils.merge_retweet_full_text(data)
data = data.dropna(axis=1)
data = data[data['lang'] == "en"]
data = data.sample(frac=1).reset_index(drop=True)
data.head()

Unnamed: 0,id,user.screen_name,user.followers_count,created_at,retweet_count,favorite_count,lang,full_text
0,1428225239868137477,nytimes,50335345,2021-08-19 05:20:06,109,0,en,Haiti was hit by an earthquake on Sunday that ...
1,1428542318689361924,nytimes,50335347,2021-08-20 02:20:04,89,0,en,“People are not going to be able to make ends ...
2,1428704639894106117,nytimes,50335347,2021-08-20 13:05:04,61,329,en,United Airlines recently told flight attendant...
3,1427989942894989315,nytimes,50335347,2021-08-18 13:45:07,69,184,en,"The spread of the Caldor fire in California, w..."
4,1428894636383289344,nytimes,50335347,2021-08-21 01:40:03,40,174,en,Technology billionaires have typically divorce...


## LDA Pre-processing

In [42]:
#Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

#spacy
import spacy
from nltk.corpus import stopwords

#vis
import pyLDAvis
import pyLDAvis.gensim_models

# import warnings
# warnings.filterwarnings("ignore", category=DeprecationWarning)

In [43]:
stopwords = stopwords.words("english")

In [217]:
texts = list(data['full_text'])
texts = [" ".join([t for t in text.split(" ") if "https:" not in t]) for text in texts]
texts[:4]

['Haiti was hit by an earthquake on Sunday that killed more than 1,900 people and left thousands injured and displaced. Days later, a strong storm lashed the area, bringing the risk of floods and mudslides. Here’s how the disasters devastated the country.',
 '“People are not going to be able to make ends meet. People are going to loose the roofs over their heads.” \n\nThank you to @themeredith and the other OnlyFans creators/sex workers who spoke to me for this story',
 'United Airlines recently told flight attendants not to tape its passengers to seats. Flight attendants say it was a rude PR stunt, and that airlines have failed to offer ways to deal with unruly passengers.',
 'The spread of the Caldor fire in California, which started Saturday, prompted a mandatory evacuation order for Pollock Pines, a community of 7,000 people close to the state capital.\n\nOfficials described the blaze as “dynamic and rapidly']

In [219]:
def lemmatization(texts, allowed_postags=["PROPN", "NOUN", "ADJ", "VERB", "ADP", "NUM"]):
    nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])
    texts_out = []
    for text in texts:
        doc = nlp(text)
        new_text = []
        for token in doc:
            if token.pos_ in allowed_postags and token.is_alpha:
                new_text.append(token.lemma_)
        final = " ".join(new_text)
        texts_out.append(final)
    return (texts_out)


lemmatized_texts = lemmatization(texts)
print (lemmatized_texts[0][0:90])

Haiti be hit by earthquake on Sunday kill more people leave thousand injure displace day s


In [88]:
def gen_words(texts):
    final = []
    for text in texts:
        new = gensim.utils.simple_preprocess(text, deacc=True)
        final.append(new)
    return (final)

data_words = gen_words(lemmatized_texts)

print (data_words[2])

['united', 'airlines', 'recently', 'tell', 'flight', 'attendant', 'tape', 'passenger', 'seat', 'flight', 'attendant', 'say', 'be', 'rude', 'pr', 'stunt', 'airline', 'have', 'fail', 'offer', 'way', 'deal', 'unruly', 'passenger']


In [89]:
#BIGRAMS AND TRIGRAMS
bigram_phrases = gensim.models.Phrases(data_words, min_count=2, threshold=5)
trigram_phrases = gensim.models.Phrases(bigram_phrases[data_words], threshold=5)

bigram = gensim.models.phrases.Phraser(bigram_phrases)
trigram = gensim.models.phrases.Phraser(trigram_phrases)

def make_bigrams(texts):
    return([bigram[doc] for doc in texts])

def make_trigrams(texts):
    return ([trigram[bigram[doc]] for doc in texts])

data_bigrams = make_bigrams(data_words)
data_bigrams_trigrams = make_trigrams(data_bigrams)

print (data_bigrams_trigrams[:100])

[['haiti', 'be', 'hit_earthquake', 'sunday', 'kill_more', 'people', 'leave_thousand', 'injure_displace', 'day_later', 'strong_storm', 'lash_area', 'bring_risk', 'flood_mudslide', 'here_how', 'disaster_devastate', 'country'], ['people', 'be', 'go', 'be_able', 'make', 'end', 'meet', 'people', 'be', 'go', 'loose', 'roof', 'head', 'thank', 'themeredith', 'other', 'onlyfans', 'creator', 'sex', 'worker', 'speak', 'story'], ['united', 'airlines', 'recently', 'tell', 'flight_attendant', 'tape', 'passenger', 'seat', 'flight_attendant', 'say', 'be', 'rude', 'pr', 'stunt', 'airline_have', 'fail', 'offer', 'way', 'deal', 'unruly', 'passenger'], ['spread_caldor', 'fire_california', 'start', 'saturday', 'prompt', 'mandatory_evacuation', 'order_pollock', 'pines_community', 'people', 'close', 'state', 'capital', 'official', 'describe', 'blaze', 'dynamic', 'rapidly'], ['technology', 'billionaire', 'have', 'typically', 'divorce', 'closed', 'door', 'trial', 'expect', 'start', 'monday', 'determine', 'how'

In [163]:
badwords = ['https_co', 'https', 'co', 'be', 'say', 'breaking_news', 'new_york_times'] + [i.lower() for i in news_accounts] + stopwords
data_bigrams_trigrams_clean = [[ele for ele in sub if ele not in badwords] for sub in data_bigrams_trigrams]

In [164]:
from collections import Counter

count_grams = [i for li in data_bigrams_trigrams_clean for i in li]
count  = Counter(count_grams)
countdf = pd.DataFrame(count.most_common())

In [165]:
countdf[countdf[0].str.contains("_")].head(50)

Unnamed: 0,0,1
11,here_be,43
17,president_biden,37
38,tropical_storm,29
51,new_york,25
68,biden_administration,22
87,delta_variant,19
103,official_say,18
107,new_york_city,17
108,at_least,17
118,kabul_airport,16


In [174]:
countdf.head(50)

Unnamed: 0,0,1
0,afghanistan,138
1,taliban,108
2,people,94
3,new,67
4,state,54
5,country,53
6,get,49
7,covid,49
8,write,49
9,woman,47


In [166]:
#TF-IDF REMOVAL
from gensim.models import TfidfModel

id2word = corpora.Dictionary(data_bigrams_trigrams_clean)

texts = data_bigrams_trigrams

corpus = [id2word.doc2bow(text) for text in texts]
# print (corpus[0][0:20])

tfidf = TfidfModel(corpus, id2word=id2word)

low_value = 0.03
words  = []
words_missing_in_tfidf = []
for i in range(0, len(corpus)):
    bow = corpus[i]
    low_value_words = [] #reinitialize to be safe. You can skip this.
    tfidf_ids = [id for id, value in tfidf[bow]]
    bow_ids = [id for id, value in bow]
    low_value_words = [id for id, value in tfidf[bow] if value < low_value]
    drops = low_value_words+words_missing_in_tfidf
    for item in drops:
        words.append(id2word[item])
    words_missing_in_tfidf = [id for id in bow_ids if id not in tfidf_ids] # The words with tf-idf socre 0 will be missing

    new_bow = [b for b in bow if b[0] not in low_value_words and b[0] not in words_missing_in_tfidf]
    corpus[i] = new_bow

id2word = corpora.Dictionary(data_bigrams_trigrams_clean)

corpus = []
for text in data_bigrams_trigrams_clean:
    new = id2word.doc2bow(text)
    corpus.append(new)

print (corpus[0][0:20])

word = id2word[[0][:1][0]]
print (word)

## LDA Model

In [214]:
NUM_TOPICS = 14

lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=NUM_TOPICS,
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=100,
                                           decay=0.5,
                                           per_word_topics=True,
                                           minimum_phi_value=0.03,
                                           alpha="auto")

In [215]:
coherence_model = gensim.models.CoherenceModel(model=lda_model, texts=data_bigrams_trigrams_clean, dictionary=id2word)
print(coherence_model.get_coherence())

0.4917201882599735


In [None]:
gensim.models.

In [213]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word, mds="mmds", R=30)
vis

  default_term_info = default_term_info.sort_values(


In [209]:
topics = lda_model.show_topics(NUM_TOPICS, 20)
topics

[(0,
  '0.054*"vaccine" + 0.037*"hurricane" + 0.035*"california" + 0.035*"henri" + 0.030*"make_landfall" + 0.028*"coast" + 0.027*"community" + 0.025*"election" + 0.022*"recall" + 0.022*"democrats" + 0.021*"global" + 0.020*"island" + 0.018*"new_england" + 0.017*"vote" + 0.016*"voter" + 0.013*"come" + 0.012*"newsom" + 0.012*"million" + 0.011*"first_hurricane" + 0.010*"employee"'),
 (1,
  '0.182*"new" + 0.064*"fall" + 0.051*"part" + 0.040*"next" + 0.029*"be_still" + 0.029*"watch" + 0.029*"think" + 0.025*"rise" + 0.024*"fight" + 0.014*"date" + 0.014*"cancel" + 0.012*"virus_case" + 0.008*"stadium" + 0.004*"brooks" + 0.004*"tour" + 0.004*"garth" + 0.000*"medal" + 0.000*"ruby" + 0.000*"freshly" + 0.000*"xargay"'),
 (2,
  '0.103*"continue" + 0.053*"follow" + 0.028*"surprise" + 0.016*"reporter" + 0.000*"ample" + 0.000*"weather" + 0.000*"hair" + 0.000*"unvaccinated" + 0.000*"stack" + 0.000*"quiz" + 0.000*"tragedy" + 0.000*"laundry" + 0.000*"desperately_try" + 0.000*"knee" + 0.000*"balance" + 0.0

In [210]:
TEST_NO = 384

test = lda_model.get_document_topics(corpus)
print(data.full_text[TEST_NO])
test_topics_p = [[topic[1] for topic in test[TEST_NO]]]
print(topics[np.argmax(test_topics_p)])
pd.DataFrame(test[TEST_NO])[[1]].T.style.background_gradient(axis=1)

Once a fierce critic of Donald Trump, Senator Lindsey Graham remains his staunch defender. 

Inside one of the unlikeliest relationships in politics: https://t.co/XI3AMhmCFK
(3, '0.072*"tropical_storm" + 0.065*"airport" + 0.050*"want" + 0.035*"official_say" + 0.034*"much" + 0.031*"saturday" + 0.029*"interview" + 0.029*"remain" + 0.027*"taliban_fighter" + 0.021*"ground" + 0.021*"appear" + 0.020*"daily" + 0.017*"reality" + 0.016*"video" + 0.015*"landfall" + 0.013*"international" + 0.012*"official" + 0.011*"prevent" + 0.011*"female" + 0.011*"future"')


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27
1,0.022753,0.027004,0.039191,0.143137,0.036484,0.070506,0.032623,0.034621,0.017384,0.018057,0.01302,0.015904,0.016934,0.046814,0.040533,0.011432,0.024259,0.011707,0.049533,0.01054,0.046195,0.049449,0.036854,0.033115,0.020589,0.016194,0.052179,0.046971


In [160]:
data[data['full_text'].str.lower().str.contains('new york times')]

Unnamed: 0,id,user.screen_name,user.followers_count,created_at,retweet_count,favorite_count,lang,full_text
49,1428527233556168704,nytimes,50335347,2021-08-20 01:20:07,296,0,en,NEW: The New York Times worked with the Qatari...
117,1426790795353415681,nytimes,50335345,2021-08-15 06:20:08,25,142,en,“New York feels horny again. It feels sexy aga...
248,1428718485694291974,nytimes,50335347,2021-08-20 14:00:05,57,166,en,"Here’s how The New York Times, The Wall Street..."
288,1427913179082919936,nytimes,50335347,2021-08-18 08:40:05,37,0,en,“I never knew a man who had better motives for...
303,1428514658563104779,nytimes,50335347,2021-08-20 00:30:09,74,339,en,"The publishers of The New York Times, The Wall..."
324,1427471515822481409,nytimes,50335345,2021-08-17 03:25:05,632,2850,en,The front page of The New York Times for Aug. ...
355,1428177424664117248,nytimes,50335345,2021-08-19 02:10:06,87,588,en,11 vegetarian recipes that New York Times Cook...
413,1427373370803425290,nytimes,50335347,2021-08-16 20:55:05,533,0,en,"Dear President Biden, a joint Statement on Beh..."
419,1428157279916986369,nytimes,50335345,2021-08-19 00:50:04,172,0,en,Biden Administration to Use Federal Civil Righ...
429,1428469357353517066,nytimes,50335345,2021-08-19 21:30:09,41,164,en,"New York’s digital vaccine app, the Excelsior ..."


In [161]:
data.loc[248, 'full_text']

'Here’s how The New York Times, The Wall Street Journal and The Washington Post scrambled to get their Afghan colleagues out of Kabul. https://t.co/b0WJcYkD0a'

## Scratchwork

In [2]:
import pandas as pd
data_large = pd.read_excel('../data/nytimes_foxnews_tweets.xlsx')

In [3]:
[col for col in data_large.columns if "follow" in col]

['user.followers_count',
 'user.following',
 'user.follow_request_sent',
 'retweeted_status.user.followers_count',
 'retweeted_status.user.following',
 'retweeted_status.user.follow_request_sent',
 'quoted_status.user.followers_count',
 'quoted_status.user.following',
 'quoted_status.user.follow_request_sent',
 'retweeted_status.quoted_status.user.followers_count',
 'retweeted_status.quoted_status.user.following',
 'retweeted_status.quoted_status.user.follow_request_sent']