# Topic analysis of tweets referencing Forbes and CNBC

## Contents: 

1. [Explanation of preprocessing steps + parameters ](#first-section)
2. [Motivation of media outlets](#second-section)
3. [Code](#third-section)
4. [Interpretation and discussion](#fourth-section)

In [1]:
import numpy as np
import pandas as pd
from tqdm.auto import tqdm
import spacy
from gensim.corpora import Dictionary
from gensim.models.wrappers import LdaMallet
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
import gensim
import re

In [2]:
nlp = spacy.load("en_core_web_sm")

## 1. Explanation of preprocessing steps + parameters <a class="anchor" id="first-section"></a>

### 1.1 Processing

I have first processed the tweets through Spacy's nlp pipe module. Processing the text through Spacy has many advantages as it is able to capture a wide variety of metadata associated with the words such as the part of speech, word lemma, etc.

### 1.2 Tokenizing

I have removed stop words and punctuation as these would not be very informative in distinguishing between topics.

I have lemmatized the words as lemmatization reduces the size of the dictionary and is generally found to improve model performance in many applications. 

Although some argue for the improved coherence of only including nouns or adjectives in topic modelling, I have not restricted permittable words based on their POS tag in my analysis. My reason for this is the appearance of highly informative words from various POS categories among the topics. Examples include the adjectives {bipartisan, supreme, black, public} and verbs {shoot, kill, vote}.

### 1.4 Creating dictionary

I have filtered out extremely uncommon words that appear in less than 5 tweets and extremely common words that appear in more than 85% of documents.

### 1.5 Selecting number of topics

While there are certain data-driven metrics that allow for direct comparison of different topic models (such as coherence), I have not utilized these in my analysis. My criterion for choosing the number of topics was human interpretability. Human interpretability of the topics is crucial for examining the qualitative differences between discussions, which is why I have based my decisions on this criterion despite the subjective bias that may be present. I found that increasing the number of topics from 15 **(3.5)** to 20 **(3.5)** resulted in less distinct topics and a higher number of uninterpretable topics, which is why I have used the model with 15 topics for my analysis.

## 2. Motivation of media outlets <a class="anchor" id="second-section"></a>

Forbes and CNBC are both news outlets focusing on business, financial markets and the economy. Because both networks cover the same themes and target similar audiences on the surface, it would be interesting to see whether the tweets which reference them have a large overlap in topics or whether there is some distinction to be had.

## 3. Code <a class="anchor" id="third-section"></a>

### 3.1 Data Exploration

Loading the data and examining the first 5 records of the dataframe

In [3]:
framing = pd.read_pickle('data/framing.p')
framing.head()

Unnamed: 0,tweet_id,date,user,party,state,chamber,tweet,news_mention,url_reference,netloc,title,description,label
0,1325914751495499776,2020-11-09 21:34:45,SenShelby,R,Alabama,Senator,ICYMI – @BusinessInsider declared #Huntsville ...,businessinsider,https://www.businessinsider.com/personal-finan...,www.businessinsider.com,The 10 best US cities to move to if you want t...,The best US cities to move to if you want to s...,
1,1294021087118987264,2020-08-13 21:20:43,SenShelby,R,Alabama,Senator,Great news! Today @mazda_toyota announced an a...,,https://pressroom.toyota.com/mazda-and-toyota-...,pressroom.toyota.com,Mazda and Toyota Further Commitment to U.S. Ma...,"HUNTSVILLE, Ala., (Aug. 13, 2020) – Today, Maz...",
2,1323340848130609156,2020-11-02 19:06:59,DougJones,D,Alabama,Senator,He’s already quitting on the folks of Alabama ...,,https://apnews.com/article/c73f0dfe8008ebaf85e...,apnews.com,"Tuberville, Jones fight for Senate seat in Ala...","GARDENDALE, Ala. (AP) — U.S. Sen. Doug Jones, ...",
3,1323004075831709698,2020-11-01 20:48:46,DougJones,D,Alabama,Senator,I know you guys are getting bombarded with fun...,,https://secure.actblue.com/donate/djfs-close?r...,secure.actblue.com,I just gave!,Join us! Contribute today.,negiotated
4,1322567531320717314,2020-10-31 15:54:06,DougJones,D,Alabama,Senator,"Well looky here folks, his own players don’t t...",,https://slate.com/culture/2020/10/tommy-tuberv...,slate.com,What Tommy Tuberville’s Former Auburn Players ...,"""All I could think is, why?""",


Viewing the shape of the dataframe

In [8]:
framing.shape

(23448, 13)

Getting the dates of the earliest and latest tweets

In [23]:
framing['date'].min()

Timestamp('2020-08-13 00:00:00')

In [24]:
framing['date'].max()

Timestamp('2020-11-14 13:46:00')

We can see that the data consists of 23448 tweets by US representatives spanning over the time period 13 Aug 2020 to 14 Nov 2020 with values for 13 attributes. The dates of the tweets encompass the three month period directly preceding the 2020 US presidential elections. Moreover, all tweets took place in the context of the ongoing COVID19 pandemic. Thus, we may already have some expectations of the topics that may occur in these tweets.

Examining the 40 networks with the highest number of referencing tweets

In [26]:
framing.groupby('netloc').agg({'tweet_id': 'count'}).sort_values('tweet_id', ascending=False)[1:40]

Unnamed: 0_level_0,tweet_id
netloc,Unnamed: 1_level_1
,1092
www.politico.com,623
www.cnn.com,602
thehill.com,550
www.foxnews.com,549
www.nbcnews.com,485
nyti.ms,422
www.cnbc.com,342
cnn.it,341
secure.actblue.com,292


### 3.2 Data Preparation / Preprocessing

Processing the tweets through Spacy's nlp.pipe module.

In [12]:
processed_texts = [text for text in tqdm(nlp.pipe(framing.description, 
                                              disable=["ner",
                                                       "parser"]))]

0it [00:00, ?it/s]

Tokenizing the processed texts

In [36]:
# tokenizing + removing stop-words and punctuation + lemmatization
tokenized_texts = [[word.lemma_.lower() for word in processed_text if not word.is_stop and not word.is_punct] 
                   for processed_text in processed_texts]

# replacing non-word characters with an empty string
tokenized_texts = [[re.sub(r'\W+', '', word) for word in text] for text in tokenized_texts] 

Creating a dictionary

In [79]:
dictionary = Dictionary(tokenized_texts) 
dictionary.filter_extremes(no_below=5, # removing extremely rare words appearing in less than 5 documents
                           no_above=0.85) #  removing extremely common words appearing in more than 85% of documents

corpus = [dictionary.doc2bow(text) for text in tokenized_texts] # creating corpus

### 3.3 Learning the topic model

#### 3.3.1 Model 1 (15 topics)

Learning the topic model with 15 topics

In [235]:
lda15 = LdaMallet(r'C:/mallet/bin/mallet.bat',
                corpus=corpus,
                id2word=dictionary,
                num_topics=15,
                optimize_interval=10,
                iterations=1000)

lda15.save(r'models/lda15.model')

Creating a list of the top 10 words for each topic and outputting it to the screen

In [40]:
topics15 = []

for topic in range(15):
    words = lda15.show_topic(topic, 10)
    topic_n_words = ' '.join([word[0] for word in words])
    topics15.append('Topic {}: {}'.format(str(topic+1), topic_n_words))
topics15

['Topic 1: state national million department join federal today announce grant community',
 'Topic 2: trump president administration official department united white report change china',
 'Topic 3: year woman day american black life honor national world country',
 'Topic 4: health covid19 care coronavirus vaccine test pandemic public find program',
 'Topic 5: school student education child family survey teacher year jones america',
 'Topic 6: coronavirus case covid19 pandemic state report number people million week',
 'Topic 7: act bill support bipartisan legislation veteran congress year crisis provide',
 'Topic 8: county news fire state city california wildfire local community home',
 'Topic 9: election vote ballot voter state mail voting 2020 general official',
 'Topic 10: president trump biden joe donald election presidential campaign senate democratic',
 'Topic 11: court supreme senate justice barrett amy coney president judge nominee',
 'Topic 12: police officer city department 

##### labels for 15 topics

We can attempt to label these topics as:
- 1: N/a
- 2: trump-related
- 3: holidays
- 4: covid19-policy
- 5: education
- 6: covid19-status
- 7: legislation
- 8: news
- 9: voting/ballots
- 10: 2020 election
- 11: supreme court
- 12: polie shootings
- 13: postal voting
- 14: politics
- 15: covid19-economy

In [175]:
pyLDAvis.enable_notebook()

lda_conv = gensim.models.wrappers.ldamallet.malletmodel2ldamodel(lda15)

gensimvis.prepare(lda_conv, corpus, dictionary)

  default_term_info = default_term_info.sort_values(


#### 3.3.2 Model 2 (20 topics)

In [229]:
lda20 = LdaMallet(r'C:/mallet/bin/mallet.bat',
                corpus=corpus,
                id2word=dictionary,
                num_topics=20,
                optimize_interval=10,
                iterations=1000)

lda20.save(r'models/lda20.model')

Examining top 10 words in each topic:

In [262]:
topics20 = []

for topic in range(20):
    words = lda20.show_topic(topic, 10)
    topic_n_words = ' '.join([word[0] for word in words])
    topics20.append('Topic {}: {}'.format(str(topic+1), topic_n_words))
topics20

['Topic 1: coronavirus pandemic million business relief covid19 federal economic week benefit',
 'Topic 2: information official trumpâs obtain tax times write decade hundred president',
 'Topic 3: biden president trump joe presidential election campaign democratic vice donald',
 'Topic 4: department china report security force release homeland georgia border accord',
 'Topic 5: house rep bill committee member act legislation introduce congress policy',
 'Topic 6: county state announce city year million department grant school federal',
 'Topic 7: year people american country government change public black political america',
 'Topic 8: news 2020 live talk event story watch show late join',
 'Topic 9: election mail service vote postal ballot general voter voting state',
 'Topic 10: court supreme senate justice barrett amy coney judge nominee majority',
 'Topic 11: health care today join face percent american contribute insurance return',
 'Topic 12: woman pelosi speaker nancy house day 

##### labels for 20 topics

We can attempt to label these topics as:
- 1: covid19-economy
- 2: trump-scandals (tax)
- 3: election-biden
- 4: foreign affairs
- 5: legislation
- 6: N/a
- 7: N/a
- 8: news & live events
- 9: postal voting
- 10: supreme court
- 11: healthcare
- 12: nancy pelosi
- 13: fires
- 14: military affairs
- 15: covid19-vaccines
- 16: trump administration
- 17: police shootings
- 18: legislation
- 19: covid19-health
- 20: covid19-education

The model with 15 topics appears more consistent and interpretable. The model with 20 topics has two topics which I could not assign a label to and has greater overlap between topics. 

Moreover, I believe that decreasing the number of topics below 15 is not neccessary as the topics are already distinctive enough and lowering the number of topics any further would result in loss of detail. 

Thus, the rest of this analysis will make use of the lda15 model consisting of 15 topics.

### 3.4 Exploring differences between forbes and CNBC

Creating a dataframe of the topic distributions

In [28]:
# Transforming the documents to their topic distributions
transformed_docs = lda15.load_document_topics()

# Creating a dataframe of the topic distributions
topic_distributions = pd.DataFrame([[x[1] for x in doc] for doc in transformed_docs], 
             columns=['topic_{}'.format(i) for i in range(1,16)])

# Joining the topic distributions with the main dataframe
joined_topic_dist = framing.reset_index().join(topic_distributions)

Creating a subset of the data that only includes tweets referencing forbes and CNBC

In [32]:
forbes_cnbc = joined_topic_dist[joined_topic_dist['netloc'].isin(['www.forbes.com', 'www.cnbc.com'])]
len(forbes_cnbc)

449

In [33]:
forbes_cnbc.head()

Unnamed: 0,index,tweet_id,date,user,party,state,chamber,tweet,news_mention,url_reference,...,topic_6,topic_7,topic_8,topic_9,topic_10,topic_11,topic_12,topic_13,topic_14,topic_15
75,91,1298656867393167360,2020-08-26 16:21:39,RepByrne,R,Alabama 1st District,Representative,Please join me in prayer for our Gulf Coast ne...,,https://www.cnbc.com/2020/08/26/hurricane-laur...,...,0.002627,0.000528,0.896789,0.002372,0.003868,0.001568,0.002028,0.001568,0.003402,0.003541
149,174,1325789473813172224,2020-11-09 13:16:57,RepMoBrooks,R,Alabama 5th District,U.S. Representative,#Pfizer #COVID19 #vaccine is 90% effective.\n\...,,https://www.cnbc.com/2020/11/09/covid-vaccine-...,...,0.003045,0.000612,0.003637,0.00275,0.004484,0.001818,0.002351,0.001818,0.003944,0.004105
155,181,1321803173510680578,2020-10-29 13:16:49,RepMoBrooks,R,Alabama 5th District,U.S. Representative,The American economy grew at an astounding 33....,,https://www.cnbc.com/2020/10/29/us-gdp-report-...,...,0.003623,0.000728,0.004326,0.003272,0.005335,0.002163,0.002797,0.002162,0.004692,0.004883
207,235,1326211786958401537,2020-11-10 17:15:04,USRepGaryPalmer,R,Alabama 6th District,U.S. Representative,This is encouraging news about the development...,,https://www.cnbc.com/2020/11/09/covid-vaccine-...,...,0.003045,0.000612,0.003637,0.00275,0.004484,0.001818,0.002351,0.001818,0.003944,0.004105
208,236,1325888131116228612,2020-11-09 19:48:58,USRepGaryPalmer,R,Alabama 6th District,U.S. Representative,Good news continues on the jobs front! The #Jo...,,https://www.cnbc.com/2020/11/06/jobs-report-oc...,...,0.951755,0.000728,0.004326,0.003272,0.005335,0.002163,0.002797,0.002162,0.004692,0.004883


Getting the mean topic distribution per media outlet

In [36]:
# Grouping by netloc and getting the mean topic vector
mean_topic = forbes_cnbc.groupby('netloc').mean()

# Dropping 'index' and 'tweet_id' columns
mean_topic.drop(columns= ['index', 'tweet_id'], inplace=True)

mean_topic

Unnamed: 0_level_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,topic_9,topic_10,topic_11,topic_12,topic_13,topic_14,topic_15
netloc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
www.cnbc.com,0.024364,0.030233,0.084037,0.056749,0.12192,0.235052,0.003238,0.012953,0.01231,0.044315,0.023373,0.01441,0.025795,0.037239,0.274012
www.forbes.com,0.104951,0.079964,0.09618,0.067463,0.018575,0.074756,0.001618,0.037131,0.020674,0.093666,0.026612,0.015997,0.10666,0.065098,0.190654


Transposing the dataframe so that the topics are the rows and the networks are the columns. This will make the rest of my analysis easier.

In [37]:
mean_topic_transposed = mean_topic.transpose() # transposing the dataframe
mean_topic_transposed.index.rename('topic', inplace=True) # renaming the index
mean_topic_transposed

netloc,www.cnbc.com,www.forbes.com
topic,Unnamed: 1_level_1,Unnamed: 2_level_1
topic_1,0.024364,0.104951
topic_2,0.030233,0.079964
topic_3,0.084037,0.09618
topic_4,0.056749,0.067463
topic_5,0.12192,0.018575
topic_6,0.235052,0.074756
topic_7,0.003238,0.001618
topic_8,0.012953,0.037131
topic_9,0.01231,0.020674
topic_10,0.044315,0.093666


Showing the top 5 most prominent topics for Forbes

In [38]:
forbes_top5 = mean_topic_transposed.loc[:,'www.forbes.com'].sort_values(ascending=False)[0:5]
forbes_top5

topic
topic_15    0.190654
topic_13    0.106660
topic_1     0.104951
topic_3     0.096180
topic_10    0.093666
Name: www.forbes.com, dtype: float64

In [42]:
print(f"""
Top 5 topics in Forbes:

COVID19-ECONOMY, {topics15[14]}
POSTAL VOTING, {topics15[12]}
TRUMP-RELATED, {topics15[0]}
HOLIDAYS, {topics15[2]}
2020 ELECTION, {topics15[9]}
""")


Top 5 topics in Forbes:

COVID19-ECONOMY, Topic 15: coronavirus pandemic relief federal million business program covid19 house democrats
POSTAL VOTING, Topic 13: service postal mail general tax postmaster change dejoy trump president
TRUMP-RELATED, Topic 1: state national million department join federal today announce grant community
HOLIDAYS, Topic 3: year woman day american black life honor national world country
2020 ELECTION, Topic 10: president trump biden joe donald election presidential campaign senate democratic



Showing the top 5 most prominent topics for CNBC

In [43]:
cnbc_top5 = mean_topic_transposed.loc[:,'www.cnbc.com'].sort_values(ascending=False)[0:5]
cnbc_top5

topic
topic_15    0.274012
topic_6     0.235052
topic_5     0.121920
topic_3     0.084037
topic_4     0.056749
Name: www.cnbc.com, dtype: float64

In [44]:
print(f"""
Top 5 topics in CNBC:

COVID19-ECONOMY, {topics15[14]}
COVID19-STATUS, {topics15[5]}
EDUCATION, {topics15[4]}
HOLIDAYS, {topics15[2]}
COVID19-POLICY, {topics15[3]}
""")


Top 5 topics in CNBC:

COVID19-ECONOMY, Topic 15: coronavirus pandemic relief federal million business program covid19 house democrats
COVID19-STATUS, Topic 6: coronavirus case covid19 pandemic state report number people million week
EDUCATION, Topic 5: school student education child family survey teacher year jones america
HOLIDAYS, Topic 3: year woman day american black life honor national world country
COVID19-POLICY, Topic 4: health covid19 care coronavirus vaccine test pandemic public find program



Getting the most distinctive topics of each media outlet. Here I define most distinctive as the difference between the rank/position of a topic in topic distribution for CNBC and the rank/position of the topic in topic distribution for Forbes. Thus, if a given topic appears high in the topic ranking for CNBC but low in the topic ranking for Forbes, that topic is distinctive for CNBC and vice versa.

In [47]:
# Creating the topic ranking for CNBC
cnbc = mean_topic_transposed.copy().sort_values('www.cnbc.com', ascending=False)
cnbc['rank'] = range(1,16)
cnbc.drop(columns=['www.forbes.com'], inplace=True)

# Creating the topic ranking for Forbes
forbes = mean_topic_transposed.copy().sort_values('www.forbes.com', ascending=False)
forbes['rank'] = range(1,16)
forbes.drop(columns=['www.cnbc.com'], inplace=True)

# Substracting the two rankings. Topics with positive values after subtraction are distinctive of Forbes, while topics with negative values are distincive of CNBC.
most_distinctive = cnbc['rank'] - forbes['rank']

# printing the distinctive topics
print(f"""
forbes_distinctive: 
{most_distinctive.sort_values(ascending=False)[0:3]}

cnbc_distinctive: 
{most_distinctive.sort_values(ascending=True)[0:3]}
""")


forbes_distinctive: 
topic
topic_1     7
topic_13    7
topic_8     3
Name: rank, dtype: int64

cnbc_distinctive: 
topic
topic_5   -10
topic_6    -5
topic_4    -3
Name: rank, dtype: int64



## 4. Interpretation and discussion <a class="anchor" id="fourth-section"></a>

It is not surprising that the most prominent topic in the tweets referencing both news media is COVID19-ECONOMY as both news outlets cover themes around business, financial markets and the economy. Moreover, the topic HOLIDAYS also appears in the top 5 topics of both news media. This may be due to the increased business activity surrounding holidays and the high importance of these days for companies and the markets.

What we can see from comparing the top 5 topics in each news media is that tweets referencing Forbes appear to be much more politically-oriented, due to the presence of topics such as “POSTAL-VOTING”, “2020 ELECTIONS” and “TRUMP-RELATED”. It may seem strange that all three COVID19-related topics appear in the top 5 for CNBC, however, it is important to remember that the COVID19 pandemic plays a major role in business activity and the economy. Thus, we can say that while tweets referencing Forbes are more politically-oriented and thematically diverse, tweets referencing CNBC are much more centered around the network’s core themes of business, financial markets and the economy.

If we shift our focus from the most prominent topics to the most distinctive topics (defined as the difference in ranks), we see similar patterns. The top 3 most distinctive topics for Forbes are POSTAL VOTING, NEWS and the uninterpretable first topic, while CNBC’s most distinctive topics are EDUCATION, COVID19-STATUS and COVID19-ECONOMY. This further enforces the finding that tweets referencing Forbes are more thematically diverse.
It is unclear whether this these contrasts reflect differences in the topics covered by the two networks or differences in the framing of tweets referencing the two networks. Further research is thus warranted.
