# Question 1 
#### Author: Michal Kubina

#### 0. Imports

In this block I will have my imports all together to tidy up the code.

In [96]:
import pandas as pd
import spacy
import re
from gensim.corpora import Dictionary
from gensim.models.wrappers import LdaMallet
from gensim.models import CoherenceModel
import plotly.graph_objects as go
import plotly.io as pio
import gensim
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
import datetime 
from plotly.offline import init_notebook_mode, iplot, plot
import pickle
pio.renderers.default = "notebook_connected"
init_notebook_mode(connected=True)

I am providing also visualisations which are extra to this notebook. Please download them too.

#### 1. Exploration and motivation

In this part I will explore the data set and describe my motivation. Moreover, I will be adding a comment to each line of code to describe my steps. Also I wanted to mention that I like to have my code clean. Thus, I will be using a lot of functions and my reasoning might be sometimes too long - but I very much enjoyed this task which are turned into question of politics.

In [2]:
def get_percentage_user_site(party, netloc, data):
    """
    Returns the string describing the percentage of people from a specific party using specific netloc = site. 

    Parameters
    ----------
    party : string
    netloc : string
        netloc can be describes also as a site
    data : pandas dataframe
        whole framing dataframe
    
    Returns
    -------
    string
        a string with percentage info
    """
    
    percentage = len(data.netloc[(data.party == party) & (data.netloc == netloc)])/len(data.netloc[data.netloc == netloc])*100
    return netloc + " is used in tweets by "+ str(round(percentage)) + " percents of users which should support " + party + " party."

data = pd.read_pickle('framing.p') #function to read pickle file by pandas
display(data.head(2)) #displaying part of pandas dataframe to see the structure of the data set --> more visually pleasant than printing it
print(data.netloc.value_counts()[data.netloc.value_counts().values > 200]) #show just sites with more than two hundred entries

Unnamed: 0,tweet_id,date,user,party,state,chamber,tweet,news_mention,url_reference,netloc,title,description,label
0,1325914751495499776,2020-11-09 21:34:45,SenShelby,R,Alabama,Senator,ICYMI – @BusinessInsider declared #Huntsville ...,businessinsider,https://www.businessinsider.com/personal-finan...,www.businessinsider.com,The 10 best US cities to move to if you want t...,The best US cities to move to if you want to s...,
1,1294021087118987264,2020-08-13 21:20:43,SenShelby,R,Alabama,Senator,Great news! Today @mazda_toyota announced an a...,,https://pressroom.toyota.com/mazda-and-toyota-...,pressroom.toyota.com,Mazda and Toyota Further Commitment to U.S. Ma...,"HUNTSVILLE, Ala., (Aug. 13, 2020) – Today, Maz...",


www.nytimes.com       1508
                      1092
www.politico.com       623
www.cnn.com            602
thehill.com            550
www.foxnews.com        549
www.nbcnews.com        485
nyti.ms                422
www.cnbc.com           342
cnn.it                 341
secure.actblue.com     292
apnews.com             266
Name: netloc, dtype: int64


In [52]:
data['date'] =  pd.to_datetime(data['date'], format='%Y-%m-%d') #converting date to datetime to get min and max
print("First date of dataframe: ")
print(data.date.min())
print("Last date of dataframe")
print(data.date.max())

First date of dataframe: 
2020-08-13 00:00:00
Last date of dataframe
2020-11-14 13:46:00


This data set consists of tweets by political representatives where we have an info about their political party before and after elections. Moreover, we have the tweet, the site mentioned in the tweet and the description of the original message and some other columns. I will be comparing two news media which are www.foxnews.com and www.cnn.com. Firstly, I wanted to include media with a sufficient amount of messages (or a sufficient amount of mentions of tweets if we want to) and with a possibly balanced number of these messages on both sites in order to have balanced and rich data. Secondly, I wanted to pick some interesting duo of sites to analyze. The reason behind the specific sites I picked, is that in the United States of America there is known that the same news media are more biased to certain parties - mainly the republican or democratic party. This means that the news media prefer a specific party over another one and thus, followers of a specific political party are usually getting content from the specific news media. Consequently, the topics discussed on both sites, thus by republicans - www.foxnews.com and democrats - www.cnn.com should differ. So my motivation of this analysis is to look at the topics and see if this can be supported by data. My reasoning can be supported https://today.yougov.com/topics/media/articles-reports/2020/06/18/trust-news-republican-democrat-poll or https://www.pewresearch.org/journalism/2020/01/24/democrats-report-much-higher-levels-of-trust-in-a-number-of-news-sources-than-republicans/. Moreover, below we can see that at the first glance that this idea might be true since www.cnn.com is used in 94 percents of tweets by representatives part of democrating party. On the other hand, www.foxnews.com is used in tweets by 97 percents of represantives from a republican party. The topics analysis will be definitely interesting.

In [3]:
print(get_percentage_user_site("R", "www.cnn.com", data)) #call function defined earlier 
print(get_percentage_user_site("D", "www.cnn.com", data)) #same with different parameters
print(get_percentage_user_site("R", "www.foxnews.com", data)) #same with different parameters
print(get_percentage_user_site("D", "www.foxnews.com", data)) #same with different parameters

www.cnn.com is used in tweets by 6 percents of users which should support R party.
www.cnn.com is used in tweets by 94 percents of users which should support D party.
www.foxnews.com is used in tweets by 97 percents of users which should support R party.
www.foxnews.com is used in tweets by 3 percents of users which should support D party.


#### 2. Preprocessing

In this section I will try to preprocess my text.

In [4]:
nlp = spacy.load("en_core_web_sm") #loading the spacy model
processed_texts = [text for text in nlp.pipe(data.description, 
                                              disable=["ner",
                                                       "parser"])] #pipeline which processes text
tokenized_text = [[word.lemma_.lower() for word in processed_text if not word.is_stop] for processed_text in processed_texts] #lemmatise and lowercase words which are not stopped words
tokenized_text = [[re.sub(r'\W+', '', word) for word in text] for text in tokenized_text] #error fix

Peace of code to get average number of words in desctiptions after lemmatisation:

In [5]:
sum_t = 0 #setting sum to zero
for token in tokenized_text: #for loop to go through all the descriptions
    sum_t += len(token) #sum up number of words in descriptions
print("Average number of words in description after lemmatisation, lowercasing and removing stopwords: ")    
print(sum_t/(len(tokenized_text)))

Average number of words in description after lemmatisation, lowercasing and removing stopwords: 
19.095871716137836


For preprocessing I have decided to lemmatise the words and apply lower casing. Lemmatisation might be very beneficial because I suspect similar words will occur a lot due to the nature of the data set. Thus, keeping their lemmatised form is, I believe, a very good idea. The data set consists of political topics and voting topics which are going to be somehow similar across the data as there is always a limited number of political topics which is discussed among (American) citizens. Someone might call it populism. Moreover, I am just filtering out stop words and keeping all other words. I do not think that it is beneficial to focus just on adjectives as the descriptions seem to be very formal including lots of nouns and some verbs. Moreover, from the analysis, the descriptions are not very long as can be seen above. To be precise, the average number of words after lemmatisation, lower casing and removing stopwords in one description is already 19. Thus, I am keeping all types of words to be able to focus on small differences. But as I wrote before, I am filtering out stop words. 

Here I will create a dictionary and final corpus where I fill filter words that occur less than in 3 documents and words with higher frequency than 0.85 in document:

In [6]:
MIN_DF = 3 # setting minimum document frequency -> we keep words which occur at least in 3 documents
MAX_DF = 0.85 # maximum document frequency -> word has to be included in less then 85 percent of document in order to be included

dictionary = Dictionary(tokenized_text) # get the vocabulary
dictionary.filter_extremes(no_below=MIN_DF, 
                           no_above=MAX_DF) #filtering out of the extreme values 
corpus = [dictionary.doc2bow(text) for text in tokenized_text] #creating corpus

#### 3. Model

In this section I will try to first define useful functions and then I will train models with different parameters and later do a choice of the best one.

In [113]:
def get_lda(topics, corpus, dictionary):
    """
    Returns the lda model.

    Parameters
    ----------
    topics : int
        number of topics to be found
    corpus : list
        corpus
    dictionary : Dictionary
        dictionary of tokenized text
    
    Returns
    -------
    LdaMallet instance
        trained model
    """
    PATH_TO_MALLET = '/Users/michalkubina/Downloads/mallet-2.0.8/bin/mallet' #my path to mallet 
    N_TOPICS = topics # set number of topics
    N_ITERATIONS = 1000 # parameter to set number of iterations

    lda = LdaMallet(PATH_TO_MALLET,
                    corpus=corpus,
                    id2word=dictionary,
                    num_topics=N_TOPICS,
                    optimize_interval=10,
                    iterations=N_ITERATIONS) #train the model
    return lda #return the model

def analyse_topics(N_TOPICS, lda):
    """
    Prints topics

    Parameters
    ----------
    N_TOPICS : int
        number of topics to be found
    lda : instance of LdaMallet
        trained model
        
    Returns
    -------
    """
    for topic in range(N_TOPICS): #for loop to go over all topics
        words = lda.show_topic(topic, 10) #get the words
        topic_n_words = ' '.join([word[0] for word in words]) 
        print('Topic {}: {}'.format(str(topic), topic_n_words)) #printing the topic
        
def visualize_coherence(score, labels):
    """
    Returns the figure where coherence across different number of topics is visualised.

    Parameters
    ----------
    score : list
        list of int scores to be on y axis
    labels : list
        list of strings with labels to be on x axis
    
    Returns
    -------
    plotly figure
    """
    fig = go.Figure(data=[
    go.Scatter(name='cnn', x=labels, y=score, mode='markers')]) #get a figure
    fig.update_layout(
    title="Coherence score plot", xaxis_title="Topic",
    yaxis_title="Coherence score") #update layout
    return fig #return fig
    
def get_coherence(lda, tokenized_text, dictionary):
    """
    Returns the coherence number of the model.
    Resource: https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0

    Parameters
    ----------
    lda : int
        number of topics to be found
    tokenized_text : list
        list of lists of tokenized text
    dictionary : Dictionary
        dictionary of tokenized text
    
    Returns
    -------
    int
        coherence value
    """ 
    coherence_model = CoherenceModel(model=lda, texts=tokenized_text, dictionary=dictionary, coherence='c_v') #get coherence model with type of c_v coherence
    return coherence_model.get_coherence() #get coherence and return it 

def visualize_topics(topics, grouped):
    """
    Returns the figure where coherence across different number of topics is visualised.

    Parameters
    ----------
    topics : int
        number of topics used
    grouped : pandas dataframe
        grouped dataframe 
    
    Returns
    -------
    plotly figure
    """
    fig = go.Figure(data=[
    go.Bar(name='cnn', x=topics, y=grouped[grouped.index == "www.cnn.com"].values[0]),
    go.Bar(name='fox news', x=topics, y=grouped[grouped.index == "www.foxnews.com"].values[0])
    ]) #set up a figure
    fig.update_layout(barmode='group') #update layout to groups
    return fig #return fig

def analyse_topic_py(lda, corpus, dictionary):
    """
    Prints topics via gensim pleasant visualisation.

    Parameters
    ----------
    lda : instance of LdaMallet
        trained model
    corpus : list
        corpus
    dictionary : Dictionary
        dictionary of tokenized text
    
    Returns
    -------
    gensim visualization
    """
    pyLDAvis.enable_notebook() #enable notebook for pyLDAvis
    lda_conv = gensim.models.wrappers.ldamallet.malletmodel2ldamodel(lda) #get a model
    return gensimvis.prepare(lda_conv, corpus, dictionary) #return visualisation 

Here I will train models on different number of topics. I will also store the coherence number and the model to the list. Moreover I will be printing out the topics. Thus, I am very sorry for the long output which can be skipped by [clicking here.](#another_cell)

In [8]:
score = [] #list to store the score
labels = [] #list to store the label
models = [] #lsit to store the model
for k in range(2,30): #for loop to go from two topics to 29
    lda = get_lda(k, corpus, dictionary) #train the model
    analyse_topics(k, lda) #print the topics found
    print("------")
    score.append(get_coherence(lda, tokenized_text, dictionary)) #add coherence score to the list
    labels.append(str(k)) #add number of topics to the list
    models.append(lda) #add model to the list

Topic 0: covid19 state health pandemic coronavirus year million federal act people
Topic 1: trump president house election court vote senate biden donald official
------
Topic 0: coronavirus covid19 health pandemic state million federal act care program
Topic 1: year national police people day city woman state join rep
Topic 2: president trump election house court senate biden donald vote supreme
------
Topic 0: service election mail vote postal ballot day voter news year
Topic 1: trump president house court biden senate election donald supreme democrats
Topic 2: act rep bill house national legislation congress state join year
Topic 3: covid19 coronavirus pandemic health million state people federal county year
------
Topic 0: trump president house biden election senate donald joe official white
Topic 1: coronavirus pandemic covid19 million people trump state health case president
Topic 2: service bill house postal member act congress legislation mail change
Topic 3: court vote supreme

In [71]:
with open('models.pkl', 'wb') as f: #saving for my purposes
    pickle.dump(models, f) #saving for my purposes 
with open('scores.pkl', 'wb') as f: 
    pickle.dump(score, f)
with open('topics.pkl', 'wb') as f:
    pickle.dump(labels, f)

<a id='another_cell'></a>

Here I will visualise coherence which is a metric that can be used for measuring model performance. The higher coherence, the better. Resource: https://stats.stackexchange.com/questions/375062/how-does-topic-coherence-score-in-lda-intuitively-makes-sense However, I still need to look at the topics by myself as more subjective input could more valuable.

Here I visualised coherence across the number of topic but I was afraid that it will not be seen. Thus, I saved it and it is opened in a markdown below.

In [121]:
#plot(visualize_coherence(score, labels)) #visualising coherence


'temp-plot.html'

<img src="newplot.png" width="800" height="400">

Here I used visualization for different topics using gensim, however I did not include all of them

In [22]:
index = 18 #<- set index of a model in the list
analyse_topic_py(models[index], corpus, dictionary) #show visualisation

  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload


By looking at the word distribution in topics across the multiple models, I believe that the most interpretable model is with twenty topics. Moreover, the coherence score is also quite high and one of the best. Nevertheless, this amount of topic seems reflect main topics such as healthcare, coronavirus, pandemic, elections over mail etc. Below I have labeled the topics after visualising the model above. It seems that in news media there are topics which focus on death of george floyd which was a big theme. Moreover, healthcare, pandemic and jobs are also discussed. Moreover, the fires in californa or problem with israel is also visible in topics.

Topic 0: act bill legislation year bipartisan taxpayer congress hr federal agency -> taxes and legislation

Topic 1: police officer city death shoot black man department protester protest -> death of george floyd

Topic 2: news woman year black day world story live talk american -> black lives

Topic 3: pandemic million coronavirus unemployment job benefit week worker americans accord -> jobs during corona

Topic 4: 2020 win election senate trump race fight president republican critical -> trump for the president 

Topic 5: court supreme justice barrett amy coney senate judge nominee trump ->  supreme court nomination

Topic 6: county city state fire day 2020 california community area census -> california fires

Topic 7: school student support year education family community university pandemic provide -> education

Topic 8: today american year honor family war join face veteran contribute -> war and veterans

Topic 9: biden president joe trump official presidential vice elect house democratic -> elections

Topic 10: committee house china foreign policy force member rep castro chair -> china 

Topic 11: trump president donald administration tax house coronavirus information decade obtain -> trump in general

Topic 12: department trump security homeland border group united president states government -> security

Topic 13: house bill senate democrats leader relief coronavirus republicans pelosi speaker -> nancy pelocy 

Topic 14: sen senator rep president state united israel trump announce organization -> israel action

Topic 15: election mail service postal ballot vote voter voting general president -> mail election 

Topic 16: federal program million grant state department order receive funding announce -> government support 

Topic 17: health care public covid19 drug act vaccine community flu pandemic -> healthcare

Topic 18: climate change national energy america environmental world water gas oil -> climate change

Topic 19: covid19 coronavirus case state people pandemic test health report virus -> pandemic

Store the labels in the list:

In [53]:
my_labels = ["taxes", "george floyd", "black lives", "jobs during corona", "trump for president", "supreme court (barrett)", "ca fires", "education", "war and veterans", "elections", "china", "about trump", "security", "nancy pelocy", "israel problem", "mail ballots", "money support", "healthcare","climate change", "pandemic"]

#### 4. Comparing cnn and fox news

In this part of the assignment I will compare two news media foxnews and cnn based on distribution of topics:

In [39]:
num = 20 #set number of topics to be assessed
index = num - 2 #get index of the model in list
lda = models[index] #get the model
transformed_docs = lda.load_document_topics() #transform the documens
topic_distributions = pd.DataFrame([[x[1] for x in doc] for doc in transformed_docs], 
             columns=['topic_{}'.format(i) for i in range(num)]) #get dataframe of topic distributions
display(topic_distributions.head(2)) #display the first two rows of the dataframe
joined_topic_dist = data.reset_index().join(topic_distributions) #join the topic

Unnamed: 0,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,topic_9,topic_10,topic_11,topic_12,topic_13,topic_14,topic_15,topic_16,topic_17,topic_18,topic_19
0,0.000599,0.001757,0.002543,0.002931,0.002554,0.001723,0.39985,0.00269,0.002165,0.002915,0.001684,0.002514,0.002213,0.002798,0.318598,0.002962,0.003069,0.002521,0.240631,0.003282
1,0.000225,0.00066,0.000955,0.001101,0.000959,0.000647,0.150201,0.090204,0.000813,0.001095,0.000633,0.000944,0.000831,0.001051,0.000755,0.001113,0.268733,0.000947,0.476898,0.001233


In [40]:
grouped = joined_topic_dist.groupby('netloc').mean() #groupby netloc mean
grouped = grouped.loc[(grouped.index== "www.cnn.com") | (grouped.index== "www.foxnews.com") , ] #get just foxnews and cnn
grouped = grouped.iloc[:,2:] #do not include unnecesary columns
display(grouped) #display grouped dataframe

Unnamed: 0_level_0,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,topic_9,topic_10,topic_11,topic_12,topic_13,topic_14,topic_15,topic_16,topic_17,topic_18,topic_19
netloc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
www.cnn.com,0.008378,0.07072,0.019333,0.052169,0.019623,0.058532,0.043,0.020941,0.022513,0.092492,0.022237,0.105792,0.086881,0.057027,0.016024,0.072704,0.028321,0.037672,0.03914,0.126502
www.foxnews.com,0.002686,0.091371,0.025333,0.010433,0.052681,0.087723,0.026412,0.017451,0.02578,0.129346,0.02613,0.115462,0.057834,0.079249,0.082533,0.042521,0.020597,0.0497,0.018706,0.038053


Here I created another visualisation in plotly to see differences across topics. To be on the safe side I saved it and included the png as a picture in markdown.

In [123]:
#plot(visualize_topics(my_labels, grouped)) #visualize topics difference

'temp-plot.html'

<img src="newplot-2.png" width="800" height="400">

#### 5. Summary of comparison (interpretation and discussion)

After a thorough look, I have labelled my topics with as much meaning as I could get. To emphasise my hypothesis is that since www.cnn.com is considered as a democratic news media and www.foxnews.com is considered as a republican news media the topics should reflect this. And it seems that I am on the right path based on the graph visualised above and taking into account knowledge about the US politics such as George Floyd, Israel problem or individuals such as Nancy Pelocy or Barret. To remind, the data come from the period of 2020 before and a little after the US election.

To start we can see that the biggest topics discussed in news media were pandemics, elections, trump and George Floyd.

(pandemic, jobs during corona, money support, china) <- covid related topics

The biggest difference is in label pandemic which is a topic which talks about coronavirus and is used in cnn mainly. This makes sense since before the election this was a very common topic that democrats used in their campaign against republicans. The topic jobs during corona can also support this since there is a big favour for cnn. This was a huge problem for Republicans and thus it was not as discussed in fox news but it was mentioned more on cnn. A similar thing is happening with topics such as money support. On the other hand, China is more mentioned on fox news as maybe republicans wanted to blame the country because of the coronavirus. 

(George Floyd, black lives)

Surprisingly, George Floyd and black lives matter were still a big theme in news media. And even more surprisingly, it was used more on fox news media. Maybe as a power move and "hot" topic for republicans to get minorities to vote for them.

(taxes, climate change, California fires)

These topics are more discussed in news media cnn than on fox news. This is also quite logical since climate change is a topic that is nonexistent according to republicans. Similarly with taxes or California fires.

(elections, about trump, mail ballots, trump for president)

Looking at the topics regarding the elections, we can find that this was the main theme discussed in news media before elections. Moreover, mail ballots were criticised heavily by republicans and were preferred by democrats (cnn) - this is something that we can find in the data too.

(supreme court barret, nancy pelocy, Israel problem)

https://eu.usatoday.com/in-depth/news/politics/2021/04/09/bracing-battle-inside-nancy-pelosis-war-donald-trump/7053833002/
https://spectator.clingendael.org/en/publication/trump-most-pro-israel-president-american-history
https://www.newyorker.com/humor/borowitz-report/trump-attempts-to-fire-amy-coney-barrett

These three topics are more heavily discussed on fox news than on cnn. I reason that these were topics that Trump used in the election campaign (articles which support this are above). He was the most pro-Israel American president and the most anti-Palestina president. Moreover, he had problems with individuals such as Nancy Pelocy or Coney Barrett. And republicans followed him.

To sum up, I feel like the topics were surprisingly interpretable, but the knowledge of a domain is a must. The model could still be enhanced by setting different parameters or even applying a more thorough analysis. I believe that I can confirm my hypothesis about division of topics in news media cnn (democrats) and fox news (republicans). Moreover, the main topics that were discussed before elections were elections itself, corona virus, black lives matter or Trump's attacks e.g. to Coney Barrett or Nancy Pelocy.