# Political Sentiment Analyzer - Data Wrangling Notebook

## Written by: Michael Trent

This notebook imports and cleans data from ~7,800 news articles obtained from an Article published in 2020 titled Analyzing Political Bias and Unfairness in News Articles at Different Levels of Granularity

In [1]:
import pandas as pd
import numpy as np
import tarfile
from nltk import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from string import punctuation
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [2]:
articles = pd.read_json('released_data.json', lines = True)

In [3]:
articles.shape

(7775, 8)

In [4]:
articles.head()

Unnamed: 0,source,title,event_id,adfontes_fair,adfontes_political,allsides_bias,content,misc
0,Fox News,"Trump blasts Howard Schultz, says ex-Starbucks...",0,bias,bias,From the Right,Obama administration alum Roger Fisk and Repub...,"{'time': '2019-01-28 16:10:44.680484', 'topics..."
1,USA TODAY,Trump blasts former Starbucks CEO Howard Schul...,0,bias,neutral,From the Center,WASHINGTON – President Donald Trump took a swi...,"{'time': 'None', 'topics': 'Election: Presiden..."
2,Washington Times,Mick Mulvaney: Trump to secure border 'with or...,0,bias,neutral,From the Right,Acting White House chief of staff Mick Mulvane...,"{'time': 'None', 'topics': 'White House', 'aut..."
3,Washington Times,Trump says 'we'll do the emergency' if border ...,0,bias,neutral,From the Right,President Trump repeated his vow Friday to dec...,"{'time': 'None', 'topics': 'White House, Polit..."
4,BBC News,Trump backs down to end painful shutdown tempo...,0,bias,neutral,From the Center,President Donald Trump has yielded to politica...,"{'time': '2019-01-26 00:00:00', 'topics': 'Whi..."


In [5]:
articles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7775 entries, 0 to 7774
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   source              7775 non-null   object
 1   title               7775 non-null   object
 2   event_id            7775 non-null   int64 
 3   adfontes_fair       7775 non-null   object
 4   adfontes_political  7775 non-null   object
 5   allsides_bias       7775 non-null   object
 6   content             7775 non-null   object
 7   misc                7775 non-null   object
dtypes: int64(1), object(7)
memory usage: 486.1+ KB


The first thing we need to do is to tokenize the content of the articles for each article. At this time we will also 

In [6]:
articles['Tokenized'] = articles.content.apply(word_tokenize)

In [7]:
articles['sent_tokenized'] = articles.content.apply(sent_tokenize)

In [8]:
stop_words = set(stopwords.words('english'))

In [9]:
articles.insert(articles.shape[1], 'stemmed', np.nan)

In [10]:
porter = PorterStemmer()
for row in range(0, articles.shape[0]):
    stemmed = []
    [stemmed.append(porter.stem(w.strip().lower())) for w in articles.Tokenized[row] if w.isalpha() and w not in stop_words]
    articles.stemmed[row] = stemmed

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


In [11]:
articles.head()

Unnamed: 0,source,title,event_id,adfontes_fair,adfontes_political,allsides_bias,content,misc,Tokenized,sent_tokenized,stemmed
0,Fox News,"Trump blasts Howard Schultz, says ex-Starbucks...",0,bias,bias,From the Right,Obama administration alum Roger Fisk and Repub...,"{'time': '2019-01-28 16:10:44.680484', 'topics...","[Obama, administration, alum, Roger, Fisk, and...",[Obama administration alum Roger Fisk and Repu...,"[obama, administr, alum, roger, fisk, republic..."
1,USA TODAY,Trump blasts former Starbucks CEO Howard Schul...,0,bias,neutral,From the Center,WASHINGTON – President Donald Trump took a swi...,"{'time': 'None', 'topics': 'Election: Presiden...","[WASHINGTON, –, President, Donald, Trump, took...",[WASHINGTON – President Donald Trump took a sw...,"[washington, presid, donald, trump, took, swip..."
2,Washington Times,Mick Mulvaney: Trump to secure border 'with or...,0,bias,neutral,From the Right,Acting White House chief of staff Mick Mulvane...,"{'time': 'None', 'topics': 'White House', 'aut...","[Acting, White, House, chief, of, staff, Mick,...",[Acting White House chief of staff Mick Mulvan...,"[act, white, hous, chief, staff, mick, mulvane..."
3,Washington Times,Trump says 'we'll do the emergency' if border ...,0,bias,neutral,From the Right,President Trump repeated his vow Friday to dec...,"{'time': 'None', 'topics': 'White House, Polit...","[President, Trump, repeated, his, vow, Friday,...",[President Trump repeated his vow Friday to de...,"[presid, trump, repeat, vow, friday, declar, n..."
4,BBC News,Trump backs down to end painful shutdown tempo...,0,bias,neutral,From the Center,President Donald Trump has yielded to politica...,"{'time': '2019-01-26 00:00:00', 'topics': 'Whi...","[President, Donald, Trump, has, yielded, to, p...",[President Donald Trump has yielded to politic...,"[presid, donald, trump, yield, polit, pressur,..."


In [12]:
articles.allsides_bias.value_counts()

From the Left      3684
From the Right     2851
From the Center    1240
Name: allsides_bias, dtype: int64

In [13]:
articles.adfontes_political.value_counts()

neutral    5523
bias       1441
unknown     811
Name: adfontes_political, dtype: int64

In [14]:
articles.insert(articles.shape[1], 'count_vector', np.nan)

In [15]:
for row in range(0, articles.shape[0]):
    vectorizer = CountVectorizer()
    articles.count_vector[row] = [vectorizer.fit_transform(articles.stemmed[row])]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


In [16]:
articles.count_vector[1]

[<239x147 sparse matrix of type '<class 'numpy.int64'>'
 	with 229 stored elements in Compressed Sparse Row format>]

In [17]:
full_text = ""
full_text = [full_text + " " + article for article in articles.content]

In [18]:
full_stemmed = []
#tokens = word_tokenize(full_text)
_ = [full_stemmed.append(porter.stem(w.strip().lower())) for w in word_tokenize(str(full_text)) if w.isalpha() and w not in stop_words]

In [19]:
full_stemmed[0:10]

['obama',
 'administr',
 'alum',
 'roger',
 'fisk',
 'republican',
 'strategist',
 'chri',
 'turner',
 'weigh']

In [20]:
full_vectorizer = CountVectorizer()
full_vectorizer.fit(full_stemmed)

CountVectorizer()

In [21]:
vocab = full_vectorizer.vocabulary_
corpus_matrix = full_vectorizer.transform(full_stemmed)

In [22]:
articles.columns

Index(['source', 'title', 'event_id', 'adfontes_fair', 'adfontes_political',
       'allsides_bias', 'content', 'misc', 'Tokenized', 'sent_tokenized',
       'stemmed', 'count_vector'],
      dtype='object')

In [23]:
articles.Tokenized.head()

0    [Obama, administration, alum, Roger, Fisk, and...
1    [WASHINGTON, –, President, Donald, Trump, took...
2    [Acting, White, House, chief, of, staff, Mick,...
3    [President, Trump, repeated, his, vow, Friday,...
4    [President, Donald, Trump, has, yielded, to, p...
Name: Tokenized, dtype: object

Rather than using the allsides bias indicator which is "from the left" or "from the right" or "From the center", for simplicitys sake we'll only include "Right", "Left", or "Center". We also need to ensure that there is consistency in the capitalization scheme of the bias column. 

In [24]:
articles['bias'] = articles.allsides_bias.apply(word_tokenize)
articles['bias'] = articles.bias.apply(lambda x: x[-1].lower())

In [25]:
articles.head()

Unnamed: 0,source,title,event_id,adfontes_fair,adfontes_political,allsides_bias,content,misc,Tokenized,sent_tokenized,stemmed,count_vector,bias
0,Fox News,"Trump blasts Howard Schultz, says ex-Starbucks...",0,bias,bias,From the Right,Obama administration alum Roger Fisk and Repub...,"{'time': '2019-01-28 16:10:44.680484', 'topics...","[Obama, administration, alum, Roger, Fisk, and...",[Obama administration alum Roger Fisk and Repu...,"[obama, administr, alum, roger, fisk, republic...","[ (0, 158)\t1\n (1, 0)\t1\n (2, 5)\t1\n (3...",right
1,USA TODAY,Trump blasts former Starbucks CEO Howard Schul...,0,bias,neutral,From the Center,WASHINGTON – President Donald Trump took a swi...,"{'time': 'None', 'topics': 'Election: Presiden...","[WASHINGTON, –, President, Donald, Trump, took...",[WASHINGTON – President Donald Trump took a sw...,"[washington, presid, donald, trump, took, swip...","[ (0, 141)\t1\n (1, 95)\t1\n (2, 41)\t1\n ...",center
2,Washington Times,Mick Mulvaney: Trump to secure border 'with or...,0,bias,neutral,From the Right,Acting White House chief of staff Mick Mulvane...,"{'time': 'None', 'topics': 'White House', 'aut...","[Acting, White, House, chief, of, staff, Mick,...",[Acting White House chief of staff Mick Mulvan...,"[act, white, hous, chief, staff, mick, mulvane...","[ (0, 0)\t1\n (1, 129)\t1\n (2, 53)\t1\n (...",right
3,Washington Times,Trump says 'we'll do the emergency' if border ...,0,bias,neutral,From the Right,President Trump repeated his vow Friday to dec...,"{'time': 'None', 'topics': 'White House, Polit...","[President, Trump, repeated, his, vow, Friday,...",[President Trump repeated his vow Friday to de...,"[presid, trump, repeat, vow, friday, declar, n...","[ (0, 40)\t1\n (1, 63)\t1\n (2, 45)\t1\n (...",right
4,BBC News,Trump backs down to end painful shutdown tempo...,0,bias,neutral,From the Center,President Donald Trump has yielded to politica...,"{'time': '2019-01-26 00:00:00', 'topics': 'Whi...","[President, Donald, Trump, has, yielded, to, p...",[President Donald Trump has yielded to politic...,"[presid, donald, trump, yield, polit, pressur,...","[ (0, 266)\t1\n (1, 100)\t1\n (2, 372)\t1\n...",center


At this point I'm not sure if I am going to need to eliminate articles if it is the only article in the corpus by the author. It doesn't seem like I would need to because the author is largely irrelvant and may in fact end up adding artificial bias as the neural network could learn to associate an author with a politcal bias and factor that into the sentiment when this is actually the exact kind of bias we are trying to avoid in the first place. At any rate it makes sense that at some point we may want to know which sources have less than one article in the corpus. The following cells do just that. 

In [26]:
articles.source.value_counts()

CNN (Web News)                  1021
Fox News                        1002
New York Times - News            781
Washington Times                 657
HuffPost                         539
                                ... 
John Gable, AllSides Founder       1
Rich Lowry                         1
Peter Roff                         1
George Will                        1
Aaron Carroll                      1
Name: source, Length: 113, dtype: int64

In [27]:
vals = articles.source.value_counts()

In [28]:
vals = pd.DataFrame(vals)

In [29]:
single_sources = vals[vals.source == 1]

In [30]:
slice = []
index = range(0, len(articles.source.values))
_ = [slice.append(list(articles.source.values)[art] not in list(single_sources.index)) for art in index]

In [31]:
sum(slice)

7735

In [32]:
articles[slice]

Unnamed: 0,source,title,event_id,adfontes_fair,adfontes_political,allsides_bias,content,misc,Tokenized,sent_tokenized,stemmed,count_vector,bias
0,Fox News,"Trump blasts Howard Schultz, says ex-Starbucks...",0,bias,bias,From the Right,Obama administration alum Roger Fisk and Repub...,"{'time': '2019-01-28 16:10:44.680484', 'topics...","[Obama, administration, alum, Roger, Fisk, and...",[Obama administration alum Roger Fisk and Repu...,"[obama, administr, alum, roger, fisk, republic...","[ (0, 158)\t1\n (1, 0)\t1\n (2, 5)\t1\n (3...",right
1,USA TODAY,Trump blasts former Starbucks CEO Howard Schul...,0,bias,neutral,From the Center,WASHINGTON – President Donald Trump took a swi...,"{'time': 'None', 'topics': 'Election: Presiden...","[WASHINGTON, –, President, Donald, Trump, took...",[WASHINGTON – President Donald Trump took a sw...,"[washington, presid, donald, trump, took, swip...","[ (0, 141)\t1\n (1, 95)\t1\n (2, 41)\t1\n ...",center
2,Washington Times,Mick Mulvaney: Trump to secure border 'with or...,0,bias,neutral,From the Right,Acting White House chief of staff Mick Mulvane...,"{'time': 'None', 'topics': 'White House', 'aut...","[Acting, White, House, chief, of, staff, Mick,...",[Acting White House chief of staff Mick Mulvan...,"[act, white, hous, chief, staff, mick, mulvane...","[ (0, 0)\t1\n (1, 129)\t1\n (2, 53)\t1\n (...",right
3,Washington Times,Trump says 'we'll do the emergency' if border ...,0,bias,neutral,From the Right,President Trump repeated his vow Friday to dec...,"{'time': 'None', 'topics': 'White House, Polit...","[President, Trump, repeated, his, vow, Friday,...",[President Trump repeated his vow Friday to de...,"[presid, trump, repeat, vow, friday, declar, n...","[ (0, 40)\t1\n (1, 63)\t1\n (2, 45)\t1\n (...",right
4,BBC News,Trump backs down to end painful shutdown tempo...,0,bias,neutral,From the Center,President Donald Trump has yielded to politica...,"{'time': '2019-01-26 00:00:00', 'topics': 'Whi...","[President, Donald, Trump, has, yielded, to, p...",[President Donald Trump has yielded to politic...,"[presid, donald, trump, yield, polit, pressur,...","[ (0, 266)\t1\n (1, 100)\t1\n (2, 372)\t1\n...",center
...,...,...,...,...,...,...,...,...,...,...,...,...,...
7770,Politico,Ann Romney's task: Humanize Mitt,0,bias,neutral,From the Left,"TAMPA, Fla. — Ann Romney will take to the podi...","{'time': '2012-08-28 04:59:14', 'topics': 'Ele...","[TAMPA, ,, Fla., —, Ann, Romney, will, take, t...","[TAMPA, Fla. — Ann Romney will take to the pod...","[tampa, ann, romney, take, podium, tuesday, ni...","[ (0, 170)\t1\n (1, 7)\t1\n (2, 141)\t1\n ...",left
7771,Washington Times,'Mittigator' to make case for Romney,0,bias,neutral,From the Right,"She is a gracious warrior with a kind face, a ...","{'time': 'None', 'topics': 'Election: Presiden...","[She, is, a, gracious, warrior, with, a, kind,...","[She is a gracious warrior with a kind face, a...","[she, graciou, warrior, kind, face, polish, de...","[ (0, 403)\t1\n (1, 188)\t1\n (2, 496)\t1\n...",right
7772,Fox News,Convention-bound Ryan slams Obama for presidin...,0,bias,bias,From the Right,Republican VP pick on 'Special Report'\nMaking...,"{'time': '2012-08-27 00:00:00', 'topics': 'Ele...","[Republican, VP, pick, on, 'Special, Report', ...",[Republican VP pick on 'Special Report'\nMakin...,"[republican, vp, pick, make, one, last, stop, ...","[ (0, 234)\t1\n (1, 308)\t1\n (2, 198)\t1\n...",right
7773,Politico,Ryan seeks comfort of Ron Paul fans,0,bias,neutral,From the Left,Paul Ryan said Monday he expects Ron Paul supp...,"{'time': '2012-08-27 15:52:45', 'topics': 'Ele...","[Paul, Ryan, said, Monday, he, expects, Ron, P...",[Paul Ryan said Monday he expects Ron Paul sup...,"[paul, ryan, said, monday, expect, ron, paul, ...","[ (0, 49)\t1\n (1, 58)\t1\n (2, 59)\t1\n (...",left


In [39]:
len(articles.source.unique())

113