In [51]:
# initial imports
import pandas as pd
import numpy as np
import os

# nltk sentiment analysis
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# textblob sentiment analysis
from textblob import TextBlob


data_path = "~/Desktop/NewsGenerator/data/"

First, we load all data into one dataframe

In [28]:
df1 = pd.read_csv(data_path+"articles1.csv").drop("Unnamed: 0", axis=1)
df2 = pd.read_csv(data_path+"articles2.csv").drop("Unnamed: 0", axis=1)
df3 = pd.read_csv(data_path+"articles3.csv").drop("Unnamed: 0", axis=1)

In [32]:
df = pd.concat([df1, df2, df3])

In [43]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 142570 entries, 0 to 42570
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   id           142570 non-null  int64  
 1   title        142568 non-null  object 
 2   publication  142570 non-null  object 
 3   author       126694 non-null  object 
 4   date         139929 non-null  object 
 5   year         139929 non-null  float64
 6   month        139929 non-null  float64
 7   url          85559 non-null   object 
 8   content      142570 non-null  object 
dtypes: float64(2), int64(1), object(6)
memory usage: 10.9+ MB


### Types of Publications

we first need to figure out what types of articles are there. So, we can print the unique names of publications and how to define the project better. A couple rudimentary analysis is done to get basic statistics like counts of publications, and how I've divided up the publications.

In [39]:
pubs = np.unique(df.publication)
pubs

array(['Atlantic', 'Breitbart', 'Business Insider', 'Buzzfeed News',
       'CNN', 'Fox News', 'Guardian', 'NPR', 'National Review',
       'New York Post', 'New York Times', 'Reuters',
       'Talking Points Memo', 'Vox', 'Washington Post'], dtype=object)

In [45]:
total_count = 142570
for x in pubs:
    print(x)
    n = df[df.publication == x].shape[0]
    print("\tNumber of Articles:",n)
    print("\tFraction of the dataset:",round(n/total_count*100,2))

Atlantic
	Number of Articles: 7179
	Fraction of the dataset: 5.04
Breitbart
	Number of Articles: 23781
	Fraction of the dataset: 16.68
Business Insider
	Number of Articles: 6757
	Fraction of the dataset: 4.74
Buzzfeed News
	Number of Articles: 4854
	Fraction of the dataset: 3.4
CNN
	Number of Articles: 11488
	Fraction of the dataset: 8.06
Fox News
	Number of Articles: 4354
	Fraction of the dataset: 3.05
Guardian
	Number of Articles: 8681
	Fraction of the dataset: 6.09
NPR
	Number of Articles: 11992
	Fraction of the dataset: 8.41
National Review
	Number of Articles: 6203
	Fraction of the dataset: 4.35
New York Post
	Number of Articles: 17493
	Fraction of the dataset: 12.27
New York Times
	Number of Articles: 7803
	Fraction of the dataset: 5.47
Reuters
	Number of Articles: 10710
	Fraction of the dataset: 7.51
Talking Points Memo
	Number of Articles: 5214
	Fraction of the dataset: 3.66
Vox
	Number of Articles: 4947
	Fraction of the dataset: 3.47
Washington Post
	Number of Articles: 11114


In [37]:
df[df.publication == "Breitbart"].shape[0]

23781

In [38]:
df[df.publication == "Fox News"].shape[0]

4354

### Sentiment Analysis

In [68]:
sid = SentimentIntensityAnalyzer()

We can first take a look at NYT article.

In [54]:
df.iloc[1].title

'Rift Between Officers and Residents as Killings Persist in South Bronx - The New York Times'

In [69]:
nyt = df.iloc[1]
nyt

id                                                         17284
title          Rift Between Officers and Residents as Killing...
publication                                       New York Times
author                             Benjamin Mueller and Al Baker
date                                                  2017-06-19
year                                                        2017
month                                                          6
url                                                          NaN
content        After the bullet shells get counted, the blood...
Name: 1, dtype: object

In [70]:
sid.polarity_scores(nyt.content)

{'neg': 0.157, 'neu': 0.784, 'pos': 0.059, 'compound': -1.0}

Then, we take a look at a random Breitbart article.

In [67]:
df.iloc[10001].title

'Watch: Spicer Asked How It Feels ’To Work for a Fascist?’ In Apple Store - Breitbart'

In [71]:
bb = df.iloc[10001]
bb

id                                                         28737
title          Watch: Spicer Asked How It Feels ’To Work for ...
publication                                            Breitbart
author                                              Ian Hanchett
date                                                  2017-03-12
year                                                        2017
month                                                          3
url                                                          NaN
content        Asking @PressSec questions in Apple Store sinc...
Name: 10001, dtype: object

In [72]:
sid.polarity_scores(bb.content)

{'neg': 0.156, 'neu': 0.755, 'pos': 0.089, 'compound': -0.9282}

Despite such a different content and political polemics, the sentiment actually doesn't seem to have changed all that much.  Mostly, the articles both point to neutral.

What if we wanted to use a different sentiment analysis tool?