# Data Import, Cleaning, and Preparation

The data used for this project comes from: 
https://www.kaggle.com/snapcrack/all-the-news

It is 150,000 articles from 15 different news providers, and comes as three csv files.

In [121]:
import pandas as pd
import re

In [2]:
# Bring in the csv files
csv1 = pd.read_csv('articles1.csv')
csv2 = pd.read_csv('articles2.csv')
csv3 = pd.read_csv('articles3.csv')

In [58]:
# Append those DataFrames together
data = csv1.drop('Unnamed: 0', axis=1).append(csv2.drop('Unnamed: 0', axis=1)).append(csv3.drop('Unnamed: 0', axis=1)).reset_index()

In [63]:
data.shape

(142570, 10)

In [64]:
data.head()

Unnamed: 0,index,id,title,publication,author,date,year,month,url,content
0,0,17283,House Republicans Fret About Winning Their Hea...,New York Times,Carl Hulse,2016-12-31,2016.0,12.0,,WASHINGTON — Congressional Republicans have...
1,1,17284,Rift Between Officers and Residents as Killing...,New York Times,Benjamin Mueller and Al Baker,2017-06-19,2017.0,6.0,,"After the bullet shells get counted, the blood..."
2,2,17285,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",New York Times,Margalit Fox,2017-01-06,2017.0,1.0,,"When Walt Disney’s “Bambi” opened in 1942, cri..."
3,3,17286,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",New York Times,William McDonald,2017-04-10,2017.0,4.0,,"Death may be the great equalizer, but it isn’t..."
4,4,17287,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,Choe Sang-Hun,2017-01-02,2017.0,1.0,,"SEOUL, South Korea — North Korea’s leader, ..."


In [65]:
data['publication'].value_counts()

Breitbart              23781
New York Post          17493
NPR                    11992
CNN                    11488
Washington Post        11114
Reuters                10710
Guardian                8681
New York Times          7803
Atlantic                7179
Business Insider        6757
National Review         6203
Talking Points Memo     5214
Vox                     4947
Buzzfeed News           4854
Fox News                4354
Name: publication, dtype: int64

In [66]:
data['author'].value_counts().head(10)

Breitbart News      1559
Pam Key             1282
Associated Press    1231
Charlie Spiering     928
Jerome Hudson        806
John Hayward         747
Daniel Nussbaum      735
AWR Hawkins          720
Ian Hanchett         647
Joel B. Pollak       624
Name: author, dtype: int64

There are more Breitbart articles than any other, and they are more likely not to list the author. Pam Key, Charlie Spiering, Jerome Hudson, John Hayward, Daniel Nussbaum...they are all writers for Breitbart.

In [67]:
authors = data['author'].value_counts()

In [68]:
prolific_authors = authors[authors>100]
prolific_authors

Breitbart News       1559
Pam Key              1282
Associated Press     1231
Charlie Spiering      928
Jerome Hudson         806
                     ... 
Robinson Meyer        103
Cartel Chronicles     102
Ed Yong               102
Matt Zapotosky        102
John Wagner           102
Name: author, Length: 221, dtype: int64

In [69]:
data.loc[data['author'] == prolific_authors.index[1], 'publication'].value_counts().index[0]

'Breitbart'

In [70]:
authors_publishers = pd.DataFrame({'authors':prolific_authors.index, 'publisher':0})
for i in range(len(prolific_authors)):
    pub = data.loc[data['author'] == prolific_authors.index[i], 'publication'].value_counts().index[0]
    authors_publishers.loc[i, 'publisher'] = pub

In [71]:
authors_publishers['count'] = prolific_authors.values
authors_publishers.head(20)

Unnamed: 0,authors,publisher,count
0,Breitbart News,Breitbart,1559
1,Pam Key,Breitbart,1282
2,Associated Press,New York Post,1231
3,Charlie Spiering,Breitbart,928
4,Jerome Hudson,Breitbart,806
5,John Hayward,Breitbart,747
6,Daniel Nussbaum,Breitbart,735
7,AWR Hawkins,Breitbart,720
8,Ian Hanchett,Breitbart,647
9,Joel B. Pollak,Breitbart,624


Interesting. Definitely interesting. All of the really prolific writers seem to work for Breitbart. The highest non-Breitbart author is Camile Domonoske of NPR.

In [106]:
data.loc[data['title'].isnull(), 'title'] = "No Title"
titles = pd.DataFrame({"title":data['title'], "length":0})
titles['length'] = [len(title) for title in data['title']]

In [107]:
titles.loc[titles['title']=='No Title', 'length'] = 0

In [108]:
data['title_length'] = titles['length']

In [109]:
data.head()

Unnamed: 0,index,id,title,publication,author,date,year,month,url,content,title_length
0,0,17283,House Republicans Fret About Winning Their Hea...,New York Times,Carl Hulse,2016-12-31,2016.0,12.0,,WASHINGTON — Congressional Republicans have...,80
1,1,17284,Rift Between Officers and Residents as Killing...,New York Times,Benjamin Mueller and Al Baker,2017-06-19,2017.0,6.0,,"After the bullet shells get counted, the blood...",91
2,2,17285,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",New York Times,Margalit Fox,2017-01-06,2017.0,1.0,,"When Walt Disney’s “Bambi” opened in 1942, cri...",84
3,3,17286,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",New York Times,William McDonald,2017-04-10,2017.0,4.0,,"Death may be the great equalizer, but it isn’t...",68
4,4,17287,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,Choe Sang-Hun,2017-01-02,2017.0,1.0,,"SEOUL, South Korea — North Korea’s leader, ...",89


In [110]:
contents = pd.DataFrame({'content':data['content'], 'length':0})
contents['length'] = [len(content) for content in data['content']]
data['content_length'] = contents['length']
data.head()

Unnamed: 0,index,id,title,publication,author,date,year,month,url,content,title_length,content_length
0,0,17283,House Republicans Fret About Winning Their Hea...,New York Times,Carl Hulse,2016-12-31,2016.0,12.0,,WASHINGTON — Congressional Republicans have...,80,5607
1,1,17284,Rift Between Officers and Residents as Killing...,New York Times,Benjamin Mueller and Al Baker,2017-06-19,2017.0,6.0,,"After the bullet shells get counted, the blood...",91,27834
2,2,17285,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",New York Times,Margalit Fox,2017-01-06,2017.0,1.0,,"When Walt Disney’s “Bambi” opened in 1942, cri...",84,14018
3,3,17286,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",New York Times,William McDonald,2017-04-10,2017.0,4.0,,"Death may be the great equalizer, but it isn’t...",68,12274
4,4,17287,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,Choe Sang-Hun,2017-01-02,2017.0,1.0,,"SEOUL, South Korea — North Korea’s leader, ...",89,4195


Okay now let's get into cleaning the text. We are going to tokenize the articles...so we want to change everything to lower case and remove punctuation. Let's try it.

In [111]:
try_art = data.loc[0, 'content']
try_art

'WASHINGTON  —   Congressional Republicans have a new fear when it comes to their    health care lawsuit against the Obama administration: They might win. The incoming Trump administration could choose to no longer defend the executive branch against the suit, which challenges the administration’s authority to spend billions of dollars on health insurance subsidies for   and   Americans, handing House Republicans a big victory on    issues. But a sudden loss of the disputed subsidies could conceivably cause the health care program to implode, leaving millions of people without access to health insurance before Republicans have prepared a replacement. That could lead to chaos in the insurance market and spur a political backlash just as Republicans gain full control of the government. To stave off that outcome, Republicans could find themselves in the awkward position of appropriating huge sums to temporarily prop up the Obama health care law, angering conservative voters who have been 

In [118]:
try_art = try_art.lower()
try_art_list = try_art.split()

In [124]:
try_art_punc = [re.sub(r'[^\w\s]', '', word) for word in try_art_list]
try_art_punc

['washington',
 '',
 'congressional',
 'republicans',
 'have',
 'a',
 'new',
 'fear',
 'when',
 'it',
 'comes',
 'to',
 'their',
 'health',
 'care',
 'lawsuit',
 'against',
 'the',
 'obama',
 'administration',
 'they',
 'might',
 'win',
 'the',
 'incoming',
 'trump',
 'administration',
 'could',
 'choose',
 'to',
 'no',
 'longer',
 'defend',
 'the',
 'executive',
 'branch',
 'against',
 'the',
 'suit',
 'which',
 'challenges',
 'the',
 'administrations',
 'authority',
 'to',
 'spend',
 'billions',
 'of',
 'dollars',
 'on',
 'health',
 'insurance',
 'subsidies',
 'for',
 'and',
 'americans',
 'handing',
 'house',
 'republicans',
 'a',
 'big',
 'victory',
 'on',
 'issues',
 'but',
 'a',
 'sudden',
 'loss',
 'of',
 'the',
 'disputed',
 'subsidies',
 'could',
 'conceivably',
 'cause',
 'the',
 'health',
 'care',
 'program',
 'to',
 'implode',
 'leaving',
 'millions',
 'of',
 'people',
 'without',
 'access',
 'to',
 'health',
 'insurance',
 'before',
 'republicans',
 'have',
 'prepared',
 '

In [129]:
for i in try_art_punc[:]:
        if len(i) == 0:
            try_art_punc.remove(i)

In [134]:
try_art_out = " ".join(try_art_punc)
try_art_out

'washington congressional republicans have a new fear when it comes to their health care lawsuit against the obama administration they might win the incoming trump administration could choose to no longer defend the executive branch against the suit which challenges the administrations authority to spend billions of dollars on health insurance subsidies for and americans handing house republicans a big victory on issues but a sudden loss of the disputed subsidies could conceivably cause the health care program to implode leaving millions of people without access to health insurance before republicans have prepared a replacement that could lead to chaos in the insurance market and spur a political backlash just as republicans gain full control of the government to stave off that outcome republicans could find themselves in the awkward position of appropriating huge sums to temporarily prop up the obama health care law angering conservative voters who have been demanding an end to the la

In [141]:
clean_content = [""] * len(data['content'])
place=0
for article in data['content']:
    article = article.lower()
    article = article.split()
    article = [re.sub(r'[^\w\s]', '', word) for word in article]
    for i in article[:]:
        if len(i) == 0:
            article.remove(i)
    article = " ".join(article)
    clean_content[place] = article
    place += 1

In [143]:
data['clean_content'] = clean_content
data.head()

Unnamed: 0,index,id,title,publication,author,date,year,month,url,content,title_length,content_length,clean_content
0,0.0,17283.0,House Republicans Fret About Winning Their Hea...,New York Times,Carl Hulse,2016-12-31,2016.0,12.0,,WASHINGTON — Congressional Republicans have...,80.0,5607.0,washington congressional republicans have a ne...
1,1.0,17284.0,Rift Between Officers and Residents as Killing...,New York Times,Benjamin Mueller and Al Baker,2017-06-19,2017.0,6.0,,"After the bullet shells get counted, the blood...",91.0,27834.0,after the bullet shells get counted the blood ...
2,2.0,17285.0,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",New York Times,Margalit Fox,2017-01-06,2017.0,1.0,,"When Walt Disney’s “Bambi” opened in 1942, cri...",84.0,14018.0,when walt disneys bambi opened in 1942 critics...
3,3.0,17286.0,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",New York Times,William McDonald,2017-04-10,2017.0,4.0,,"Death may be the great equalizer, but it isn’t...",68.0,12274.0,death may be the great equalizer but it isnt n...
4,4.0,17287.0,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,Choe Sang-Hun,2017-01-02,2017.0,1.0,,"SEOUL, South Korea — North Korea’s leader, ...",89.0,4195.0,seoul south korea north koreas leader kim said...


In [144]:
# Don't do this. The file is huge and makes everything run very slowly.
#data.to_csv('cleaned_news_data.csv')