### Joining Fake News data

This notebook will join different datasets from McIntire-Kaggle, Veronica et al. 2017, and BuzzFeed Fake News Corpus

In [2]:
import pandas as pd
import numpy as np
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [4]:
buzz = pd.read_csv("data/buzzfeedcorpus.csv")
kaggle = pd.read_csv("data/fake_or_real_news.csv")
news = pd.read_csv('data/news.csv')
celebrity = pd.read_csv("data/celebrity.csv")

buzz.shape
kaggle.shape
news.shape
celebrity.shape

(1627, 5)

(6335, 4)

(480, 4)

(500, 4)

In [5]:
buzz.columns
kaggle.columns
news.columns
celebrity.columns

Index(['title', 'author', 'text', 'url', 'veracity'], dtype='object')

Index(['Unnamed: 0', 'title', 'text', 'label'], dtype='object')

Index(['content', 'class', 'source', 'file_name'], dtype='object')

Index(['content', 'class', 'source', 'file_name'], dtype='object')

#### Column differences
We'll need to work around the differences in columns. While I'd rather have the publisher, source, author - some of our data just isn't good enough to do that. 

We'll just keep the normal format of the buzzfeed [title, author, text, url, label] and force all the data to adopt that format.

In [6]:
buzz['veracity'].value_counts()
kaggle['label'].value_counts()
news['class'].value_counts()
celebrity['class'].value_counts()

mostly true                  1264
mixture of true and false     212
mostly false                   87
no factual content             64
Name: veracity, dtype: int64

REAL    3171
FAKE    3164
Name: label, dtype: int64

legit    240
fake     240
Name: class, dtype: int64

legit    250
fake     250
Name: class, dtype: int64

#### Modifying Kaggle Data

In [7]:
kaggle['author'], kaggle['url'] = np.nan, np.nan
kaggle = kaggle.drop('Unnamed: 0', axis=1)
kaggle = kaggle[['title', 'author', 'text', 'url', 'label']]

In [8]:
kaggle.head()

Unnamed: 0,title,author,text,url,label
0,You Can Smell Hillary’s Fear,,"Daniel Greenfield, a Shillman Journalism Fello...",,FAKE
1,Watch The Exact Moment Paul Ryan Committed Pol...,,Google Pinterest Digg Linkedin Reddit Stumbleu...,,FAKE
2,Kerry to go to Paris in gesture of sympathy,,U.S. Secretary of State John F. Kerry said Mon...,,REAL
3,Bernie supporters on Twitter erupt in anger ag...,,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",,FAKE
4,The Battle of New York: Why This Primary Matters,,It's primary day in New York and front-runners...,,REAL


#### Modifying Buzzfeed Data

In [10]:
buzz.rename(columns = {'veracity': 'label'}, inplace=True)

In [11]:
buzz.head()

Unnamed: 0,title,author,text,url,label
0,The Impact of Debates? It's Debatable,GARY LANGER,With the Hillary Clinton-Donald Trump debates ...,http://abcnews.go.com/Politics/impact-debates-...,mostly true
1,Details Emerge About NYC Bomb Suspect Ahmad Kh...,Brian Ross Rhonda Schwartz Mike Levine Stephan...,As police today captured the man wanted for qu...,http://abcnews.go.com/US/source-suspect-wanted...,mostly true
2,Donald Trump Repeats Calls for Police Profilin...,ALANA ABRAMSON,One day after explosive devices were discovere...,http://abcnews.go.com/Politics/donald-trump-re...,mostly true
3,"NY, NJ Bombings Suspect Charged With Attempted...",EMILY SHAPIRO Aaron Katersky Josh Margolin Mik...,"Ahmad Khan Rahami, earlier named a person of i...",http://abcnews.go.com/US/bombing-incidences-ny...,mostly true
4,Trump Surrogates Push Narrative That Clinton S...,Candace Smith,Donald Trump's surrogates and leading supporte...,http://abcnews.go.com/Politics/trump-surrogate...,mostly true


#### Modifying the news/celebrity data from Veronica 2017

In [12]:
news.rename(columns = {'content': 'text', 'class': 'label', 'file_name':'url'}, inplace=True)
celebrity.rename(columns = {'content': 'text', 'class': 'label', 'file_name':'url'}, inplace=True)
news['title'], news['author'], celebrity['title'], celebrity['author'] = np.nan, np.nan, np.nan, np.nan
news, celebrity = news[['title', 'author', 'text', 'url', 'label']], celebrity[['title', 'author', 'text', 'url', 'label']]

In [13]:
news.head()
celebrity.head()

Unnamed: 0,title,author,text,url,label
0,,,"Alex Jones Vindicated in ""Pizzagate"" Controver...",biz01.fake.txt,fake
1,,,THE BIG DATA CONSPIRACYGovernment and Silicon ...,biz02.fake.txt,fake
2,,,California Surprisingly Lenient on Auto Emissi...,biz03.fake.txt,fake
3,,,Mexicans Are Chomping at the Bit to Stop NAFTA...,biz04.fake.txt,fake
4,,,Breaking News: Snapchat to purchase Twitter fo...,biz05.fake.txt,fake


Unnamed: 0,title,author,text,url,label
0,,,JENNIFER ANISTON HINTS AT FRIENDS REUNIONJenni...,001fake.txt,fake
1,,,Brad Pitt Texts Jennifer Aniston Nonstop: Seek...,002fake.txt,fake
2,,,Jennifer Aniston Pregnant With 'Miracle Baby' ...,003fake.txt,fake
3,,,Heartbroken Jennifer Aniston Runs For The Bord...,004fake.txt,fake
4,,,Jennifer Aniston Cheated On Brad Pitt With Mat...,005fake.txt,fake


#### Concatenating the separate datasets 

In [14]:
data = pd.concat([buzz, kaggle, news, celebrity])
data.head()

Unnamed: 0,title,author,text,url,label
0,The Impact of Debates? It's Debatable,GARY LANGER,With the Hillary Clinton-Donald Trump debates ...,http://abcnews.go.com/Politics/impact-debates-...,mostly true
1,Details Emerge About NYC Bomb Suspect Ahmad Kh...,Brian Ross Rhonda Schwartz Mike Levine Stephan...,As police today captured the man wanted for qu...,http://abcnews.go.com/US/source-suspect-wanted...,mostly true
2,Donald Trump Repeats Calls for Police Profilin...,ALANA ABRAMSON,One day after explosive devices were discovere...,http://abcnews.go.com/Politics/donald-trump-re...,mostly true
3,"NY, NJ Bombings Suspect Charged With Attempted...",EMILY SHAPIRO Aaron Katersky Josh Margolin Mik...,"Ahmad Khan Rahami, earlier named a person of i...",http://abcnews.go.com/US/bombing-incidences-ny...,mostly true
4,Trump Surrogates Push Narrative That Clinton S...,Candace Smith,Donald Trump's surrogates and leading supporte...,http://abcnews.go.com/Politics/trump-surrogate...,mostly true


In [15]:
data['label'].value_counts()

REAL                         3171
FAKE                         3164
mostly true                  1264
legit                         490
fake                          490
mixture of true and false     212
mostly false                   87
no factual content             64
Name: label, dtype: int64

#### Changing labels
This is probably the most important part that we're glossing over for now. But how do we match different ways of defining 'fake'. Each dataset has a slightly different way. The buzzfeed method is great because they define some level of veracity "mostly, mixture" etc. But the Mcintire one is just real or fake. 

We'll come back to this repeatedly, but this is just a first step to get us started.

In [16]:
data = data.replace(('mostly true', 'legit', 'REAL'), 'real')
data = data.replace(('FAKE', 'fake', 'mostly false', 'no factual content'), 'fake')

data['label'].value_counts()

real                         4925
fake                         3805
mixture of true and false     212
Name: label, dtype: int64

In [None]:
# data.to_csv('combinedData.csv', index=False)