### Joining Fake News data

This notebook will join different datasets from McIntire-Kaggle, Veronica et al. 2017, and BuzzFeed Fake News Corpus

In [1]:
import pandas as pd
import numpy as np
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
buzz = pd.read_csv("data/buzzfeedcorpus.csv")
kaggle = pd.read_csv("data/fake_or_real_news.csv")
news = pd.read_csv('data/news.csv')
celebrity = pd.read_csv("data/celebrity.csv")

buzz.shape
kaggle.shape
news.shape
celebrity.shape

(1627, 5)

(6335, 4)

(480, 4)

(500, 4)

In [3]:
buzz.columns
kaggle.columns
news.columns
celebrity.columns

Index(['title', 'author', 'text', 'url', 'veracity'], dtype='object')

Index(['Unnamed: 0', 'title', 'text', 'label'], dtype='object')

Index(['content', 'class', 'source', 'file_name'], dtype='object')

Index(['content', 'class', 'source', 'file_name'], dtype='object')

#### Column differences
We'll need to work around the differences in columns. While I'd rather have the publisher, source, author - some of our data just isn't good enough to do that. 

We'll just keep the normal format of the buzzfeed [title, author, text, url, label] and force all the data to adopt that format.

In [4]:
buzz['veracity'].value_counts()
kaggle['label'].value_counts()
news['class'].value_counts()
celebrity['class'].value_counts()

mostly true                  1264
mixture of true and false     212
mostly false                   87
no factual content             64
Name: veracity, dtype: int64

REAL    3171
FAKE    3164
Name: label, dtype: int64

legit    240
fake     240
Name: class, dtype: int64

legit    250
fake     250
Name: class, dtype: int64

#### Modifying Kaggle Data

In [5]:
kaggle['author'], kaggle['url'] = np.nan, np.nan
kaggle = kaggle.drop('Unnamed: 0', axis=1)
kaggle = kaggle[['title', 'author', 'text', 'url', 'label']]
kaggle['dataset'] = 'Risdal-McIntire'

In [6]:
kaggle.head()

Unnamed: 0,title,author,text,url,label,dataset
0,You Can Smell Hillary’s Fear,,"Daniel Greenfield, a Shillman Journalism Fello...",,FAKE,Risdal-McIntire
1,Watch The Exact Moment Paul Ryan Committed Pol...,,Google Pinterest Digg Linkedin Reddit Stumbleu...,,FAKE,Risdal-McIntire
2,Kerry to go to Paris in gesture of sympathy,,U.S. Secretary of State John F. Kerry said Mon...,,REAL,Risdal-McIntire
3,Bernie supporters on Twitter erupt in anger ag...,,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",,FAKE,Risdal-McIntire
4,The Battle of New York: Why This Primary Matters,,It's primary day in New York and front-runners...,,REAL,Risdal-McIntire


#### Modifying Buzzfeed Data

In [7]:
buzz.rename(columns = {'veracity': 'label'}, inplace=True)
buzz['dataset'] = 'Buzzfeed'

In [8]:
buzz.head()

Unnamed: 0,title,author,text,url,label,dataset
0,The Impact of Debates? It's Debatable,GARY LANGER,With the Hillary Clinton-Donald Trump debates ...,http://abcnews.go.com/Politics/impact-debates-...,mostly true,Buzzfeed
1,Details Emerge About NYC Bomb Suspect Ahmad Kh...,Brian Ross Rhonda Schwartz Mike Levine Stephan...,As police today captured the man wanted for qu...,http://abcnews.go.com/US/source-suspect-wanted...,mostly true,Buzzfeed
2,Donald Trump Repeats Calls for Police Profilin...,ALANA ABRAMSON,One day after explosive devices were discovere...,http://abcnews.go.com/Politics/donald-trump-re...,mostly true,Buzzfeed
3,"NY, NJ Bombings Suspect Charged With Attempted...",EMILY SHAPIRO Aaron Katersky Josh Margolin Mik...,"Ahmad Khan Rahami, earlier named a person of i...",http://abcnews.go.com/US/bombing-incidences-ny...,mostly true,Buzzfeed
4,Trump Surrogates Push Narrative That Clinton S...,Candace Smith,Donald Trump's surrogates and leading supporte...,http://abcnews.go.com/Politics/trump-surrogate...,mostly true,Buzzfeed


#### Modifying the news/celebrity data from Veronica 2017

In [9]:
news.rename(columns = {'content': 'text', 'class': 'label', 'file_name':'url'}, inplace=True)
celebrity.rename(columns = {'content': 'text', 'class': 'label', 'file_name':'url'}, inplace=True)
news['title'], news['author'], celebrity['title'], celebrity['author'] = np.nan, np.nan, np.nan, np.nan
news, celebrity = news[['title', 'author', 'text', 'url', 'label']], celebrity[['title', 'author', 'text', 'url', 'label']]

news['dataset'] = 'Rosas-News'
celebrity['dataset'] = 'Rosas-Celebrity'

In [10]:
news.head()
celebrity.head()

Unnamed: 0,title,author,text,url,label,dataset
0,,,"Alex Jones Vindicated in ""Pizzagate"" Controver...",biz01.fake.txt,fake,Rosas-News
1,,,THE BIG DATA CONSPIRACYGovernment and Silicon ...,biz02.fake.txt,fake,Rosas-News
2,,,California Surprisingly Lenient on Auto Emissi...,biz03.fake.txt,fake,Rosas-News
3,,,Mexicans Are Chomping at the Bit to Stop NAFTA...,biz04.fake.txt,fake,Rosas-News
4,,,Breaking News: Snapchat to purchase Twitter fo...,biz05.fake.txt,fake,Rosas-News


Unnamed: 0,title,author,text,url,label,dataset
0,,,JENNIFER ANISTON HINTS AT FRIENDS REUNIONJenni...,001fake.txt,fake,Rosas-Celebrity
1,,,Brad Pitt Texts Jennifer Aniston Nonstop: Seek...,002fake.txt,fake,Rosas-Celebrity
2,,,Jennifer Aniston Pregnant With 'Miracle Baby' ...,003fake.txt,fake,Rosas-Celebrity
3,,,Heartbroken Jennifer Aniston Runs For The Bord...,004fake.txt,fake,Rosas-Celebrity
4,,,Jennifer Aniston Cheated On Brad Pitt With Mat...,005fake.txt,fake,Rosas-Celebrity


In [11]:
news['text'][9]

'Uber to open new headquarters in Denmark despite new lawsDespite the introduction of new taxi laws in Denmark the Uber has said this is not a setback for the US ride-booking service in Eurpoe. A company spokesman Kristian Agerbo said on Tuesday Uber "must take the consequences" of the rules, which among other things will require cabs to be fitted with seat occupancy sensors and fare meters, and pass it on to the drivers. Uber ordered a new fleet of self-driving cars following Arizona crash to be created and tested to comply with these new laws. Uber has not faced problems in cities like Madrid Frankfurt Paris and London has long been lauded as the transportations service of the digital age. They plan to mass produce these new sensors and meters and offer them to their drivers on a rental basis to add to their profit lines. Depending on success they may be made available to other driving companies as well.'

#### Concatenating the separate datasets 

In [12]:
data = pd.concat([buzz, kaggle, news, celebrity])
data.head()

Unnamed: 0,title,author,text,url,label,dataset
0,The Impact of Debates? It's Debatable,GARY LANGER,With the Hillary Clinton-Donald Trump debates ...,http://abcnews.go.com/Politics/impact-debates-...,mostly true,Buzzfeed
1,Details Emerge About NYC Bomb Suspect Ahmad Kh...,Brian Ross Rhonda Schwartz Mike Levine Stephan...,As police today captured the man wanted for qu...,http://abcnews.go.com/US/source-suspect-wanted...,mostly true,Buzzfeed
2,Donald Trump Repeats Calls for Police Profilin...,ALANA ABRAMSON,One day after explosive devices were discovere...,http://abcnews.go.com/Politics/donald-trump-re...,mostly true,Buzzfeed
3,"NY, NJ Bombings Suspect Charged With Attempted...",EMILY SHAPIRO Aaron Katersky Josh Margolin Mik...,"Ahmad Khan Rahami, earlier named a person of i...",http://abcnews.go.com/US/bombing-incidences-ny...,mostly true,Buzzfeed
4,Trump Surrogates Push Narrative That Clinton S...,Candace Smith,Donald Trump's surrogates and leading supporte...,http://abcnews.go.com/Politics/trump-surrogate...,mostly true,Buzzfeed


In [13]:
data['label'].value_counts()

REAL                         3171
FAKE                         3164
mostly true                  1264
legit                         490
fake                          490
mixture of true and false     212
mostly false                   87
no factual content             64
Name: label, dtype: int64

#### Changing labels
This is probably the most important part that we're glossing over for now. But how do we match different ways of defining 'fake'. Each dataset has a slightly different way. The buzzfeed method is great because they define some level of veracity "mostly, mixture" etc. But the Mcintire one is just real or fake. 

We'll come back to this repeatedly, but this is just a first step to get us started.

In [14]:
data = data.replace(('mostly true', 'legit', 'REAL'), 'real')
data = data.replace(('FAKE', 'fake', 'mostly false', 'no factual content'), 'fake')

data['label'].value_counts()

real                         4925
fake                         3805
mixture of true and false     212
Name: label, dtype: int64

In [15]:
# data.to_csv('combinedData.csv', index=False)

### Adding another dataset from Wang 2018

In [16]:
data.head()

Unnamed: 0,title,author,text,url,label,dataset
0,The Impact of Debates? It's Debatable,GARY LANGER,With the Hillary Clinton-Donald Trump debates ...,http://abcnews.go.com/Politics/impact-debates-...,real,Buzzfeed
1,Details Emerge About NYC Bomb Suspect Ahmad Kh...,Brian Ross Rhonda Schwartz Mike Levine Stephan...,As police today captured the man wanted for qu...,http://abcnews.go.com/US/source-suspect-wanted...,real,Buzzfeed
2,Donald Trump Repeats Calls for Police Profilin...,ALANA ABRAMSON,One day after explosive devices were discovere...,http://abcnews.go.com/Politics/donald-trump-re...,real,Buzzfeed
3,"NY, NJ Bombings Suspect Charged With Attempted...",EMILY SHAPIRO Aaron Katersky Josh Margolin Mik...,"Ahmad Khan Rahami, earlier named a person of i...",http://abcnews.go.com/US/bombing-incidences-ny...,real,Buzzfeed
4,Trump Surrogates Push Narrative That Clinton S...,Candace Smith,Donald Trump's surrogates and leading supporte...,http://abcnews.go.com/Politics/trump-surrogate...,real,Buzzfeed


In [17]:
df = pd.read_csv('liar_datasetWang2018/train.tsv', sep='\t', header=None)
df2 = pd.read_csv('liar_datasetWang2018/test.tsv', sep='\t', header=None)
df3 = pd.read_csv('liar_datasetWang2018/valid.tsv', sep='\t', header=None)

df = pd.concat([df, df2, df3])

In [18]:
df.columns = ['statement_id', 'label', 'text', 'subject', 'speaker', 'speaker_title', 'state_info', 'party affiliation', 'credit1', 'credit2', 'credit3', 'credit4', 'credit5', 'context']

In [19]:
df = df[(df['label'] == 'false') | (df['label'] == 'true') | (df['label'] == 'mostly-true') | (df['label'] == 'mostly-false')]
df = df.dropna()
df = df.replace(['mostly-true', 'mostly-false'], ['true', 'false'])

In [20]:
df.head()

Unnamed: 0,statement_id,label,text,subject,speaker,speaker_title,state_info,party affiliation,credit1,credit2,credit3,credit4,credit5,context
0,2635.json,False,Says the Annies List political group supports ...,abortion,dwayne-bohac,State representative,Texas,republican,0.0,1.0,0.0,0.0,0.0,a mailer
2,324.json,True,"Hillary Clinton agrees with John McCain ""by vo...",foreign-policy,barack-obama,President,Illinois,democrat,70.0,71.0,160.0,163.0,9.0,Denver
5,12465.json,True,The Chicago Bears have had more starting quart...,education,robin-vos,Wisconsin Assembly speaker,Wisconsin,republican,0.0,3.0,2.0,5.0,1.0,a an online opinion-piece
9,9741.json,True,Says GOP primary opponents Glenn Grothman and ...,"energy,message-machine-2014,voting-record",duey-stroebel,State representative,Wisconsin,republican,0.0,0.0,0.0,1.0,0.0,an online video
10,7115.json,True,"For the first time in history, the share of th...",elections,robert-menendez,U.S. Senator,New Jersey,democrat,1.0,3.0,1.0,3.0,0.0,a speech


In [21]:
len(df)
data.columns

4783

Index(['title', 'author', 'text', 'url', 'label', 'dataset'], dtype='object')

In [22]:
df2 = df[['speaker_title', 'speaker', 'text', 'context', 'label']]

In [23]:
df2.head()

Unnamed: 0,speaker_title,speaker,text,context,label
0,State representative,dwayne-bohac,Says the Annies List political group supports ...,a mailer,False
2,President,barack-obama,"Hillary Clinton agrees with John McCain ""by vo...",Denver,True
5,Wisconsin Assembly speaker,robin-vos,The Chicago Bears have had more starting quart...,a an online opinion-piece,True
9,State representative,duey-stroebel,Says GOP primary opponents Glenn Grothman and ...,an online video,True
10,U.S. Senator,robert-menendez,"For the first time in history, the share of th...",a speech,True


### Sample the Wang dataset
With around 5000 rows, this single statement would be overrepresented in our data. We reduce by random sampling 2500 rows instead

In [24]:
df3 = df2[df2['label'] == 'false'].sample(n=1250)
df4 = df2[df2['label'] == 'true'].sample(n=1250)

# concatenate and reset index
df5 = pd.concat([df3, df4])
df5 = df5.reset_index(drop=True)

# Rename columns
df5.columns = ['title', 'author', 'text', 'url', 'label']

# Add in the dataset name for valdiation later
df5['dataset'] = 'Wang-PolitiFact'

df5 = df5.replace(['true', 'false'], ['real', 'fake'])

In [25]:
df5['label'].value_counts()

fake    1250
real    1250
Name: label, dtype: int64

### Combine datasets

In [30]:
df6 = pd.concat([data, df5])

# Reorder the columns
df6 = df6[['dataset', 'title', 'author', 'text', 'url', 'label']]

len(df6)
df6['label'].value_counts()
df6['dataset'].value_counts()

11442

real                         6175
fake                         5055
mixture of true and false     212
Name: label, dtype: int64

Risdal-McIntire    6335
Wang-PolitiFact    2500
Buzzfeed           1627
Rosas-Celebrity     500
Rosas-News          480
Name: dataset, dtype: int64

In [31]:
df6.head()

Unnamed: 0,dataset,title,author,text,url,label
0,Buzzfeed,The Impact of Debates? It's Debatable,GARY LANGER,With the Hillary Clinton-Donald Trump debates ...,http://abcnews.go.com/Politics/impact-debates-...,real
1,Buzzfeed,Details Emerge About NYC Bomb Suspect Ahmad Kh...,Brian Ross Rhonda Schwartz Mike Levine Stephan...,As police today captured the man wanted for qu...,http://abcnews.go.com/US/source-suspect-wanted...,real
2,Buzzfeed,Donald Trump Repeats Calls for Police Profilin...,ALANA ABRAMSON,One day after explosive devices were discovere...,http://abcnews.go.com/Politics/donald-trump-re...,real
3,Buzzfeed,"NY, NJ Bombings Suspect Charged With Attempted...",EMILY SHAPIRO Aaron Katersky Josh Margolin Mik...,"Ahmad Khan Rahami, earlier named a person of i...",http://abcnews.go.com/US/bombing-incidences-ny...,real
4,Buzzfeed,Trump Surrogates Push Narrative That Clinton S...,Candace Smith,Donald Trump's surrogates and leading supporte...,http://abcnews.go.com/Politics/trump-surrogate...,real


In [33]:
df6.to_csv('data_wang.csv', index=False)