In [1]:
import pandas as pd

In [2]:
news_df = pd.read_csv('./fake.csv')

# What is the structure of the data frame?

In [3]:
news_df.head(2)

Unnamed: 0,uuid,ord_in_thread,author,published,title,text,language,crawled,site_url,country,domain_rank,thread_title,spam_score,main_img_url,replies_count,participants_count,likes,comments,shares,type
0,6a175f46bcd24d39b3e962ad0f29936721db70db,0,Barracuda Brigade,2016-10-26T21:41:00.000+03:00,Muslims BUSTED: They Stole Millions In Gov’t B...,Print They should pay all the back all the mon...,english,2016-10-27T01:49:27.168+03:00,100percentfedup.com,US,25689.0,Muslims BUSTED: They Stole Millions In Gov’t B...,0.0,http://bb4sp.com/wp-content/uploads/2016/10/Fu...,0,1,0,0,0,bias
1,2bdc29d12605ef9cf3f09f9875040a7113be5d5b,0,reasoning with facts,2016-10-29T08:47:11.259+03:00,Re: Why Did Attorney General Loretta Lynch Ple...,Why Did Attorney General Loretta Lynch Plead T...,english,2016-10-29T08:47:11.259+03:00,100percentfedup.com,US,25689.0,Re: Why Did Attorney General Loretta Lynch Ple...,0.0,http://bb4sp.com/wp-content/uploads/2016/10/Fu...,0,1,0,0,0,bias


In [4]:
news_df.describe()

Unnamed: 0,ord_in_thread,domain_rank,spam_score,replies_count,participants_count,likes,comments,shares
count,12999.0,8776.0,12999.0,12999.0,12999.0,12999.0,12999.0,12999.0
mean,0.89153,38092.996582,0.026122,1.383183,1.727518,10.831833,0.038311,10.831833
std,6.486822,26825.487454,0.122889,9.656838,6.884239,79.798949,0.827335,79.798949
min,0.0,486.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,17423.0,0.0,0.0,1.0,0.0,0.0,0.0
50%,0.0,34478.0,0.0,0.0,1.0,0.0,0.0,0.0
75%,0.0,60570.0,0.0,0.0,1.0,0.0,0.0,0.0
max,100.0,98679.0,1.0,309.0,240.0,988.0,65.0,988.0


In [5]:
news_df.columns

Index([u'uuid', u'ord_in_thread', u'author', u'published', u'title', u'text',
       u'language', u'crawled', u'site_url', u'country', u'domain_rank',
       u'thread_title', u'spam_score', u'main_img_url', u'replies_count',
       u'participants_count', u'likes', u'comments', u'shares', u'type'],
      dtype='object')

# What labels are used to categorize news?

In [6]:
news_df['type'].unique()

array(['bias', 'conspiracy', 'fake', 'bs', 'satire', 'hate', 'junksci',
       'state'], dtype=object)

In [7]:
news_df.groupby(['type'])['type'].count()

type
bias            443
bs            11492
conspiracy      430
fake             19
hate            246
junksci         102
satire          146
state           121
Name: type, dtype: int64

It looks like we need to supplement this dataset with "credible" news sources because there are no news items that are "credible"...need to find out what "state" means.

In [8]:
non_desc_mask = news_df['type'].isnull()
non_desc_mask.count()

12999

In [9]:
news_df[non_desc_mask]

Unnamed: 0,uuid,ord_in_thread,author,published,title,text,language,crawled,site_url,country,domain_rank,thread_title,spam_score,main_img_url,replies_count,participants_count,likes,comments,shares,type


In [10]:
news_df.count()

uuid                  12999
ord_in_thread         12999
author                10575
published             12999
title                 12319
text                  12953
language              12999
crawled               12999
site_url              12999
country               12823
domain_rank            8776
thread_title          12987
spam_score            12999
main_img_url           9356
replies_count         12999
participants_count    12999
likes                 12999
comments              12999
shares                12999
type                  12999
dtype: int64

There are no real news in this data set which is bad. We need to somehow find "real" news to clasify against. Also: it is unclear how the different types of "fake" news are come up with...the definition is also not clear. It's probably best to find a data set that has both real and fake news if we don't want to scrape data sets from scratch. Such a thing exists:

- https://github.com/GeorgeMcIntire/fake_real_news_dataset
- https://opendatascience.com/blog/how-to-build-a-fake-news-classification-model/#

Let me download and get that dataset.

In [11]:
fake_real_df = pd.read_csv('./fake_or_real_news.csv')

In [12]:
fake_real_df.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [14]:
print fake_real_df.columns
print news_df.columns

Index([u'Unnamed: 0', u'title', u'text', u'label'], dtype='object')
Index([u'uuid', u'ord_in_thread', u'author', u'published', u'title', u'text',
       u'language', u'crawled', u'site_url', u'country', u'domain_rank',
       u'thread_title', u'spam_score', u'main_img_url', u'replies_count',
       u'participants_count', u'likes', u'comments', u'shares', u'type'],
      dtype='object')


In [16]:
new_df = news_df[['title','text','type']]

In [19]:
new_df['type'].apply(lambda x: 'FAKE')

0        FAKE
1        FAKE
2        FAKE
3        FAKE
4        FAKE
5        FAKE
6        FAKE
7        FAKE
8        FAKE
9        FAKE
10       FAKE
11       FAKE
12       FAKE
13       FAKE
14       FAKE
15       FAKE
16       FAKE
17       FAKE
18       FAKE
19       FAKE
20       FAKE
21       FAKE
22       FAKE
23       FAKE
24       FAKE
25       FAKE
26       FAKE
27       FAKE
28       FAKE
29       FAKE
         ... 
12969    FAKE
12970    FAKE
12971    FAKE
12972    FAKE
12973    FAKE
12974    FAKE
12975    FAKE
12976    FAKE
12977    FAKE
12978    FAKE
12979    FAKE
12980    FAKE
12981    FAKE
12982    FAKE
12983    FAKE
12984    FAKE
12985    FAKE
12986    FAKE
12987    FAKE
12988    FAKE
12989    FAKE
12990    FAKE
12991    FAKE
12992    FAKE
12993    FAKE
12994    FAKE
12995    FAKE
12996    FAKE
12997    FAKE
12998    FAKE
Name: type, Length: 12999, dtype: object

In [20]:
new_df.columns

Index([u'title', u'text', u'type'], dtype='object')

In [21]:
fake_real_df.columns

Index([u'Unnamed: 0', u'title', u'text', u'label'], dtype='object')