### Load B.S. Detector "Fake news" dataset

We will follow the same basic idea we used for the "Liar, Liar" dataset

In [2]:
import pandas as pd

bs_raw = pd.read_csv("./fake-news/fake.csv")

bs_raw.head()

Unnamed: 0,uuid,ord_in_thread,author,published,title,text,language,crawled,site_url,country,domain_rank,thread_title,spam_score,main_img_url,replies_count,participants_count,likes,comments,shares,type
0,6a175f46bcd24d39b3e962ad0f29936721db70db,0,Barracuda Brigade,2016-10-26T21:41:00.000+03:00,Muslims BUSTED: They Stole Millions In Gov’t B...,Print They should pay all the back all the mon...,english,2016-10-27T01:49:27.168+03:00,100percentfedup.com,US,25689.0,Muslims BUSTED: They Stole Millions In Gov’t B...,0.0,http://bb4sp.com/wp-content/uploads/2016/10/Fu...,0,1,0,0,0,bias
1,2bdc29d12605ef9cf3f09f9875040a7113be5d5b,0,reasoning with facts,2016-10-29T08:47:11.259+03:00,Re: Why Did Attorney General Loretta Lynch Ple...,Why Did Attorney General Loretta Lynch Plead T...,english,2016-10-29T08:47:11.259+03:00,100percentfedup.com,US,25689.0,Re: Why Did Attorney General Loretta Lynch Ple...,0.0,http://bb4sp.com/wp-content/uploads/2016/10/Fu...,0,1,0,0,0,bias
2,c70e149fdd53de5e61c29281100b9de0ed268bc3,0,Barracuda Brigade,2016-10-31T01:41:49.479+02:00,BREAKING: Weiner Cooperating With FBI On Hilla...,Red State : \nFox News Sunday reported this mo...,english,2016-10-31T01:41:49.479+02:00,100percentfedup.com,US,25689.0,BREAKING: Weiner Cooperating With FBI On Hilla...,0.0,http://bb4sp.com/wp-content/uploads/2016/10/Fu...,0,1,0,0,0,bias
3,7cf7c15731ac2a116dd7f629bd57ea468ed70284,0,Fed Up,2016-11-01T05:22:00.000+02:00,PIN DROP SPEECH BY FATHER OF DAUGHTER Kidnappe...,Email Kayla Mueller was a prisoner and torture...,english,2016-11-01T15:46:26.304+02:00,100percentfedup.com,US,25689.0,PIN DROP SPEECH BY FATHER OF DAUGHTER Kidnappe...,0.068,http://100percentfedup.com/wp-content/uploads/...,0,0,0,0,0,bias
4,0206b54719c7e241ffe0ad4315b808290dbe6c0f,0,Fed Up,2016-11-01T21:56:00.000+02:00,FANTASTIC! TRUMP'S 7 POINT PLAN To Reform Heal...,Email HEALTHCARE REFORM TO MAKE AMERICA GREAT ...,english,2016-11-01T23:59:42.266+02:00,100percentfedup.com,US,25689.0,FANTASTIC! TRUMP'S 7 POINT PLAN To Reform Heal...,0.865,http://100percentfedup.com/wp-content/uploads/...,0,0,0,0,0,bias


In [25]:
len(bs_raw)

12999

In [52]:
pd.value_counts(bs_raw['language'])

english       12403
russian         203
spanish         172
german          111
french           38
arabic           22
portuguese       11
turkish          10
italian           9
ignore            7
greek             4
norwegian         3
dutch             3
chinese           1
polish            1
finnish           1
Name: language, dtype: int64

Immediately we see that we have a problem with this dataset as we planned to use it: it's made up entirely of "fake news," with no "real news" examples. As such, it may be suitable for unsupervised learning, but on its own we can't use this for training a supervised model. We will still load it in and tokenize it, and will have to decide whether and how we will use this.

We also have a lot of missing values. If we want to preserve the matching between headlines and text, we'll have to drop records where either is missing. 

There are also some examples that are not in English. This may confuse our model, so we will only keep the English language examples, which make up the vast majority of the data.

In [60]:
bs_english = bs_raw.loc[bs_raw["language"] == "english"]
bs_keep = bs_english[["title","text"]].dropna()


In [62]:
bs_keep.head()

Unnamed: 0,title,text
0,Muslims BUSTED: They Stole Millions In Gov’t B...,Print They should pay all the back all the mon...
1,Re: Why Did Attorney General Loretta Lynch Ple...,Why Did Attorney General Loretta Lynch Plead T...
2,BREAKING: Weiner Cooperating With FBI On Hilla...,Red State : \nFox News Sunday reported this mo...
3,PIN DROP SPEECH BY FATHER OF DAUGHTER Kidnappe...,Email Kayla Mueller was a prisoner and torture...
4,FANTASTIC! TRUMP'S 7 POINT PLAN To Reform Heal...,Email HEALTHCARE REFORM TO MAKE AMERICA GREAT ...


In [63]:
bs_headlines = list(bs_keep["title"])
bs_text = list(bs_keep["text"])

In [64]:
print(len(bs_headlines), len(bs_text))

11677 11677


After removing missing data and non-English examples, we have 11,677 remaining pairs of headlines and text. 

As before, let's look at some random examples to familiarize ourselves with the data.

In [74]:
# Again, look at some random examples
import random
ex_num = random.randint(1, len(bs_headlines))

print(bs_headlines[ex_num]) 
print(bs_text[ex_num])

Explosive Audio Surfaces Of GOP Senator Joking About Gun Owners Shooting Clinton
By Sean Colarossi on Mon, Oct 31st, 2016 at 7:56 pm CNN has obtained audio of North Carolina GOP Senator Richard Burr joking about gun owners putting a "bullseye" on Hillary Clinton. Share on Twitter Print This Post 
CNN has obtained brand new audio of North Carolina Republican Senator Richard Burr joking about gun owners putting a “bullseye” on Hillary Clinton. 
The despicable comments come as Burr faces a close Senate race against Democratic opponent Deborah Ross. Both candidates are within a point of each other, according to RealClearPolitics. 
More of the CNN report : 
The North Carolina Republican, locked in a tight race for reelection, quipped that as he walked into a gun shop “nothing made me feel better” than seeing a magazine about rifles “with a picture of Hillary Clinton on the front of it.” 
“I was a little bit shocked at that — it didn’t have a bullseye on it,” he said Saturday to GOP voluntee

Some of these text examples are very long, much longer than the statements in the "liar, liar" dataset. There are also some that are very short.

Finally, let's tokenize our headlines and text, and look at a tokenized example.

In [75]:
def tokenize(lines, token='word'):
    if token == 'word':
        return [line.split(' ') for line in lines]
    elif token == 'char':
        return [list(line) for line in lines]
    else:
        print('ERROR: unknown token type '+token)
        
bs_head_tokens = tokenize(bs_headlines, token = 'word')
bs_text_tokens = tokenize(bs_text, token = 'word')

In [76]:
bs_head_tokens[ex_num]

['Explosive',
 'Audio',
 'Surfaces',
 'Of',
 'GOP',
 'Senator',
 'Joking',
 'About',
 'Gun',
 'Owners',
 'Shooting',
 'Clinton']

In [77]:
bs_text_tokens[ex_num]

['By',
 'Sean',
 'Colarossi',
 'on',
 'Mon,',
 'Oct',
 '31st,',
 '2016',
 'at',
 '7:56',
 'pm',
 'CNN',
 'has',
 'obtained',
 'audio',
 'of',
 'North',
 'Carolina',
 'GOP',
 'Senator',
 'Richard',
 'Burr',
 'joking',
 'about',
 'gun',
 'owners',
 'putting',
 'a',
 '"bullseye"',
 'on',
 'Hillary',
 'Clinton.',
 'Share',
 'on',
 'Twitter',
 'Print',
 'This',
 'Post',
 '\nCNN',
 'has',
 'obtained',
 'brand',
 'new',
 'audio',
 'of',
 'North',
 'Carolina',
 'Republican',
 'Senator',
 'Richard',
 'Burr',
 'joking',
 'about',
 'gun',
 'owners',
 'putting',
 'a',
 '“bullseye”',
 'on',
 'Hillary',
 'Clinton.',
 '\nThe',
 'despicable',
 'comments',
 'come',
 'as',
 'Burr',
 'faces',
 'a',
 'close',
 'Senate',
 'race',
 'against',
 'Democratic',
 'opponent',
 'Deborah',
 'Ross.',
 'Both',
 'candidates',
 'are',
 'within',
 'a',
 'point',
 'of',
 'each',
 'other,',
 'according',
 'to',
 'RealClearPolitics.',
 '\nMore',
 'of',
 'the',
 'CNN',
 'report',
 ':',
 '\nThe',
 'North',
 'Carolina',
 'Rep