# Intro

The goal of this EDA is to find extractable and useful features which can utilize the known characterastics of fake news, as well as to find unexpected features of fake news. There are tons of items to explore. Fun!

In [1]:
import pandas as pd
import numpy as np
import re
import time

from nltk.help import upenn_tagset
from nltk.tokenize import TreebankWordTokenizer

from nltk.tokenize import word_tokenize, sent_tokenize
from nltk import pos_tag, RegexpParser
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
from matplotlib import pyplot as plt
import seaborn as sns

def pline(word):
    print('\n===== ',word,' =====\n')
    
pd.options.display.max_colwidth = 200
pd.options.display.max_rows = 100

In [2]:
df0 = pd.read_csv('True.csv')
df1 = pd.read_csv('Fake.csv')

# Check data quality and contents
Provided data are already well organized, so let's just see what we have.

In [3]:
df0.drop_duplicates()
df1.drop_duplicates()

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’s Eve Message; This is Disturbing,"Donald Trump just couldn t wish all Americans a Happy New Year and leave it at that. Instead, he had to give a shout out to his enemies, haters and the very dishonest fake news media. The former...",News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian Collusion Investigation,"House Intelligence Committee Chairman Devin Nunes is going to have a bad day. He s been under the assumption, like many of us, that the Christopher Steele-dossier was what prompted the Russia inve...",News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke For Threatening To Poke People ‘In The Eye’,"On Friday, it was revealed that former Milwaukee Sheriff David Clarke, who was being considered for Homeland Security Secretary in Donald Trump s administration, has an email scandal of his own.In...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name Coded Into His Website (IMAGES),"On Christmas day, Donald Trump announced that he would be back to work the following day, but he is golfing for the fourth day in a row. The former reality show star blasted former President Bar...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump During His Christmas Speech,Pope Francis used his annual Christmas Day message to rebuke Donald Trump without even mentioning his name. The Pope delivered his message just days after members of the United Nations condemned T...,News,"December 25, 2017"
...,...,...,...,...
23476,McPain: John McCain Furious That Iran Treated US Sailors Well,"21st Century Wire says As 21WIRE reported earlier this week, the unlikely mishap of two US Naval vessels straying into Iranian waters just hours before the President s State of the Union speec...",Middle-east,"January 16, 2016"
23477,"JUSTICE? Yahoo Settles E-mail Privacy Class-action: $4M for Lawyers, $0 for Users","21st Century Wire says It s a familiar theme. Whenever there is a dispute or a change of law, and two tribes go to war, there is normally only one real winner after the tribulation the lawyers. A...",Middle-east,"January 16, 2016"
23478,Sunnistan: US and Allied ‘Safe Zone’ Plan to Take Territorial Booty in Northern Syria,"Patrick Henningsen 21st Century WireRemember when the Obama Administration told the world how it hoped to identify 5,000 reliable non-jihadist moderate rebels hanging out in Turkey and Jordan, ...",Middle-east,"January 15, 2016"
23479,How to Blow $700 Million: Al Jazeera America Finally Calls it Quits,21st Century Wire says Al Jazeera America will go down in history as one of the biggest failures in broadcast media history.Ever since the US and its allies began plotting to overthrow Libya and S...,Middle-east,"January 14, 2016"


In [4]:
%%script false --no-raise-error

print('\n===== Format =====\n')
print(df0.info())
print(df1.info())
print('\n  ---- Real, subject\n',df0.subject.value_counts())
print('\n  ---- Fake, subject\n',df1.subject.value_counts())
print('\n  ---- Real, date\n',df0.date.nunique())
print('\n  ---- Fake, date\n',df1.date.nunique())
print('\n  ---- Real, head\n',df0.head(5))
print('\n  ---- Fake, head\n',df1.head(5))

print('\n===== (Nearly) Empty contents =====\n')
print('\n  ---- Real, text <6 words\n')
print(df0[df0.text.str.split().str.len()<6].text.count())
try:
    print(df0[df0.text.str.split().str.len()<6].sample(20))
except:
    print(df0[df0.text.str.split().str.len()<6].head(20))

print('\n  ---- Fake, text <6 words\n')
print(df1[df1.text.str.split().str.len()<6].text.count())
try:
    print(df1[df1.text.str.split().str.len()<6].sample(20))
except:
    print(df1[df1.text.str.split().str.len()<6].head(20))
    
print('\n  ---- Real, title <3 words\n')
print(df0[df0.title.str.split().str.len()<3].title.count())
try:
    print(df0[df0.title.str.split().str.len()<3].sample(20))
except:
    print(df0[df0.title.str.split().str.len()<3].head(20))
    
print('\n  ---- Fake, title <3 words\n')
print(df1[df1.title.str.split().str.len()<3].title.count())
try:
    print(df1[df1.title.str.split().str.len()<3].sample(20))
except:
    print(df1[df1.title.str.split().str.len()<3].head(20))

print('\n===== Title, Real =====\n')
print(df0.title.sample(30))
print('\n===== Title, Fake =====\n')
print(df1.title.sample(30))
print('\n===== Text, Real =====\n')
print(df0.text.sample(1))
print('\n===== Text, Fake =====\n')
print(df1.text.sample(1))


print('\n===== Fake news samples of each subject =====\n')

for x in df1.subject.unique():
    print('\n  ----',x,' \n')
    print(df1[df1.subject==x].drop(['subject','date'], axis=1, inplace=False).sample(10))

## Observation

1. Data format
- 21k of real and 23k of fake news.
- No duplication, no nan entry.

2. Title
- Every **real news** has a title. These titles are **concise and informative summaries** of the main artcle, so title is enough to guess the topic of contents.
- **Fake news** have **longer** title, but often they are **teasers**, i.e. you **can't guess what happened without cliking the title** and see the contents.
- **Empty title** of **fake news** are mostly a **website address**. I don't know if it's a **mistake of data processing or not**. I'll **discard** those rows because it's only a few (10 rows).


3. Text
- Only one real news has empty text, which is seemingly a graphic contents only.
- **3.3% of fake news don't have texts**, and they were mostly **video**.


4. Other
- Unfortunately, this dataset doesn't contain the name of **author**. Empty or fake author name could be a strong feature which yield high precition.
- Interesting observation about **names** is that in the real news, they are either "title+full name" or "last name", whereas in fake news, they are sometimes **just full names without title**. It's not trivial to extract names and position, so let's keep it as a note for now.
- From selective examples, fake news show more **website addresses** or **social media account** as their **source**. Let's take a look.

In [5]:
# drop rows with unreliable data
df1.drop(df1[df1.title.str.split().str.len()<3].index.tolist(), axis=0, inplace=True)

# Iterate cleaning and exploration

## Digital source

In [6]:
%%script false --no-raise-error

# check usage of digital source
print(df0[df0.text.str.contains('@.', regex= True, na=False)].text.count())
print(df1[df1.text.str.contains('@.', regex= True, na=False)].text.count())
print(df0[df0.text.str.contains('https?://.', regex= True, na=False)].text.count())
print(df1[df1.text.str.contains('https?://.', regex= True, na=False)].text.count())

# exclude articles about Twitter
print(df0[df0.text.str.contains('@.', regex= True, na=False) & (~df0.text.str.contains('Twitter', regex= False, na=False))].text.count())
print(df1[df1.text.str.contains('@.', regex= True, na=False) & (~df1.text.str.contains('Twitter', regex= False, na=False))].text.count())

#print(df0.iloc[31][1])

## Observation about digital source

**Fake news** use **social media account (@..)** and **website (http(s)://...)** as source much **more often**.

1. @
- 282 real news and 6312 fake news contains social media account.
- **Real news** use social media account as source, mostly when the news **topic is about Twitter post**.
- Including "Twitter" in the row selection, only 31 of real news contains @, whereas 4038 of fake news have it.

2. http(s)://
- 3290 of fake news have website address in the article.
- Non of real news have website address in the article.

In [7]:
# replace digital sources
df0.replace(to_replace='@.', value='social_media_account', regex=True, inplace=True)
df1.replace(to_replace='@.', value='social_media_account', regex=True, inplace=True)
df0.replace(to_replace='https?://.', value='website_address', regex=True, inplace=True)
df1.replace(to_replace='https?://.', value='website_address', regex=True, inplace=True)

## Title and text sizes - before preprocessing

In [8]:
%%script false --no-raise-error
print("Average title length in number of words")
print("Real:", df0.title.str.split().str.len().mean(),"+-",df0.title.str.split().str.len().std())
print("Fake:", df1.title.str.split().str.len().mean(),"+-",df1.title.str.split().str.len().std(),"\n")

print("Average text length in number of words")
print("Real:", df0.text.str.split().str.len().mean(),"+-",df0.text.str.split().str.len().std())
print("Fake:", df1.text.str.split().str.len().mean(),"+-",df1.text.str.split().str.len().std(),"\n")

plt.hist(df0.title.str.split().str.len(), alpha=0.5, range=(0,50), bins=50)
plt.hist(df1.title.str.split().str.len(), alpha=0.5, range=(0,50), bins=50)
plt.title("Title word count")

plt.legend(['Real','Fake']) 
plt.show()

plt.hist(np.log10(df0.text.str.split().str.len()+1), alpha=0.5, range=(0,4), bins=50)
plt.hist(np.log10(df1.text.str.split().str.len()+1), alpha=0.5, range=(0,4), bins=50)
plt.title("Log10(Text word count)")

plt.legend(['Real','Fake'])
plt.show()

## Observation of raw text and title sizes
- I expected that fake news has much fewer text. However, in most of cases, both real and fake news has **similar amount of raw text**, although whether they are informative or not is different story.
- As we saw from selected samples in the above, real news has shorter title length with smaller standard deviation. That's seemingly because **briefness** is necessary for news title. **Long title length** can be a **useful feature** of fake news.
- Real news has **bimodal shape for text word count**. Probably they have two types of news length and authors follow standard word count strictly for each of them.

## Frequent words - before preprocessing

In [9]:
%%script false --no-raise-error

# Word count without text processing, in case frequently used keywords are not an alphabet, etc
title0 = df0.sample(1000).title.tolist()
title1 = df1.sample(1000).title.tolist()
text0 = df0.sample(1000).text.tolist()
text1 = df1.sample(1000).text.tolist()

def text2words(text):
    
    words_list = []
    
    for sentence in text:
    
        words = word_tokenize(sentence)
    
        for word in words:
            words_list.append(word)

    return words_list

title0 = text2words(title0)
title1 = text2words(title1)
text0 = text2words(text0)
text1 = text2words(text1)

In [10]:
%%script false --no-raise-error

word_counter = Counter(title0)
print('\n===== Title, Real =====\n')
print(word_counter.most_common(100))
print('\n===== Title, Fake =====\n')
word_counter = Counter(title1)
print(word_counter.most_common(100))
print('\n===== Text, Real =====\n')
word_counter = Counter(text0)
print(word_counter.most_common(100))
print('\n===== Text, Fake =====\n')
word_counter = Counter(text1)
print(word_counter.most_common(100))

# Text preprocessing

In [11]:
#stop_words = set(stopwords.words('english'))
#print(stop_words,len(stop_words))

#stopword.add('can')
#stopword.add('could')
#stopword.add('will')
#stopword.add('would')
#stopword.add('must')
#stopword.add('might')

In [12]:
def convert_pos(pos_only):
    
    #word_pos = pos_tag([word])

    tag = ''
    try:
        tag = pos_only[:2]
    except:
        tag = 'n'
    
    if tag == 'JJ':
        tag = 'a'
    elif tag == 'NN':
        tag = 'n'
    elif tag == 'RB':
        tag = 'r'
    elif tag == 'VB':
        tag = 'v'
    else:
        tag = 'n'
        
    return tag


def preprocess_text(corpus):
    
    word_list = []
    pos_list = []
    keypos_list = []
    test_list =[]
    
    lemmatizer = WordNetLemmatizer()
    stop_words = set(stopwords.words('english'))
    
    #shortword = re.compile(r'\W*\b\w{1,2}\b')
    
    # tokenize sentence
    sentences = sent_tokenize(corpus)
 
    for sentence in sentences:
        
        # cleaning characters
        #sentence = re.sub('\.\W',' ',sentence)
        sentence = re.sub('\.','',sentence) # keep abbreviation words
        #sentence = re.sub('[U]\.?[S]\.?[A]?\.?','usa',sentence)        
        sentence = re.sub('US','USA',sentence)        
        sentence = re.sub('\w+\*+\w+','swear_tagged',sentence) # e.g. f*ck
        sentence = re.sub('\W+',' ',sentence)

        # lower
        sentence = sentence.lower()
        sentence = re.sub('reuters',' ',sentence)
        sentence = re.sub('\w*video\w*','video',sentence)
        
        # cleaning short words, rarely have meaning
        #shortword.sub('', sentence)
        
        #sentence = sentence[:-1]
    
        # word tokenize 
        words = word_tokenize(sentence)
        
        for word in words:
            
            pos = pos_tag([word])
            test_list.append(pos)
            
            pos_only = pos[0][1]#[:2]
            pos_list.append(pos_only)
            
            
            if word not in stop_words:
                word_list.append(lemmatizer.lemmatize(word, pos=convert_pos(pos_only)))
                keypos_list.append(pos_only)

    return ' '.join(word_list)+' ', ' '.join(pos_list)+' ', ' '.join(keypos_list)+' '#, test_list, corpus

#print(preprocess_text(df1.iloc[12][0]))

In [13]:
#n_sample0 = 10*df1.subject.nunique()
#n_sample1 = 10*df0.subject.nunique()
#print(n_sample0,n_sample1)
#df0 = pd.concat([df0[df0.subject==x].sample(n=n_sample0) for x in df0.subject.unique().tolist()])
#df1 = pd.concat([df1[df1.subject==x].sample(n=n_sample1) for x in df1.subject.unique().tolist()])

df0 = df0.sample(n=20000) #20,000
df1 = df1.sample(n=20000)  #20,000

In [14]:
# Takes time 
# 150s per n=1000, expect 1h for n=20k

start = time.time()
df0[['title_lm','title_pos','title_lmpos']] = df0.apply(lambda x: preprocess_text(x.title), axis=1, result_type='expand')
print("time :", time.time() - start)

start = time.time()
df1[['title_lm','title_pos','title_lmpos']] = df1.apply(lambda x: preprocess_text(x.title), axis=1, result_type='expand')
print("time :", time.time() - start)

start = time.time()
df0[['text_lm','text_pos','text_lmpos']] = df0.apply(lambda x: preprocess_text(x.text), axis=1, result_type='expand')
print("time :", time.time() - start)

start = time.time()
df1[['text_lm','text_pos','text_lmpos']] = df1.apply(lambda x: preprocess_text(x.text), axis=1, result_type='expand')
print("time :", time.time() - start)

time : 51.010501861572266
time : 79.71051287651062
time : 1342.7784399986267
time : 1497.703408241272


In [15]:
%%script false --no-raise-error
lst = []

print('=======Real=========')
for x in df0.subject.unique().tolist():
    print('\n',x,'\n')
    print(df0[df0.subject==x].head(10).title)
    print(df0[df0.subject==x].head(10).title_lm)
    
    word_counter = Counter(df0[df0.subject==x].title_lm.sum().split())
    lst.append(['Real',x,word_counter])
    
print('=======Fake=========')    
for x in df1.subject.unique().tolist():
    print('\n',x,'\n')
    print(df1[df1.subject==x].head(10).title)
    print(df1[df1.subject==x].head(10).title_lm)
    
    word_counter = Counter(df1[df1.subject==x].title_lm.sum().split())
    lst.append(['Fake',x,word_counter])
    
for i in range(len(lst)):
    
    pline(lst[i][0])
    pline(lst[i][1])
    print(lst[i][2].most_common(20))

lst = []


word_counter = Counter(df0.title_lm.sum().split())
lst.append(['Real','title',word_counter.most_common(30)])
word_counter = Counter(df1.title_lm.sum().split())
lst.append(['Fake','title',word_counter.most_common(30)])
word_counter = Counter(df0.text_lm.sum().split())
lst.append(['Real','text',word_counter.most_common(30)])
word_counter = Counter(df1.text_lm.sum().split())
lst.append(['Fake','text',word_counter.most_common(30)])

for i in range(len(lst)):
    pline(lst[i][0]+', '+lst[i][1])
    print(lst[i][2])

## Findings of fake news characteristics from processed text

- **"Video", "Watch", or "Image"** on **title**, probably in order to make readers to click.
- **first names** are shown in the title, and use first names often in the text.
- Some words closely relate to **discrimination based on demography** are shown often, however, I'll take a closer look later.
- **"Boiler, room ep, sunday, episode"** in several irrelavent subject category. I don't know if I can trust subject labels here. They seemed to be envolved in **TV shows**. 

In [17]:
%%script false --no-raise-error

df0.to_csv('df0.csv',index=False)
df1.to_csv('df1.csv',index=False)