# Intro

The goal of this project is to develop a **fake news filtering software**.
To make such **filters**, I **utilized known features** of fake news as well as **newly identified features** of fake news from the **explorative data analysis with natural language processings**.

# Data

The dataset is provide by Information Security and Object Technology (ISOT) research lab, University of Victoria.
https://www.uvic.ca/engineering/ece/isot/datasets/fake-news/index.php

The dataset contains 21k of real news scrapped from "Reuters.com" and 24k of fake news collected from different sources, where all of them are flagged as unreliable by Polififact (a fact-checking organization in the USA) and Wikipedia. The coverage of topics are various, yet mostly about politics.

Provided dataset are **True.csv** (reuter news) and **Fake.csv** (unreliable news), both have **title, text, subject, and publication date**.

# Known characteristics of fake news

I summarized the judging criteria about credibility of news into three categories.

1. Information-wise

    1. Not a whole information
        
        1. Lack of necessary information, like 5W1H
        2. Lack of context

    2. Not a NEWs
    
        1. Outdated
        
    3. Not valuable
    
        1. Not impactful/important socially
        2. Not a rare event
        3. Nothing to do with the area where news provider cover
        
2. Tone

    1. Doesn't sounds professional
    
        1. Contain slangs
        2. Vocabularies are not specific
    
    2. Hateful
    
        1. Enhance bias or discrimination
        2. Provocative
        
    3. Urgent
    
        1. Make readers to spread this news as much as you can
        2. Make reader to act promptly
        
    4. Joke (or pretend to be a joke)
    
        1. Make fun of someone/organization/policy
        
    5. Clickbait

        1. The title contains the above

3. Source-wise

    1. Author
    
        1. Cannot find the name of author
        2. The author is fake
        3. The author is not a reliable person/organization
        
    2. Media/Publishing organization
    
        1. The media is not reliable of fishy

    3. Supporting evidence
    
        1. The evidence that support the news is not adequate
        2. Not provided by a relavant expert or organization

# EDA

The goal of this EDA is to find extractable and useful features which can utilize the known characterastics of fake news, as well as to find unexpected features of fake news.

In [10]:
import pandas as pd
import numpy as np
import re
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk import pos_tag, RegexpParser
from nltk.corpus import stopwords
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB 

In [11]:
df0 = pd.read_csv('True.csv')
df1 = pd.read_csv('Fake.csv')

In [16]:
print(df0.info())
print(df1.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21417 entries, 0 to 21416
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    21417 non-null  object
 1   text     21417 non-null  object
 2   subject  21417 non-null  object
 3   date     21417 non-null  object
dtypes: object(4)
memory usage: 669.4+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23481 entries, 0 to 23480
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    23481 non-null  object
 1   text     23481 non-null  object
 2   subject  23481 non-null  object
 3   date     23481 non-null  object
dtypes: object(4)
memory usage: 733.9+ KB
None


### Observation 1
- Unfortunately, this dataset doesn't contain the name of author.
- There's no n/a entry.

In [26]:
pd.options.display.max_colwidth = 200
print(df0.head(20))
print(df1.head(20))

                                                                             title  \
0                 As U.S. budget fight looms, Republicans flip their fiscal script   
1                 U.S. military to accept transgender recruits on Monday: Pentagon   
2                     Senior U.S. Republican senator: 'Let Mr. Mueller do his job'   
3                      FBI Russia probe helped by Australian diplomat tip-off: NYT   
4            Trump wants Postal Service to charge 'much more' for Amazon shipments   
5                 White House, Congress prepare for talks on spending, immigration   
6                  Trump says Russia probe will be fair, but timeline unclear: NYT   
7                     Factbox: Trump on Twitter (Dec 29) - Approval rating, Amazon   
8                                       Trump on Twitter (Dec 28) - Global Warming   
9     Alabama official to certify Senator-elect Jones today despite challenge: CNN   
10                      Jones certified U.S. Senate wi

### Observation 2
- 
- "Subject" column doesn't seem to be useful

In [9]:


df0.iloc[4][1]
print(text)



10% of articles don't have author. I wonder if that can be characteristics of fake news.

In [6]:
df_read.fillna({'author':'_EmptyAuthorTitleText_','title':'_EmptyAuthorTitleText_','text':'_EmptyAuthorTitleText_'},inplace=True)

In [7]:
df_read[df_read.author == '_EmptyAuthorTitleText_'].describe()

Unnamed: 0,id,label
count,1957.0,1957.0
mean,10292.462954,0.986714
std,6029.266152,0.114524
min,6.0,0.0
25%,5124.0,1.0
50%,10242.0,1.0
75%,15458.0,1.0
max,20786.0,1.0


100% of news that has no title or text were all fake, and 99% of news that cannot specify author are fake. They can be strong features with high precision.

In [22]:
vectorizer = CountVectorizer()

vectorizer.fit(text)

X = vectorizer.transform(text)
y = label

print(vectorizer.vocabulary_)



In [23]:
classifier = MultinomialNB()
classifier.fit(X,y)

MultinomialNB()

In [27]:
x = df_read.iloc[100]

print(x)

classifier.predict_proba(vectorizer.transform([x.content]))


id                                                       100
title      Technocracy: The Real Reason Why The UN Wants ...
author                                         Activist Post
text       By Patrick Wood By its very nature, the Intern...
label                                                      1
content    By Patrick Wood By its very nature, the Intern...
Name: 100, dtype: object


array([[0.01342701, 0.98657299]])

In [None]:
# Get a random text
text = df.iloc[6][3]

# Sentence tokenize
text = sent_tokenize(text)

# Part of speech tagging
pos_text = []

# Stop word removal
stop_words = set(stopwords.words('english'))

t = ['this is good', 'that is good', 'good hehe', 'this is bad', 'that is bad', 'bad sucks']
l = [0, 0, 0, 1, 1, 1]


counter = CountVectorizer()
counter.fit(t)


print(counter.vocabulary_)
print(counter.transform(t))



After playing with the 'chunk_counter' function for a few sentences, I noticed that real news have more specific (you can guess topic based on keywords) and objective terms whereas fake news have more subjective and plain (no idea what's topic based on frequent keywords) terms.

In [None]:


for sentence in text:
    
    sentence = re.sub('\W+',' ',sentence)
    sentence = sentence.lower()
    
    sentence = word_tokenize(sentence)
    
    keywords = [word for word in sentence if word not in stop_words]
    
    keywords = pos_tag(keywords)
    pos_text.append(keywords)
    
    

def chunk_counter(pos_text, abbr='NP', n_chunk=30):

    grammar = ''
    
    if abbr == 'Noun':
        grammar = "Noun: {<NN.*>}"
        
    elif abbr == 'Verb':
        grammar = "Verb: {<VB.*>}"
    
    elif abbr == 'Adj':
        grammar = 'Adj: {<JJ.*>}'
        
    elif abbr == 'Adv':
        grammar = 'Adv: {<RB.*>}'    
        
    #grammar = "NP: {<DT>?<JJ>*<NN.*>}" # noun
    #grammar = "VPa: {<DT>?<JJ>*<NN.*><VB.*><RB.?>?}"
    #grammar = "VPb: {<VB.*><DT>?<JJ>*<NN><RB.?>?}" 
    
    else:
        print('Incorrect gabbr')
        return False

        
    # Chunk phrases
    parser = RegexpParser(grammar)
    
    chunks = []
    
    for sentence in pos_text:
        
        chunk = parser.parse(sentence)
        
        for subtree in chunk.subtrees(filter=lambda t: t.label() == abbr):
            chunks.append(tuple(subtree))
        
    # Count phrases
    counter = Counter()

    for chunk in chunks:
        counter[chunk] += 1

    return counter.most_common(10)

print(chunk_counter(pos_text, abbr='Verb'))