# Text classification for fake news detection

In this notebook I train and evaluate 3 different ML models from sklearn library, namely:
- Passive Aggressive Classifier
- Logistic Regression
- Linear SVC
All three models use TF-iDF vectorizer, a frequency based textvectorizer.
The models will classify news based only on the text or title of the news.

The investigation is structured in the following manner:
1. Read and preprocess data.
2. Split data into training and test sets, vectorize it for models input.
3. Training and evaluation of models.
4. Summary


Download dataset: https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset and extract into the same folder this notebook is in.

In [1]:
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
# text processing, term frequency based
from sklearn.feature_extraction.text import TfidfVectorizer 
# models
from sklearn.linear_model import PassiveAggressiveClassifier, LogisticRegression
from sklearn.svm import LinearSVC

### Data preprocessing

The input data is not 'clean'. Apperently many "true" news contain a source in the begining of their "text" field, and many "fake" news contain related pic info in the end of their "text" field, this can cause bias for a model, so we can cut them out.

In [2]:
def cutsource(s):
    ''' a function to cut out news source in "true" texts
        luckily they are separated by '-' (dash sign)
    '''
    if '- ' in s:
        s1 = s.split('- ')[0]
        s = s[len(s1)+2:]
        
    return s

def cutgetty(s):
    ''' a function to cut out 'getty images' in "fake" texts
    '''
    s = re.sub('Getty Images', '', s)
    
    return s

def cutfactbox(s):
    ''' a function to cut out 'factbox' in "true" titles
    '''
    s = re.sub('factbox', '', s, flags=re.IGNORECASE)
    
    return s

In [3]:
fake = pd.read_csv('Fake.csv')
true = pd.read_csv('True.csv')

In [4]:
true['text'] = true['text'].apply(cutsource)
fake['text'] = fake['text'].apply(cutgetty)
true['title'] = true['title'].apply(cutfactbox)
# combine data into 1 dataframe, discarding 'date' and 'subject' fields, 
# removing rows with empty text or title.
cols = ['title', 'text']
df = pd.concat([fake[cols], true[cols]], ignore_index=True)
df['text'] = df['text'].str.strip()
df['title'] = df['title'].str.strip()
label = len(fake)*['fake'] + len(true)*['true']
df['label'] = label
# drop news shorter than a tweet
df = df[df['text'].str.len() > 280]
df = df.replace('', np.nan)
df.dropna(inplace=True)
df['label'].value_counts()
example = df.iloc[42]
print(example['title'] + '\n' + example['text'] + '\n' + example['label'])

Leaked Email Proves Trump Officials Aware Russia Had ‘Thrown The USA Election’ To Trump
Donald Trump s current deputy national security adviser K.T. McFarland, a former Fox News personality, K. T. McFarland admitted in an email to a colleague during the 2016 presidential transition to Russia throwing the election to Trump. The leaked email was written just weeks before Trump s inauguration and it states that sanctions would make it difficult to ease relations with Russia,  which has just thrown the U.S.A. election to him. The New York Times reports:But emails among top transition officials, provided or described to The New York Times, suggest that Mr. Flynn was far from a rogue actor. In fact, the emails, coupled with interviews and court documents filed on Friday, showed that Mr. Flynn was in close touch with other senior members of the Trump transition team both before and after he spoke with the Russian ambassador, Sergey I. Kislyak, about American sanctions against Russia.A White H

### Machine learning time. 
Training will consider 2 cases: title only and text only classification.

For word processing I use TF-iDF, a frequency based metric, which checks the occurence of a term against a given text and the whole corpus.

Here I construct a function for evaluation of different models.

In [5]:
def data_split(df_col):
    '''split data into train and test, turn into vectors.
    '''
    x_train,x_test,y_train,y_test=train_test_split(df[df_col], df['label'], test_size=0.2, random_state=42, shuffle=True)
    # Learn vocabulary and idf, return document-term matrix.
    tfidf_vectorizer=TfidfVectorizer(stop_words='english', max_df=0.75)
    vec_train=tfidf_vectorizer.fit_transform(x_train.values.astype('U')) 
    # Transform documents to document-term matrix.
    vec_test=tfidf_vectorizer.transform(x_test.values.astype('U'))
    
    return (tfidf_vectorizer, vec_train, vec_test, y_train, y_test)
    
def model_eval(input_data, model):
    '''function to report f1 scores for a model
        based on classification based on df_col (text or title)
        tdidf_vectorizer
        https://github.com/satssehgal/FakeNewsDetector
    '''
    tfidf_vectorizer, vec_train, vec_test, y_train, y_test = input_data
    model.fit(vec_train,y_train)
    y_pred=model.predict(vec_test)
    f1 = f1_score(y_test, y_pred, pos_label='fake')
    # let's see what are the top terms for fake and true news (largest vectors)
    # https://www.datacamp.com/community/tutorials/scikit-learn-fake-news
    terms = tfidf_vectorizer.get_feature_names_out()
    keywords = sorted(zip(model.coef_[0], terms), reverse=True)
    keywords_true = np.array(keywords[:20])
    keywords_fake = np.flip(np.array(keywords[-20:]), axis=0)
    true_out = keywords_true[:, 1]
    fake_out = keywords_fake[:, 1]

    print('F1 score', f1)
    print('\t True keywords: \n', true_out)
    print('\t Fake keywords: \n', fake_out)
    
    return (f1, true_out, fake_out)

In [6]:
# split data into training and test sets, vectorize.
title_data = data_split('title')
text_data = data_split('text')

# setting models and computing scores
pac = PassiveAggressiveClassifier()
lr = LogisticRegression()
lsvc = LinearSVC()
table = []
for model in (pac, lr, lsvc):
    print('*--------- '+type(model).__name__+' ---------*')
    print('## Title based classification ##')
    f1, true_m, fake_m = model_eval(title_data, model)
    print('## Text based classification ##') 
    f1txt, true_mtxt, fake_mtxt = model_eval(text_data, model)
    table.append([f1, true_m[:5], fake_m[:5], f1txt, true_mtxt[:5], fake_mtxt[:5]])


*--------- PassiveAggressiveClassifier ---------*
## Title based classification ##
F1 score 0.9396904870277653
	 True keywords: 
 ['says' 'exclusive' 'faults' 'spokesman' 'fame' 'rohingya' 'german'
 'kremlin' 'hindu' 'kabul' 'talks' 'pakistan' 'blitz' 'vulgar' 'myanmar'
 'employs' 'fights' 'collar' 'north' 'inspiration']
	 Fake keywords: 
 ['video' 'breaking' 'racist' 'gop' 'just' 'hillary' 'joe' 'illegals'
 'james' 'anonymous' 'ck' 'actually' 'watch' 'gitmo' 'ammo' 'globalist'
 'mails' 'dems' 'dog' 'shocking']
## Text based classification ##
F1 score 0.9878884826325411
	 True keywords: 
 ['thursday' 'tuesday' 'wednesday' 'nov' 'friday' 'republican' 'reuters'
 'monday' 'donald' 'spokeswoman' 'rival' 'representatives' 'comment'
 'spokesman' 'barack' 'statement' 'referring' 'reporters' 'saying'
 'television']
	 Fake keywords: 
 ['featured' 'image' 'read' 'gop' 'com' 'sen' 'just' 'breitbart' 'rep'
 'pic' 'watch' 'mr' 'wfb' 'https' 'daily' 'hillary' 'wire' '21st' 'mail'
 'reportedly']
*---

In [7]:
models_ev = pd.DataFrame(table, index=['PAC', 'LogReg', 'LinSVC'], 
                         columns=['F1 score (title)', 'Top true (title)', 'Top fake (title)', 
                                 'F1 score (text)', 'Top true (text)', 'Top fake (text)'])
models_ev

Unnamed: 0,F1 score (title),Top true (title),Top fake (title),F1 score (text),Top true (text),Top fake (text)
PAC,0.93969,"[says, exclusive, faults, spokesman, fame]","[video, breaking, racist, gop, just]",0.987888,"[thursday, tuesday, wednesday, nov, friday]","[featured, image, read, gop, com]"
LogReg,0.94172,"[says, china, house, talks, myanmar]","[video, hillary, watch, breaking, just]",0.976499,"[wednesday, reuters, thursday, tuesday, republ...","[image, featured, just, read, gop]"
LinSVC,0.951412,"[says, britain, german, urges, myanmar]","[video, breaking, gop, just, hillary]",0.988001,"[thursday, wednesday, tuesday, reuters, nov]","[featured, image, read, gop, just]"


NOTE: because of shuffle of input data the keywords might differ from run to run (e.g. starting a new kernel), I think the seed for random_state in train_test_split() is not uniquely determined for a given number.

### Summary
From the textbased classification we can conclude that the models identified true news as the ones that refer to to a person or other news, like 'said', 'showed', 'citing', 'comment' as well as containing a date (day).
On the other hand, fake news seem to refer to images a lot (presumably in the news article) or links on the internet (e.g. pic.twitter.com/*). Also in contrast true news refer to particular days, while fake news refer to rather time adverbs, like 'just', 'daily', 'reportedly'.
Alternatively fake news refer to rather screenshots than text citations.

It is also plausible that fake news appeal to visual comprehension of information in contrast to true news appealing to verbal/idea based information. 

The title based classification shows that keywords like "shocking", "breaking", "racist" were identified for fake news. Fake news are similar to a virus, they are aimed at getting as many clicks as possible in short amount of time, indeed the  keywords for fake news look like clickbaiting words. While true news seem to have more neutral and passive keywords. 

Within this investigation all models show similar high F1 scores, with PAC and LogReg being slightly better than LogReg most of the times. At the same time textbased modelling shows higher scores than titlebased, which is reasinable, the more information is available, the more precise one can be (to some extent).

We can conclude that all the models showed reasonable results which we can also interpret in a clear understandable manner.

PAC was my first choice since I didn't really know what to pick, then I took LogisiticRegression for it being the simplest and easiest to understand, and  SVC as google said is one of the best for text classification.

### References
1. https://www.datacamp.com/community/tutorials/scikit-learn-fake-news
2. https://cezannec.github.io/CNN_Text_Classification/
