# Text classification for fake news detection

In this notebook I train and evaluate 3 different ML models from sklearn library, namely:
- Passive Aggressive Classifier
- Logistic Regression
- Linear SVC

All three models use TF-iDF vectorizer, a frequency based textvectorizer.
The models will classify news based only on the text or title of the news.

The investigation is structured in the following manner:
1. Read and preprocess data.
2. Split data into training and test sets, vectorize it for models input.
3. Training and evaluation of models.
4. Summary


Download dataset: https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset and extract into the same folder this notebook is in.

In [1]:
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
# text processing, term frequency based
from sklearn.feature_extraction.text import TfidfVectorizer 
# models
from sklearn.linear_model import PassiveAggressiveClassifier, LogisticRegression
from sklearn.svm import LinearSVC

### Data preprocessing

The input data is not 'clean'. Apperently many "true" news contain a source in the begining of their "text" field, and many "fake" news contain related pic info in the end of their "text" field, this can cause bias for a model, so we can cut them out.

In [2]:
def cutsource(s):
    ''' a function to cut out news source in "true" texts
        luckily they are separated by '-' (dash sign)
    '''
    if '- ' in s:
        s1 = s.split('- ')[0]
        s = s[len(s1)+2:]
        
    return s

def cutgetty(s):
    ''' a function to cut out 'getty images' in "fake" texts
    '''
    s = re.sub('Getty Images', '', s)
    
    return s

def cutfactbox(s):
    ''' a function to cut out 'factbox' in "true" titles
    '''
    s = re.sub('factbox', '', s, flags=re.IGNORECASE)
    
    return s

In [3]:
fake = pd.read_csv('Fake.csv')
true = pd.read_csv('True.csv')

In [4]:
true['text'] = true['text'].apply(cutsource)
fake['text'] = fake['text'].apply(cutgetty)
true['title'] = true['title'].apply(cutfactbox)
# combine data into 1 dataframe, discarding 'date' and 'subject' fields, 
# removing rows with empty text or title.
cols = ['title', 'text']
df = pd.concat([fake[cols], true[cols]], ignore_index=True)
df['text'] = df['text'].str.strip()
df['title'] = df['title'].str.strip()
label = len(fake)*['fake'] + len(true)*['true']
df['label'] = label
# drop news shorter than a tweet
df = df[df['text'].str.len() > 280]
df = df.replace('', np.nan)
df.dropna(inplace=True)
df['label'].value_counts()
example = df.iloc[42]
print(example['title'] + '\n' + example['text'] + '\n' + example['label'])

Leaked Email Proves Trump Officials Aware Russia Had ‘Thrown The USA Election’ To Trump
Donald Trump s current deputy national security adviser K.T. McFarland, a former Fox News personality, K. T. McFarland admitted in an email to a colleague during the 2016 presidential transition to Russia throwing the election to Trump. The leaked email was written just weeks before Trump s inauguration and it states that sanctions would make it difficult to ease relations with Russia,  which has just thrown the U.S.A. election to him. The New York Times reports:But emails among top transition officials, provided or described to The New York Times, suggest that Mr. Flynn was far from a rogue actor. In fact, the emails, coupled with interviews and court documents filed on Friday, showed that Mr. Flynn was in close touch with other senior members of the Trump transition team both before and after he spoke with the Russian ambassador, Sergey I. Kislyak, about American sanctions against Russia.A White H

### Machine learning time. 
Training will consider 2 cases: title only and text only classification.

For word processing I use TF-iDF, a frequency based metric, which checks the occurence of a term against a given text and the whole corpus.

Here I construct a function for evaluation of different models.

In [5]:
def data_split(df_col):
    '''split data into train and test, turn into vectors.
    '''
    x_train,x_test,y_train,y_test=train_test_split(df[df_col], df['label'], test_size=0.2, random_state=42, shuffle=True)
    # Learn vocabulary and idf, return document-term matrix.
    tfidf_vectorizer=TfidfVectorizer(stop_words='english', max_df=0.75)
    vec_train=tfidf_vectorizer.fit_transform(x_train.values.astype('U')) 
    # Transform documents to document-term matrix.
    vec_test=tfidf_vectorizer.transform(x_test.values.astype('U'))
    
    return (tfidf_vectorizer, vec_train, vec_test, y_train, y_test)
    
def model_eval(input_data, model):
    '''function to report f1 scores for a model
        based on classification based on df_col (text or title)
        tdidf_vectorizer
        https://github.com/satssehgal/FakeNewsDetector
    '''
    tfidf_vectorizer, vec_train, vec_test, y_train, y_test = input_data
    model.fit(vec_train,y_train)
    y_pred=model.predict(vec_test)
    f1 = f1_score(y_test, y_pred, pos_label='fake')
    # let's see what are the top terms for fake and true news (largest vectors)
    # https://www.datacamp.com/community/tutorials/scikit-learn-fake-news
    terms = tfidf_vectorizer.get_feature_names_out()
    keywords = sorted(zip(model.coef_[0], terms), reverse=True)
    keywords_true = np.array(keywords[:20])
    keywords_fake = np.flip(np.array(keywords[-20:]), axis=0)
    true_out = keywords_true[:, 1]
    fake_out = keywords_fake[:, 1]

    print('F1 score', f1)
    print('\t True keywords: \n', true_out)
    print('\t Fake keywords: \n', fake_out)
    
    return (f1, true_out, fake_out)

In [6]:
# split data into training and test sets, vectorize.
title_data = data_split('title')
text_data = data_split('text')

# setting models and computing scores
pac = PassiveAggressiveClassifier()
lr = LogisticRegression()
lsvc = LinearSVC()
table = []
for model in (pac, lr, lsvc):
    print('*--------- '+type(model).__name__+' ---------*')
    print('## Title based classification ##')
    f1, true_m, fake_m = model_eval(title_data, model)
    print('## Text based classification ##') 
    f1txt, true_mtxt, fake_mtxt = model_eval(text_data, model)
    table.append([f1, true_m[:5], fake_m[:5], f1txt, true_mtxt[:5], fake_mtxt[:5]])


*--------- PassiveAggressiveClassifier ---------*
## Title based classification ##
F1 score 0.9376854599406528
	 True keywords: 
 ['says' 'exclusive' 'faults' 'spokesman' 'fame' 'kremlin' 'rohingya'
 'hindu' 'german' 'kabul' 'blitz' 'urges' 'talks' 'pakistan' 'myanmar'
 'employs' 'fights' 'vulgar' 'collar' 'weigh']
	 Fake keywords: 
 ['video' 'breaking' 'racist' 'gop' 'just' 'joe' 'hillary' 'illegals'
 'james' 'anonymous' 'actually' 'watch' 'gitmo' 'ck' 'ammo' 'globalist'
 'dog' 'dems' 'destroy' 'knees']
## Text based classification ##
F1 score 0.9875528148909445
	 True keywords: 
 ['thursday' 'tuesday' 'wednesday' 'nov' 'friday' 'republican' 'reuters'
 'monday' 'donald' 'spokeswoman' 'rival' 'spokesman' 'statement' 'comment'
 'referring' 'representatives' 'barack' 'reporters' 'saying' 'edt']
	 Fake keywords: 
 ['featured' 'image' 'read' 'gop' 'just' 'com' 'sen' 'rep' 'watch' 'pic'
 'breitbart' 'wfb' 'mr' 'https' 'daily' 'hillary' 'mail' 'wire'
 'reportedly' '21st']
*--------- Logistic

In [7]:
models_ev = pd.DataFrame(table, index=['PAC', 'LogReg', 'LinSVC'], 
                         columns=['F1 score (title)', 'Top true (title)', 'Top fake (title)', 
                                 'F1 score (text)', 'Top true (text)', 'Top fake (text)'])
models_ev

Unnamed: 0,F1 score (title),Top true (title),Top fake (title),F1 score (text),Top true (text),Top fake (text)
PAC,0.937685,"[says, exclusive, faults, spokesman, fame]","[video, breaking, racist, gop, just]",0.987553,"[thursday, tuesday, wednesday, nov, friday]","[featured, image, read, gop, just]"
LogReg,0.94172,"[says, china, house, talks, myanmar]","[video, hillary, watch, breaking, just]",0.976499,"[wednesday, reuters, thursday, tuesday, republ...","[image, featured, just, read, gop]"
LinSVC,0.951412,"[says, britain, german, urges, myanmar]","[video, breaking, gop, just, hillary]",0.988001,"[thursday, wednesday, tuesday, reuters, nov]","[featured, image, read, gop, just]"


NOTE: because of shuffle of input data the keywords might differ from run to run (e.g. starting a new kernel), I think the seed for random_state in train_test_split() is not uniquely determined for a given number.

### Summary
From the textbased classification we can conclude that the models identified true news as the ones that refer to to a person or other news, like 'said', 'showed', 'citing', 'comment' as well as containing a date (day).
On the other hand, fake news seem to refer to images a lot (presumably in the news article) or links on the internet (e.g. pic.twitter.com/*). Also in contrast true news refer to particular days, while fake news refer to rather time adverbs, like 'just', 'daily', 'reportedly'.
Alternatively fake news refer to rather screenshots than text citations.

It is also plausible that fake news appeal to visual comprehension of information in contrast to true news appealing to verbal/idea based information. 

The title based classification shows that keywords like "shocking", "breaking", "racist" were identified for fake news. Fake news are similar to a virus, they are aimed at getting as many clicks as possible in short amount of time, indeed the  keywords for fake news look like clickbaiting words. While true news seem to have more neutral and passive keywords. 

Within this investigation all models show similar high F1 scores, with PAC and LogReg being slightly better than LogReg most of the times. At the same time textbased modelling shows higher scores than titlebased, which is reasonable, the more information is available, the more precise one can be (to some extent).

We can conclude that all the models showed reasonable results which we can also interpret in a clear understandable manner.

PAC was my first choice since I didn't really know what to pick, then I took LogisiticRegression for it being the simplest and easiest to understand, and  SVC as google said is one of the best for text classification.

## CNN (Convolutional Neural Network) for text classification using PyTorch

This part of investigation is an attempt to apply the CNN classification algorithm written for movie reviews [[2]](#references). 

I'll go through the following steps:
1. Text preprocessing
    - Embedding layer and Tokenization
    - Padding
2. CNN model
3. Evaluation

### Text preprocessing

I am going to use an embedding pre-trained layer to which I'd like to map words in the news. For that first I need to tokenize the news, i.e. turn them into lists of words, tokens and then map them to the embedding pre-trained layer.

In [8]:
# lowercase the texts and remove punctuation
df['text'] = df['text'].str.lower()
df['text'] = df['text'].str.replace('[^\w\s]','', regex=True) # remove punctuation (everything that's not a word(also a number) or whitespace)
texts_split = df['text'].str.split().tolist()

In [9]:
# now the texts are lists of words
texts_split[42][:15]

['donald',
 'trump',
 's',
 'current',
 'deputy',
 'national',
 'security',
 'adviser',
 'kt',
 'mcfarland',
 'a',
 'former',
 'fox',
 'news',
 'personality']

In [10]:
# seems like I have 1 letter words which usually don't carry much semantic significance
# let's get rid of them
for i, text in enumerate(texts_split):
    texts_split[i] = [word for word in text if len(word)>1]

In [11]:
print(texts_split[42])

['donald', 'trump', 'current', 'deputy', 'national', 'security', 'adviser', 'kt', 'mcfarland', 'former', 'fox', 'news', 'personality', 'mcfarland', 'admitted', 'in', 'an', 'email', 'to', 'colleague', 'during', 'the', '2016', 'presidential', 'transition', 'to', 'russia', 'throwing', 'the', 'election', 'to', 'trump', 'the', 'leaked', 'email', 'was', 'written', 'just', 'weeks', 'before', 'trump', 'inauguration', 'and', 'it', 'states', 'that', 'sanctions', 'would', 'make', 'it', 'difficult', 'to', 'ease', 'relations', 'with', 'russia', 'which', 'has', 'just', 'thrown', 'the', 'usa', 'election', 'to', 'him', 'the', 'new', 'york', 'times', 'reportsbut', 'emails', 'among', 'top', 'transition', 'officials', 'provided', 'or', 'described', 'to', 'the', 'new', 'york', 'times', 'suggest', 'that', 'mr', 'flynn', 'was', 'far', 'from', 'rogue', 'actor', 'in', 'fact', 'the', 'emails', 'coupled', 'with', 'interviews', 'and', 'court', 'documents', 'filed', 'on', 'friday', 'showed', 'that', 'mr', 'flynn'

In [12]:
# one last preprocessing step is that I would like to cut the length of the texts to have a "light" model
lens = [len(text) for text in texts_split]
print("mean and median number of words in texts:", np.mean(lens), '&', np.median(lens))

mean and median number of words in texts: 404.93019340052854 & 356.0


In [13]:
# I push this number up to 400 and cut all the texts up to this number
max_len = 400
for i, text in enumerate(texts_split):
    texts_split[i] = text[:max_len]

#### Embedding layer and Tokenization
You will need to download word2vec model ```GoogleNews-vectors-negative300-SLIM.bin.gz``` (approx. 300MB) from

https://github.com/eyaler/word2vec-slim/blob/master/GoogleNews-vectors-negative300-SLIM.bin.gz

and put it into the same folder the project is. 

This word2vec model was compiled from google news which suits quite well for this project.
After loading the model I tokenize the texts according to the loaded word2vec model, i.e. I map the words from the corpus to the integers from lookup table of the model. As an output I have a 2D array of integers represting words in the news, each row is a seperate news text. I also cut long news and left pad short news.

In [14]:
## Need to run this cell once
## unziping our word2vec model 
# ! gzip -d GoogleNews-vectors-negative300-SLIM.bin.gz

In [15]:
from gensim.models import KeyedVectors

# loading the model
embed_lookup = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300-SLIM.bin', 
                                                 binary=True)
print(len(embed_lookup), 'words in the vocabulary')

299567 words in the vocabulary


In [16]:
word = 'news'
print("Length of embedding: ", len(embed_lookup[word]))  # dimension of the vector space of words
# embed_lookup.index_to_key[:11]
print("Index of '", word, "' in the lookup table:", embed_lookup.key_to_index[word])

Length of embedding:  300
Index of ' news ' in the lookup table: 283


In [17]:
#Tokenization: For each news text we represent words as their index in the lookup table
# unknown words are represnted as 0s, i.e. spaces
tokenized_news = []
for text in texts_split:
    ints = []
    for word in text:
        try:
            idx = embed_lookup.key_to_index[word]
        except:
            idx = 0
        ints.append(idx)
    tokenized_news.append(ints)

In [18]:
print('An example of a tokenized text: \n', tokenized_news[42])

An example of a tokenized text: 
 [177069, 23049, 378, 2326, 387, 348, 4042, 78551, 0, 234, 24411, 283, 5376, 0, 1790, 0, 24, 2165, 0, 7126, 128, 9, 0, 1874, 3226, 0, 127297, 3215, 9, 600, 0, 23049, 9, 8176, 2165, 8, 1465, 71, 410, 93, 23049, 9432, 0, 13, 713, 2, 3656, 42, 103, 13, 777, 0, 3268, 2090, 6, 127297, 43, 22, 71, 3381, 9, 73282, 600, 0, 88, 9, 60, 65033, 337, 0, 0, 341, 200, 3226, 232, 836, 26, 1280, 0, 9, 60, 65033, 337, 2731, 2, 64499, 0, 8, 330, 15, 14090, 2595, 0, 562, 9, 0, 6749, 6, 3529, 0, 321, 1578, 969, 4, 61858, 753, 2, 64499, 0, 8, 0, 374, 2369, 6, 61, 537, 264, 0, 9, 23049, 3226, 96, 172, 93, 0, 50, 20, 1538, 6, 9, 103824, 4994, 0, 0, 41, 46734, 3656, 97, 0, 1151, 513, 1801, 907, 0, 2873, 0, 2165, 0, 9, 9, 337, 16, 451, 2, 69, 8, 3653, 0, 9, 27631, 15090, 0, 9, 600, 2, 68710, 103, 95, 1131, 16, 9, 0, 1053, 9, 2165, 0, 121330, 0, 28, 664, 2639, 12, 23049, 9475, 348, 4042, 138, 20, 13048, 13, 0, 329, 387, 348, 6319, 88689, 0, 92, 6348, 329, 683, 0, 524, 0, 0, 329, 

#### Padding
Since I have already cut my texts up to ```max_len``` I will need to left pad with 0s all the texts that are shorter than this number. This will bring all the texts to the same length. I will end up with a 2D array with as many rows as there are news texts and as many columns as ```max_len``` .

In [19]:
pttexts = np.zeros((len(tokenized_news), max_len), dtype=int)
for i, tok_text in enumerate(tokenized_news):
    pttexts[i, -len(tok_text):] = tok_text

In [20]:
print("padded and tokenized first 11 texts up to first 10 words \n", pttexts[:11, :10])

padded and tokenized first 11 texts up to first 10 words 
 [[177069  23049     71  93409   2157     47  69404   1013     60     32]
 [     0      0      0      0      0      0      0      0      0      0]
 [     4  61858     13      8   1744      2    234 178446   4053  81995]
 [     4  68834    107 177069  23049    317      2     20     42     14]
 [  7705 205961    219     23    647  68834    107   1197      0  24121]
 [     0      0      0      0      0      0      0      0      0      0]
 [     0      0      0      0      0      0      0      0      0      0]
 [     0      0      0      0      0      0      0      0      0      0]
 [   124     63     19    965      9   6764   1732      9    562      2]
 [     0      0      0      0      0      0      0      0      0      0]
 [     0      0      0      0      0      0      0      0      0      0]]


### CNN model
Now I take the preprocessed data, split it into training, test and validation sets. Then I take the CNN model, load the data and train. 

***TO BE CONTINUED...***

### References
1. https://www.datacamp.com/community/tutorials/scikit-learn-fake-news
2. https://cezannec.github.io/CNN_Text_Classification/
3. https://github.com/eyaler/word2vec-slim
