<a href="https://www.kaggle.com/code/ishanmitra96/nlp-fake-news-cnn-lstm-and-transformers?scriptVersionId=175112667" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Loading the dataset

In [None]:
true_df = pd.read_csv("/kaggle/input/fake-news-detection/true.csv")
true_df

In [None]:
fake_df = pd.read_csv("/kaggle/input/fake-news-detection/fake.csv")
fake_df

# Data Preprocessing

https://github.com/remydeshayes/NLP_Pytorch/blob/main/Notebook%20-%20Fake_News%20Detection%20Pytorch%20-%20Billiot_Deshayes.ipynb

Preprocessing of true and fake datasets for data optimisation and removal of blank and duplicate entries.

## Verified News Dataset

In [None]:
# TRUE DATASET
# Checking for NaN values
true_df.isna().sum()

In [None]:
true_df['text']

In [None]:
# Download stopwords from nltk library
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stopwords = list(set(stopwords.words('english')))

In [None]:
# Define function for most common words
# https://github.com/remydeshayes/NLP_Pytorch/blob/main/Notebook%20-%20Fake_News%20Detection%20Pytorch%20-%20Billiot_Deshayes.ipynb
from collections import Counter


def most_common(corpus, nb_words):
    articles = corpus.str.split()
    # Explanation of nested list comprehension:
    # Iterate through every article in articles
    # Iterate through every word in article (second for in loop)
    # Add word to the np.array if it is not a stopword (nltk)
    words = np.array([word for article in articles for word in article if word.lower() not in stopwords])
    counter = Counter(words)
    d = pd.DataFrame(counter, index=['occurences']).transpose().reset_index()
    d.columns = ['word', 'occurences']
    d = d.sort_values('occurences', ascending=False)
    return d[:nb_words]

In [None]:
# This is VERY HIGH cpu compute (but it only takes a few seconds)
most_common(true_df['text'], 10)

Even without computing the occurences, we notice the presence of the format "CITY (Reuters) - " at the beginning of each article. This is missing in all the articles from the fake_df.
We will delete this pattern format where it is present in the article within the first 50 characters of an article.
Failing to clean this would create a bias where perhaps, presence of the word Reuters would equate to non-falsified news.

In [None]:
import re

In [None]:
for i in range(0, len(true_df['text'])):
    try :
        # Search if CITY (Reuters) exist, if it does, shift the index
        # to 3 more spaces to accomodate the hyphen
        start = re.search('(Reuters)',true_df['text'][i][0:49]).end() + 3
    except:
        pass
    else:
        true_df['text'][i] = true_df['text'][i][start:]

In [None]:
true_df

In [None]:
# Check for duplicates or blank articles
# Conversion to dataframe to reduce output text clutter
duplicate = true_df['text'].value_counts()[true_df['text'].value_counts()>1]
duplicate = duplicate.rename_axis('unique_values').reset_index(name='counts')
duplicate

In [None]:
# Number of duplicate entries
true_df['text'].value_counts()[true_df['text'].value_counts()>1].sum() - 212

In [None]:
# Deletion of duplicates
true_df = true_df.drop_duplicates(subset=['text'], ignore_index=True)

In [None]:
true_df.shape

In [None]:
# Concatenate title and text column
true_df['article'] = true_df['title'] + '.' + true_df['text']

In [None]:
true_df = true_df.drop(['title', 'text'], axis=1)

In [None]:
true_df

For future EDA analysis, we keep the date column for plotting metrics. Date is processed into a uniform format.

In [None]:
true_df['date_len'] = [len(x) for x in true_df['date']]
print(true_df['date_len'].value_counts())
true_df = true_df.drop(['date_len'], axis=1)

In [None]:
true_df

In [None]:
from datetime import datetime

In [None]:
# All data format is uniform (month date, year)
# Unifying the date format to datetime
dates = []
for x in true_df['date']:
    date = datetime.strptime(x,'%B %d, %Y ')
    dates.append(date)
true_df['date'] = dates

In [None]:
true_df

In [None]:
true_df['date_len'] = [len(str(x)) for x in true_df['date']]
print(true_df['date_len'].value_counts())
true_df = true_df.drop(['date_len'], axis=1)

Finally creating a label variable - 1 means the news is verified

In [None]:
true_df['label'] = 1

In [None]:
true_df

## False News Dataset

We preprocess the articles in a similar manner as the true dataset

In [None]:
# FAKE DATASET
# Checking for NaN values
fake_df.isna().sum()

In [None]:
# Check for duplicates or blank articles
# Conversion to dataframe to reduce output text clutter
duplicate = fake_df['text'].value_counts()[fake_df['text'].value_counts()>1]
duplicate = duplicate.rename_axis('unique_values').reset_index(name='counts')
duplicate

In [None]:
fake_df['text'].value_counts()[fake_df['text'].value_counts()>1].sum() - 4927 - 626

There are 4927 rows of articles that have duplicates. The total number of duplicate articles is 5401. There are 626 blank articles in the database as well.

In [None]:
# Values with no text - only title
blank = fake_df.loc[fake_df["text"] == duplicate["unique_values"][0]]
blank

In [None]:
blank.index

In [None]:
# Dropping the blank news from the fake dataframe
fake_df = fake_df.drop(blank.index)

In [None]:
# Verification of removal (verified)
fake_df

In [None]:
# Deletion of duplicates
fake_df = fake_df.drop_duplicates(subset=['text'], ignore_index=True)

In [None]:
fake_df

In [None]:
# Concatenate title and text column
fake_df['article'] = fake_df['title'] + '.' + fake_df['text']

In [None]:
fake_df = fake_df.drop(['title', 'text'], axis=1)

In [None]:
fake_df

We process the date column once again

In [None]:
fake_df['date_len'] = [len(x) for x in fake_df['date']]
print(fake_df['date_len'].value_counts())

In [None]:
fake_df.loc[fake_df["date_len"] > 18]

In [None]:
fake_df.loc[fake_df["date_len"] > 18].article

These are incorrect values for dates and they are not proper articles. So they will be dropped.

In [None]:
bad_date = fake_df.loc[fake_df["date_len"] > 18]

In [None]:
bad_date.index

In [None]:
# Dropping entries with bad date and no news from the fake dataframe
fake_df = fake_df.drop(index=bad_date.index)

In [None]:
# Verifying entries are dropped (verified)
fake_df

In [None]:
# All data format is uniform (month date, year)
# Unifying the date format to datetime
# Nested try except for ValueError exceptions in the fake dataset
dates = []
for x in fake_df['date']:
    try:
        date = datetime.strptime(x, '%B %d, %Y')
    except ValueError:
        try:
            date = datetime.strptime(x, '%d-%b-%y')
        except ValueError:
            date = datetime.strptime(x, '%b %d, %Y')
    dates.append(date)
fake_df['date'] = dates

In [None]:
# Verification
fake_df['date'].nunique

All the dates are unique. Now the label for the fake dataset is finally added, where 0 means fake news

In [None]:
fake_df['label'] = 0

In [None]:
fake_df = fake_df.drop(['date_len'], axis=1)
fake_df

Finally we concatenate both the dataframe

In [None]:
dataset = pd.concat([true_df, fake_df])
dataset = dataset.reset_index(drop=True)

In [None]:
dataset

# Exploratory Data Analysis

In [None]:
# Importing matplotlib and seaborn
import matplotlib.pyplot as plt 
import seaborn as sns

In [None]:
# Plotting histogram of article subjects
fig, hist = plt.subplots(figsize = (11,7))
hist = sns.histplot(data=dataset, x = 'subject', hue="label")

The subjects in fake news are different from the subjects in verified news. Subject will not be taken into account.

In [None]:
# Plotting histogram of article dates
fig, hist = plt.subplots(figsize = (11,7))
hist = sns.histplot(data=dataset, x = 'date', hue="label")

The date distribution of verified and false articles are varying. Fake articles span from before 2016 and verified articles are recorded from after 2016. Also, most of the verified articles in the dataset seem to be dated between Sep 2017 to Jan 2018.
Date will not be taken into account.

In [None]:
# Text analysis
av_t = dataset[dataset['label'] == 1]['article'].apply(lambda x: len(x)).mean()
av_f = dataset[dataset['label'] == 0]['article'].apply(lambda x: len(x)).mean()
av = pd.DataFrame(data = {'Average character length': [av_t, av_f], 'Label':['True', 'False']})
fig, bar = plt.subplots(figsize = (11,7))
bar = sns.barplot(y='Average character length', x='Label', data=av)

The average between all fake and true news is almost the same.

In [None]:
# Characters length of articles
len_cha_true = dataset[dataset['label'] == 1]['article'].apply(lambda x: len(x))
len_cha_fake = dataset[dataset['label'] == 0]['article'].apply(lambda x: len(x))

norm_weights_true = np.ones(len(len_cha_true))/len(len_cha_true)
norm_weights_fake = np.ones(len(len_cha_fake))/len(len_cha_fake)

bins = [i * 1000 for i in range(0,31)]

fig, (hist1, hist2) = plt.subplots(1,2, figsize = (11,7))
hist1.hist(len_cha_true, bins = bins, weights = norm_weights_true, color = 'C0')
hist1.set_ylim(0, top=0.4)
hist1.set_xlim(0, 30000)
hist1.set_xlabel('Number of characters')
hist1.set_ylabel('Proportion of articles')
hist1.set_title('True texts')

hist2.hist(len_cha_fake, bins = bins, weights = norm_weights_fake, color = 'C1')
hist2.set_ylim(0, top=0.4)
hist2.set_xlim(0, 30000)
hist2.set_xlabel('Number of characters')
hist2.set_ylabel('Proportion of articles')
hist2.set_title('False texts');

In [None]:
# Number of words per article
len_w_true = dataset[dataset['label'] == 1]['article'].str.split().map(lambda x: len(x))
len_w_fake = dataset[dataset['label'] == 0]['article'].str.split().map(lambda x: len(x))

norm_weights_true = np.ones(len(len_w_true))/len(len_w_true)
norm_weights_fake = np.ones(len(len_w_fake))/len(len_w_fake)

bins_ = [i * 200 for i in range(0,26)]

fig, (hist1, hist2) = plt.subplots(1,2, figsize = (11,7))
hist1.hist(len_w_true, bins = bins_, weights = norm_weights_true, color = 'C0')
hist1.set_ylim(0, top=0.4)
hist1.set_xlim(0, 5000)
hist1.set_xlabel('Number of words / article')
hist1.set_ylabel('Proportion of aticles')
hist1.set_title('True texts')
hist2.hist(len_w_fake, bins = bins_, weights = norm_weights_fake, color = 'C1')
hist2.set_ylim(0, top=0.4)
hist2.set_xlim(0, 5000)
hist2.set_xlabel('Number of words / article')
hist2.set_ylabel('Proportion of aticles')
hist2.set_title('False texts');

The normalized distribution of both words per article and number of characters between the two labels are somewhat similar.

Defining functions for preprocessing - removal of stopwords, punctuations and sanitization of web elements from the articles. (Useful for LSTM model training)

In [None]:
from string import punctuation
punctuation = list(punctuation)

In [None]:
from tqdm.notebook import tqdm

In [None]:
# https://github.com/remydeshayes/NLP_Pytorch/blob/main/Notebook%20-%20Fake_News%20Detection%20Pytorch%20-%20Billiot_Deshayes.ipynb
stop = stopwords + punctuation + ['“','’', '“', '”', '‘','...']
tqdm.pandas()

def lowerizer(article):
  """
  Lowerize a given text
  ----
  Inputs : 
    article (str) : text to be pre-processed
  Outputs : 
    article.lower() (str) : lowerized text
  """
  return article.lower()

def remove_html(article):
    """
    Remove HTML tags from a given text
    ----
    Inputs : 
      article (str) : text to be pre-processed
    Outputs : 
      article (str) : text cleaned of HTML tags
    """
    article = re.sub("(<!--.*?-->)", "", article, flags=re.DOTALL)
    return article

def remove_url(article):
    """
    Remove URL tags from a given text
    ----
    Inputs : 
      article (str) : text to be pre-processed
    Outputs : 
      article (str) : text cleaned of URL tags
    """
    article = re.sub(r'https?:\/\/.\S+', "", article)
    return article

def remove_hashtags(article):
    """
    Remove hashtags from a given text
    ----
    Inputs : 
      article (str) : text to be pre-processed
    Outputs : 
      article (str) : text cleaned of hashtags
    """
    article = re.sub("#"," ",article)
    return article

def remove_a(article):
    """
    Remove twitter account references @ rom a given text
    ----
    Inputs : 
      article (str) : text to be pre-processed
    Outputs : 
      article (str) : text withouttwitter account references 
    """
    article = re.sub("@"," ",article)
    return article

def remove_brackets(article):
    """
    Remove square brackets from a given text 
    ----
    Inputs : 
      article (str) : text to be pre-processed
    Outputs : 
      article (str) : text without square brackets
    """
    article = re.sub('\[[^]]*\]', '', article)
    return article

def remove_stop_punct(article):
    """
    Remove punctuation and stopwords from a given text
    ----
    Inputs : 
      article (str) : text to be pre-processed
    Outputs : 
      article (str) : text without punctuation or stopwords
    """
    final_article = []
    for i in article.split():
        if i not in stop:
            final_article.append(i.strip())
    return " ".join(final_article)

def preprocessing(article, lowerizer_=False, remove_web=True, remove_brackets_=False, remove_stop_punct_=False):
    """
    Computes the above-define steps to clean a given text
    ----
    Inputs : 
      article (str) : text to be pre-processed
    Outputs : 
      article (str) : pre-processed text
    """
    
    if lowerizer_:
        article = lowerizer(article)
    if remove_web:
        article = remove_html(article)
        article = remove_url(article)
        article = remove_hashtags(article)
        article = remove_a(article)
    if remove_brackets_:
        article = remove_brackets(article)
    if remove_stop_punct_:
        article = remove_stop_punct(article)
    return article

In [None]:
dataset['article_lstm'] = dataset['article'].progress_apply(
    lambda x : preprocessing(x, lowerizer_=True, remove_brackets_=True, remove_stop_punct_=True))

In [None]:
dataset

# Model prediction

## Data Preparation

In [None]:
# Declaring constants
SEED = 42
MAX_LENGTH = 100

We shuffle the dataset and then split the dataset into training, validation and test datasets.

In [None]:
# Shuffling the dataset
shuffled_data = dataset.sample(frac=1, random_state=SEED).reset_index(drop=True)
shuffled_data

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = shuffled_data['article_lstm']
y = shuffled_data['label']

In [None]:
# Train-Validation-Test set split into 80:10:10 ratio
train_X, temp_X, train_y, temp_y = train_test_split(X, y, random_state=SEED, test_size=0.2, stratify=y)
# Validation-Test split
valid_X, test_X, valid_y, test_y = train_test_split(temp_X, temp_y, random_state=SEED, test_size=0.5, stratify=temp_y)

In [None]:
# Return size of the split datasets
(len(train_X), len(train_y)), (len(valid_X), len(valid_y)), (len(test_X), len(test_y))

In [None]:
# Defining a function to ascertain balance of true and fake news
# https://github.com/remydeshayes/NLP_Pytorch/blob/main/Notebook%20-%20Fake_News%20Detection%20Pytorch%20-%20Billiot_Deshayes.ipynb
def distribution_data(corpus): 
    """
    Returns number of fake and true news in a given dataset
    ----
    Inputs : 
    corpus (array) : labels of our dataset
    Outputs : 
    distrib (pd.DataFrame) : number of true and fake news in the dataset 
    """
    nb_true = corpus.sum()
    nb_false = len(corpus) - nb_true
    distrib = pd.DataFrame(data = {'Number of samples': [nb_true, nb_false], 'Label':['True', 'False']})
    return distrib

In [None]:
distrib = distribution_data(dataset['label'])
fig, bar = plt.subplots(figsize = (2,4))
bar = sns.barplot(y='Number of samples', x='Label',data=distrib);

There are more Validated news than Falsified news in the dataset, but ideal for the model to be trained.

In [None]:
train_distrib = distribution_data(train_y)
valid_distrib = distribution_data(valid_y)
test_distrib = distribution_data(test_y)

In [None]:
fig, (bar1, bar2, bar3) = plt.subplots(1,3, figsize = (11,7))
sns.barplot(y='Number of samples', x='Label',data=train_distrib, ax = bar1)
sns.barplot(y='Number of samples', x='Label',data=valid_distrib, ax = bar2)
sns.barplot(y='Number of samples', x='Label',data=test_distrib, ax = bar3)
bar1.set_title("Training")
bar2.set_title("Validation")
bar3.set_title("Testing");

The same can be said for the shuffled and split dataset. They share the same ratio of true to fake news.

A class is created that will build vocabulary from the training dataset and create a dictionary - indexes to words and words to indexes.  
The clean article is processed by tokenizing using nltk, transposed to indexes, passed as tensor and padding to a determined length.

In [None]:
import torch
from torch import nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence

In [None]:
# Importing nltk (yet again)
import nltk
from nltk import word_tokenize
nltk.download('punkt')

In [None]:
# https://github.com/remydeshayes/NLP_Pytorch/blob/main/Notebook%20-%20Fake_News%20Detection%20Pytorch%20-%20Billiot_Deshayes.ipynb
# Definition of class that will preprocess the words to indexes, tranform to tensor and cut/pad to the desired sequence of length
# These functions are automated when using HuggingFace tokenizers

class TextClassificationDataset(Dataset):
    def __init__(self, data, categories, vocab = None, max_length = 100, min_freq = 5):
        
        self.data = data
        self.max_length = max_length
        
        # Allow to import a vocabulary (validation and testing will use the training vocabulary)
        if vocab is not None:
            self.word2idx, self.idx2word = vocab
        else:
            # Build the vocabulary if none is imported
            self.word2idx, self.idx2word = self.build_vocab(self.data, min_freq)
        
        # We tokenize the articles
        tokenized_data = [word_tokenize(file.lower()) for file in self.data]
        # Transform words into lists of indexes
        indexed_data = [[self.word2idx.get(word, self.word2idx['UNK']) for word in file] for file in tokenized_data]
        # Transform into a list of Pytorch LongTensors
        tensor_data = [torch.LongTensor(file) for file in indexed_data]
        # Lables are passed into a FloatTensor (adding to_numpy() to resolve ValueError)
        # ValueError: could not determine the shape of object type 'Series'
        tensor_y = torch.FloatTensor(categories.to_numpy())
        # Finally we cut too the determined maximum length
        cut_tensor_data = [tensor[:max_length] for tensor in tensor_data]
        # We pad the sequences to have the whole dataset containing sequences of the same length
        self.tensor_data = pad_sequence(cut_tensor_data, batch_first=True, padding_value=0)
        self.tensor_y = tensor_y
        
    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()
        return self.tensor_data[idx], self.tensor_y[idx] 
    
    def build_vocab(self, corpus, count_threshold):
        word_counts = {}
        for sent in tqdm(corpus):
            for word in word_tokenize(sent.lower()):
                if word not in word_counts:
                    word_counts[word] = 0
                word_counts[word] += 1   
        filtered_word_counts = {word: count for word, count in word_counts.items() if count >= count_threshold}        
        words = sorted(filtered_word_counts.keys(), key=word_counts.get, reverse=True) + ['UNK']
        word_index = {words[i] : (i+1) for i in range(len(words))}
        idx_word = {(i+1) : words[i] for i in range(len(words))}
        return word_index, idx_word
    
    def get_vocab(self):
        return self.word2idx, self.idx2word

We start preprocessing the training data, and after building the vocabulary, it is used to prepare the validation and testing dataset.

In [None]:
training_dataset = TextClassificationDataset(train_X, train_y, max_length=MAX_LENGTH)

In [None]:
training_word2idx, training_idx2word = training_dataset.get_vocab()
valid_dataset = TextClassificationDataset(valid_X, valid_y, (training_word2idx, training_idx2word), max_length=MAX_LENGTH)
test_dataset = TextClassificationDataset(test_X, test_y, (training_word2idx, training_idx2word), max_length=MAX_LENGTH)

The datasets are then passed into a DataLoader

In [None]:
training_dataloader = DataLoader(training_dataset, batch_size = 200, shuffle=True)
valid_dataloader = DataLoader(valid_dataset, batch_size = 25)
test_dataloader = DataLoader(test_dataset, batch_size = 25)

# Embeddings and Model Definition

A pretrained GloVe embedding with 300 features per word vector is loaded, which had been trained on Wikipedia.  
GloVe leverages global word to word co-occurance counts, in contrast to Word2vec, which leverages co-occurance within neighboring words.

In [None]:
import gensim.downloader as api
loaded_glove_model = api.load("glove-wiki-gigaword-300")
loaded_glove_embeddings = loaded_glove_model.vectors

After creation of the word to index dictionary, it is passed into the GloVe embedding function. The internal working is commented in the function definition

In [None]:
# https://github.com/remydeshayes/NLP_Pytorch/blob/main/Notebook%20-%20Fake_News%20Detection%20Pytorch%20-%20Billiot_Deshayes.ipynb

def get_glove_adapted_embeddings(glove_model, input_voc):
    """
    Retrieve a vocabulary words'embeddings from GloVe 
    ----
    Inputs : 
    glove_model () : GloVe Embedding model
    input_voc (dict) : dictionnary of our indexed vocabulary 
    Outputs : 
    embeddings (ndarray) : GloVe Embeddings for the given vocabulary with the vocabulary index
    """
    # Create a dict: Get the corresponding GloVe vocabulary from the word index
    keys = {i: glove_model.key_to_index[w] if w in glove_model.key_to_index.keys() else None for w, i in input_voc.items()}
#     print(keys)
    # Create a dict of index corresponding to index of the word in GloVe
    index_dict = {i: key for i, key in keys.items() if key is not None}
    # Create a matrix to retrieve GloVe vectors
    embeddings = np.zeros((len(input_voc)+1,glove_model.vectors.shape[1]))
    # Populate the matrix with the vectors and return this matrix
    for i, ind in index_dict.items():
        embeddings[i] = glove_model.vectors[ind]
    return embeddings

In [None]:
GloveEmbeddings = get_glove_adapted_embeddings(loaded_glove_model, training_word2idx)

The training model is now defined: an embedding layer of dim(vocab_size, 300), two LSTM layers with hidden_size 256, followed by GloVe vectors as embeddings and a fully connected layer.

In [None]:
class LSTMModel(nn.Module):

    def __init__(self, embedding_dim, vocabulary_size, hidden_dim, embeddings=None, fine_tuning=False):
        super().__init__()
        self.embedding_dim = embedding_dim
        self.hidden_dim = hidden_dim
        if embeddings:
            self.embeddings = nn.Embedding.from_pretrained(torch.FloatTensor(GloveEmbeddings), freeze=not fine_tuning, padding_idx=0)
        else:
            self.embeddings = nn.Embedding(num_embeddings=vocabulary_size+1, embedding_dim=embedding_dim, padding_idx=0)

        self.lstm = nn.LSTM(input_size=embedding_dim, hidden_size=hidden_dim, batch_first=True, num_layers=2)
        self.linear = nn.Linear(in_features=2*hidden_dim, out_features=1)

    def forward(self, inputs):
        emb = self.embeddings(inputs)
        lstm_out, (ht, ct) = self.lstm(emb, None)
        h = torch.cat((ht[-2], ht[-1]), dim=1)
        x = torch.squeeze(self.linear(h))
        return x

# Training

Three functions are defined: `train_epoch` which iterates through the training DataLoader object and computes a complete pass for every batch, `eval_model` which evaluates throught the validation/testing dataset only a forward pass for every batch and `experiment` which iterates through number of epochs for the entire training session.

In [None]:
def train_epoch(model, opt, criterion, dataloader):
    """
    Trains the mode over an epoch 
    ----
    Inputs : 
    model () : defined model to be trained
    opt () : chosen and defined optimizer 
    criterion () : chosen and defined loss
    dataloader() : iterable object with the batches
    Outputs : 
    losses (list) : list of training loss for each batch of the epoch
    accs (list) : list of training accuracy for each batch of the epoch
    """
    model.train()
    losses = []
    accs = []
    for i, (x, y) in enumerate(dataloader):
        opt.zero_grad()
        # Forward pass
        pred = model(x)
        # Loss Computation
        loss = criterion(pred, y)
        # Backward pass
        loss.backward()
        # Weights update
        opt.step()
        losses.append(loss.item())
        # Compute accuracy
        num_corrects = sum((torch.sigmoid(pred)>0.5) == y)
        acc = 100.0 * num_corrects/len(y)
        accs.append(acc.item())
        if (i%20 == 0):
            print("Batch " + str(i) + " : training loss = " + str(loss.item()) + "; training acc = " + str(acc.item()))
    return losses, accs

In [None]:
def eval_model(model, criterion, evalloader):
    """
    Evaluate the model  
    ----
    Inputs : 
    model () : defined model to be trained
    criterion () : chosen and defined loss
    evalloader() : iterable object with the batches 
    Outputs : 
    total_epoch_loss/(i+1) (float) : computed loss 
    total_epoch_acc/(i+1) (float) : computed accuracy 
    preds (list) : predictions made by the model
    """
    model.eval()
    total_epoch_loss = 0
    total_epoch_acc = 0
    preds = []
    with torch.no_grad():
        for i, (x, y) in enumerate(evalloader):
            pred = model(x)
            loss = criterion(pred, y)
            num_corrects = sum((torch.sigmoid(pred)>0.5) == y)
            acc = 100.0 * num_corrects/len(y)
            total_epoch_loss += loss.item()
            total_epoch_acc += acc.item()
            preds.append(pred)

    return total_epoch_loss/(i+1), total_epoch_acc/(i+1), preds

In [None]:
def experiment(model, opt, criterion, num_epochs = 5):
    """
    Trains & Evaluates the model over all epochs 
    ----
    Inputs : 
    model () : defined model to be trained
    opt () : chosen and defined optimizer 
    criterion () : chosen and defined loss
    num_epochs() : chosen number of epochs to go through
    Outputs : 
    train_losses (list): training losses of all batches for each epochs
    valid_losses (list): losses over vaidation data for all epochs
    test_loss (list): loss over test data once the model is trained 
    train_accs (list): training accuracies of all batches for each epochs
    valid_accs (list): accuracies over vaidation data for all epochs
    test_acc (list): accuracy over test data once the model is trained
    test_preds (): predictions on test dataset
    """
    train_losses = []
    valid_losses = []
    train_accs = []
    valid_accs = []
    print("Beginning training...")
    for e in range(num_epochs):
        print("Epoch " + str(e+1) + ":")
        losses, accs = train_epoch(model, opt, criterion, training_dataloader)
        train_losses.append(losses)
        train_accs.append(accs)
        valid_loss, valid_acc, val_preds = eval_model(model, criterion, valid_dataloader)
        valid_losses.append(valid_loss)
        valid_accs.append(valid_acc)
        print("Epoch " + str(e+1) + " : Validation loss = " + str(valid_loss) + "; Validation acc = " + str(valid_acc))
    test_loss, test_acc, test_preds = eval_model(model, criterion, test_dataloader)
    print("Test loss = " + str(test_loss) + "; Test acc = " + str(test_acc))
    return train_losses, valid_losses, test_loss, train_accs, valid_accs, test_acc, test_preds

## Training with LSTM Model

The model is defined, the optimizer used is Adam and the loss function used is binary cross entropy - since the task is binary classification.

In [None]:
# Setting the hyperparameters of the model
EMBEDDING_DIM = 300
VOCAB_SIZE = len(training_word2idx)
HIDDEN_DIM = 256
learning_rate = 0.0025
num_epochs = 10

In [None]:
model_lstm = LSTMModel(EMBEDDING_DIM, VOCAB_SIZE, HIDDEN_DIM,  embeddings=True, fine_tuning=False)
opt = optim.Adam(model_lstm.parameters(), lr=learning_rate, betas=(0.9, 0.999))
criterion = nn.BCEWithLogitsLoss()

In [None]:
train_losses_lstm, valid_losses_lstm, test_loss_lstm, train_accs_lstm, valid_accs_lstm, test_acc_lstm, test_preds_lstm = experiment(model_lstm, opt, criterion, num_epochs)

Test accuracy of 99.63% is quite a high metric so the model weights are saved.

In [None]:
torch.save(model_lstm, '/kaggle/working/lstm_saved_weights.pt')
torch.save(model_lstm.state_dict(), '/kaggle/working/lstm_state_dict.pt')

The training and validation accuracy is plotted against the number of epochs

In [None]:
import statistics
from statistics import mean

In [None]:
train_losses = [mean(train_loss) for train_loss in train_losses_lstm]
train_accs = [mean(train_acc) for train_acc in train_accs_lstm]

In [None]:
epochs = [i for i in range(num_epochs)]
fig , ax = plt.subplots(1,2)
fig.set_size_inches(20,10)

ax[0].plot(epochs , train_accs , 'C0o-' , label = 'Training Accuracy (LSTM)')
ax[0].plot(epochs , valid_accs_lstm , 'C1o-' , label = 'validation Accuracy (LSTM)')
ax[0].set_title('Training & Validation Accuracy (LSTM)')
ax[0].legend()
ax[0].set_xlabel("Epochs")
ax[0].set_ylabel("Accuracy")

ax[1].plot(epochs , train_losses , 'C0o-' , label = 'Training Loss (LSTM)')
ax[1].plot(epochs , valid_losses_lstm , 'C1o-' , label = 'Validation Loss (LSTM)')
ax[1].set_title('Training & Validation Loss (LSTM)')
ax[1].legend()
ax[1].set_xlabel("Epochs")
ax[1].set_ylabel("Loss")
plt.show()

## Training with CNN

Definition of a CNN model to establish performance comparison versus LSTM. The CNN model has a similar embedding layer with size (vocab_size, embedding_dim=300), a CNN layer with size (embedding_size = 300, 64, 16) with ReLU activation function, max-pooling layer, dropout layer set to 0.5 and a fully connected layer (64, 1)

In [None]:
from torch.nn import functional as F

class CNNModel(nn.Module):
    def __init__(self, embedding_dim, vocabulary_size, window_size: int = 16, filter_multiplier = 64, embeddings = None, fine_tuning = False):
        super().__init__()
        self.embedding_dim = embedding_dim 
        if embeddings:
            self.embeddings = nn.Embedding.from_pretrained(torch.FloatTensor(GloveEmbeddings), freeze=not fine_tuning, padding_idx=0)
        else:
            self.embeddings = nn.Embedding(num_embeddings=vocabulary_size+1, embedding_dim=embedding_dim, padding_idx=0)

        self.conv1d = nn.Conv1d(embedding_dim, filter_multiplier, window_size)
        self.dropout = nn.Dropout(0.5)
        self.linear = nn.Linear(filter_multiplier, 1)

    def forward(self, inputs):
        x = self.embeddings(inputs)
        x = x.permute(0, 2, 1)
        x = self.conv1d(x)
        x = F.relu(x)
        x = F.max_pool1d(x, x.shape[2]).squeeze(2)
        x = self.dropout(x)
        output = torch.squeeze(self.linear(x))

        return output

The Adam optimizer and Binary Cross Entropy Loss criterion is kept the same as before.

In [None]:
model_cnn = CNNModel(300, len(training_word2idx), 16, 64, embeddings = True, fine_tuning=True)
optimizer = optim.Adam(model_cnn.parameters(), lr=0.0025, betas=(0.9, 0.999))
criterion = nn.BCEWithLogitsLoss()

In [None]:
train_losses_cnn, valid_losses_cnn, test_loss_cnn, train_accs_cnn, valid_accs_cnn, test_acc_cnn, test_preds_cnn = experiment(model_cnn, optimizer, criterion, num_epochs=7)

In [None]:
train_losses_cnn = [mean(train_loss) for train_loss in train_losses_cnn]
train_accs_cnn = [mean(train_acc) for train_acc in train_accs_cnn]

In [None]:
epochs = [i for i in range(7)] # epochs is 7
fig , ax = plt.subplots(1,2)
fig.set_size_inches(20,10)

ax[0].plot(epochs , train_accs_cnn , 'C0o-' , label = 'Training Accuracy (CNN)')
ax[0].plot(epochs , valid_accs_cnn , 'C1o-' , label = 'validation Accuracy (CNN)')
ax[0].set_title('Training & Validation Accuracy (CNN)')
ax[0].legend()
ax[0].set_xlabel("Epochs")
ax[0].set_ylabel("Accuracy")

ax[1].plot(epochs , train_losses_cnn , 'C0o-' , label = 'Training Loss (CNN)')
ax[1].plot(epochs , valid_losses_cnn , 'C1o-' , label = 'Validation Loss (CNN)')
ax[1].set_title('Training & Validation Loss (CNN)')
ax[1].legend()
ax[1].set_xlabel("Epochs")
ax[1].set_ylabel("Loss")
plt.show()

The model has overfitted to the training data, but the CNN model is unable to converge closer to the training accuracy metric, compared to the LSTM model.

# Evaluation of LSTM Model

The LSTM model is evaluated using confusion matrix and classification report metrics.

In [None]:
preds = [(torch.sigmoid(t)>0.5).tolist() for t in test_preds_lstm]
preds = [int(t) for el in preds for t in el]

In [None]:
# Predictions
preds[:20]

In [None]:
# Ground Truth
test_y[:20]

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(test_y, preds)
cm_ = pd.DataFrame(cm , index = ['True','Fake'] , columns = ['True','Fake'])

In [None]:
plt.figure(figsize = (6,6))
sns.heatmap(cm,cmap= "Blues", linecolor = 'black' , linewidth = 1 , annot = True, fmt='' , xticklabels = ['True','Fake'] , yticklabels = ['True','Fake'])
plt.xlabel("Predicted")
plt.ylabel("Actual");

In [None]:
plt.figure(figsize = (6,6))
sns.heatmap(cm/3864*100,cmap= "Blues", linecolor = 'black' , linewidth = 1 , annot = True, fmt='' , xticklabels = ['True','Fake'] , yticklabels = ['True','Fake'])
plt.xlabel("Predicted")
plt.ylabel("Actual");

In [None]:
from sklearn.metrics import classification_report
print(classification_report(test_y, preds, target_names = ['Predicted Fake','Predicted True']))

# Thoughts

The binary classification of detecting fake news can be solved nowadays using Transformers. While using the LSTM and the CNN models to train, GloVe embeddings were utilized to convert words into their corresponding vectors. GloVe leverages global word-to-word co-occurences which effect the final word vector. This is comparable to the how the attention layers of the Transformer models are able to understand word context in a given passage of text.  
It could be said that while the CNN model came close to the verification accuracy, the consequences of overfitting made it a poor architecture when compared to the LSTM model. An RNN would therefore have a better performance index than the CNN model. It would however run into the problem of vanishing or exploding gradients. Since LSTM solves this problem, it would be unwise to reinvent the wheel.

# Training on BERT

Preprocessing the data suitable for BERT tokenizer and BERT Transformer model. As most of the preprocessing has already been done, some of the steps would be straightforward.

In [None]:
shuffled_data['article_bert'] = shuffled_data['article'].progress_apply(
    lambda x : preprocessing(x, lowerizer_=False, remove_web=True, remove_brackets_=False, remove_stop_punct_=False))

In [None]:
shuffled_data

## Split the training dataset

In [None]:
bert_X = shuffled_data['article_bert']
bert_y = shuffled_data['label']

In [None]:
# Train-Validation-Test set split into 80:10:10 ratio
b_train_X, b_temp_X, b_train_y, b_temp_y = train_test_split(bert_X, bert_y, random_state=SEED, test_size=0.2, stratify=y)
# Validation-Test split
b_valid_X, b_test_X, b_valid_y, b_test_y = train_test_split(b_temp_X, b_temp_y, random_state=SEED, test_size=0.5, stratify=temp_y)

In [None]:
# Return size of the split datasets
(len(b_train_X), len(b_train_y)), (len(b_valid_X), len(b_valid_y)), (len(b_test_X), len(b_test_y))

## Import BERT Model and Tokenizer

In [None]:
import transformers
from transformers import AutoModel, BertTokenizerFast

In [None]:
import torch

In [None]:
# specify GPU
device = torch.device("cuda")

In [None]:
# import BERT-base pretrained model
bert = AutoModel.from_pretrained('bert-base-uncased')

# Load the BERT tokenizer
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

## BERT Tokenization of the split datasets

In [None]:
# Tokenize and encode sequences in the training set
tokens_train = tokenizer.batch_encode_plus(
    b_train_X.tolist(),
    max_length = MAX_LENGTH,
    pad_to_max_length=True,
    truncation=True,
    return_token_type_ids=False
)

# Tokenize and encode sequences in the validation set
tokens_val = tokenizer.batch_encode_plus(
    b_valid_X.tolist(),
    max_length = MAX_LENGTH,
    pad_to_max_length=True,
    truncation=True,
    return_token_type_ids=False
)

# Tokenize and encode sequences in the test set
tokens_test = tokenizer.batch_encode_plus(
    b_test_X.tolist(),
    max_length = MAX_LENGTH,
    pad_to_max_length=True,
    truncation=True,
    return_token_type_ids=False
)

## Convert Integer Sequences to Tensors

In [None]:
# For train set
train_seq = torch.tensor(tokens_train['input_ids'])
train_mask = torch.tensor(tokens_train['attention_mask'])
b_train_y_ = torch.tensor(b_train_y.tolist())

# For validation set
val_seq = torch.tensor(tokens_val['input_ids'])
val_mask = torch.tensor(tokens_val['attention_mask'])
b_val_y_ = torch.tensor(b_valid_y.tolist())

# For test set
test_seq = torch.tensor(tokens_test['input_ids'])
test_mask = torch.tensor(tokens_test['attention_mask'])
b_test_y_ = torch.tensor(b_test_y.tolist())

## Create DataLoaders

In [None]:
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

In [None]:
# Define a batch size
batch_size = 32
torch.manual_seed(SEED)

# train_data
train_data = TensorDataset(train_seq, train_mask, b_train_y_)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

# val_data
val_data = TensorDataset(val_seq, val_mask, b_val_y_)
val_sampler = SequentialSampler(val_data)
val_dataloader = DataLoader(val_data, sampler = val_sampler, batch_size=batch_size)

# test_data
test_data = TensorDataset(test_seq, test_mask, b_test_y_)
test_sampler = SequentialSampler(test_data)
test_dataloader = DataLoader(test_data, sampler = test_sampler, batch_size=batch_size)

## Freeze BERT Parameters

In [None]:
# freeze all the parameters
for param in bert.parameters():
    param.requires_grad = False

# Define BERT Architecture

In [None]:
from torch import nn

In [None]:
class BERT_Arch(nn.Module):

    def __init__(self, bert):

        super(BERT_Arch, self).__init__()
        
        self.bert = bert 
        self.dropout = nn.Dropout(0.1)
        self.relu =  nn.ReLU()
        self.fc1 = nn.Linear(768,512)
        self.fc2 = nn.Linear(512,2)

        self.softmax = nn.LogSoftmax(dim=1)

    #define the forward pass
    def forward(self, sent_id, mask):

        _, cls_hs = self.bert(sent_id, attention_mask=mask, return_dict=False)
        x = self.fc1(cls_hs)
        x = self.relu(x)
        x = self.dropout(x)
        x = self.fc2(x)

        x = self.softmax(x)

        return x

In [None]:
# pass the pre-trained BERT to our define architecture
model = BERT_Arch(bert)

# push the model to GPU
model = model.to(device)

In [None]:
# optimizer from hugging face transformers
from transformers import AdamW

# define the optimizer
optimizer = AdamW(model.parameters(), lr = 1e-3)

## Get Class Weights

In [None]:
from sklearn.utils.class_weight import compute_class_weight

#compute the class weights
class_wts = compute_class_weight(class_weight='balanced', classes=np.unique(train_y), y=train_y)

print(class_wts)

In [None]:
# convert class weights to tensor
weights = torch.tensor(class_wts,dtype=torch.float)
weights = weights.to(device)

# loss function
cross_entropy  = nn.NLLLoss(weight=weights)

In [None]:
# number of training epochs
BERT_EPOCHS = 40

## Fine tuning BERT

In [None]:
# function to train the model
def train():

    model.train()

    total_loss, total_accuracy = 0, 0

    # empty list to save model predictions
    total_preds = []

    # iterate over batches
    for step, batch in enumerate(train_dataloader):

        # progress update after every 50 batches.
        if step % 50 == 0 and not step == 0:
            print("  Batch {:>5,}  of  {:>5,}.".format(step, len(train_dataloader)))

        # push the batch to gpu
        batch = [r.to(device) for r in batch]

        sent_id, mask, labels = batch

        # clear previously calculated gradients
        model.zero_grad()

        # get model predictions for the current batch
        preds = model(sent_id, mask)

        # compute the loss between actual and predicted values
        loss = cross_entropy(preds, labels)

        # add on to the total loss
        total_loss = total_loss + loss.item()

        # backward pass to calculate the gradients
        loss.backward()

        # clip the the gradients to 1.0. It helps in preventing the exploding gradient problem
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        # update parameters
        optimizer.step()

        # model predictions are stored on GPU. So, push it to CPU
        preds = preds.detach().cpu().numpy()

        # append the model predictions
        total_preds.append(preds)

    # compute the training loss of the epoch
    avg_loss = total_loss / len(train_dataloader)

    # predictions are in the form of (no. of batches, size of batch, no. of classes).
    # reshape the predictions in form of (number of samples, no. of classes)
    total_preds = np.concatenate(total_preds, axis=0)

    # returns the loss and predictions
    return avg_loss, total_preds

In [None]:
# function for evaluating the model
def evaluate():

    print("\nEvaluating...")

    # deactivate dropout layers
    model.eval()

    total_loss, total_accuracy = 0, 0

    # empty list to save the model predictions
    total_preds = []

    # iterate over batches
    for step, batch in enumerate(val_dataloader):

        # Progress update every 50 batches.
        if step % 50 == 0 and not step == 0:

            # Report progress.
            print("  Batch {:>5,}  of  {:>5,}.".format(step, len(val_dataloader)))

        # push the batch to gpu
        batch = [t.to(device) for t in batch]

        sent_id, mask, labels = batch

        # deactivate autograd
        with torch.no_grad():

            # model predictions
            preds = model(sent_id, mask)

            # compute the validation loss between actual and predicted values
            loss = cross_entropy(preds, labels)

            total_loss = total_loss + loss.item()

            preds = preds.detach().cpu().numpy()

            total_preds.append(preds)

    # compute the validation loss of the epoch
    avg_loss = total_loss / len(val_dataloader)

    # reshape the predictions in form of (number of samples, no. of classes)
    total_preds = np.concatenate(total_preds, axis=0)

    return avg_loss, total_preds

# BERT Model Training

In [None]:
# set initial loss to infinite
best_valid_loss = float("inf")

# empty lists to store training and validation loss of each epoch
train_losses = []
valid_losses = []

# early stop
EARLY_STOP = 5

es_counter = 0

# for each epoch
for epoch in tqdm(range(BERT_EPOCHS)):

    print("\n Epoch {:} / {:}".format(epoch + 1, BERT_EPOCHS))

    # train model
    train_loss, _ = train()

    # evaluate model
    valid_loss, _ = evaluate()

    # save the best model
    if valid_loss < best_valid_loss:
        # Reset early stop
        es_counter = 0
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), "saved_weights.pt")

    # append training and validation loss
    train_losses.append(train_loss)
    valid_losses.append(valid_loss)

    print(f"\nTraining Loss: {train_loss:.3f}")
    print(f"Validation Loss: {valid_loss:.3f}")
    
    es_counter += 1
    if es_counter == EARLY_STOP:
        print(f"Training stopped early at epoch {epoch + 1}")
        break

## Model Performance

In [None]:
# Visualise the train_loss vs valid_loss
data_preproc = pd.DataFrame({
    'train_loss': train_losses,
    'valid_loss': valid_losses})

In [None]:
loss_plot = sns.lineplot(data_preproc)
loss_plot.set(xlabel='Epoch', ylabel='Loss');

## Model Evaluation

In [None]:
# load weights of best model
path = 'saved_weights.pt'
model.load_state_dict(torch.load(path))

In [None]:
from sklearn.metrics import classification_report

In [None]:
# get predictions for test data
with torch.no_grad():
#     preds = model(test_seq.to(device), test_mask.to(device))
#     preds = preds.detach().cpu().numpy()    

    cross_matrix = []
    
    for step, batch in tqdm(enumerate(test_dataloader)):

#         if step > 0:
#             break

        # push the batch to gpu
        batch = [r.to(device) for r in batch]

        sent_id, mask, labels = batch

        # get model predictions for the current batch
        preds = model(sent_id, mask)
        preds = preds.detach().cpu().numpy()
        
        # model's performance
        preds = np.argmax(preds, axis = 1)
        print(classification_report(labels.cpu(), preds))
        
        # confusion matrix
        cross_matrix.append(pd.crosstab(labels.cpu(), preds))

In [None]:
final_cm = sum(cross_matrix)
final_cm

In [None]:
plt.figure(figsize = (6,6))
sns.heatmap(final_cm,cmap= "Blues", linecolor = 'black' , linewidth = 1 , annot = True, fmt='' , xticklabels = ['True','Fake'] , yticklabels = ['True','Fake'])
plt.xlabel("Predicted")
plt.ylabel("Actual");

In [None]:
plt.figure(figsize = (6,6))
sns.heatmap(final_cm/3864*100,cmap= "Blues", linecolor = 'black' , linewidth = 1 , annot = True, fmt='' , xticklabels = ['True','Fake'] , yticklabels = ['True','Fake'])
plt.xlabel("Predicted")
plt.ylabel("Actual");

# Final Thoughts

The BERT model performs slightly worse than the LSTM model. The complexity of the model might be the reason for not being able to converge to the optimum results. Additionally, tuning the learning rate and other hyperparameters could further improve the performance of the model. The LSTM model also benefitted from learning on pre-trained embeddings while removing stopwords and unintelligible words.