# Fake News Detection using Deep Learning Methods

* Dataset: LIAR,LIAR Pants-On-Fire
* Deep Learning Methods: ANN, RNN, CNN, GRU, LSTM, Bi-Directional LSTM

### Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf

print(tf.__version__)

2.2.0


### Importing Dataset

LIAR dataset is divided into three set: 1. train set 2.test set 3. validation set. We will import each separately and as column do not have names to it, we will add for clearer understanding

In [2]:
#Importing training set and adding column label names

train_data = pd.read_csv('train.tsv',sep = '\t',
                         names = ['File Type', 'Label', 'Statement', 'Context', 'Speaker',
                                  'Position', 'State', 'Party', 'n1','n2','n3','n4','n5','Source'])

In [3]:
#checking training set

train_data.head()

Unnamed: 0,File Type,Label,Statement,Context,Speaker,Position,State,Party,n1,n2,n3,n4,n5,Source
0,2635.json,false,Says the Annies List political group supports ...,abortion,dwayne-bohac,State representative,Texas,republican,0.0,1.0,0.0,0.0,0.0,a mailer
1,10540.json,half-true,When did the decline of coal start? It started...,"energy,history,job-accomplishments",scott-surovell,State delegate,Virginia,democrat,0.0,0.0,1.0,1.0,0.0,a floor speech.
2,324.json,mostly-true,"Hillary Clinton agrees with John McCain ""by vo...",foreign-policy,barack-obama,President,Illinois,democrat,70.0,71.0,160.0,163.0,9.0,Denver
3,1123.json,false,Health care reform legislation is likely to ma...,health-care,blog-posting,,,none,7.0,19.0,3.0,5.0,44.0,a news release
4,9028.json,half-true,The economic turnaround started at the end of ...,"economy,jobs",charlie-crist,,Florida,democrat,15.0,9.0,20.0,19.0,2.0,an interview on CNN


#### Removing all other redundant columns except 'Label' and 'Statement' as they will not required for fake news detection task 

In [4]:
td= train_data.filter(['Label', 'Statement'],axis=1)
td

Unnamed: 0,Label,Statement
0,false,Says the Annies List political group supports ...
1,half-true,When did the decline of coal start? It started...
2,mostly-true,"Hillary Clinton agrees with John McCain ""by vo..."
3,false,Health care reform legislation is likely to ma...
4,half-true,The economic turnaround started at the end of ...
...,...,...
10235,mostly-true,There are a larger number of shark attacks in ...
10236,mostly-true,Democrats have now become the party of the [At...
10237,half-true,Says an alternative to Social Security that op...
10238,false,On lifting the U.S. Cuban embargo and allowing...


#### Removing labels with HALF_TRUE 

Because half-true is equal to half-false, and therfore label with HALF-TRUE cannot substantially contribute to news classification into true and false

In [5]:
td1 = td[td.Label != 'half-true']
td1

Unnamed: 0,Label,Statement
0,false,Says the Annies List political group supports ...
2,mostly-true,"Hillary Clinton agrees with John McCain ""by vo..."
3,false,Health care reform legislation is likely to ma...
5,true,The Chicago Bears have had more starting quart...
6,barely-true,Jim Dunnam has not lived in the district he re...
...,...,...
10233,pants-fire,Mayor Fung wants to punish our childrens educa...
10235,mostly-true,There are a larger number of shark attacks in ...
10236,mostly-true,Democrats have now become the party of the [At...
10238,false,On lifting the U.S. Cuban embargo and allowing...


### Changing Labels into 0 and 1

* Labels with "TRUE", "MOSTLY TRUE" as True or 1
* Labels with "FALSE", "BARELY-TRUE", "PANTS-FIRE" as Fake or 0

In [6]:
mapping = {'true': 1, 'mostly-true': 1, 'false': 0, 'pants-fire': 0, 'barely-true': 0}
training_set =td1.replace({'Label': mapping})
training_set

Unnamed: 0,Label,Statement
0,0,Says the Annies List political group supports ...
2,1,"Hillary Clinton agrees with John McCain ""by vo..."
3,0,Health care reform legislation is likely to ma...
5,1,The Chicago Bears have had more starting quart...
6,0,Jim Dunnam has not lived in the district he re...
...,...,...
10233,0,Mayor Fung wants to punish our childrens educa...
10235,1,There are a larger number of shark attacks in ...
10236,1,Democrats have now become the party of the [At...
10238,0,On lifting the U.S. Cuban embargo and allowing...


## Test set and validation need to pre-processed as similar to Train Set

* Additional columns needs to be removed
* "Half-True" labels needs to be removed
* Labels needs to re-labelled as 0 and 1 for fake and real news respectively

### Pre-processing on test set

In [7]:
#Importing test set and adding column label names
test_data = pd.read_csv('test.tsv', sep = '\t',
                       names = ['File Type', 'Label', 'Statement', 'Context', 'Speaker',
                                  'Position', 'State', 'Party', 'n1','n2','n3','n4','n5','Source'])

test_data.head()

Unnamed: 0,File Type,Label,Statement,Context,Speaker,Position,State,Party,n1,n2,n3,n4,n5,Source
0,11972.json,true,Building a wall on the U.S.-Mexico border will...,immigration,rick-perry,Governor,Texas,republican,30,30,42,23,18,Radio interview
1,11685.json,false,Wisconsin is on pace to double the number of l...,jobs,katrina-shankland,State representative,Wisconsin,democrat,2,1,0,0,0,a news conference
2,11096.json,false,Says John McCain has done nothing to help the ...,"military,veterans,voting-record",donald-trump,President-Elect,New York,republican,63,114,51,37,61,comments on ABC's This Week.
3,5209.json,half-true,Suzanne Bonamici supports a plan that will cut...,"medicare,message-machine-2012,campaign-adverti...",rob-cornilles,consultant,Oregon,republican,1,1,3,1,1,a radio show
4,9524.json,pants-fire,When asked by a reporter whether hes at the ce...,"campaign-finance,legal-issues,campaign-adverti...",state-democratic-party-wisconsin,,Wisconsin,democrat,5,7,2,2,7,a web video


In [8]:
# Removing redundant columns

test_set1= test_data.filter(['Label', 'Statement'],axis=1)
test_set1

Unnamed: 0,Label,Statement
0,true,Building a wall on the U.S.-Mexico border will...
1,false,Wisconsin is on pace to double the number of l...
2,false,Says John McCain has done nothing to help the ...
3,half-true,Suzanne Bonamici supports a plan that will cut...
4,pants-fire,When asked by a reporter whether hes at the ce...
...,...,...
1262,half-true,Says his budget provides the highest state fun...
1263,barely-true,Ive been here almost every day.
1264,barely-true,"In the early 1980s, Sen. Edward Kennedy secret..."
1265,barely-true,Says an EPA permit languished under Strickland...


In [9]:
#Removing news with label as "HALF-TRUE"

tset = test_set1[test_set1.Label != 'half-true']
tset

Unnamed: 0,Label,Statement
0,true,Building a wall on the U.S.-Mexico border will...
1,false,Wisconsin is on pace to double the number of l...
2,false,Says John McCain has done nothing to help the ...
4,pants-fire,When asked by a reporter whether hes at the ce...
5,true,Over the past five years the federal governmen...
...,...,...
1261,pants-fire,Sen. Bob Menendez voted to enact a new tax on ...
1263,barely-true,Ive been here almost every day.
1264,barely-true,"In the early 1980s, Sen. Edward Kennedy secret..."
1265,barely-true,Says an EPA permit languished under Strickland...


In [10]:
#chaning labels to 0 and 1 for fake and real news

mapping = {'true': 1, 'mostly-true': 1, 'false': 0, 'pants-fire': 0, 'barely-true': 0}
test_set =tset.replace({'Label': mapping})

In [11]:
test_set

Unnamed: 0,Label,Statement
0,1,Building a wall on the U.S.-Mexico border will...
1,0,Wisconsin is on pace to double the number of l...
2,0,Says John McCain has done nothing to help the ...
4,0,When asked by a reporter whether hes at the ce...
5,1,Over the past five years the federal governmen...
...,...,...
1261,0,Sen. Bob Menendez voted to enact a new tax on ...
1263,0,Ive been here almost every day.
1264,0,"In the early 1980s, Sen. Edward Kennedy secret..."
1265,0,Says an EPA permit languished under Strickland...


### Pre-processing on Validation set

In [12]:
#Importing test set and adding column label names

valid_data = pd.read_csv('test.tsv', sep = '\t',
                       names = ['File Type', 'Label', 'Statement', 'Context', 'Speaker',
                                  'Position', 'State', 'Party', 'n1','n2','n3','n4','n5','Source'])
valid_data.head()

Unnamed: 0,File Type,Label,Statement,Context,Speaker,Position,State,Party,n1,n2,n3,n4,n5,Source
0,11972.json,true,Building a wall on the U.S.-Mexico border will...,immigration,rick-perry,Governor,Texas,republican,30,30,42,23,18,Radio interview
1,11685.json,false,Wisconsin is on pace to double the number of l...,jobs,katrina-shankland,State representative,Wisconsin,democrat,2,1,0,0,0,a news conference
2,11096.json,false,Says John McCain has done nothing to help the ...,"military,veterans,voting-record",donald-trump,President-Elect,New York,republican,63,114,51,37,61,comments on ABC's This Week.
3,5209.json,half-true,Suzanne Bonamici supports a plan that will cut...,"medicare,message-machine-2012,campaign-adverti...",rob-cornilles,consultant,Oregon,republican,1,1,3,1,1,a radio show
4,9524.json,pants-fire,When asked by a reporter whether hes at the ce...,"campaign-finance,legal-issues,campaign-adverti...",state-democratic-party-wisconsin,,Wisconsin,democrat,5,7,2,2,7,a web video


In [13]:
#Removing all other labels expect statement and label

vd= valid_data.filter(['Label', 'Statement'],axis=1)
vd

Unnamed: 0,Label,Statement
0,true,Building a wall on the U.S.-Mexico border will...
1,false,Wisconsin is on pace to double the number of l...
2,false,Says John McCain has done nothing to help the ...
3,half-true,Suzanne Bonamici supports a plan that will cut...
4,pants-fire,When asked by a reporter whether hes at the ce...
...,...,...
1262,half-true,Says his budget provides the highest state fun...
1263,barely-true,Ive been here almost every day.
1264,barely-true,"In the early 1980s, Sen. Edward Kennedy secret..."
1265,barely-true,Says an EPA permit languished under Strickland...


In [14]:
#removing news label with "HAF-LF_TRUE" label

vd1 = vd[vd.Label != 'half-true']
vd1

Unnamed: 0,Label,Statement
0,true,Building a wall on the U.S.-Mexico border will...
1,false,Wisconsin is on pace to double the number of l...
2,false,Says John McCain has done nothing to help the ...
4,pants-fire,When asked by a reporter whether hes at the ce...
5,true,Over the past five years the federal governmen...
...,...,...
1261,pants-fire,Sen. Bob Menendez voted to enact a new tax on ...
1263,barely-true,Ive been here almost every day.
1264,barely-true,"In the early 1980s, Sen. Edward Kennedy secret..."
1265,barely-true,Says an EPA permit languished under Strickland...


In [15]:
#Changing news label into 0 and 1 for fake and real news
mapping = {'true': 1, 'mostly-true': 1, 'false': 0, 'pants-fire': 0, 'barely-true': 0}
valid_set = vd1.replace({'Label': mapping})

In [16]:
valid_set

Unnamed: 0,Label,Statement
0,1,Building a wall on the U.S.-Mexico border will...
1,0,Wisconsin is on pace to double the number of l...
2,0,Says John McCain has done nothing to help the ...
4,0,When asked by a reporter whether hes at the ce...
5,1,Over the past five years the federal governmen...
...,...,...
1261,0,Sen. Bob Menendez voted to enact a new tax on ...
1263,0,Ive been here almost every day.
1264,0,"In the early 1980s, Sen. Edward Kennedy secret..."
1265,0,Says an EPA permit languished under Strickland...


# Natural Language Processing with the Text Data of News

* Remove Punctuation
* Remove Stopwords
* Then, we will implement tokenizer to create padding sequences. Also, to calculate num_words, we will concatenate all news statement to gather exact number

### Removing Punctuation

In [17]:
import string

def remove_punc(text):
    table = str.maketrans("","",string.punctuation)
    return text.translate(table)

In [18]:
#removing punctuation from the news statement from the training set, test set, and validation set

training_set['Statement']=training_set['Statement'].map(lambda x: remove_punc(x))
test_set['Statement']=test_set['Statement'].map(lambda x: remove_punc(x))
valid_set['Statement']=valid_set['Statement'].map(lambda x: remove_punc(x))

### Removing stopwords

In [19]:
#importing stopwords from the NLTK to remove ENGLISH stopwords, and convert further into lower within 
#remove_Stopwords function defined

import nltk
from nltk.corpus import stopwords

stop = set(stopwords.words("english"))

def remove_stopwords(text):
    text = [word.lower() for word in text.split() if word.lower() not in stop]
    return " ".join(text)

In [20]:
#removing ENGLISH stopwords from the news statement from the training set, test set, and validation set

training_set['Statement']=training_set['Statement'].map(remove_stopwords)
test_set['Statement']=test_set['Statement'].map(remove_stopwords)
valid_set['Statement']=valid_set['Statement'].map(remove_stopwords)

In [21]:
#Training set after removing punctuation, stopwords, and lowercasing all the words in the corpus
training_set['Statement']

0        says annies list political group supports thir...
2        hillary clinton agrees john mccain voting give...
3        health care reform legislation likely mandate ...
5        chicago bears starting quarterbacks last 10 ye...
6               jim dunnam lived district represents years
                               ...                        
10233    mayor fung wants punish childrens education re...
10235    larger number shark attacks florida cases vote...
10236     democrats become party atlanta metro area blacks
10238        lifting us cuban embargo allowing travel cuba
10239    department veterans affairs manual telling vet...
Name: Statement, Length: 8126, dtype: object

In [22]:
#Test set after removing punctuation, stopwords, and lowercasing all the words in the corpus
test_set['Statement']

0       building wall usmexico border take literally y...
1               wisconsin pace double number layoffs year
2                 says john mccain done nothing help vets
4       asked reporter whether hes center criminal sch...
5       past five years federal government paid 601 mi...
                              ...                        
1261    sen bob menendez voted enact new tax sale home...
1263                                 ive almost every day
1264    early 1980s sen edward kennedy secretly offere...
1265    says epa permit languished strickland new epa ...
1266    says governor going around state talking fund ...
Name: Statement, Length: 1002, dtype: object

In [23]:
#Validation set after removing punctuation, stopwords, and lowercasing all the words in the corpus
valid_set['Statement']

0       building wall usmexico border take literally y...
1               wisconsin pace double number layoffs year
2                 says john mccain done nothing help vets
4       asked reporter whether hes center criminal sch...
5       past five years federal government paid 601 mi...
                              ...                        
1261    sen bob menendez voted enact new tax sale home...
1263                                 ive almost every day
1264    early 1980s sen edward kennedy secretly offere...
1265    says epa permit languished strickland new epa ...
1266    says governor going around state talking fund ...
Name: Statement, Length: 1002, dtype: object

### In the next step, we will be using TOKENIZER, and tokenizer requires num_words parameter

* To find num_words, we need to the know total count of unique words in the train set, test set, and validation set
* Therefore, we will be using counter collection of PYTHON. It counts the unique words

#### Creating new set to concatnate all three sets to get exact no. of words in dataset

In [24]:
new_set = pd.concat([training_set, test_set, valid_set], axis=0)
new_set

Unnamed: 0,Label,Statement
0,0,says annies list political group supports thir...
2,1,hillary clinton agrees john mccain voting give...
3,0,health care reform legislation likely mandate ...
5,1,chicago bears starting quarterbacks last 10 ye...
6,0,jim dunnam lived district represents years
...,...,...
1261,0,sen bob menendez voted enact new tax sale home...
1263,0,ive almost every day
1264,0,early 1980s sen edward kennedy secretly offere...
1265,0,says epa permit languished strickland new epa ...


In [25]:
from collections import Counter

def counter(text):
    count = Counter()
    for i in text.values:
        for word in i.split():
            count[word] +=1
    return count

In [26]:
text = new_set['Statement']
total_count = counter(text)

In [27]:
len(total_count)

12633

#### Creating parameters

In [28]:
num_words = len(total_count)
oov_tok = "<oov>"
max_length = 20 #because its says average statement length is 17.9
trunc_type = 'post'
padding_type = 'post'

#### Creating training set, training label, test set, test label, validation set to be used in neural network

In [29]:
train_sentences = training_set['Statement']
train_labels = training_set['Label']

In [30]:
test_sentences = test_set['Statement']
test_labels = test_set['Label']

In [31]:
valid_sentences = valid_set['Statement']
valid_labels = valid_set['Label']

#### Importing TOKENIZER

In [32]:
from keras.preprocessing.text import Tokenizer

Using TensorFlow backend.


In [33]:
#we will use all the words in the train set, test set, and validation set to tokenier
new_set_sentences = new_set['Statement']

In [34]:
tokenizer = Tokenizer(num_words = num_words, oov_token = oov_tok)
tokenizer.fit_on_texts(new_set_sentences)

In [35]:
#Creating word_index for each words tokenized

word_index = tokenizer.word_index
word_index

{'<oov>': 1,
 'says': 2,
 'percent': 3,
 'state': 4,
 'obama': 5,
 'tax': 6,
 'years': 7,
 'president': 8,
 'health': 9,
 'people': 10,
 'states': 11,
 'year': 12,
 'us': 13,
 'care': 14,
 'would': 15,
 'million': 16,
 'new': 17,
 'one': 18,
 'jobs': 19,
 'bill': 20,
 'federal': 21,
 'billion': 22,
 'texas': 23,
 'barack': 24,
 'said': 25,
 'every': 26,
 'voted': 27,
 'budget': 28,
 'country': 29,
 'government': 30,
 'taxes': 31,
 'since': 32,
 'last': 33,
 'law': 34,
 'united': 35,
 'pay': 36,
 'money': 37,
 'first': 38,
 'plan': 39,
 'wisconsin': 40,
 'clinton': 41,
 'rate': 42,
 'even': 43,
 'republican': 44,
 '1': 45,
 'time': 46,
 'get': 47,
 'senate': 48,
 'spending': 49,
 'public': 50,
 'americans': 51,
 'house': 52,
 'cut': 53,
 'insurance': 54,
 'florida': 55,
 'obamacare': 56,
 'american': 57,
 'office': 58,
 'school': 59,
 '10': 60,
 'debt': 61,
 'national': 62,
 'hillary': 63,
 'dont': 64,
 'congress': 65,
 'average': 66,
 'scott': 67,
 'never': 68,
 'america': 69,
 'three'

In [36]:
train_sequences = tokenizer.texts_to_sequences(train_sentences)

In [37]:
train_sequences[0]

[2, 7229, 997, 480, 426, 267, 5066, 416, 1555]

In [38]:
from keras.preprocessing.sequence import pad_sequences

In [39]:
train_padded = pad_sequences(
                train_sequences, maxlen =max_length, padding=padding_type,truncating = trunc_type)

In [40]:
train_padded[0]

array([   2, 7229,  997,  480,  426,  267, 5066,  416, 1555,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0])

In [41]:
test_sequences = tokenizer.texts_to_sequences(test_sentences)
test_padded = pad_sequences(
                test_sequences, maxlen=20, padding="post",
                truncating = "post")

In [42]:
valid_sequences = tokenizer.texts_to_sequences(valid_sentences)
valid_padded = pad_sequences(valid_sequences, maxlen= 20, padding ="post",truncating="post")

In [43]:
print(f"Shape of the train {train_padded.shape}")
print(f"shape of the test {test_padded.shape}")

Shape of the train (8126, 20)
shape of the test (1002, 20)


# Implementing DEEP LEARNING methods

We will use keras to implement deep learning methods. Methods used are

* Long Short Term Memory (LSTM)
* Artificial Neural Network
* Convolutional Neural Network (CNN)
* Gated Recurrent Unit (GRU)
* Bi-Directional LSTM
* Recurrent Neural Network (RNN)

### Modelling LSTM

In [44]:
#Importing relevant libraries to create desired neural networks

from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout
from keras.initializers import Constant
from keras.optimizers import Adam

In [45]:
model = Sequential()

#### Number of hidden layer, dropout, learning rate has been tried and tested many times over for better result

In [46]:
model.add(Embedding(num_words, 32, input_length=max_length))
model.add(LSTM(32, return_sequences=True, activation="relu", dropout = 0.2))
model.add(LSTM(32, return_sequences=True, activation="relu", dropout = 0.2))
model.add(LSTM(32, return_sequences=True, activation="relu", dropout = 0.2))
model.add(LSTM(32, return_sequences=True, activation="relu", dropout = 0.2))
model.add(LSTM(32, return_sequences=True, activation="relu", dropout = 0.2))
model.add(LSTM(32, return_sequences=True, activation="relu", dropout = 0.2))
model.add(LSTM(32, activation = "relu", dropout = 0.1))
model.add(Dense(1, activation='sigmoid'))
opt = Adam(learning_rate=0.0001)
model.compile(loss='binary_crossentropy', optimizer=opt,metrics=['accuracy'])

In [47]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 20, 32)            404256    
_________________________________________________________________
lstm_1 (LSTM)                (None, 20, 32)            8320      
_________________________________________________________________
lstm_2 (LSTM)                (None, 20, 32)            8320      
_________________________________________________________________
lstm_3 (LSTM)                (None, 20, 32)            8320      
_________________________________________________________________
lstm_4 (LSTM)                (None, 20, 32)            8320      
_________________________________________________________________
lstm_5 (LSTM)                (None, 20, 32)            8320      
_________________________________________________________________
lstm_6 (LSTM)                (None, 20, 32)           

#### Epochs and batch_size has been tested many times over for better results

In [48]:
model.fit(train_padded, train_labels, epochs=30, batch_size=32, validation_data=(valid_padded, valid_labels))

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 8126 samples, validate on 1002 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.callbacks.History at 0x20677a62888>

#### Changing float value in output to 0 and 1 with 0.5 as classification point

In [50]:
predictions = model.predict(test_padded)
predictions = (predictions > 0.5)

In [51]:
predictions.astype('int64')

array([[0],
       [0],
       [0],
       ...,
       [1],
       [1],
       [0]], dtype=int64)

#### We will be focusing on Accuracy Score and Recall for fake news i.e 0 label for evaluating our model performance

In [52]:
#Importing confusion matrix and accuracy score from SCIKIT-LEARN to understand results

from sklearn.metrics import confusion_matrix,classification_report, accuracy_score
print(confusion_matrix(test_labels, predictions))
print(classification_report(test_labels,predictions))

[[326 227]
 [191 258]]
              precision    recall  f1-score   support

           0       0.63      0.59      0.61       553
           1       0.53      0.57      0.55       449

    accuracy                           0.58      1002
   macro avg       0.58      0.58      0.58      1002
weighted avg       0.59      0.58      0.58      1002



In [53]:
print("Accuracy Score for LSTM:",accuracy_score(test_labels,predictions))

Accuracy Score for LSTM: 0.5828343313373253


## Modelling ANN

In [54]:
from keras.layers import Flatten

In [55]:
ann = Sequential()

#First input and hidden layer
ann.add(Embedding(num_words,32, input_length=max_length))
ann.add(Dense(units=32, activation='relu'))
ann.add(Dense(units=32, activation='relu'))
ann.add(Dense(units=32, activation='relu'))
ann.add(Dense(units=32, activation='relu'))
ann.add(Dense(units=32, activation='relu'))
ann.add(Dense(units=32, activation='relu'))
ann.add(Flatten())
#Output layer
ann.add(Dense(units=1))

#compiling
opt = Adam(learning_rate=0.001)
ann.compile(optimizer = opt, loss = 'binary_crossentropy', metrics = ['accuracy'])

In [56]:
ann.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 20, 32)            404256    
_________________________________________________________________
dense_2 (Dense)              (None, 20, 32)            1056      
_________________________________________________________________
dense_3 (Dense)              (None, 20, 32)            1056      
_________________________________________________________________
dense_4 (Dense)              (None, 20, 32)            1056      
_________________________________________________________________
dense_5 (Dense)              (None, 20, 32)            1056      
_________________________________________________________________
dense_6 (Dense)              (None, 20, 32)            1056      
_________________________________________________________________
dense_7 (Dense)              (None, 20, 32)           

#### Epochs and batch_size has been tested many times over for better results

In [57]:
#training ann
ann.fit(train_padded, train_labels, batch_size=32, validation_data=(valid_padded, valid_labels),epochs=40)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 8126 samples, validate on 1002 samples
Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


<keras.callbacks.callbacks.History at 0x2067f8921c8>

In [58]:
predictions_ann = ann.predict(test_padded)
predictions_ann = (predictions_ann > 0.5)
predictions_ann.astype('int64')

array([[0],
       [0],
       [1],
       ...,
       [1],
       [0],
       [0]], dtype=int64)

In [63]:
print(confusion_matrix(test_labels, predictions_ann))
print(classification_report(test_labels,predictions_ann))
print("Accuracy Score for ANN :",accuracy_score(test_labels,predictions_ann))

[[342 211]
 [217 232]]
              precision    recall  f1-score   support

           0       0.61      0.62      0.62       553
           1       0.52      0.52      0.52       449

    accuracy                           0.57      1002
   macro avg       0.57      0.57      0.57      1002
weighted avg       0.57      0.57      0.57      1002

Accuracy Score for ANN : 0.5728542914171657


### Modelling CNN 1D

In [124]:
from keras.layers import Conv1D, GlobalAveragePooling1D,Dense

In [125]:
cnn = Sequential()

#First input and hidden layer
cnn.add(Embedding(num_words, 32, input_length=max_length))
cnn.add(Conv1D(filters=128, kernel_size=5, activation='relu'))
cnn.add(GlobalAveragePooling1D())
cnn.add(Dense(24,activation='relu'))
cnn.add(Dense(24,activation='relu'))
#Output layer
cnn.add(Dense(1, activation='sigmoid'))

#compiling
opt = Adam(learning_rate=0.00001)
cnn.compile(optimizer = opt, loss = 'binary_crossentropy', metrics = ['accuracy'])

In [126]:
cnn.summary()

Model: "sequential_11"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_10 (Embedding)     (None, 20, 32)            404256    
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 16, 128)           20608     
_________________________________________________________________
global_average_pooling1d_2 ( (None, 128)               0         
_________________________________________________________________
dense_28 (Dense)             (None, 24)                3096      
_________________________________________________________________
dense_29 (Dense)             (None, 24)                600       
_________________________________________________________________
dense_30 (Dense)             (None, 1)                 25        
Total params: 428,585
Trainable params: 428,585
Non-trainable params: 0
_______________________________________________

#### Epochs and batch_size has been tested many times over for better results

In [141]:
#training cnn
cnn.fit(train_padded, train_labels, epochs=40, batch_size=32, validation_data=(valid_padded, valid_labels))

Train on 8126 samples, validate on 1002 samples
Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


<keras.callbacks.callbacks.History at 0x20659ec3588>

In [142]:
predictions_cnn = cnn.predict(test_padded)
predictions_cnn = (predictions_cnn > 0.5)
predictions_cnn.astype('int64')

array([[0],
       [0],
       [0],
       ...,
       [1],
       [0],
       [0]], dtype=int64)

In [143]:
print(confusion_matrix(test_labels, predictions_cnn))
print(classification_report(test_labels,predictions_cnn))
print("\n")
print("Accuracy Score for CNN:",accuracy_score(test_labels,predictions_cnn))

[[350 203]
 [217 232]]
              precision    recall  f1-score   support

           0       0.62      0.63      0.63       553
           1       0.53      0.52      0.52       449

    accuracy                           0.58      1002
   macro avg       0.58      0.57      0.57      1002
weighted avg       0.58      0.58      0.58      1002



Accuracy Score for CNN: 0.5808383233532934


### Modelling GRU

In [133]:
from keras.layers import GRU

In [134]:
gru = Sequential()

In [135]:
gru.add(Embedding(num_words, 32, input_length=max_length))
gru.add(GRU(32, return_sequences=True, activation="relu", recurrent_activation='sigmoid', dropout = 0.2))
gru.add(GRU(32, return_sequences=True, activation="relu", recurrent_activation='sigmoid', dropout = 0.2))
gru.add(GRU(32, return_sequences=True, activation="relu", recurrent_activation='sigmoid', dropout = 0.2))
gru.add(GRU(32, return_sequences=True, activation="relu", recurrent_activation='sigmoid', dropout = 0.2))
gru.add(GRU(32, return_sequences=True, activation="relu", recurrent_activation='sigmoid', dropout = 0.2))
gru.add(GRU(32, return_sequences=True, activation="relu", recurrent_activation='sigmoid', dropout = 0.2))
gru.add(GRU(32, activation = "relu", dropout = 0.1))
gru.add(Dense(1, activation='sigmoid'))
opt = Adam(learning_rate=0.0001)
gru.compile(loss='binary_crossentropy', optimizer=opt,metrics=['accuracy'])

In [136]:
#printing GRU Model Summary
gru.summary()

Model: "sequential_12"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_11 (Embedding)     (None, 20, 32)            404256    
_________________________________________________________________
gru_21 (GRU)                 (None, 20, 32)            6240      
_________________________________________________________________
gru_22 (GRU)                 (None, 20, 32)            6240      
_________________________________________________________________
gru_23 (GRU)                 (None, 20, 32)            6240      
_________________________________________________________________
gru_24 (GRU)                 (None, 20, 32)            6240      
_________________________________________________________________
gru_25 (GRU)                 (None, 20, 32)            6240      
_________________________________________________________________
gru_26 (GRU)                 (None, 20, 32)          

#### Epochs and batch_size has been tested many times over for better results

In [137]:
gru.fit(train_padded, train_labels, epochs=25, batch_size=32, validation_data=(valid_padded, valid_labels))

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 8126 samples, validate on 1002 samples
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


<keras.callbacks.callbacks.History at 0x2064f6a87c8>

In [138]:
predictions_gru = gru.predict(test_padded)
predictions_gru = (predictions_gru > 0.50)
predictions_gru.astype('int64')

array([[0],
       [0],
       [0],
       ...,
       [1],
       [1],
       [0]], dtype=int64)

In [140]:
print(confusion_matrix(test_labels, predictions_gru))
print(classification_report(test_labels,predictions_gru))
print("\n")
print("Accuracy Score for GRU :",accuracy_score(test_labels,predictions_gru))

[[321 232]
 [195 254]]
              precision    recall  f1-score   support

           0       0.62      0.58      0.60       553
           1       0.52      0.57      0.54       449

    accuracy                           0.57      1002
   macro avg       0.57      0.57      0.57      1002
weighted avg       0.58      0.57      0.57      1002



Accuracy Score for GRU : 0.5738522954091816


### Modelling Bi-Directional LSTM

In [78]:
from keras.layers import Bidirectional

In [79]:
bilstm = Sequential()

In [80]:
bilstm.add(Embedding(num_words, 32, input_length=max_length))
bilstm.add(Bidirectional(LSTM(32, return_sequences=True, activation="relu", dropout = 0.2)))
bilstm.add(Bidirectional(LSTM(32, return_sequences=True, activation="relu", dropout = 0.2)))
bilstm.add(Bidirectional(LSTM(32, return_sequences=True, activation="relu", dropout = 0.2)))
bilstm.add(Bidirectional(LSTM(32, return_sequences=True, activation="relu", dropout = 0.2)))
bilstm.add(Bidirectional(LSTM(32, return_sequences=True, activation="relu", dropout = 0.2)))
bilstm.add(Bidirectional(LSTM(32, activation = "relu", dropout = 0.2)))
bilstm.add(Dense(1, activation='sigmoid'))
opt = Adam(learning_rate=0.0001)
bilstm.compile(loss='binary_crossentropy', optimizer=opt,metrics=['accuracy'])

In [81]:
bilstm.summary()

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (None, 20, 32)            404256    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 20, 64)            16640     
_________________________________________________________________
bidirectional_2 (Bidirection (None, 20, 64)            24832     
_________________________________________________________________
bidirectional_3 (Bidirection (None, 20, 64)            24832     
_________________________________________________________________
bidirectional_4 (Bidirection (None, 20, 64)            24832     
_________________________________________________________________
bidirectional_5 (Bidirection (None, 20, 64)            24832     
_________________________________________________________________
bidirectional_6 (Bidirection (None, 64)               

#### Epochs and batch_size has been tested many times over for better results

In [82]:
bilstm.fit(train_padded, train_labels, epochs=20, batch_size=32, validation_data=(valid_padded, valid_labels))

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 8126 samples, validate on 1002 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.callbacks.History at 0x2061e8639c8>

In [83]:
predictions_bilstm = bilstm.predict(test_padded)
predictions_bilstm = (predictions_bilstm > 0.5)
predictions_bilstm.astype('int64')

array([[0],
       [0],
       [0],
       ...,
       [1],
       [1],
       [0]], dtype=int64)

In [84]:
from sklearn.metrics import confusion_matrix,classification_report, accuracy_score
print(confusion_matrix(test_labels, predictions_bilstm))
print(classification_report(test_labels,predictions_bilstm))

[[296 257]
 [173 276]]
              precision    recall  f1-score   support

           0       0.63      0.54      0.58       553
           1       0.52      0.61      0.56       449

    accuracy                           0.57      1002
   macro avg       0.57      0.57      0.57      1002
weighted avg       0.58      0.57      0.57      1002



In [85]:
print("Accuracy Score for Bi-Directional LSTM :",accuracy_score(test_labels,predictions_bilstm))

Accuracy Score for Bi-Directional LSTM : 0.5708582834331337


### Modelling RNN for Fake News Detection

In [100]:
from keras.layers import SimpleRNN
from keras.layers import Flatten

In [101]:
rnn = Sequential()

#First input and hidden layer
rnn.add(Embedding(num_words,32, input_length=max_length))
rnn.add(Dense(units=6, activation='relu'))
rnn.add(Dense(units=6, activation='relu'))
rnn.add(Dense(units=6, activation='relu'))
rnn.add(Dense(units=6, activation='relu'))
rnn.add(Dense(units=6, activation='relu'))
rnn.add(Flatten())
#Output layer
rnn.add(Dense(units=1))

#compiling
opt = Adam(learning_rate=0.0001)
rnn.compile(optimizer = opt, loss = 'binary_crossentropy', metrics = ['accuracy'])

#### RNN Model Summary

In [102]:
rnn.summary()

Model: "sequential_8"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_7 (Embedding)      (None, 20, 32)            404256    
_________________________________________________________________
dense_20 (Dense)             (None, 20, 6)             198       
_________________________________________________________________
dense_21 (Dense)             (None, 20, 6)             42        
_________________________________________________________________
dense_22 (Dense)             (None, 20, 6)             42        
_________________________________________________________________
dense_23 (Dense)             (None, 20, 6)             42        
_________________________________________________________________
dense_24 (Dense)             (None, 20, 6)             42        
_________________________________________________________________
flatten_3 (Flatten)          (None, 120)              

#### Epochs and batch_size has been tested many times over for better results

In [106]:
rnn.fit(train_padded, train_labels, epochs=45, batch_size=32, validation_data=(valid_padded, valid_labels))

Train on 8126 samples, validate on 1002 samples
Epoch 1/45
Epoch 2/45
Epoch 3/45
Epoch 4/45
Epoch 5/45
Epoch 6/45
Epoch 7/45
Epoch 8/45
Epoch 9/45
Epoch 10/45
Epoch 11/45
Epoch 12/45
Epoch 13/45
Epoch 14/45
Epoch 15/45
Epoch 16/45
Epoch 17/45
Epoch 18/45
Epoch 19/45
Epoch 20/45
Epoch 21/45
Epoch 22/45
Epoch 23/45
Epoch 24/45
Epoch 25/45
Epoch 26/45
Epoch 27/45
Epoch 28/45
Epoch 29/45
Epoch 30/45
Epoch 31/45
Epoch 32/45
Epoch 33/45
Epoch 34/45
Epoch 35/45
Epoch 36/45
Epoch 37/45
Epoch 38/45
Epoch 39/45
Epoch 40/45
Epoch 41/45
Epoch 42/45
Epoch 43/45
Epoch 44/45
Epoch 45/45


<keras.callbacks.callbacks.History at 0x2062e00b5c8>

In [107]:
predictions_rnn = rnn.predict(test_padded)
predictions_rnn = (predictions_rnn > 0.5)
predictions_rnn.astype('int64')

array([[0],
       [1],
       [1],
       ...,
       [1],
       [1],
       [0]], dtype=int64)

In [108]:
print(confusion_matrix(test_labels, predictions_rnn))
print(classification_report(test_labels,predictions_rnn))
print("Accuracy Score for RNN :",accuracy_score(test_labels,predictions_rnn))

[[302 251]
 [207 242]]
              precision    recall  f1-score   support

           0       0.59      0.55      0.57       553
           1       0.49      0.54      0.51       449

    accuracy                           0.54      1002
   macro avg       0.54      0.54      0.54      1002
weighted avg       0.55      0.54      0.54      1002

Accuracy Score for RNN : 0.5429141716566867
