Application of LSTM and GRU Recurrent Neural Networks in Fake NEWS detection

 With more computational capabilities and to handle massive
datasets, deep learning models present a finer performance over traditional text mining techniques and machine
learning techniques. Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN), are widely
explored Deep Neural Network (DNN) architectures to handle various NLP tasks [6][7].The current work is related
to number of research areas such as text classification, rumor detection, spammer detection, and sentiment analysis


## Imports :

In [1]:
# importing necessary libraries 
import pandas as pd
import tensorflow as tf
import os
import re
import numpy as np
from string import punctuation
#from zipfile import ZipFile
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

In [2]:
# importing neural network libraries
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Embedding, GRU, LSTM, RNN, SpatialDropout1D

## Read data

In [3]:
train = pd.read_csv('../fake-news/train.csv')
test = pd.read_csv('../fake-news/test.csv')
train_data = train.copy()
test_data = test.copy()

In [4]:
train_data = train_data.set_index('id', drop = True)

In [15]:
print(train_data.shape)
train_data.head()

(20800, 4)


Unnamed: 0_level_0,title,author,text,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [16]:
slice_train.shape

(800, 4)

In [12]:
#let's just take a slice of the train data 

slice_train=train_data.iloc[0:800, 0:]

In [149]:
print(test_data.shape)
test_data.head()

(5200, 4)


Unnamed: 0,id,title,author,text
0,20800,"Specter of Trump Loosens Tongues, if Not Purse...",David Streitfeld,"PALO ALTO, Calif. — After years of scorning..."
1,20801,Russian warships ready to strike terrorists ne...,,Russian warships ready to strike terrorists ne...
2,20802,#NoDAPL: Native American Leaders Vow to Stay A...,Common Dreams,Videos #NoDAPL: Native American Leaders Vow to...
3,20803,"Tim Tebow Will Attempt Another Comeback, This ...",Daniel Victor,"If at first you don’t succeed, try a different..."
4,20804,Keiser Report: Meme Wars (E995),Truth Broadcast Network,42 mins ago 1 Views 0 Comments 0 Likes 'For th...


In [150]:
# checking for missing values
train_data.isnull().sum()

title      558
author    1957
text        39
label        0
dtype: int64

out of 20,000 training samples, around 40 samples (bothering only the text column) have missing values. so we can drop them at once

In [17]:
# dropping missing values from text columns alone. 
train_data[['title', 'author']] = train_data[['title', 'author']].fillna(value = 'Missing')
train_data = train_data.dropna()
train_data.isnull().sum()

# dropping missing values from text columns alone. 
slice_train[['title', 'author']] = slice_train[['title', 'author']].fillna(value = 'Missing')
slice_train = slice_train.dropna()
slice_train.isnull().sum()

title     0
author    0
text      0
label     0
dtype: int64

In [18]:
length = []
[length.append(len(str(text))) for text in train_data['text']]
train_data['length'] = length
train_data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._set_item(key, value)


Unnamed: 0_level_0,title,author,text,label,length
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1,4930
1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0,4160
2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1,7692
3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1,3237
4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1,938


In [19]:
length = []
[length.append(len(str(text))) for text in slice_train['text']]
slice_train['length'] = length
slice_train.head()

Unnamed: 0_level_0,title,author,text,label,length
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1,4930
1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0,4160
2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1,7692
3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1,3237
4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1,938


In [153]:
min(train_data['length']), max(train_data['length']), round(sum(train_data['length'])/len(train_data['length']))

(1, 142961, 4553)

In [20]:
min(slice_train['length']), max(slice_train['length']), round(sum(slice_train['length'])/len(slice_train['length']))

(1, 92561, 4987)

we can keep 4500 as max features for training the neural network.

**minimum length is 1 ?? Looks like there are some outliers.**

In [154]:
len(train_data[train_data['length'] < 50])

207

**There are 207 outliers in this dataset. Outliers can be removed. It is a good practice to check the outliers before removing them**

In [155]:
train_data['text'][train_data['length'] < 50]

id
82                                                   
169                                                  
173                                   Guest   Guest  
196            They got the heater turned up on high.
295                                                  
                             ...                     
20350                         I hope nobody got hurt!
20418                                 Guest   Guest  
20431    \nOctober 28, 2016 The Mothers by stclair by
20513                                                
20636                              Trump all the way!
Name: text, Length: 207, dtype: object

*Mostly empty texts. They can be removed since they will surely guide the neural network in the wrong way*

In [21]:
# dropping the outliers
train_data = train_data.drop(train_data['text'][train_data['length'] < 50].index, axis = 0)

In [22]:
# dropping the outliers
slice_train = slice_train.drop(slice_train['text'][slice_train['length'] < 50].index, axis = 0)

In [157]:
min(train_data['length']), max(train_data['length']), round(sum(train_data['length'])/len(train_data['length']))

(50, 142961, 4598)

In [23]:
max_features = 4500

## Preprocessing the Text before feeding it into the neural networks (NLP)

In [24]:
import nltk 
from nltk.corpus import stopwords 

In [34]:
# Tokenizing the text - converting the words, letters into counts or numbers. 
# We dont need to explicitly remove the punctuations. we have an inbuilt option in Tokenizer for this purpose
tokenizer = Tokenizer(num_words = max_features, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower = True, split = ' ')
tokenizer.fit_on_texts(texts = train_data['text'])
X = tokenizer.texts_to_sequences(texts = train_data['text'])

KeyboardInterrupt: 

In [31]:
# Tokenizing the text - converting the words, letters into counts or numbers. 
# We dont need to explicitly remove the punctuations. we have an inbuilt option in Tokenizer for this purpose
tokenizer = Tokenizer(num_words = max_features, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower = True, split = ' ')
tokenizer.fit_on_texts(texts = slice_train['text'])
X2 = tokenizer.texts_to_sequences(texts = slice_train['text'])

In [32]:
# now applying padding to make them even shaped. padding = even shape
X2 = pad_sequences(sequences = X, maxlen = max_features, padding = 'pre')

In [33]:
X2.shape

(788, 4500)

In [35]:
print(X2.shape)
y = slice_train['label'].values
print(y.shape)

(788, 4500)
(788,)


In [36]:
# splitting the data training data for training and validation.
X_train, X_test, y_train, y_test = train_test_split(X2, y, test_size = 0.2, random_state = 101)

In [37]:
X_train.shape

(630, 4500)

We got our training data preprocessed and ready for training the neural network. 

We have to create a neural network now

In [38]:
from tensorflow.keras.layers import Embedding 
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential 
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.layers import LSTM ,Dense, Bidirectional


In [39]:
voc_size = 500

In [47]:
# LSTM Neural Network
'''lstm_model = Sequential()
#lstm_model.add(Embedding( voc_size,input_length =X_train.shape[1], output_dim=120 ))
lstm_model.add(LSTM(units=120, return_sequences=True, input_shape=(X_train.shape[1],1)))
lstm_model.add(Dropout(rate = 0.5))

lstm_model.add( Dense(120,  activation = 'tanh'))
lstm_model.add(layer = Dropout(rate = 0.5))
lstm_model.add(Dense(units = len(set(y)),  recurrent_activation = 'sigmoid'))
# compiling the model
lstm_model.compile(optimizer = 'adam', loss = 'sparse_categorical_crossentropy', metrics = ['accuracy'])'''

"lstm_model = Sequential()\n#lstm_model.add(Embedding( voc_size,input_length =X_train.shape[1], output_dim=120 ))\nlstm_model.add(LSTM(units=120, return_sequences=True, input_shape=(X_train.shape[1],1)))\nlstm_model.add(Dropout(rate = 0.5))\n\nlstm_model.add( Dense(120,  activation = 'tanh'))\nlstm_model.add(layer = Dropout(rate = 0.5))\nlstm_model.add(Dense(units = len(set(y)),  recurrent_activation = 'sigmoid'))\n# compiling the model\nlstm_model.compile(optimizer = 'adam', loss = 'sparse_categorical_crossentropy', metrics = ['accuracy'])"

In [66]:
embedding_vector_features=40 #freedom to test different values 
#voc_size=127
model1=Sequential()
model1.add(Embedding(voc_size,embedding_vector_features,input_length=20)) #input_length =4500
model1.add(Dropout(0.7))
model1.add(Bidirectional(LSTM(100)))
model1.add(Dropout(0.7))
model1.add(Dense(1,activation='sigmoid'))
model1.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
print(model1.summary())

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 4500, 40)          180000    
_________________________________________________________________
dropout_8 (Dropout)          (None, 4500, 40)          0         
_________________________________________________________________
bidirectional_3 (Bidirection (None, 200)               112800    
_________________________________________________________________
dropout_9 (Dropout)          (None, 200)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 201       
Total params: 293,001
Trainable params: 293,001
Non-trainable params: 0
_________________________________________________________________
None


In [67]:
lstm_model_fit = model1.fit(X_train, y_train, epochs = 5)

# early stopping : define a threshold 

Epoch 1/5


InvalidArgumentError:  Received a label value of 1 which is outside the valid range of [0, 1).  Label values: 1 1 1 0 1 0 1 1 0 1 1 1 0 1 1 0 1 0 1 1 0 0 1 0 1 0 0 1 0 1 0 0
	 [[node sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits (defined at <ipython-input-67-4c34c6bdc0e4>:1) ]] [Op:__inference_train_function_28824]

Function call stack:
train_function


In [48]:
max_features = 4500


Constructing GRU Neural Network : for short sequences of data ( short memory )

In [49]:
# GRU neural Network
gru_model = Sequential(name = 'gru_nn_model')
gru_model.add(layer = Embedding(input_dim = max_features, output_dim = 40, name = '1st_layer'))
gru_model.add(layer = GRU(units = 40, dropout = 0.2, 
                          recurrent_dropout = 0.2, recurrent_activation = 'relu', 
                          activation = 'relu', name = '2nd_layer'))
gru_model.add(layer = Dropout(rate = 0.4, name = '3rd_layer'))
gru_model.add(layer = Dense(units = 40, activation = 'relu', name = '4th_layer'))
gru_model.add(layer = Dropout(rate = 0.2, name = '5th_layer'))
gru_model.add(layer = Dense(units = len(set(y_train)), activation = 'softmax', name = 'output_layer'))
# compiling the model
gru_model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

In [50]:
gru_model.summary()

Model: "gru_nn_model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
1st_layer (Embedding)        (None, None, 40)          180000    
_________________________________________________________________
2nd_layer (GRU)              (None, 40)                9840      
_________________________________________________________________
3rd_layer (Dropout)          (None, 40)                0         
_________________________________________________________________
4th_layer (Dense)            (None, 40)                1640      
_________________________________________________________________
5th_layer (Dropout)          (None, 40)                0         
_________________________________________________________________
output_layer (Dense)         (None, 2)                 82        
Total params: 191,562
Trainable params: 191,562
Non-trainable params: 0
________________________________________________

In [52]:
gru_model_fit = gru_model.fit(X_train, y_train, epochs = 6 ) 

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Now preparing the test dataset

In [53]:
print(test.shape)
test_data = test.copy()
print(test_data.shape)

(5200, 4)
(5200, 4)


In [54]:
test_data = test_data.set_index('id', drop = True)
test_data.shape

(5200, 3)

**Filling the Missing values**

In [55]:
test_data = test_data.fillna(' ')
print(test_data.shape)
test_data.isnull().sum()

(5200, 3)


title     0
author    0
text      0
dtype: int64

In [56]:
tokenizer.fit_on_texts(texts = test_data['text'])
test_text = tokenizer.texts_to_sequences(texts = test_data['text'])

In [57]:
test_text = pad_sequences(sequences = test_text, maxlen = max_features, padding = 'pre')

Prediction:

In [70]:
lstm_prediction = model1.predict(test_text)

KeyboardInterrupt: 

In [75]:
lstm_prediction

array([0.5046744 , 0.5039911 , 0.50524086, ..., 0.50563794, 0.50482506,
       0.503637  ], dtype=float32)

In [74]:
lstm_prediction= np.concatenate( lstm_prediction, axis=0 )

The LSTM predictions have more accuracy.

In [76]:
submission = pd.DataFrame({'id':test_data.index, 'label':lstm_prediction})
submission.shape

(5200, 2)

In [77]:
submission.head()

Unnamed: 0,id,label
0,20800,0.504674
1,20801,0.503991
2,20802,0.505241
3,20803,0.503716
4,20804,0.505237


In [None]:
submission.to_csv('submission.csv', index = False)