<a href="https://colab.research.google.com/github/kelvinfoo123/Natural-Language-Processing/blob/main/Fake_News_Classification_using_LSTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this project, I experiment using LSTM to classify fake news by looking at the headline. 



In [None]:
import pandas as pd 
import numpy as np 

In [None]:
news = pd.read_csv("data.csv")
news.head()

Unnamed: 0,URLs,Headline,Body,Label
0,http://www.bbc.com/news/world-us-canada-414191...,Four ways Bob Corker skewered Donald Trump,Image copyright Getty Images\nOn Sunday mornin...,1
1,https://www.reuters.com/article/us-filmfestiva...,Linklater's war veteran comedy speaks to moder...,"LONDON (Reuters) - “Last Flag Flying”, a comed...",1
2,https://www.nytimes.com/2017/10/09/us/politics...,Trump’s Fight With Corker Jeopardizes His Legi...,The feud broke into public view last week when...,1
3,https://www.reuters.com/article/us-mexico-oil-...,Egypt's Cheiron wins tie-up with Pemex for Mex...,MEXICO CITY (Reuters) - Egypt’s Cheiron Holdin...,1
4,http://www.cnn.com/videos/cnnmoney/2017/10/08/...,Jason Aldean opens 'SNL' with Vegas tribute,"Country singer Jason Aldean, who was performin...",1


In [None]:
news.isnull().sum() # No null values in the headline. 

URLs         0
Headline     0
Body        21
Label        0
dtype: int64

In [None]:
X = news['Headline']
y = news['Label']

print(X.shape)
print(y.shape)

(4009,)
(4009,)


**Text Preprocessing**

In [None]:
import nltk 
import re 
from nltk.corpus import stopwords 
from nltk.stem.porter import PorterStemmer

nltk.download('stopwords')
stemmer = PorterStemmer()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
corpus = []

for i in range(0, len(X)): 
  review = re.sub('[^a-zA-Z]', ' ', X[i]) # For every headline, if the character is not a word, replace with ' '. 
  review = review.lower() # Make all words lower case. The same word in different cases are treated as different words. 
  review = review.split()
  review = [word for word in review if not word in stopwords.words('english')] # Remove stopwords 
  review = [stemmer.stem(word) for word in review] # Stemming 

  review = ' '.join(review)
  corpus.append(review)

In [None]:
for i in range(1,11): 
  print(corpus[i])

linklat war veteran comedi speak modern america say star
trump fight corker jeopard legisl agenda
egypt cheiron win tie pemex mexican onshor oil field
jason aldean open snl vega tribut
jetnat fanduel leagu week
kansa tri tax plan similar trump fail
india rbi chief growth import cost inflat newspap
epa chief sign rule clean power plan exit tuesday
talk sale air berlin plane easyjet risk collaps report
u presid donald trump quietli sign law allow warrantless search part va dc md


**One Hot Representation**

One-hot encodes a text into a list of word indexes of size n, where n is the size of the vocab. 

Reason for encoding is to apply embedding as embedding requires input data to be integer encoded. 

In [None]:
import tensorflow as tf 
from tensorflow.keras.preprocessing.text import one_hot

In [None]:
# Vocabulary size 
voc_size = 10000

In [None]:
onehot_repre = [one_hot(words, voc_size) for words in corpus]

for i in range(11): 
  print(onehot_repre[i])

[5158, 6484, 6823, 1668, 6327, 7425, 7412]
[1926, 9453, 7126, 2586, 650, 4102, 1761, 625, 1150]
[7412, 8097, 1668, 7841, 2446, 3487]
[9447, 1487, 9641, 1285, 7460, 1575, 3554, 8378, 9970]
[2313, 7904, 2066, 7133, 7700, 6097]
[4913, 2247, 4938, 7649]
[7637, 4985, 4370, 8277, 2331, 7412, 2751]
[4202, 8294, 7452, 2864, 3376, 9890, 4293, 7879]
[1087, 7452, 6186, 6904, 4795, 6410, 8277, 6735, 6574]
[3723, 9017, 2443, 4556, 3568, 2867, 5526, 8141, 593]
[5026, 8720, 7425, 7412, 8876, 6186, 8525, 4206, 3298, 7037, 6436, 7052, 3169, 4887]


**Padding Sequences**

LSTM require that input sequences have the same length. We make all input sequences have the same length by padding 0 to the input until they have the stated length. 

In [None]:
# Library for padding 

from tensorflow.keras.preprocessing.sequence import pad_sequences 

In [None]:
length = 30 # Require that all inputs have length 30 

padded = pad_sequences(onehot_repre, padding = 'pre', maxlen = length) # Padding = 'pre' means pad in front of the input. 

for i in range(11): 
  print(padded[i])

[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0 5158 6484 6823 1668 6327
 7425 7412]
[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0 1926 9453 7126 2586  650 4102 1761
  625 1150]
[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0 7412 8097 1668 7841
 2446 3487]
[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0 9447 1487 9641 1285 7460 1575 3554
 8378 9970]
[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0 2313 7904 2066 7133
 7700 6097]
[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0 4913 2247
 4938 7649]
[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0

**Building Model**

Before running LSTM, we need an embedding layer. Compared to traditional bag of words model where representation vectors are sparse due to large vocabularies, words are represented by dense vectors in an embedding. 

The embedding layer requires that the input data be integer encoded, so that each word is represented by a unique integer. 

The embedding layer has 3 arguments: 


*   input_dim: Size of vocab
*   output_dim: Size of vector space in which word will be embedded. 
*   input_length: Length of input sentence 


The embedding layer will be trained as part of the neural network. The output is a 2D vector with one embedding for each word in the input. 


In [None]:
from tensorflow.keras import regularizers 
from tensorflow.keras.layers import Embedding 
from tensorflow.keras.models import Sequential 
from tensorflow.keras.layers import LSTM 
from tensorflow.keras.layers import Dense, Dropout

In [None]:
padded_final = np.array(padded)
y_final = np.array(y)

from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(padded_final, y_final, test_size = 0.3, random_state = 42)

In [None]:
vector_size = 40 # Size of vector space in which word will be embedded.

model = Sequential()
model.add(Embedding(voc_size, vector_size, input_length = length))
model.add(LSTM(100)) # One LSTM layer with 100 neurons 
model.add(Dense(1, activation = 'sigmoid'))

In [None]:
model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

In [None]:
model.fit(X_train, y_train, validation_data = (X_test, y_test), epochs = 10, batch_size = 64)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fa52d567490>

The training accuracy is much higher than the test accuracy. This might be a sign of overfitting. We implement dropout or regularization. 

In [None]:
model = Sequential()
model.add(Embedding(voc_size, vector_size, input_length = length))
model.add(Dropout(0.5))
model.add(LSTM(100))
model.add(Dropout(0.5))
model.add(Dense(1, activation = 'sigmoid', kernel_regularizer = regularizers.l2(0.01)))

In [None]:
model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

In [None]:
model.fit(X_train, y_train, validation_data = (X_test, y_test), epochs = 10, batch_size = 64)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fa52a357bd0>

**Bidirectional LSTM**

In [None]:
from tensorflow.keras.layers import Bidirectional

In [None]:
vector_size = 40 

model = Sequential()
model.add(Embedding(voc_size, vector_size, input_length = length))
model.add(Bidirectional(LSTM(100)))
model.add(Dense(1, activation = 'sigmoid'))

In [None]:
model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

In [None]:
model.fit(X_train, y_train, validation_data = (X_test, y_test), epochs = 10, batch_size = 64)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f829ae174d0>