<h1 align='center'> Domain-Invariant Fake News Detection </h1>
<img src="images/fake_news.gif" alt="Fake News" align="middle">

### Import Modules

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
from sklearn.model_selection import train_test_split

import re
from keras.models import Sequential
from keras.layers import Activation, Dropout, Flatten, Dense, BatchNormalization, LSTM, Embedding, Reshape
from keras.models import load_model, model_from_json


### Read the Preprocessed Dataset
We crawled over webistes to get real and fake news dataset.<br>
The dataset is divided into 5 different categories which is :
- India
- Politics
- Entertainment
- Sports and
- Technology

The preprocessing takes the raw dataset and performs following operations
- Lower the text.
- Remove Quotes
- Remove all the special characters
- Replace multiple spaces with one space
- Tokenize the Words

We are using Pandas to read the dataset where the dataset contains three columns 'title', 'text' and 'label'.

In [6]:
df = pd.read_csv('./Preprocessing/news_data_final3.csv')
df.head()

Unnamed: 0,title,text,label
0,mumbai man prays for single buttock in next bi...,while most people pray for wealth health job a...,Fake
1,just trying to fit into delhi culture by brand...,putting across his side of the story ex bsp mp...,Fake
2,in a bid to solicit support from govt lawyers ...,with fresh allegations against mj akbar croppi...,Fake
3,shashi tharoors new book comes with a pocket d...,shashi tharoor is ready with his new book the ...,Fake
4,fir against delhi man for threatening couple b...,a case has been registered against former bsp ...,Fake


### Using GloVe Embedding
We are using [GloVe Word Embedding](https://nlp.stanford.edu/projects/glove/) to initialize our word embedding. 
GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. We have used 50 dimensional word embedding vector.

In [14]:
with open('/scratch/nitin/glove.6B.50d.txt','rb') as f:
    lines = f.readlines()
    
glove_weights = np.zeros((len(lines), 50))
words = []
for i, line in enumerate(lines):
    word_weights = line.split()
    words.append(word_weights[0])
    weight = word_weights[1:]
    glove_weights[i] = np.array([float(w) for w in weight])
word_vocab = [w.decode("utf-8") for w in words]

word2glove = dict(zip(word_vocab, glove_weights))

In [15]:
all_text = ' '.join(df.text.values)
words = all_text.split()
u_words = Counter(words).most_common()
u_words_counter = u_words
u_words_frequent = [word[0] for word in u_words if word[1]>5] # we will only consider words that have been used more than 5 times

u_words_total = [k for k,v in u_words_counter]
word_vocab = dict(zip(word_vocab, range(len(word_vocab))))
word_in_glove = np.array([w in word_vocab for w in u_words_total])

words_in_glove = [w for w,is_true in zip(u_words_total,word_in_glove) if is_true]
words_not_in_glove = [w for w,is_true in zip(u_words_total,word_in_glove) if not is_true]

print('Fraction of unique words in glove vectors: ', sum(word_in_glove)/len(word_in_glove))

# # create the dictionary
word2num = dict(zip(words_in_glove,range(len(words_in_glove))))
len_glove_words = len(word2num)
freq_words_not_glove = [w for w in words_not_in_glove if w in u_words_frequent]
b = dict(zip(freq_words_not_glove,range(len(word2num), len(word2num)+len(freq_words_not_glove))))
word2num = dict(**word2num, **b)
word2num['<Other>'] = len(word2num)
num2word = dict(zip(word2num.values(), word2num.keys()))

int_text = [[word2num[word] if word in word2num else word2num['<Other>'] 
             for word in content.split()] for content in df.text.values]

print('The number of unique words are: ', len(u_words))
print('The first review looks like this: ')
print(int_text[0][:20])
print('And once this is converted back to words, it looks like: ')
print(' '.join([num2word[i] for i in int_text[0][:20]]))

Fraction of unique words in glove vectors:  0.9221395932264134
The number of unique words are:  22322
The first review looks like this: 
[84, 115, 56, 2450, 8, 2051, 959, 434, 3, 712, 1419, 2793, 38, 326, 37, 96, 0, 4761, 64, 21]
And once this is converted back to words, it looks like: 
while most people pray for wealth health job and relationship hiren desai had something which even the gods would be


In [16]:
num2word[len(word2num)] = '<PAD>'
word2num['<PAD>'] = len(word2num)

for i, t in enumerate(int_text):
    if len(t)<500:
        int_text[i] = [word2num['<PAD>']]*(500-len(t)) + t
    elif len(t)>500:
        int_text[i] = t[:500]
    else:
        continue

x = np.array(int_text)
y = (df.label.values=='REAL').astype('int')

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.1, random_state=42)

### An Example of the Dataset

In [17]:
df[df.label=='Real'].text.values[0]

'new delhi an etihad airways flight travelling from abu dhabi to jakarta was diverted to mumbai on wednesday morning after a passenger gave birth on board ani reported the woman was taken to hospital as soon as flight ey 474 touched down at the chhatrapati shivaji international airport'

In [18]:
df[df.label=='Fake'].text.values[0]

'while most people pray for wealth health job and relationship hiren desai had something which even the gods would be shocked to hear the 30 year old business man from mira road who is a regularly commutes by local train asked for a single buttock in his next birth in his prayers the reason for this unique request was to enable him to sit as the fourth passenger on a local train seat meant for three speaking to faking news hiren said i have been travelling by train since last ten years and all the seats are already taken when i enter the compartment leaving me with that uncomfortable fourth seat where one of my buttock in left hanging without support by the end of the day i am left with one sore butt so my only prayer to the almighty is to give me just one buttock hiren s request struck a chord with many other passengers who have been braving heavy rush in local trains for all these years a city based cosmetic surgeon even offered to surgically remove one of hiren s buttock free of cha

### Model Architecture
We are using LSTM based model to classify Fake/Real News. Architecture is as follows.
![](images/Project_Arch.jpg)

In [22]:
model = Sequential()
model.add(Embedding(len(word2num), 50)) # , batch_size=batch_size
model.add(LSTM(32))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, None, 50)          1034300   
_________________________________________________________________
lstm_3 (LSTM)                (None, 32)                10624     
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 33        
Total params: 1,044,957
Trainable params: 1,044,957
Non-trainable params: 0
_________________________________________________________________


In [23]:
batch_size = 128
epochs = 5
model.fit(X_train, y_train, batch_size=batch_size, epochs=10, validation_data=(X_test, y_test))

Train on 576 samples, validate on 64 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fcb9e857320>

In [21]:
model.evaluate(X_test, y_test)



[0.00011899125092895702, 1.0]