# Fake news detection

**Authors:** Peter Mačinec, Simona Miková

## Model construction

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

import sys
sys.path.append('..')

%load_ext tensorboard
%tensorboard --logdir logs --bind_all

### Reading data

In [2]:
df = pd.read_csv('../data/preprocessed/dataset.csv', index_col=0)

Let’s use only sample of data for now.

In [3]:
df = df.sample(300)

In [4]:
df.head()

Unnamed: 0,body,label
339022,alive! catholic monthly newspaper ireland by n...,unreliable
314059,new imaging technique could make brain tumor r...,reliable
299697,last september i sent you an e alert about a s...,unreliable
354092,"the answer to this question is complicated, so...",reliable
272561,"for a trip to the er, some are opting for uber...",reliable


Labels need to be encoded to 0 and 1 (One hot encoding), so our model will be able to understand them correctly.

In [6]:
df['label_encoded'] = df['label'].apply(lambda label: 1 if label == 'unreliable' else 0)
labels = np.asarray(df['label_encoded'])

Let’s check the shape of labels.

In [7]:
labels.shape

(300,)

### Embeddings preprocessing

In [9]:
from src.model.preprocessing import get_sequences_and_word_index
from sklearn.model_selection import train_test_split

Maximum number of words to preserve will be set to 20000 for now. 

In [10]:
max_words = 20000

 Now we need to generate sequences and word index table from article body. 

In [12]:
%%time
sequences, word_index = get_sequences_and_word_index(df['body'], max_words)

Wall time: 310 ms


Let’s have a look at number of unique tokens.

In [13]:
len(word_index)

18549

And shape of input sequences.

In [14]:
sequences.shape

(300, 4886)

### Loading fastText model

In [17]:
from src.model.fasttext import read_fasttext_model

Read fastText model from .vec file.

In [18]:
%%time
fasttext = read_fasttext_model('../models/fasttext/wiki-news-300d-1M.vec')

Wall time: 4min 52s


In [19]:
len(fasttext), len(fasttext['work'])

(999995, 300)

Embeddings dimension (length of vectors) will be set to 300.

In [20]:
embeddings_dim = 300

### Embeddings matrix

In [23]:
from src.model.preprocessing import get_embeddings_matrix

Now we are able to get embeddings matrix from acquired word indexes and pre-trained embeddings (in our case fastText).

In [25]:
%%time
embeddings_matrix = get_embeddings_matrix(word_index, fasttext, 300)

Number of words not found in pre-trained embeddings: 1951
Wall time: 14.5 s


In [21]:
embeddings_matrix.shape

(19401, 300)

Let’s check if index in sequences match with some randomly choosen word (for example word 'tumor').

First we will find given word in word_index.

In [26]:
word_index['tumor']

912

Now, we will check fasttext value for this word.

In [36]:
fasttext['tumor'][:50]

array([ 0.0094, -0.2403, -0.0307,  0.066 , -0.1379,  0.0199,  0.004 ,
       -0.1303,  0.3053,  0.1848,  0.0616, -0.0794,  0.0279, -0.1055,
       -0.1279,  0.0466,  0.0365,  0.0953, -0.091 , -0.1541, -0.5253,
        0.0725, -0.0753,  0.2322, -0.0999, -0.0431, -0.1307,  0.0884,
       -0.0428, -0.0842, -0.0598, -0.0334, -0.0154,  0.0476,  0.309 ,
       -0.1065,  0.1463, -0.0541, -0.0502, -0.0209,  0.0089,  0.1604,
        0.1508,  0.153 ,  0.0445,  0.0035, -0.0885, -0.0259,  0.1005,
       -0.1053], dtype=float32)

And compare it to value at given index in embeddings_matrix.

In [37]:
embeddings_matrix[912][:50]

array([ 0.0094    , -0.2403    , -0.0307    ,  0.066     , -0.13789999,
        0.0199    ,  0.004     , -0.1303    ,  0.3053    ,  0.1848    ,
        0.0616    , -0.0794    ,  0.0279    , -0.1055    , -0.1279    ,
        0.0466    ,  0.0365    ,  0.0953    , -0.091     , -0.1541    ,
       -0.52530003,  0.0725    , -0.0753    ,  0.2322    , -0.0999    ,
       -0.0431    , -0.13070001,  0.0884    , -0.0428    , -0.0842    ,
       -0.0598    , -0.0334    , -0.0154    ,  0.0476    ,  0.30899999,
       -0.1065    ,  0.1463    , -0.0541    , -0.0502    , -0.0209    ,
        0.0089    ,  0.1604    ,  0.1508    ,  0.153     ,  0.0445    ,
        0.0035    , -0.0885    , -0.0259    ,  0.1005    , -0.1053    ])

**What have we shown by this?**   
That we have correctly mapped given sequence of words into an embedding matrix that will be used in training.

Split data into train and test subsets.

In [16]:
X_train, X_test, y_train, y_test = train_test_split(sequences, labels, test_size=0.10, random_state=1)

In [32]:
from src.model.model import FakeNewsDetectionNet

And now, we can finally train our model on the sample of dataset.

In [35]:
import tensorflow.keras as keras
import os
import datetime

model = FakeNewsDetectionNet(
        dim_input=len(word_index),
        dim_embeddings=300,
        embeddings=embeddings_matrix,
        lstm_units = 64,
        num_hidden_layers = 1
    )

model.compile(
    optimizer='rmsprop',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

callbacks = [
    keras.callbacks.TensorBoard(
        log_dir=os.path.join("logs", datetime.datetime.now().strftime("%Y%m%d-%H%M%S")),
        histogram_freq=1,
        profile_batch=0
    )
]

model.fit(
    x=X_train,
    y=y_train.reshape((-1,1)),
    batch_size=16,
    validation_data=(X_test, y_test.reshape((-1,1))),
    callbacks=callbacks,
    epochs=1
)

model.summary()

Train on 270 samples, validate on 30 samples
Model: "fake_news_detection_net_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      multiple                  5564700   
_________________________________________________________________
bidirectional_1 (Bidirection multiple                  186880    
_________________________________________________________________
dense_4 (Dense)              multiple                  8256      
_________________________________________________________________
dense_5 (Dense)              multiple                  65        
Total params: 5,759,901
Trainable params: 195,201
Non-trainable params: 5,564,700
_________________________________________________________________
