# Fake news detection

**Authors:** Peter Mačinec, Simona Miková

## Model construction

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

import sys
sys.path.append('..')

### Reading data

In [2]:
df = pd.read_csv('../data/preprocessed/dataset.csv', index_col=0)

Let’s use only sample of data for now.

In [3]:
df = df.sample(300)

In [4]:
df.head()

Unnamed: 0,body,label
351651,two people were bitten by a black mammal at a ...,unreliable
353884,vaccines are evaluated for safety in studies w...,reliable
292248,summary a new study in neuropsychopharmacology...,reliable
270158,at the recent conference of the committee for...,reliable
236449,the fukushima fallout awareness network ffan i...,unreliable


Labels need to be encoded to 0 and 1 (One hot encoding), so our model will be able to understand them correctly.

In [5]:
df['label_encoded'] = df['label'].apply(lambda label: 1 if label == 'unreliable' else 0)
labels = np.asarray(df['label_encoded'])

Let’s check the shape of labels.

In [6]:
labels.shape

(300,)

### Embeddings preprocessing

In [7]:
from src.model.preprocessing import get_sequences_and_word_index
from sklearn.model_selection import train_test_split

Maximum number of words to preserve will be set to 20000 for now. 

In [8]:
max_words = 20000

 Now we need to generate sequences and word index table from article body. 

In [9]:
%%time
sequences, word_index = get_sequences_and_word_index(df['body'], max_words)

Wall time: 282 ms


Let’s have a look at number of unique tokens.

In [10]:
len(word_index)

18843

And shape of input sequences.

In [11]:
sequences.shape

(300, 5387)

### Loading fastText model

In [12]:
from src.model.fasttext import read_fasttext_model

Read fastText model from .vec file.

In [13]:
%%time
fasttext = read_fasttext_model('../models/fasttext/wiki-news-300d-1M.vec')

Wall time: 2min 46s


In [14]:
len(fasttext), len(fasttext['work'])

(999995, 300)

Embeddings dimension (length of vectors) will be set to 300.

In [15]:
embeddings_dim = 300

### Embeddings matrix

In [16]:
from src.model.preprocessing import get_embeddings_matrix

Now we are able to get embeddings matrix from acquired word indexes and pre-trained embeddings (in our case fastText).

In [17]:
%%time
embeddings_matrix = get_embeddings_matrix(word_index, fasttext, 300)

Number of words not found in pre-trained embeddings: 2077
Wall time: 2.68 s


In [18]:
embeddings_matrix.shape

(18843, 300)

Let’s check if index in sequences match with some randomly choosen word (for example word 'tumor').

First we will find given word in word_index.

In [19]:
word_index['tumor']

2271

Now, we will check fasttext value for this word.

In [20]:
fasttext['tumor'][:50]

array([ 0.0094, -0.2403, -0.0307,  0.066 , -0.1379,  0.0199,  0.004 ,
       -0.1303,  0.3053,  0.1848,  0.0616, -0.0794,  0.0279, -0.1055,
       -0.1279,  0.0466,  0.0365,  0.0953, -0.091 , -0.1541, -0.5253,
        0.0725, -0.0753,  0.2322, -0.0999, -0.0431, -0.1307,  0.0884,
       -0.0428, -0.0842, -0.0598, -0.0334, -0.0154,  0.0476,  0.309 ,
       -0.1065,  0.1463, -0.0541, -0.0502, -0.0209,  0.0089,  0.1604,
        0.1508,  0.153 ,  0.0445,  0.0035, -0.0885, -0.0259,  0.1005,
       -0.1053], dtype=float32)

And compare it to value at given index in embeddings_matrix.

In [21]:
embeddings_matrix[912][:50]

array([-1.08599998e-01,  2.28300005e-01,  1.42299995e-01,  1.32300004e-01,
        4.85000014e-02, -2.95000002e-02, -1.27299994e-01, -1.31000001e-02,
        7.50000030e-02,  6.53000027e-02, -1.52600005e-01,  1.40000004e-02,
       -3.95999998e-02,  2.01999992e-02, -1.34100005e-01, -1.18000004e-02,
        7.63000026e-02,  3.77999991e-02, -1.20099999e-01, -3.29000019e-02,
       -5.58399975e-01, -1.53999999e-02,  1.71800002e-01,  6.62999973e-02,
        3.04000005e-02,  1.69499993e-01,  9.80000012e-03, -7.73999989e-02,
        2.64099985e-01,  1.09600000e-01, -8.61999989e-02, -1.13600001e-01,
       -1.74099997e-01,  1.58700004e-01,  2.83600003e-01,  7.51999989e-02,
       -3.94000001e-02,  1.29500002e-01,  2.23100007e-01, -3.99999990e-04,
       -8.96999985e-02, -4.60000010e-03,  1.36399999e-01, -4.21000011e-02,
        9.99999978e-03,  1.48100004e-01, -6.36999980e-02,  6.48000017e-02,
        7.37000033e-02,  2.23499998e-01])

**What have we shown by this?**   
That we have correctly mapped given sequence of words into an embedding matrix that will be used in training.

### Training

Split data into train and test subsets.

In [22]:
X_train, X_test, y_train, y_test = train_test_split(
    sequences, 
    labels, 
    test_size=0.10, 
    random_state=1
)

In [23]:
from src.model.model import FakeNewsDetectionNet

In [24]:
%load_ext tensorboard
%tensorboard --logdir logs --bind_all

ERROR: Timed out waiting for TensorBoard to start. It may still be running as pid 3360.

And now, we can finally train our model on the sample of dataset.

In [25]:
import tensorflow.keras as keras
import os
import datetime

model = FakeNewsDetectionNet(
        dim_input=len(word_index),
        dim_embeddings=300,
        embeddings=embeddings_matrix,
        lstm_units = 64,
        num_hidden_layers = 1
    )

model.compile(
    optimizer='Adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

callbacks = [
    keras.callbacks.TensorBoard(
        log_dir=os.path.join("logs", datetime.datetime.now().strftime("%Y%m%d-%H%M%S")),
        histogram_freq=1,
        profile_batch=0
    )
]

model.fit(
    x=X_train,
    y=y_train,
    batch_size=16,
    validation_data=(X_test, y_test),
    callbacks=callbacks,
    epochs=3
)

model.summary()

Train on 270 samples, validate on 30 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
Model: "fake_news_detection_net"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        multiple                  5652900   
_________________________________________________________________
bidirectional (Bidirectional multiple                  186880    
_________________________________________________________________
dense (Dense)                multiple                  8256      
_________________________________________________________________
dense_1 (Dense)              multiple                  65        
Total params: 5,848,101
Trainable params: 195,201
Non-trainable params: 5,652,900
_________________________________________________________________


### Conclusion
Model have been constructed and tested on small sample on data (proof of concept). Now we can use it in training.