In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from processing import Text_processing
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Flatten,Embedding,Activation, Dropout, Conv1D, MaxPooling1D, GlobalMaxPooling1D
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

Loading of Training and Test datasets defined by Full_news and Full_News_test respectively. 

In [38]:
path = 'C:\\Users\\paulo\\Desktop\\Mestrado\\2_Ano\\SIB\\Trabalho\\CoAID-master\\CoAID-master'

treated_reals_05 = pd.read_csv(path+'\\05-01-2020\\treated_reals.csv',index_col=0)
treated_reals_05.shape
treated_fakes_05 = pd.read_csv(path+'\\05-01-2020\\treated_fakes.csv',index_col=0)
treated_fakes_05.shape
treated_reals_07 = pd.read_csv(path+'\\07-01-2020\\treated_reals.csv',index_col=0)
treated_reals_07.shape
treated_fakes_07 = pd.read_csv(path+'\\07-01-2020\\treated_fakes.csv',index_col=0)
treated_fakes_07.shape
treated_reals_09 = pd.read_csv(path+'\\09-01-2020\\treated_reals.csv',index_col=0)
treated_reals_09.shape
treated_fakes_09 = pd.read_csv(path+'\\09-01-2020\\treated_fakes.csv',index_col=0)
treated_fakes_09.shape

Full_News = pd.concat([treated_reals_05, treated_fakes_05, treated_reals_07,
                       treated_fakes_07])
Full_News_test = pd.concat([treated_reals_09, treated_fakes_09])

Join both Titles and Contents of News into the same column in order to maximize the amount of data that serves as an input to better train the model. This data was also treated by removing NA values and duplicates, to provide unique data to each input.

The dataset is divided into two columns, one with both content and titles and one with corresponding State, that is, if the content is real the State = 1 and if the content is Fake the State = 0.

In order to utilize the text of collected contents and titles it must first undergo a preprocessing routine, to exclude irrelevant words and punctuation. The preprocessing routine is defined in the processing.py module, also available in the github repository.

In [42]:
contents_state = Full_News[['content','State']]
titles_state = Full_News[['title','State']]
titles_state=titles_state.rename(columns={'title':'content','State':'State'})

Full = pd.concat([contents_state,titles_state],axis=0)
Full = Full.dropna(subset = ['content'])
Full = Full.drop_duplicates(subset = ['content'])

## Apos aplicar a remocao de NA's e conteudo/titulos duplicados o valor de X baixou de 7162 para 6095

text = Full['content'].tolist()
labels = Full['State']

pre_processed_text = Text_processing(text, run_all=True)
pre_processed_text = pre_processed_text.get_processed_text()

Definition of function that returns the encoded sequences, vocabulary size and Tokenizer instance, it turns each content into an array of numerical values that represent each word.

In [44]:
def encoding(pre_processed_text, max_length = 30, pad = 'post'):
    #tokenizer to read all the words present in our corpus
    token = Tokenizer()
    token.fit_on_texts(pre_processed_text)

    #declaring the vocab_size
    vocab_size = len(token.word_index) + 1
    
    #conversion to numerical formats
    encoded_text = token.texts_to_sequences(pre_processed_text)
    X = pad_sequences(encoded_text, maxlen=max_length, padding=pad)
    
    return (X, vocab_size, token)

In [45]:
X, vocab_size, token = encoding(pre_processed_text)

Preparing a dictionary of known words used in twitter and their respective vectorized representation.

In [46]:
#declaring dict to store all the words as keys in the dictionary and their vector representations as values
glove_vectors = dict()

# file = open('glove.twitter.27B.100d.txt', encoding='utf-8')
file = open('C:\\Users\\paulo\\Desktop\\Mestrado\\2_Ano\\SIB\\Trabalho\\glove.twitter.27B\\glove.twitter.27B.100d.txt', encoding='utf-8')

file = open('C:\\Users\\paulo\\Desktop\\Mestrado\\2_Ano\\SIB\\Trabalho\\glove.twitter.27B\\glove.twitter.27B.100d.txt', encoding='utf-8')
for line in file:
    values = line.split()
    word = values[0]
    #storing the word in the variable
    vectors = np.asarray(values[1: ])
    #storing the vector representation of the respective word in the dictionary
    glove_vectors[word] = vectors
file.close()

Definition of a function to create a matrix for the tokens which we are present in our dataset and then storing their vector representation values in the matrix if it matches with glove_vectors words or else append the misspelled words or words which are not present to a list that is returned as well.

In [47]:
def create_matrix_vec(vocab_size,dimentions):
    misspelled = []
    word_vector_matrix = np.zeros((vocab_size, dimentions))

    for word, index in token.word_index.items():
        vector = glove_vectors.get(word)
        if vector is not None:
            word_vector_matrix[index] = vector
        else:
            misspelled.append(word)
            
    return (word_vector_matrix,misspelled)

In [48]:
word_vector_matrix, misspelled_words = create_matrix_vec(vocab_size, dimentions = 100)

In [49]:
word_vector_matrix

array([[ 0.      ,  0.      ,  0.      , ...,  0.      ,  0.      ,
         0.      ],
       [ 0.      ,  0.      ,  0.      , ...,  0.      ,  0.      ,
         0.      ],
       [ 0.      ,  0.      ,  0.      , ...,  0.      ,  0.      ,
         0.      ],
       ...,
       [ 0.      ,  0.      ,  0.      , ...,  0.      ,  0.      ,
         0.      ],
       [ 0.      ,  0.      ,  0.      , ...,  0.      ,  0.      ,
         0.      ],
       [ 0.016987,  0.7899  , -0.26255 , ..., -0.12681 , -0.29667 ,
         0.31492 ]])

Creation of the Deep Learning model, using the Keras package and a sequencial model. We must provide the X and Y dataset as well as: 
- vocab_size = our vocabulary size
- vec_size = the dimentions of the word vectors 
- weights = the word vector matrix 
- input_length = the maximum length of each sequence
- trainable : As we are using glove vectors, we do not want to update the learned word weights in this model therefore this attribute is set to False.

Fitting of the model: In this step the model is trained  with the training sets created at the beginning of the function, and validated with the test set(20% of the original Training dataset). This process is repeated by the number of epochs specified (30 in this case) and the model that presents the best metrics is chosen.  For this purpose we used the Sequential class of the keras package.

In [50]:
vec_size = 100
def create_model(X, y, vec_size,vocab_size,max_length, word_vector_matrix,test_size=0.2, random_state=42 ):
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = random_state, test_size = test_size, stratify = y )
    model = Sequential()
    model.add(Embedding(vocab_size, vec_size, input_length=max_length, weights = [word_vector_matrix], trainable = False))
    model.add(Conv1D(64, 8, activation = 'relu'))
    #here 64 is number of filters and 8 is size of filters
    model.add(MaxPooling1D(2))
    model.add(Dropout(0.5))
    model.add(Dense(32, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(16, activation='relu'))
    model.add(GlobalMaxPooling1D())
    model.add(Dense(1, activation='sigmoid'))
    model.compile(optimizer=Adam(learning_rate = 0.0001), loss = 'binary_crossentropy', metrics = ['accuracy'])
    model.fit(X_train, y_train, epochs = 30, validation_data = (X_test, y_test))
    return model

In [51]:
max_length = 30
modelo = create_model(X = X, y = labels ,vec_size = vec_size, vocab_size = vocab_size, max_length = max_length, word_vector_matrix = word_vector_matrix )

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


Function that provides the preproccessing, encoding and padding of a single sequence, which is necessary in case we want to predict it's class using the model.

In [52]:
def get_encode(x):
    x = pre_processing(x)
    x = token.texts_to_sequences(x)
    x = pad_sequences(x, maxlen=max_length, padding='post')
    return x

Test Dataset preparation(remove NA's and duplicates), preprocessing and enconding, because the model only works with numerical vectors. 
Subsequently the model is evaluated utilizing this data, by predicting the classes for each sequence and comparing them to real values provided, returning the model overall accuracy.

In [53]:
contents_state_test = Full_News_test[['content','State']]
titles_state_test = Full_News_test[['title','State']]
titles_state_test = titles_state_test.rename(columns={'title':'content','State':'State'})

Full_test = pd.concat([contents_state_test,titles_state_test],axis=0)
Full_test = Full_test.dropna(subset = ['content'])
Full_test = Full_test.drop_duplicates(subset = ['content'])

In [54]:
X_test = Full_test['content'].tolist()
Y_test = Full_test['State']

In [55]:
pre_proc_test = Text_processing(X_test, run_all=True)
pre_proc_test = pre_proc_test.get_processed_text()
pre_proc_test = pre_processing(X_test)
X_test , a, b = encoding(pre_proc_test)

In [56]:
modelo.evaluate(X_test,Y_test)



[0.28081366419792175, 0.9296987056732178]

The datasets used for this project, corresponds to three diferent temporal periods. That being said, we used the first two time intervals to train and validate the model, and the last one to test the model's predictive capabilities, from which results an accuracy of 0.9297.

This sounds like a promising result. Nevertheless we are planning to test the model once more with different datasets (from different sources), to truly evaluate its potential.