# Misinformation Verity- Project Demo

Mia Markovic, 39425669, mmarkovi@uci.edu

Connor Couture, 35751882, couturec@uci.edu

Justin Kang, 23736916, hyunkok1@uci.edu 

Here is an overview of how our final project runs, with code samples. We will only be using the COVID-19 dataset here, as the other datasets are too large to include in the zip. However, the preprocessed data is pickled so the model will still be able to run. For a full experience of our project, open our website, as we will display below.

In [1]:
import nltk 
from nltk import word_tokenize
import simplejson as json
import sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split 

from sklearn import linear_model 
from sklearn import metrics 

import numpy as np
import matplotlib.pyplot as plt

import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer 

from __future__ import unicode_literals, print_function, division
import torch
import torch.nn as nn

Here is the dataset we will be using for this ipynb file. As stated below, we need to clean up some of the labels, as they are missing from the dataset. The other datasets can be obtained from https://ieee-dataport.org/open-access/fnid-fake-news-inference-dataset#files

In [2]:
coronafile =  pd.read_csv("./datasets/corona_fake.csv")
#in case unavailable, dataset available at
# raw.githubusercontent.com/susanli2016/NLP-with-Python/master/data/corona_fake.csv

#cleaning up broken data according to labels added by
# https://towardsdatascience.com/explore-covid-19-infodemic-2d1ceaae2306
coronafile.loc[5]['label'] = 'fake'
coronafile.loc[15]['label'] = 'true'
coronafile.loc[43]['label'] = 'fake'
coronafile.loc[131]['label'] = 'true'
coronafile.loc[242]['label'] = 'fake'

## Preprocessing Step

In order to start processing our data, we had to first run a few preprocessing functions. First, we wanted to remove any commas between numbers (like in "1,000", so that we didn't obtain less valuable tokens. Next, we wanted to lemmatize our words, so we had to build a simple tokenizer to achieve this for us.

In [3]:
lemmatizer = WordNetLemmatizer()
lemmStop = [lemmatizer.lemmatize(t) for t in stopwords.words('english')]
lemmStop += ['could', 'might', 'must', 'need', 'sha', 'wo', 'would']

def getLemmatizedStopwords():
    return lemmStop

def replaceCommas(strToRepl):
    '''function to search for numbers like 100,000 and replace them with 100000
        so that this number will stay combined when we perform tokenization
    '''
    #searches to see if there is [digit,digit] in the text
    a = re.search(r'[0-9],[0-9]', strToRepl) 
    while (a != None): #if there is no more matches, then a will be None
        b = a.span()[0] + 1 #second character will be a comma (according to how we searched for it)
        strToRepl = strToRepl[:b] +  strToRepl[b+1:] #take everything but the comma
        a = re.search(r'[0-9],[0-9]', strToRepl)
    return strToRepl

#written based off of code found in 
# https://scikit-learn.org/stable/modules/feature_extraction.html
# under section 6.2.3.10
class LemmaTokenizer:
    
    def __init__(self):
        self.lemmatizer = WordNetLemmatizer()
        
    def __call__(self, text):
        return [self.lemmatizer.lemmatize(t.lower()) for t in word_tokenize(text) if t.isalnum()]

Written below are our preprocessing functions for the COVID-19 dataset. We have one function to get the text only (such that we can easily use it for testing), and then another function that provides us with both the lists containing text, labels, and the corresponding CountVectorizer.

In [4]:
def getCoronaText(isTrain = False):
    '''
    Parameters
    ----------
    isTrain : bool, optional
        Boolean to tell the program if we want to look at the training dataset (true)
        or the testing dataset (false). The default is False.

    Returns
    -------
    text : TYPE
        A list of the imporant text (title, text) from the corresponding dataset
    Y : TYPE
        A list of integers (0 or 1) describing which class a certain document is from.
        0 = fake article, 1 = true article

    '''
    coronafile_train = coronafile.sample(frac = 1, random_state=1).reset_index(drop = True)

    originalSize = coronafile.shape[0]
    splitSize = int(originalSize * .75) #873 of the 1164 documents will go to training, rest test

    coronafile_test = coronafile_train.loc[:splitSize-1,:] #goes inclusive to the last one, so subtract 1
    coronafile_train = coronafile_train.loc[splitSize:,:].reset_index(drop = True) 

    text = []
    Y = []
    i = 0
    nanTitle = 0
    nanText = 0
    cFile = coronafile_test
    breakI = splitSize
    if (isTrain):
        cFile = coronafile_train
        breakI = originalSize - splitSize
    for index, d in cFile.iterrows():
        ftext = d['text']   # keep only the text and label
        ftitle = d['title']
        label = (d['label']).lower()
        
        score = 1 #1 for true, 0 for fake
        if (label == "fake"):
            score = 0
            
        #some documents might not have titles (or possible text?)
        #these are stored as NaN so replace with an empty string
        if (not isinstance(ftext, str) and np.isnan([ftext])):
            ftext = ""
            nanText += 1
        if (not isinstance(ftitle, str) and np.isnan(ftitle)):
            ftitle = ""
            nanTitle += 1
        
        ftext = ftext + ftitle #combining the text and title into one
        ftext = replaceCommas(ftext)
            
        text.append(ftext)
        Y.append(score)
        i += 1
        if (i == breakI):
            #for some reason the for loop doesnt know when to stop so put in a manual break
            break
    return text, Y

def getCoronaVocabulary(isTrain = False):
    '''

    Parameters
    ----------
    isTrain : bool, optional
        Boolean to tell the program if we want to look at the training dataset (true)
        or the testing dataset (false). The default is False.

    Returns
    -------
    X : NxM Array
        Returns a NxM matrix, where N = number of documents, M = size of vocabulary.
        The array contains the documetn term matrix for our current dataset.
    Y : TYPE
        A list of integers (0 or 1) describing which class a certain document is from.
        0 = fake article, 1 = true article
    vectorizer : CountVectorizer
        The BOW for our current dataset.
    '''
    
    text, Y = getCoronaText(isTrain)    
    # create an instance of a CountVectorizer, using 
    # (1) the standard 'english' stopword set from nltk, but lemmetized
    # (2) only keeping terms in the vocabulary that occur in at least 1% of documents
    # (3) allowing both unigrams and bigrams in the vocabulary (use "ngram_range=(1,2)" to do this)
    vectorizerText = CountVectorizer(stop_words = getLemmatizedStopwords(), min_df=.01, ngram_range=(1,2), tokenizer= LemmaTokenizer() )
    # create a sparse BOW array from 'text' using vectorizer  
    X = vectorizerText.fit_transform(text)
    #print('Vocabulary for text: ', vectorizerText.get_feature_names())

    return X, Y, vectorizerText

In [5]:
X, Y, vectorizer = getCoronaVocabulary()
print('Vocabulary for COVID-19 dataset: ', vectorizer.get_feature_names())



## Model

Next, is our model. For the COVID-19 dataset, we use a simple feed forward neural network, which outputs a single neuron output.

In [6]:
class SimpleNeuralNet(nn.Module):
    # Simple Feed Forward Neural Network with One Hidden Layer that Outputs One Neuron (Binary Classification, can't handle more than 2 classes)
    
    def __init__(self, input_size, hidden_size):
        super(SimpleNeuralNet, self).__init__()
        #Written based off of the tutorial at
        #https://github.com/yunjey/pytorch-tutorial/blob/master/tutorials/01-basics/feedforward_neural_network/main.py#L37-L49
        self.hidden1 = nn.Linear(input_size, hidden_size) 
        self.relu = nn.ReLU()   
        self.oupt = nn.Linear(hidden_size, 1)  

    def forward(self, x):
        out = torch.tanh(self.hidden1(x))
        out = torch.sigmoid(self.oupt(out))
        return out

In [7]:
def trainAndTestSimpleModel(num_epochs = 5, learning_rate = 0.001, print_epoch_mod = 5):
    '''
    Used this article for help in writing the tensor parts of code so it works with the model
    https://medium.com/analytics-vidhya/part-1-sentiment-analysis-in-pytorch-82b35edb40b8
    
    Train and tests, calculates both training and test accuracy, models that use SimpleNeuralNet.
    '''
    torch.manual_seed(1)
    X,Y = getCoronaText() #this function will give us the text array (not document term matrix) and Y
    X_train,Y_train, vectorizer_train = getCoronaVocabulary(True)
    
    #transform our testing dataset to match the vocabulary for the training dataset
    #transform will return the document-term matrix for X based on training dataset
    x_test = vectorizer_train.transform(X)
    
    vocabsize = X_train.shape[1]
    
    
    #transform our training and test data into tensors for the classifier to learn off of
    X_tensor = torch.from_numpy(X_train.todense()).float()
    Y_tensor = torch.from_numpy(np.array(Y_train)).float()
    
    X_test_tensor = torch.from_numpy(x_test.todense()).float()
    Y_test_tensor = torch.from_numpy(np.array(Y))
    
    device = torch.device('cpu')
    #use TensorDataset to be able to use our DataLoader
    train_data = torch.utils.data.TensorDataset(X_tensor, Y_tensor)
    train_loader = torch.utils.data.DataLoader(train_data,batch_size=16, shuffle=False)
    train_loader_batch_size_1 = torch.utils.data.DataLoader(train_data,batch_size=1, shuffle=False)
    
    test_data = torch.utils.data.TensorDataset(X_test_tensor, Y_test_tensor)
    test_loader = torch.utils.data.DataLoader(test_data,batch_size=1, shuffle=False)
    
    #initialize our model
    model = SimpleNeuralNet(vocabsize, 200).to(device)
    loss_fn = nn.BCELoss()
    optimizer = torch.optim.Adam(model.parameters(), lr = learning_rate)
    
    
    total_step = len(train_loader)
    for epoch in range(num_epochs):
        for i, (x_batch, labels) in enumerate(train_loader):
    
            # Forward pass
            # The forward process computes the loss of each iteration on each sample
            model.train()
            y_pred = model(x_batch)
            loss = loss_fn(y_pred, labels.reshape(-1, 1))
    
            # Backward pass, using the optimizer to update the parameters
            optimizer.zero_grad()
            loss.backward()    #compute gradients
            optimizer.step()   #initiate gradient descent
    
     
            # Below, an epoch corresponds to one pass through all of the samples.
            # Each training step corresponds to a parameter update using 
            # a gradient computed on a minibatch of 100 samples 
            if (i + 1) % print_epoch_mod == 0: 
                #leaving it on 5 for corona dataset, probably want to change to % 50 or % 100
                # for the other datasets so don't get spammed 
                print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'
                      .format(epoch + 1, num_epochs, i + 1, total_step, loss.item()))
    
    # Test the model
    # In the test phase, we don't need to compute gradients (the model has already been learned)
    with torch.no_grad():
        correct = 0
        total = 0
        for inputs, label in test_loader:
            output = model(inputs)
            total += 1
            if label >= 0.5 and output >= 0.5:
                correct += 1
            elif label < 0.5 and output < 0.5:
                correct += 1
            
        print('Test accuracy of the network: {} %'.format(100 * correct / total))
        test_accuracy = 100 * correct / total
        
    # Print out training accuracy
    with torch.no_grad():
        correct = 0
        total = 0
        for inputs, label in train_loader_batch_size_1:
            output = model(inputs)
            total += 1
            if label >= 0.5 and output >= 0.5:
                correct += 1
            elif label < 0.5 and output < 0.5:
                correct += 1
                
        print('Train accuracy of the network: {} %'.format(100 * correct / total))
        train_accuracy = 100 * correct / total
    
    return test_accuracy, train_accuracy, model, vectorizer_train

In [8]:
test_accuracy, train_accuracy, model, vectorizer_train = trainAndTestSimpleModel(num_epochs=50)

Epoch [1/50], Step [5/19], Loss: 0.3543
Epoch [1/50], Step [10/19], Loss: 0.5111
Epoch [1/50], Step [15/19], Loss: 0.2169
Epoch [2/50], Step [5/19], Loss: 0.0395
Epoch [2/50], Step [10/19], Loss: 0.1832
Epoch [2/50], Step [15/19], Loss: 0.0935
Epoch [3/50], Step [5/19], Loss: 0.0094
Epoch [3/50], Step [10/19], Loss: 0.0684
Epoch [3/50], Step [15/19], Loss: 0.0311
Epoch [4/50], Step [5/19], Loss: 0.0038
Epoch [4/50], Step [10/19], Loss: 0.0362
Epoch [4/50], Step [15/19], Loss: 0.0157
Epoch [5/50], Step [5/19], Loss: 0.0021
Epoch [5/50], Step [10/19], Loss: 0.0217
Epoch [5/50], Step [15/19], Loss: 0.0092
Epoch [6/50], Step [5/19], Loss: 0.0013
Epoch [6/50], Step [10/19], Loss: 0.0140
Epoch [6/50], Step [15/19], Loss: 0.0059
Epoch [7/50], Step [5/19], Loss: 0.0009
Epoch [7/50], Step [10/19], Loss: 0.0097
Epoch [7/50], Step [15/19], Loss: 0.0041
Epoch [8/50], Step [5/19], Loss: 0.0007
Epoch [8/50], Step [10/19], Loss: 0.0071
Epoch [8/50], Step [15/19], Loss: 0.0030
Epoch [9/50], Step [5/19

## Testing

Now that the model is trained (should take 30 seconds - 1 minute), we can try running some sample datasets through it. 

In [9]:
def getTermMatrixTestData(textToTransform: str, vectorizer):
    noComText = replaceCommas(textToTransform)
    return vectorizer.transform([noComText])

def predict_model(model, vec, raw_text):
    text = getTermMatrixTestData(raw_text, vec).todense()
    X_test_tensor = torch.from_numpy(text).float()
    output_prob = float(model(X_test_tensor).detach().numpy()[0][0])

    print("Computed probability:",float(output_prob))

    return output_prob

In [10]:
print("Actual label: Fake")
print("Model returned:", predict_model(model, vectorizer_train, '''And what's next, everyone will swallow and sit in silence, only Russia and China will press, and the rest will be swallowed.''') > 0.5 )

Actual label: Fake
Computed probability: 0.09304328262805939
Model returned: False


In [11]:
print("Actual label: Fake")
print("Model returned:", predict_model(model, vectorizer_train, '''You just need to add water, and the drugs and vaccines are ready to be administered. There are two parts to the kit: one holds pellets containing the chemical machinery that synthesises the end product, and the other holds pellets containing instructions that telll the drug which compound to create. Mix two parts together in a chosen combination, add water, and the treatment is ready.''') > 0.5)

Actual label: Fake
Computed probability: 0.34015128016471863
Model returned: False


In [12]:
print("Actual label: True")
print("Model returned:", predict_model(model, vectorizer_train, '''No. Vaccines against pneumonia, such as pneumococcal vaccine and Haemophilus influenza type B (Hib) vaccine, do not provide protection against the new coronavirus. The virus is so new and different that it needs its own vaccine. Researchers are trying to develop a vaccine against 2019-nCoV, and WHO is supporting their efforts. Although these vaccines are not effective against 2019-nCoV, vaccination against respiratory illnesses is highly recommended to protect your health.''') > 0.5)

Actual label: True
Computed probability: 0.42558759450912476
Model returned: False


In [13]:
print("Actual label: True")
print("Model returned:", predict_model(model, vectorizer_train, '''Washing your hands decreases the number of microbes on your hands and helps prevent the spread of infectious diseases. Remember – coronavirus spreads easily by droplets from breathing, coughing and sneezing. As our hands touch many surfaces, they can pick up microbes, including viruses. Then by touching contaminated hands to your eyes, nose or mouth, the pathogens can infect the body. As a microbiologist, I think a lot about the differences between microbes, such as bacteria and viruses, and how they interact with animal hosts to drive health or disease. I was shocked to read a study that indicated that 93.2% of 2,800 survey respondents did not wash their hands after coughing or sneezing. Let me explain how washing your hands decreases the number of microbes on your hands and helps prevent the spread of infectious diseases.''') > 0.5)

Actual label: True
Computed probability: 0.9999998807907104
Model returned: True


Anything with probability less than .5 is considered false, while greater than .5 is considered true news. As seen above, the model is able to classify certain data well, but other data it struggles with. We found that fake articles written in a scientific manner were harder to classify as fake, while true articles that discussed the same topics as some false articles would have a harder time being classifed as true.