<h2>CNN for Fake News Detection</h2>

This file contains the implementation of a CNN that performs sentiment analysis on political statements. I used some of the code from https://cezannec.github.io/CNN_Text_Classification/. The steps taken to build this model are:

<ol>
    <li> Data Preprocessing
    <li> Tokenizing Political Statements
    <li> Train/Validation/Test Splitting (already set in the data folder)
    <li> Defining a CNN for Sentiment Analysis
    <li> Training and Evaluating the Model 

</ol>



<h3>Data Preprocessing</h3>

The following code comes from preprocessing.ipynb file.

In [14]:
import pandas as pd
import numpy as np
import nltk
import re
from tqdm import tqdm
import os

numeric_labels = {'pants-fire':0, 'false':1, 'barely-true':2, 'half-true':3, 'mostly-true':4, 'true':5}
path = os.getcwd() + '/data'
headers = ['id', 'label', 'statement', 'subject', 'speaker', 'job_title', 'state_info', 'affiliation', 'barely_true',
           'false', 'half_true', 'mostly_true', 'pants-fire', 'context']
train = pd.read_csv(path + '/train.tsv', sep='\t', header=None, names=headers)
valid = pd.read_csv(path + '/valid.tsv', sep='\t', header=None, names=headers)
test = pd.read_csv(path + '/test.tsv', sep='\t', header=None, names=headers)

# lowercase, remove punctuation, remove numbers
def clean_text(text):
    if not isinstance(text, str):
        return text
    clean_text = text.lower()
    clean_text = re.sub(r'[^\w\s]', '', clean_text)
    return clean_text

def clean_labels(label):
    return numeric_labels[label]

# cols indicates which columns we want to KEEP, the rest are dropped
def drop_columns(df, cols):
    new_df = df[cols]
    new_df = new_df.dropna() # rows with incomplete information (NaN's) are dropped
    return new_df

For this model, we will only be using the statement text and the corresponding labels.

In [2]:
train['statement'] = train['statement'].apply(clean_text)
train['label'] = train['label'].apply(clean_labels)
train = drop_columns(train, ['label', 'statement'])
train['statement'].to_csv('text_only.csv', index=False, header=['text'])
train.head()

Unnamed: 0,label,statement
0,1,says the annies list political group supports ...
1,3,when did the decline of coal start it started ...
2,4,hillary clinton agrees with john mccain by vot...
3,1,health care reform legislation is likely to ma...
4,3,the economic turnaround started at the end of ...


In [3]:
statements_train = train['statement']
labels_train = np.array(train['label'])

statements_valid = valid['statement']
labels_valid = np.array(valid['label'])

statements_test = test['statement']
labels_test = np.array(test['label'])

Now, the data is processed and ready to use!

<h3>Tokenizing Political Statements</h3>

Next, we will tokenize the political statements using a pretrained embedding model. Specifically, we will be using Google News word2vec model (https://github.com/eyaler/word2vec-slim/tree/master). According to the github README, "the model was trained over a 3 billion word corpus, and contains 3 million words (of which ~930k are NOT phrases, i.e. do not contain underscores)." Using this model will make tokenizing the statements much easier, as we will not need to create the token dictionaries by hand. There will be some words that are not listed in the pretrained embedded model, so we will account for that in the corresponding function.

In [7]:
from gensim.models import KeyedVectors

# creating the pretrained embedding model
embed = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300-SLIM.bin', binary=True)

In [8]:
# find all unlisted words
def unlisted_words(embed, statements):
    statement_words = [statement.split() for statement in statements]
    unlisted_words = []
    for statement in statement_words:
        for word in statement:
            try:
                idx = embed.key_to_index.get(word)

                # if word not in embedded list, add to unlisted words vec
                if idx == None:
                    unlisted_words.append(word)
            except: 
                idx = 0
 
    return unlisted_words

# create a token dictionary for unlisted words 
def token_dict(unlisted, num):
    t_dict = {}
    i = 0
    for word in unlisted:
        t_dict[word] = num + i
        i = i + 1
    return t_dict

# convert political statements to tokens
def tokenize_all_statements(embed, statements, t_dict):
    # split each statement into a list of words
    statement_words = [statement.split() for statement in statements]
    em_len = len(embed.key_to_index)

    tokenized_statements = []
    for statement in statement_words:
        ints = []
        for word in statement:
            try:
                idx = embed.key_to_index.get(word)

                # if word not in embedded list, create new token
                if idx == None:
                    #idx = t_dict[word]
                    idx = 0
            except: 
                idx = 0
            ints.append(idx)
        tokenized_statements.append(ints)
    
    return tokenized_statements

In [9]:
# find unlisted words and build token dictionary
#unlisted = []

#train_unlisted = unlisted_words(embed, statements_train)
#valid_unlisted = unlisted_words(embed, statements_valid)
#test_unlisted = unlisted_words(embed, statements_test)

#unlisted.extend(x for x in train_unlisted if x not in unlisted)
#unlisted.extend(x for x in valid_unlisted if x not in unlisted)
#unlisted.extend(x for x in test_unlisted if x not in unlisted)

num_tokens = len(embed.key_to_index)

t_dict = token_dict({}, num_tokens)

# tokenize the statements
tokenized_train = tokenize_all_statements(embed, statements_train, t_dict)
tokenized_valid = tokenize_all_statements(embed, statements_valid, t_dict)
tokenized_test = tokenize_all_statements(embed, statements_test, t_dict)

# check if the tokenizing works
print(statements_train[0])
print(tokenized_train[0])

print(statements_valid[0])
print(tokenized_valid[0])

print(statements_test[0])
print(tokenized_test[0])

says the annies list political group supports thirdtrimester abortions on demand
[109, 9, 0, 680, 424, 215, 2876, 0, 11132, 4, 656]
We have less Americans working now than in the 70s.
[57, 19, 350, 938, 322, 92, 55, 0, 9, 0]
Building a wall on the U.S.-Mexico border will take literally years.
[3720, 0, 2270, 4, 9, 0, 1473, 21, 135, 5220, 0]


Now, we need to pad the tokenized statements list to make all the statements the same length. The final array should be 2D, with as many rows as statements and as many columns as the longest statement.

In [10]:
# pad the features into a 2D representation
def pad_features(tokenized_statements, max_length):

    # getting the correct rows x cols shape
    features = np.zeros((len(tokenized_statements), max_length), dtype=int)
    
    for i, row in enumerate(tokenized_statements):
        features[i, -len(row):] = np.array(row)[:max_length]
    
    return features

In [11]:
from collections import Counter

max_len_train = max(Counter([len(x.split()) for x in statements_train]))
max_len_valid = max(Counter([len(x.split()) for x in statements_valid]))
max_len_test = max(Counter([len(x.split()) for x in statements_test]))

max_len = max(max_len_train, max_len_test, max_len_valid)

features_train = np.array(pad_features(tokenized_train, max_len))
features_valid = np.array(pad_features(tokenized_valid, max_len))
features_test = np.array(pad_features(tokenized_test, max_len))

# test statements to make sure dimensions are set
assert len(features_train)==len(tokenized_train), "Features should have as many rows as statements."
assert len(features_train[0])==max_len, "Each feature row should contain max_length values."
assert len(features_valid)==len(tokenized_valid), "Features should have as many rows as statements."
assert len(features_valid[0])==max_len, "Each feature row should contain max_length values."
assert len(features_test)==len(tokenized_test), "Features should have as many rows as statements."
assert len(features_test[0])==max_len, "Each feature row should contain max_length values."


Now, the training data is tokenized and put into a 2D array. We repeat the same thing for test.tsv and valid.tsv.

<h3>Train/Validation/Test Splitting (already set in the data folder)</h3>

The data folder already has the data split and 2D arrays have been made for the train, valid, and test sets. Now, we use data loaders and batching.

<h3>Defining a CNN for Sentiment Analysis</h3>

CheckNewsCNN model implementation:

In [17]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class CheckNewsCNN(nn.Module):    
    def __init__(self, embed_model, embedding_dim, max_features, num_filters=100):
        super(CheckNewsCNN, self).__init__()
        filter_sizes = [1,2,3,5]
        num_filters = 36
        n_classes = 6
        self.embedding = nn.Embedding(max_features, embedding_dim)
        self.embedding.weight = nn.Parameter(torch.tensor(embed_model.vectors, dtype=torch.float32))
        self.embedding.weight.requires_grad = False
        self.convs1 = nn.ModuleList([nn.Conv2d(1, num_filters, (K, embedding_dim)) for K in filter_sizes])
        self.dropout = nn.Dropout(0.1)
        self.fc1 = nn.Linear(len(filter_sizes)*num_filters, n_classes)
    def forward(self, x):
        x = self.embedding(x)  
        x = x.unsqueeze(1)  
        x = [F.relu(conv(x)).squeeze(3) for conv in self.convs1]
        x = [F.max_pool1d(i, i.size(2)).squeeze(2) for i in x]  
        x = torch.cat(x, 1)
        x = self.dropout(x)  
        logit = self.fc1(x)
        return logit

CheckNewsCNN setup:

In [18]:
# store pretrained vocab
pretrained_words = []
for word in embed.key_to_index:
    pretrained_words.append(word)

vocab_size =  len(pretrained_words)
 
embedding_dim = len(embed[pretrained_words[0]])
num_filters = 100
kernel_sizes = [3, 4, 5]

model = CheckNewsCNN(embed, embedding_dim, vocab_size, num_filters=100)

print(model)

CheckNewsCNN(
  (embedding): Embedding(299567, 300)
  (convs1): ModuleList(
    (0): Conv2d(1, 36, kernel_size=(1, 300), stride=(1, 1))
    (1): Conv2d(1, 36, kernel_size=(2, 300), stride=(1, 1))
    (2): Conv2d(1, 36, kernel_size=(3, 300), stride=(1, 1))
    (3): Conv2d(1, 36, kernel_size=(5, 300), stride=(1, 1))
  )
  (dropout): Dropout(p=0.1, inplace=False)
  (fc1): Linear(in_features=144, out_features=6, bias=True)
)


Training:

In [24]:
from torch.utils.data import TensorDataset, DataLoader

# create Tensor datasets
train_data = TensorDataset(torch.from_numpy(features_train), torch.from_numpy(labels_train))
# valid_data = TensorDataset(torch.from_numpy(features_valid), torch.from_numpy(labels_valid))
test_data = TensorDataset(torch.from_numpy(features_test), torch.from_numpy(labels_test))

# dataloaders
batch_size = 50

# shuffling and batching data
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
valid_loader = DataLoader(valid_data, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size)

TypeError: can't convert np.ndarray of type numpy.object_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.

In [21]:
def init_weights(module):
    if type(module) in (nn.Linear, nn.Conv1d):
        nn.init.xavier_uniform_(module.weight)

model.apply(init_weights)

CheckNewsCNN(
  (embedding): Embedding(299567, 300)
  (convs1): ModuleList(
    (0): Conv2d(1, 36, kernel_size=(1, 300), stride=(1, 1))
    (1): Conv2d(1, 36, kernel_size=(2, 300), stride=(1, 1))
    (2): Conv2d(1, 36, kernel_size=(3, 300), stride=(1, 1))
    (3): Conv2d(1, 36, kernel_size=(5, 300), stride=(1, 1))
  )
  (dropout): Dropout(p=0.1, inplace=False)
  (fc1): Linear(in_features=144, out_features=6, bias=True)
)

In [23]:
# loss and optimization functions
lr=0.001

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

def train(net, train_loader, epochs, print_every=100):


    counter = 0 # for printing
    
    # train for some number of epochs
    net.train()
    for e in range(epochs):

        # batch loop
        for inputs, labels in train_loader:
            counter += 1

            

            # zero accumulated gradients
            net.zero_grad()

            # get the output from the model
            output = net(inputs)
            
            # calculate the loss and perform backprop
            loss = criterion(output.squeeze(), labels)
            loss.backward()
            optimizer.step()

            # loss stats
            if counter % print_every == 0:
                # Get validation loss
                val_losses = []
                net.eval()
                for inputs, labels in valid_loader:

                    output = net(inputs)
                    val_loss = criterion(output.squeeze(), labels)

                    val_losses.append(val_loss.item())

                net.train()
                print("Epoch: {}/{}...".format(e+1, epochs),
                      "Step: {}...".format(counter),
                      "Loss: {:.6f}...".format(loss.item()),
                      "Val Loss: {:.6f}".format(np.mean(val_losses)))

In [100]:
epochs = 2
print_every = 100

train(model, train_loader, epochs, print_every=print_every)

Epoch: 1/2... Step: 100... Loss: 1.532752... Val Loss: 1.762318
Epoch: 1/2... Step: 200... Loss: 1.213054... Val Loss: 1.747848
Epoch: 2/2... Step: 300... Loss: 1.186944... Val Loss: 1.750537
Epoch: 2/2... Step: 400... Loss: 1.079424... Val Loss: 1.770020


Testing:

In [101]:
def weighted_ordinal_accuracy(y_true, y_pred, weight=0.5):
    '''
    Weight determines how 'correct' an adjacent prediciton is. 
    To include predictions that were off by 2, change mask definition
    '''
    y_true = np.array(y_true)
    y_pred = np.array(y_pred)
    
    abs_diff = np.abs(y_true - y_pred)
    mask = (abs_diff == 1) * weight
    weighted_diff = np.dot(abs_diff, mask)
    accuracy = 1 - (weighted_diff / len(y_true))
    
    return accuracy

# Get test data loss and accuracy
ground_truth = []
predictions = []
test_losses = [] # track loss
num_correct = 0


model.eval()
# iterate over test data
for inputs, labels in test_loader:

    # get predicted outputs
    output = model(inputs)
    
    # calculate loss
    test_loss = criterion(output.squeeze(), labels)
    test_losses.append(test_loss.item())
    
    # convert output probabilities to predicted class
    pred = output.argmax(axis=1)
    
    correct_tensor = pred.eq(labels.view_as(pred))
    correct = np.squeeze(correct_tensor.numpy())
    num_correct += np.sum(correct)

    ground_truth.append(labels)
    predictions.append(labels)
    


# -- stats! -- ##
# avg test loss
print("Test loss: {:.3f}".format(np.mean(test_losses)))

# accuracy over all test data
test_acc = num_correct/len(test_loader.dataset)
print("Test accuracy: {:.3f}".format(test_acc))

# wighted ordinal accuracy
ordinal_acc = weighted_ordinal_accuracy(ground_truth, predictions, 0.5)

Test loss: 1.772
Test accuracy: 0.261


The testing accuracy is pretty low, but there are 6 labels. Maybe I need to add more layers or train for more/less epochs.