<h2>CNN for Fake News Detection</h2>

This file contains the implementation of a CNN that performs sentiment analysis on political statements. I used some of the code from https://cezannec.github.io/CNN_Text_Classification/. The steps taken to build this model are:

<ol>
    <li> Data Preprocessing
    <li> Tokenizing Political Statements
    <li> Train/Validation/Test Splitting (already set in the data folder)
    <li> Defining a CNN for Sentiment Analysis
    <li> Training and Evaluating the Model 

</ol>



<h3>Data Preprocessing</h3>

The following code comes from preprocessing.ipynb file.

In [1]:
import pandas as pd
import numpy as np
import nltk
import re
from tqdm import tqdm
import os

numeric_labels = {'pants-fire':0, 'false':1, 'barely-true':2, 'half-true':3, 'mostly-true':4, 'true':5}
path = os.getcwd() + '/data'
headers = ['id', 'label', 'statement', 'subject', 'speaker', 'job_title', 'state_info', 'affiliation', 'barely_true',
           'false', 'half_true', 'mostly_true', 'pants-fire', 'context']
train = pd.read_csv(path + '/train.tsv', sep='\t', header=None, names=headers)
valid = pd.read_csv(path + '/valid.tsv', sep='\t', header=None, names=headers)
test = pd.read_csv(path + '/test.tsv', sep='\t', header=None, names=headers)

# lowercase, remove punctuation, remove numbers
def clean_text(text):
    if not isinstance(text, str):
        return text
    clean_text = text.lower()
    clean_text = re.sub(r'[^\w\s]', '', clean_text)
    return clean_text

def clean_labels(label):
    return numeric_labels[label]

# cols indicates which columns we want to KEEP, the rest are dropped
def drop_columns(df, cols):
    new_df = df[cols]
    new_df = new_df.dropna() # rows with incomplete information (NaN's) are dropped
    return new_df

For this model, we will only be using the statement text and the corresponding labels.

In [2]:
train['statement'] = train['statement'].apply(clean_text)
train['label'] = train['label'].apply(clean_labels)
train = drop_columns(train, ['label', 'statement'])
train['statement'].to_csv('text_only.csv', index=False, header=['text'])
train.head()

Unnamed: 0,label,statement
0,1,says the annies list political group supports ...
1,3,when did the decline of coal start it started ...
2,4,hillary clinton agrees with john mccain by vot...
3,1,health care reform legislation is likely to ma...
4,3,the economic turnaround started at the end of ...


In [3]:
valid['statement'] = valid['statement'].apply(clean_text)
valid['label'] = valid['label'].apply(clean_labels)
valid = drop_columns(valid, ['label', 'statement'])
valid['statement'].to_csv('text_only.csv', index=False, header=['text'])
valid.head()

Unnamed: 0,label,statement
0,2,we have less americans working now than in the...
1,0,when obama was sworn into office he did not us...
2,1,says having organizations parading as being so...
3,3,says nearly half of oregons children are poor
4,3,on attacks by republicans that various program...


In [4]:
test['statement'] = test['statement'].apply(clean_text)
test['label'] = test['label'].apply(clean_labels)
test = drop_columns(test, ['label', 'statement'])
test['statement'].to_csv('text_only.csv', index=False, header=['text'])
test.head()

Unnamed: 0,label,statement
0,5,building a wall on the usmexico border will ta...
1,1,wisconsin is on pace to double the number of l...
2,1,says john mccain has done nothing to help the ...
3,3,suzanne bonamici supports a plan that will cut...
4,0,when asked by a reporter whether hes at the ce...


In [6]:
statements_train = train['statement']
labels_train = train['label']

statements_valid = valid['statement']
labels_valid = valid['label']

statements_test = test['statement']
labels_test = test['label']

Now, the data is processed and ready to use!

<h3>Tokenizing Political Statements</h3>

Next, we will tokenize the political statements using a pretrained embedding model. Specifically, we will be using Google News word2vec model (https://github.com/eyaler/word2vec-slim/tree/master). According to the github README, "the model was trained over a 3 billion word corpus, and contains 3 million words (of which ~930k are NOT phrases, i.e. do not contain underscores)." Using this model will make tokenizing the statements much easier, as we will not need to create the token dictionaries by hand. There will be some words that are not listed in the pretrained embedded model, so we will account for that in the corresponding function.

In [7]:
from gensim.models import KeyedVectors

# creating the pretrained embedding model
embed = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300-SLIM.bin', binary=True)

In [16]:
# find all unlisted words
def unlisted_words(embed, statements):
    statement_words = [statement.split() for statement in statements]
    unlisted_words = []
    for statement in statement_words:
        for word in statement:
            try:
                idx = embed.key_to_index.get(word)

                # if word not in embedded list, add to unlisted words vec
                if idx == None:
                    unlisted_words.append(word)
            except: 
                idx = 0
 
    return unlisted_words

# create a token dictionary for unlisted words 
def token_dict(unlisted, num):
    t_dict = {}
    i = 0
    for word in unlisted:
        t_dict[word] = num + i
        i = i + 1
    return t_dict

# convert political statements to tokens
def tokenize_all_statements(embed, statements, t_dict):
    # split each statement into a list of words
    statement_words = [statement.split() for statement in statements]
    em_len = len(embed.key_to_index)

    tokenized_statements = []
    for statement in statement_words:
        ints = []
        for word in statement:
            try:
                idx = embed.key_to_index.get(word)

                # if word not in embedded list, create new token
                if idx == None:
                    idx = t_dict[word]
            except: 
                idx = 0
            ints.append(idx)
        tokenized_statements.append(ints)
    
    return tokenized_statements

In [21]:
# find unlisted words and build token dictionary
unlisted = []

train_unlisted = unlisted_words(embed, statements_train)
valid_unlisted = unlisted_words(embed, statements_valid)
test_unlisted = unlisted_words(embed, statements_test)

unlisted.extend(x for x in train_unlisted if x not in unlisted)
unlisted.extend(x for x in valid_unlisted if x not in unlisted)
unlisted.extend(x for x in test_unlisted if x not in unlisted)

num_tokens = len(embed.key_to_index)

t_dict = token_dict(unlisted, num_tokens)

# tokenize the statements
tokenized_train = tokenize_all_statements(embed, statements_train, t_dict)
tokenized_valid = tokenize_all_statements(embed, statements_valid, t_dict)
tokenized_test = tokenize_all_statements(embed, statements_test, t_dict)

# check if the tokenizing works
print(statements_train[0])
print(tokenized_train[0])

print(statements_valid[0])
print(tokenized_valid[0])

print(statements_test[0])
print(tokenized_test[0])

says the annies list political group supports thirdtrimester abortions on demand
[109, 9, 299567, 680, 424, 215, 2876, 299568, 11132, 4, 656]
we have less americans working now than in the 70s
[34, 19, 350, 69404, 322, 92, 55, 0, 9, 302829]
building a wall on the usmexico border will take literally years
[446, 299581, 2270, 4, 9, 299709, 1473, 21, 135, 5220, 72]


Now, we need to pad the tokenized statements list to make all the statements the same length. The final array should be 2D, with as many rows as statements and as many columns as the longest statement.

In [24]:
# pad the features into a 2D representation
def pad_features(tokenized_statements, max_length):

    # getting the correct rows x cols shape
    features = np.zeros((len(tokenized_statements), max_length), dtype=int)
    
    for i, row in enumerate(tokenized_statements):
        features[i, -len(row):] = np.array(row)[:max_length]
    
    return features

In [26]:
from collections import Counter

max_len_train = max(Counter([len(x.split()) for x in statements_train]))
max_len_valid = max(Counter([len(x.split()) for x in statements_valid]))
max_len_test = max(Counter([len(x.split()) for x in statements_test]))

max_len = max(max_len_train, max_len_test, max_len_valid)

features = pad_features(tokenized_states, max_len)

# test statements to make sure dimensions are set
assert len(features)==len(tokenized_states), "Features should have as many rows as reviews."
assert len(features[0])==max_len, "Each feature row should contain seq_length values."


Now, the training data is tokenized and put into a 2D array. We repeat the same thing for test.tsv and valid.tsv.