<h2>CNN for Fake News Detection</h2>

This file contains the implementation of a CNN that performs sentiment analysis on political statements. I used some of the code from https://cezannec.github.io/CNN_Text_Classification/. The steps taken to build this model are:

<ol>
    <li> Data Preprocessing
    <li> Tokenizing Political Statements
    <li> Train/Validation/Test Splitting (already set in the data folder)
    <li> Defining a CNN for Sentiment Analysis
    <li> Training and Evaluating the Model 

</ol>



<h3>Data Preprocessing</h3>

The following code comes from preprocessing.ipynb file.

In [25]:
import pandas as pd
import numpy as np
import nltk
import re
from tqdm import tqdm
import os

numeric_labels = {'pants-fire':0, 'false':1, 'barely-true':2, 'half-true':3, 'mostly-true':4, 'true':5}
path = os.getcwd() + '/data'
headers = ['id', 'label', 'statement', 'subject', 'speaker', 'job_title', 'state_info', 'affiliation', 'barely_true',
           'false', 'half_true', 'mostly_true', 'pants-fire', 'context']
train = pd.read_csv(path + '/train.tsv', sep='\t', header=None, names=headers)
valid = pd.read_csv(path + '/valid.tsv', sep='\t')

# lowercase, remove punctuation, remove numbers
def clean_text(text):
    if not isinstance(text, str):
        return text
    clean_text = text.lower()
    clean_text = re.sub(r'[^\w\s]', '', clean_text)
    return clean_text

def clean_labels(label):
    return numeric_labels[label]

# cols indicates which columns we want to KEEP, the rest are dropped
def drop_columns(df, cols):
    new_df = df[cols]
    new_df = new_df.dropna() # rows with incomplete information (NaN's) are dropped
    return new_df

For this model, we will only be using the statement text and the corresponding labels.

In [26]:
train['statement'] = train['statement'].apply(clean_text)
train['label'] = train['label'].apply(clean_labels)
train = drop_columns(train, ['label', 'statement'])
train['statement'].to_csv('text_only.csv', index=False, header=['text'])
train.head()

Unnamed: 0,label,statement
0,1,says the annies list political group supports ...
1,3,when did the decline of coal start it started ...
2,4,hillary clinton agrees with john mccain by vot...
3,1,health care reform legislation is likely to ma...
4,3,the economic turnaround started at the end of ...


In [45]:
statements = train['statement']
labels = train['label']

Now, the data is processed and ready to use!

<h3>Tokenizing Political Statements</h3>

Next, we will tokenize the political statements using a pretrained embedding model. Specifically, we will be using Google News word2vec model (https://github.com/eyaler/word2vec-slim/tree/master). According to the github README, "the model was trained over a 3 billion word corpus, and contains 3 million words (of which ~930k are NOT phrases, i.e. do not contain underscores)." Using this model will make tokenizing the statements much easier, as we will not need to create the token dictionaries by hand. There will be some words that are not listed in the pretrained embedded model, so we will account for that in the corresponding function.

In [80]:
from gensim.models import KeyedVectors

# creating the pretrained embedding model
embed = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300-SLIM.bin', binary=True)

In [81]:
# convert political statements to tokens
def tokenize_all_statements(embed, statements):
    # split each statement into a list of words
    statement_words = [statement.split() for statement in statements]
    em_len = len(embed.key_to_index)
    i = 0

    tokenized_statements = []
    for statement in statement_words:
        ints = []
        for word in statement:
            try:
                idx = embed.key_to_index.get(word)

                # if word not in embedded list, create new token
                if idx == None:
                    idx = em_len + i
                    i = i + 1
            except: 
                idx = 0
            ints.append(idx)
        tokenized_statements.append(ints)
    
    return tokenized_statements

In [82]:
# tokenize the statements
tokenized_states = tokenize_all_statements(embed, statements)

# check if the tokenizing works
print(statements[0])
print(tokenized_states[0])

says the annies list political group supports thirdtrimester abortions on demand
[109, 9, 299567, 680, 424, 215, 2876, 299568, 11132, 4, 656]


Now, we need to pad the tokenized statements list to make all the statements the same length. The final array should be 2D, with as many rows as statements and as many columns as the longest review.