# Fake News Detection with NLP

Resham Gala / Javier Granda / Theo Tortorici

## Initial data reading and preparing

In [2]:
import numpy as np
import pandas as pd

### Loading Dataset

In [3]:
train = pd.read_csv('fake_or_real_news_training.csv', index_col="ID", encoding='utf-8')
test = pd.read_csv('fake_or_real_news_test.csv', index_col="ID" , encoding='utf-8')

### Taking a look to the data

In [4]:
train.head()

Unnamed: 0_level_0,title,text,label,X1,X2
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE,,
10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE,,
3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL,,
10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE,,
875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL,,


In [5]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3999 entries, 8476 to 9673
Data columns (total 5 columns):
title    3999 non-null object
text     3999 non-null object
label    3999 non-null object
X1       33 non-null object
X2       2 non-null object
dtypes: object(5)
memory usage: 187.5+ KB


In [35]:
test.head()

Unnamed: 0_level_0,title,text
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
10498,September New Homes Sales Rise——-Back To 1992 ...,September New Homes Sales Rise Back To 1992 Le...
2439,Why The Obamacare Doomsday Cult Can't Admit It...,But when Congress debated and passed the Patie...
864,"Sanders, Cruz resist pressure after NY losses,...",The Bernie Sanders and Ted Cruz campaigns vowe...
4128,Surviving escaped prisoner likely fatigued and...,Police searching for the second of two escaped...
662,Clinton and Sanders neck and neck in Californi...,No matter who wins California's 475 delegates ...


In [6]:
test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2321 entries, 10498 to 4330
Data columns (total 2 columns):
title    2321 non-null object
text     2321 non-null object
dtypes: object(2)
memory usage: 54.4+ KB


### Cleaning the dataset

Dealing with the rows that have right shifted values, we delete these rows from the dataset

In [7]:
train = train[(train.label == 'REAL') | (train.label == 'FAKE')]

In [8]:
train.label.unique()

array(['FAKE', 'REAL'], dtype=object)

### Combining the train and test data

The idea is to train with the most amount of data available

In [14]:
combined_df = pd.concat([train,test])

In [15]:
combined_df.head()

Unnamed: 0_level_0,X1,X2,label,text,title
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
8476,,,FAKE,"Daniel Greenfield, a Shillman Journalism Fello...",You Can Smell Hillary’s Fear
10294,,,FAKE,Google Pinterest Digg Linkedin Reddit Stumbleu...,Watch The Exact Moment Paul Ryan Committed Pol...
3608,,,REAL,U.S. Secretary of State John F. Kerry said Mon...,Kerry to go to Paris in gesture of sympathy
10142,,,FAKE,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",Bernie supporters on Twitter erupt in anger ag...
875,,,REAL,It's primary day in New York and front-runners...,The Battle of New York: Why This Primary Matters


### Joining the title and text field

In [16]:
df = combined_df.title + str(' ') + combined_df.text
df.head()

ID
8476     You Can Smell Hillary’s Fear Daniel Greenfield...
10294    Watch The Exact Moment Paul Ryan Committed Pol...
3608     Kerry to go to Paris in gesture of sympathy U....
10142    Bernie supporters on Twitter erupt in anger ag...
875      The Battle of New York: Why This Primary Matte...
dtype: object

In [13]:
df.shape

(6287,)

## Tokenization and Lemmatization

We used NLTK and wordnet dataset to build a function that tokenizes and does lemmatization on the text

In [19]:
import nltk
# nltk.download() Download if needed
nltk.download('wordnet')
from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer 

class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, doc):
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/javiergranda/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [14]:
#from sklearn.pipeline import Pipeline
#from sklearn.feature_extraction.text import CountVectorizer
#from sklearn.feature_extraction.text import TfidfTransformer
#from sklearn.naive_bayes import MultinomialNB
#from sklearn.linear_model import SGDClassifier

#pipeline_object = Pipeline([('vect', CountVectorizer(tokenizer=LemmaTokenizer(), stop_words='english', ngram_range=(1, 3))),('tfidf', TfidfTransformer()),('svm-clf', SGDClassifier(loss='perceptron', penalty='l2',alpha=1e-3,max_iter=5,random_state=42))])
#pipeline_object = pipeline_object.fit(df, train.label)

In [15]:
#pipeline_object

## Term document matrix

Building the term document matrix using sklearn CountVectorizer. Also removing stop words and combining bi-grams to capture word relationships. Also using previously defined tokenizer.

In [20]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(tokenizer=LemmaTokenizer(), stop_words='english', ngram_range=(1, 3)) # remove english stop words and applying tri-grams
X_combined_counts = count_vect.fit_transform(df) # using the text and title from training set
X_combined_counts.shape # take a look at how many featuress where created

(6287, 3989224)

## Removing sparcity with TF-IDF

Using the tf–idf transform to re-weight the count features into floating point values suitable to use with the classifier

In [27]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()
X_combined_tfidf = tfidf_transformer.fit_transform(X_combined_counts)

# Subsetting for train set
X_train_tfidf = X_combined_tfidf[0:3966]
X_train_tfidf.shape

(3966, 3989224)

In [28]:
# Subsetting for test set
X_test_tfidf = X_combined_tfidf[3966: ]
X_test_tfidf.shape

(2321, 3989224)

## Naive Bayes

Building the classifier

In [29]:
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB().fit(X_train_tfidf, train.label)

### Cross-validation of the Naive Bayes classifier

In [30]:
import warnings; warnings.simplefilter('ignore')
from sklearn.model_selection import cross_val_score

cv_score_nb = cross_val_score(
    clf,
    X_train_tfidf,
    train.label,
    cv = 3,
    n_jobs = -1)

print('Train - Naive Bayes cross validated score is: '+ str(np.mean(cv_score_nb)))


Train - Naive Bayes cross validated score is: 0.7975270109869541


## Max Entropy Classifier (Logistic Regression)

In [31]:
from sklearn.linear_model import LogisticRegression

max_ent_lr = LogisticRegression(
    penalty = 'l2', 
    tol = 0.0001, 
    C=1.0, 
    dual = False, 
    class_weight = None, 
    multi_class = 'multinomial', 
    solver = 'saga', 
    max_iter = 100, 
    n_jobs = 1).fit(X_train_tfidf, train.label)

In [32]:
cv_score_maxent = cross_val_score(
    max_ent_lr,
    X_train_tfidf,
    train.label,
    cv = 5,
    n_jobs = -1)

print('Train - Max Entropy cross validated score is: '+ str(np.mean(cv_score_maxent)))

Train - Max Entropy cross validated score is: 0.9074661474298094


## SVM Classifier 

Regularized linear model with stochastic gradient descent (SGD) learning

In [35]:
from sklearn.linear_model import SGDClassifier

svm_clf = SGDClassifier(
    loss='perceptron', 
    penalty='l2',
    alpha=1e-3,
    max_iter=5,
    random_state=42).fit(X_train_tfidf, train.label)

In [36]:
import warnings; warnings.simplefilter('ignore')
from sklearn.model_selection import cross_val_score

cv_score_svm = cross_val_score(
    svm_clf,
    X_train_tfidf,
    train.label,
    cv = 5,
    n_jobs = -1)

print('Train - SVM cross validated score is: '+ str(np.mean(cv_score_svm)))

Train - SVM cross validated score is: 0.9392321350862872


In [54]:
# WE TRIED, UNSUCCESSFULLY, GRID SEARCH FOR HIPER-PARAMETER TUNING

#from sklearn.model_selection import GridSearchCV
#params_svm = {
#    'vect__ngram_range': [(1, 1), (1, 2)],
#    'tfidf__use_idf': (True, False),
#    'clf-svm__alpha': (1e-2, 1e-3),}

#gs_clf_svm = GridSearchCV(svm_clf, params_svm, n_jobs=-1)
#gs_clf_svm = gs_clf_svm.fit(X_train_tfidf, train.label)
#gs_clf_svm.best_score_
#gs_clf_svm.best_params_

## Passive Agressive Classifier

In [41]:
from sklearn.linear_model import PassiveAggressiveClassifier

pac = PassiveAggressiveClassifier(loss = 'squared_hinge').fit(X_train_tfidf, train.label)

In [42]:
cv_score_pac = cross_val_score(
    pac,
    X_train_tfidf,
    train.label,
    cv = 5,
    n_jobs = -1)

print('Train - PAC cross validated score is: '+ str(np.mean(cv_score_pac)))


Train - PAC cross validated score is: 0.9389815164807939


## Outputting the predictions

From all the above models, we see that SVM gives us the best result, thus we do predictions on the test data using the SVM Classifier


In [48]:
final_pred = svm_clf.predict(X_test_tfidf)
final_pred

array(['FAKE', 'REAL', 'REAL', ..., 'FAKE', 'REAL', 'REAL'], dtype='<U4')

In [50]:
# Convert the predictions into a dataframe
final_df = pd.DataFrame(final_pred, index = test.index)
final_df = final_df.reset_index()
final_df.head()

Unnamed: 0,ID,0
0,10498,FAKE
1,2439,REAL
2,864,REAL
3,4128,REAL
4,662,REAL


In [57]:
# Write the submissions to a csv file
final_df.to_csv('submission2.csv', sep=',', index=False) 