## Objective
Develop a machine learning program to identify when an article might be fake news.

The metric used to tune this model is 'accuracy'.

accuracy = (correct predictions/(correct predictions + incorrect predictions))

In [1]:
#importing the required libraries
import pandas as pd
import numpy as np

In [2]:
#reading the data
df = pd.read_csv('train.csv')

In [3]:
#displaying first 5 rows of the data set
df.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [4]:
df.shape

(20800, 5)

In [5]:
#checking for null values
df.isna().sum()

id           0
title      558
author    1957
text        39
label        0
dtype: int64

In [6]:
#filling the missing values with Missing
df.fillna('Missing', inplace=True)

In [7]:
df.isna().sum()

id        0
title     0
author    0
text      0
label     0
dtype: int64

In [8]:
df.shape

(20800, 5)

Since the main body of the news plays a major role in determining if a news article is fake or not. Dropping columns where the body of the text is 'Missing'.

In [9]:
df = df[df.text != 'Missing']

In [10]:
df.drop('id', axis = 1, inplace = True)

In [11]:
df['text'] = df.title + ' ' +df.author + ' ' + df.text

In [12]:
df.drop(['title', 'author'], axis = 1, inplace = True)

In [13]:
df.head()

Unnamed: 0,text,label
0,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",0
2,Why the Truth Might Get You Fired Consortiumne...,1
3,15 Civilians Killed In Single US Airstrike Hav...,1
4,Iranian woman jailed for fictional unpublished...,1


In [14]:
#checking for null values
df.isna().sum()

text     0
label    0
dtype: int64

In [15]:
df.isna().sum()

text     0
label    0
dtype: int64

In [16]:
#splitting into the train and set data
from sklearn.model_selection import train_test_split

In [17]:
trainx, testx, trainy, testy = train_test_split(df.text, df.label, test_size = 0.3, random_state = 123)

In [18]:
#checking if the dataset is balanced or not
trainy.value_counts()

1    7289
0    7243
Name: label, dtype: int64

In [19]:
#initializing the TFIDF vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

## TFIDF Vectorizer
The words in the text are tokenized.
The values in the tfidf doument matrix is calulated usig the following formula.

value = TF * IDF

TF - Term Frequency 
IDF - Inverse Document Frequency

TF = number of times the term occured in the document/ total number of documents
IDF = log(number of documents/ numbers of documents in which the term occurs)

In [20]:
tfidf = TfidfVectorizer(stop_words='english', max_df = 0.7)
tfidf.fit(trainx)
vec = pd.DataFrame(tfidf.transform(trainx).todense(), columns=tfidf.get_feature_names())

In [21]:
#importing regular expressions
import re

In [22]:
#checking for columns where there are only numbers
pattern = r'\D+'
columns_to_drop = []
for i in range(len(vec.columns)):
    if bool(re.search(pattern, str(vec.columns[i]))) == False:
        columns_to_drop.append(vec.columns[i])
del columns_to_drop

<strong>Passive aggressive classifier</strong> is used to build the model for the document term matrix.

Passive aggressive classifer learns the weight passes it to the next training example and deletes it from the memory

In [23]:
from sklearn.linear_model import PassiveAggressiveClassifier

In [24]:
pac=PassiveAggressiveClassifier(max_iter=50, random_state = 123, n_jobs = -1)
pac.fit(vec,trainy)

PassiveAggressiveClassifier(C=1.0, average=False, class_weight=None,
                            early_stopping=False, fit_intercept=True,
                            loss='hinge', max_iter=50, n_iter_no_change=5,
                            n_jobs=-1, random_state=123, shuffle=True,
                            tol=0.001, validation_fraction=0.1, verbose=0,
                            warm_start=False)

In [25]:
from sklearn.metrics import classification_report

In [26]:
#memeory release
del df
del trainx

## Classificaion Report - Train

In [27]:
print(classification_report(trainy, pac.predict(vec)))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      7243
           1       1.00      1.00      1.00      7289

    accuracy                           1.00     14532
   macro avg       1.00      1.00      1.00     14532
weighted avg       1.00      1.00      1.00     14532



## Classification Report - Test

In [28]:
vec_test = pd.DataFrame(tfidf.transform(testx).todense(), columns=tfidf.get_feature_names())

In [29]:
print(classification_report(testy, pac.predict(vec_test)))

              precision    recall  f1-score   support

           0       0.97      0.97      0.97      3144
           1       0.97      0.97      0.97      3085

    accuracy                           0.97      6229
   macro avg       0.97      0.97      0.97      6229
weighted avg       0.97      0.97      0.97      6229



In [30]:
del testx

Preprocessing given test data for submission on kaggle

In [31]:
test_df = pd.read_csv('test.csv')

In [32]:
test_df.head()

Unnamed: 0,id,title,author,text
0,20800,"Specter of Trump Loosens Tongues, if Not Purse...",David Streitfeld,"PALO ALTO, Calif. — After years of scorning..."
1,20801,Russian warships ready to strike terrorists ne...,,Russian warships ready to strike terrorists ne...
2,20802,#NoDAPL: Native American Leaders Vow to Stay A...,Common Dreams,Videos #NoDAPL: Native American Leaders Vow to...
3,20803,"Tim Tebow Will Attempt Another Comeback, This ...",Daniel Victor,"If at first you don’t succeed, try a different..."
4,20804,Keiser Report: Meme Wars (E995),Truth Broadcast Network,42 mins ago 1 Views 0 Comments 0 Likes 'For th...


In [33]:
test_df.isna().sum()

id          0
title     122
author    503
text        7
dtype: int64

In [35]:
test_df.title.fillna('Missing', inplace = True)
test_df.author.fillna('Missing', inplace = True)
test_df.text.fillna(' ', inplace = True)

In [37]:
test_df['text'] = test_df.title + ' ' +test_df.author + ' ' + test_df.text

In [39]:
test_df.drop(['title', 'author'], axis = 1, inplace = True)

In [38]:
vec_test_df = vec_test = pd.DataFrame(tfidf.transform(test_df.text).todense(), columns=tfidf.get_feature_names())

In [44]:
submission = pd.DataFrame({'id':test_df.id, 'label': pac.predict(vec_test_df)})

In [45]:
submission.to_csv('submission1.csv', index = 0)

##### After submission my learder accuracy was 97% on the kaggle laeberboard