### 1. Problem Description

The problem given is __Fake News Detection using Natural Language Processing__.<br>
Given Data has 4 features:<br>
__title__ - Title for the news article<br>
__author__ - author of the article<br>
__text__ - Body of the article<br>
__label__ - Whether the news provided in the article is fake or not fake

### 2. Reading Data

In [1]:
import pandas as pd

In [2]:
train_data=pd.read_csv("train.csv")

In [3]:
train_data.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [4]:
#dropping the column 'id'
train_data.drop(columns={'id'},inplace=True)

In [5]:
train_data.shape

(20800, 4)

In [6]:
train_data['label'].value_counts()

1    10413
0    10387
Name: label, dtype: int64

We can observe that the dataset is almost balanced as there are almost equal number of data points with label 0 and label 1.

### 3. Data Cleaning

#### 3.1. Checking for Duplicates and Dropping them

In [7]:
train_data.duplicated().sum()

109

There are 109 duplicate rows in train data and 6 duplicate rows in test data.__

In [8]:
#dropping duplicate rows
train_data.drop_duplicates(inplace=True)

In [9]:
train_data.duplicated().sum()

0

#### 3.2. Checking for Missing Values

In [10]:
train_data.isnull().sum()

title      518
author    1932
text        39
label        0
dtype: int64

In [11]:
train_data.isnull().sum()*100/train_data.shape[0]

title     2.503504
author    9.337393
text      0.188488
label     0.000000
dtype: float64

We can observe that there are many missing values. As there are many rows, we shouldn't drop all the rows. As 'author' is a categorical feature, we create a new category('missing') for missing authors. For missing values of title, text, we just replace NAN with ' '(space).

In [12]:
train_data['title'].fillna(' ',inplace=True)
train_data['text'].fillna(' ',inplace=True)
train_data['author'].fillna('missing',inplace=True)

In [13]:
train_data.isnull().sum()

title     0
author    0
text      0
label     0
dtype: int64

### 4. Data Preprocessing

In [14]:
# epanding English language contractions: https://stackoverflow.com/a/47091490/4084039
import re

def decontracted(phrase):
    # specific
    phrase = re.sub(r"’","'",phrase)
    phrase = re.sub(r"”",'"',phrase)
    phrase = re.sub(r"“",'"',phrase)
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", "s", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

In [15]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [16]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to C:\Users\VS
[nltk_data]     Chaitanya\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [17]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to C:\Users\VS
[nltk_data]     Chaitanya\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [18]:
stop_words=stopwords.words('english')

In [19]:
lemmatizer=WordNetLemmatizer()

#### 4.1. title

In [20]:
from tqdm import tqdm
preprocessed_titles = []
# tqdm is for printing the status bar
for sentance in tqdm(train_data['title'].values):
    sent = decontracted(sentance)
    sent=re.sub(r'https?:\/\/.*[\r\n]*', '', sent) # remove hyperlinks
    sent = re.sub('[^A-Za-z]+', ' ', sent) #remove spacial character, numbers: https://stackoverflow.com/a/5843547/4084039
    sent = ' '.join(e for e in sent.split() if e not in stop_words) #removing stop words
    sent=' '.join(lemmatizer.lemmatize(e) for e in sent.split()) #lemmatization
    preprocessed_titles.append(sent.lower().strip())

100%|██████████| 20691/20691 [00:02<00:00, 8518.16it/s]


In [21]:
train_data['title']=preprocessed_titles

#### 4.2. text

In [22]:
preprocessed_texts = []
# tqdm is for printing the status bar
for sentance in tqdm(train_data['text'].values):
    sent = decontracted(sentance)
    sent=re.sub(r'https?:\/\/.*[\r\n]*', '', sent) # remove hyperlinks
    sent = re.sub('[^A-Za-z]+', ' ', sent) #remove spacial characters, numbers: https://stackoverflow.com/a/5843547/4084039
    sent = ' '.join(e for e in sent.split() if e not in stop_words) #removing stop words
    sent=' '.join(lemmatizer.lemmatize(e) for e in sent.split()) #lemmatization
    preprocessed_texts.append(sent.lower().strip())

100%|██████████| 20691/20691 [00:55<00:00, 371.14it/s]


In [24]:
train_data['text']=preprocessed_texts

### 6. Splitting data into train, cv, test

In [25]:
from sklearn.model_selection import train_test_split

In [26]:
output=train_data['label']

In [27]:
train,test,train_output,test_output=train_test_split(train_data.drop(columns={'label'}),output,test_size=0.3,stratify=output,random_state=0)
train,cv,train_output,cv_output=train_test_split(train,train_output,test_size=0.3,stratify=train_output,random_state=0)

In [28]:
train.shape,cv.shape,test.shape

((10138, 3), (4345, 3), (6208, 3))

In [29]:
train_output.shape,cv_output.shape,test_output.shape

((10138,), (4345,), (6208,))

### 6. Data Encoding

#### 6.1. title - TFIDF Vectorization

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [31]:
title_tfidf_vectorizer = TfidfVectorizer(min_df=5)
train_title_tfidf=title_tfidf_vectorizer.fit_transform(train['title'].values)
cv_title_tfidf=title_tfidf_vectorizer.transform(cv['title'].values)
test_title_tfidf=title_tfidf_vectorizer.transform(test['title'].values)

In [33]:
#saving the tfidf vectorizer
import pickle
with open("title_tfidf_vectorizer.pickle","wb") as fp:
    pickle.dump(title_tfidf_vectorizer,fp,protocol=pickle.HIGHEST_PROTOCOL)

In [34]:
title_tfidf_vectorizer.get_feature_names()[:10]

['abandoned',
 'abc',
 'abedin',
 'abedins',
 'able',
 'abortion',
 'about',
 'absolutely',
 'abuse',
 'abuses']

#### 6.2. author - Response Coding

In [38]:
train['author'].unique().shape

(2707,)

We can observe that there are 2707 authors in total train data. As dimensions will be high with one hot encoding, we use response coding.

In [36]:
prob_dict={}
train_author=train.copy()
train_author['label']=train_output
train_author_1=train_author.groupby('author')
for i in (train_author_1.groups):
    group=train_author_1.get_group(i)
    tot=group.shape[0]
    fake=group[group['label']==1].shape[0]
    prob_fake=fake/tot
    prob_not_fake=1-prob_fake
    prob_dict.update({i:[prob_not_fake,prob_fake]})

keys=prob_dict.keys()

train_author_response_code=[]
for author in train['author']:
    if author not in keys:
        train_author_response_code.append([0.5,0.5])
    else:
        train_author_response_code.append(prob_dict.get(author))

cv_author_response_code=[]
for author in cv['author']:
    if author not in keys:
        cv_author_response_code.append([0.5,0.5])
    else:
        cv_author_response_code.append(prob_dict.get(author))
        
test_author_response_code=[]
for author in test['author']:
    if author not in keys:
        test_author_response_code.append([0.5,0.5])
    else:
        test_author_response_code.append(prob_dict.get(author))

In [37]:
#saving the probability dictionary
with open("prob_dict.pickle","wb") as fp:
    pickle.dump(prob_dict,fp,protocol=pickle.HIGHEST_PROTOCOL)

#### 6.3. text - TFIDF Vectorization

In [39]:
text_tfidf_vectorizer = TfidfVectorizer(min_df=10)
train_text_tfidf=text_tfidf_vectorizer.fit_transform(train['text'].values)
cv_text_tfidf=text_tfidf_vectorizer.transform(cv['text'].values)
test_text_tfidf=text_tfidf_vectorizer.transform(test['text'].values)

In [40]:
#saving text_tfidf_vectorizer
with open("text_tfidf_vectorizer.pickle","wb") as fp:
    pickle.dump(text_tfidf_vectorizer,fp,protocol=pickle.HIGHEST_PROTOCOL)

In [41]:
text_tfidf_vectorizer.get_feature_names()[:10]

['aa',
 'aaron',
 'aaronkleinshow',
 'ab',
 'aba',
 'aback',
 'abadi',
 'abandon',
 'abandoned',
 'abandoning']

#### 6.4. Word2Vec

In [42]:
from gensim.models import Word2Vec

In [43]:
# Reading glove vectors in python: https://stackoverflow.com/a/38230349/4084039
#using pre-trained glove model
import numpy as np
def loadGloveModel(gloveFile):
    print ("Loading Glove Model")
    f = open(gloveFile,'r',errors = 'ignore',encoding="utf8")
    model = {}
    for line in tqdm(f):
        splitLine = line.split()
        word = splitLine[0]
        embedding = np.asarray([float(val) for val in splitLine[1:]])
        model[word] = embedding
    print ("Done.",len(model)," words loaded!")
    return model
model = loadGloveModel("glove.42B.300d.txt")

1166it [00:00, 10486.95it/s]

Loading Glove Model


1917494it [02:47, 11448.18it/s]

Done. 1917494  words loaded!





In [44]:
#saving the glove model
with open("model.pickle","wb") as fp:
    pickle.dump(model,fp,protocol=pickle.HIGHEST_PROTOCOL)

In [45]:
glove_words =  set(model.keys())

#### 6.4.1. title - Word2Vec

In [46]:
avg_w2v_vectors_title_train = []; # the avg-w2v for each title is stored in this list
avg_w2v_vectors_title_cv = [];
avg_w2v_vectors_title_test = [];

for sentance in tqdm(train['title'].values): # for each title
    vector = np.zeros(300) # as word vectors are of zero length
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sentance: # for each word in a title
        if word in glove_words:
            vector += model[word]
            cnt_words += 1
    if cnt_words != 0:
        vector /= cnt_words
    avg_w2v_vectors_title_train.append(vector)
avg_w2v_vectors_title_train=np.array(avg_w2v_vectors_title_train)

for sentance in tqdm(cv['title'].values): # for each title
    vector = np.zeros(300) # as word vectors are of zero length
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sentance: # for each word in a title
        if word in glove_words:
            vector += model[word]
            cnt_words += 1
    if cnt_words != 0:
        vector /= cnt_words
    avg_w2v_vectors_title_cv.append(vector)
avg_w2v_vectors_title_cv=np.array(avg_w2v_vectors_title_cv)

for sentance in tqdm(test['title'].values): # for each title
    vector = np.zeros(300) # as word vectors are of zero length
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sentance: # for each word in a title
        if word in glove_words:
            vector += model[word]
            cnt_words += 1
    if cnt_words != 0:
        vector /= cnt_words
    avg_w2v_vectors_title_test.append(vector)
avg_w2v_vectors_title_test=np.array(avg_w2v_vectors_title_test)

100%|██████████| 10138/10138 [00:00<00:00, 15087.73it/s]
100%|██████████| 4345/4345 [00:00<00:00, 15204.97it/s]
100%|██████████| 6208/6208 [00:00<00:00, 14999.67it/s]


#### 6.4.2. text - Word2Vec

In [47]:
avg_w2v_vectors_text_train = []; # the avg-w2v for each text is stored in this list
avg_w2v_vectors_text_cv = [];
avg_w2v_vectors_text_test = [];

for sentance in tqdm(train['text'].values): # for each text
    vector = np.zeros(300) # as word vectors are of zero length
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sentance: # for each word in a text
        if word in glove_words:
            vector += model[word]
            cnt_words += 1
    if cnt_words != 0:
        vector /= cnt_words
    avg_w2v_vectors_text_train.append(vector)
avg_w2v_vectors_text_train=np.array(avg_w2v_vectors_text_train)

for sentance in tqdm(cv['text'].values): # for each text
    vector = np.zeros(300) # as word vectors are of zero length
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sentance: # for each word in a text
        if word in glove_words:
            vector += model[word]
            cnt_words += 1
    if cnt_words != 0:
        vector /= cnt_words
    avg_w2v_vectors_text_cv.append(vector)
avg_w2v_vectors_text_cv=np.array(avg_w2v_vectors_text_cv)

for sentance in tqdm(test['text'].values): # for each text
    vector = np.zeros(300) # as word vectors are of zero length
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sentance: # for each word in a text
        if word in glove_words:
            vector += model[word]
            cnt_words += 1
    if cnt_words != 0:
        vector /= cnt_words
    avg_w2v_vectors_text_test.append(vector)
avg_w2v_vectors_text_test=np.array(avg_w2v_vectors_text_test)

100%|██████████| 10138/10138 [00:30<00:00, 331.89it/s]
100%|██████████| 4345/4345 [00:13<00:00, 332.09it/s]
100%|██████████| 6208/6208 [00:18<00:00, 335.60it/s]


### 7. Combining all Encoded features

In [48]:
from scipy.sparse import hstack

#### 7.1. Combining TFIDF encoded features

In [49]:
train_data_final_tfidf=hstack((train_title_tfidf,train_author_response_code,train_text_tfidf))
cv_data_final_tfidf=hstack((cv_title_tfidf,cv_author_response_code,cv_text_tfidf))
test_data_final_tfidf=hstack((test_title_tfidf,test_author_response_code,test_text_tfidf))

In [50]:
train_data_final_tfidf.shape,cv_data_final_tfidf.shape,test_data_final_tfidf.shape

((10138, 23423), (4345, 23423), (6208, 23423))

#### 7.2. Combining Word2Vec encoded features

In [51]:
train_data_final_w2v=np.concatenate((avg_w2v_vectors_title_train,np.array(train_author_response_code),avg_w2v_vectors_text_train),axis=1)
cv_data_final_w2v=np.concatenate((avg_w2v_vectors_title_cv,np.array(cv_author_response_code),avg_w2v_vectors_text_cv),axis=1)
test_data_final_w2v=np.concatenate((avg_w2v_vectors_title_test,np.array(test_author_response_code),avg_w2v_vectors_text_test),axis=1)

In [52]:
#standardizing the data
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
train_data_final_w2v=scaler.fit_transform(train_data_final_w2v)
cv_data_final_w2v=scaler.transform(cv_data_final_w2v)
test_data_final_w2v=scaler.transform(test_data_final_w2v)

In [53]:
#saving the standard scaler
with open("scaler.pickle","wb") as fp:
    pickle.dump(scaler,fp,protocol=pickle.HIGHEST_PROTOCOL)

### 8. Modelling

In [54]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score,f1_score,classification_report

#### 8.1. Multinomail Naive Bayes with TFIDF encoded features

In [55]:
alpha_range=[0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100]
for i in alpha_range:
    nb_clf=MultinomialNB(alpha=i, fit_prior=True)
    nb_clf.fit(train_data_final_tfidf,train_output)
    
    train_prob=nb_clf.predict_proba(train_data_final_tfidf)[:,1]
    train_AUC=roc_auc_score(train_output,train_prob)
    print("For alpha=%f, train AUC=%f" % (i,train_AUC))
    
    cv_prob=nb_clf.predict_proba(cv_data_final_tfidf)[:,1]
    cv_AUC=roc_auc_score(cv_output,cv_prob)
    print("For alpha=%f, cv AUC=%f" % (i,cv_AUC))
    
    train_scores=nb_clf.predict(train_data_final_tfidf)
    train_f1=f1_score(train_output,train_scores)
    print("For alpha=%f, train f1 score=%f" % (i,train_f1))
    
    cv_scores=nb_clf.predict(cv_data_final_tfidf)
    cv_f1=f1_score(cv_output,cv_scores)
    print("For alpha=%f, cv f1 score=%f" % (i,cv_f1))

    print("-"*50)

For alpha=0.000010, train AUC=0.999995
For alpha=0.000010, cv AUC=0.998262
For alpha=0.000010, train f1 score=0.999010
For alpha=0.000010, cv f1 score=0.979391
--------------------------------------------------
For alpha=0.000100, train AUC=0.999995
For alpha=0.000100, cv AUC=0.998986
For alpha=0.000100, train f1 score=0.999010
For alpha=0.000100, cv f1 score=0.982948
--------------------------------------------------
For alpha=0.001000, train AUC=0.999994
For alpha=0.001000, cv AUC=0.999408
For alpha=0.001000, train f1 score=0.999010
For alpha=0.001000, cv f1 score=0.986474
--------------------------------------------------
For alpha=0.010000, train AUC=0.999994
For alpha=0.010000, cv AUC=0.999616
For alpha=0.010000, train f1 score=0.999010
For alpha=0.010000, cv f1 score=0.987867
--------------------------------------------------
For alpha=0.100000, train AUC=0.999993
For alpha=0.100000, cv AUC=0.999662
For alpha=0.100000, train f1 score=0.998911
For alpha=0.100000, cv f1 score=0.990

We can observe that with alpha=1, we got the best AUC and f1_score.

In [56]:
nb_clf_best=MultinomialNB(alpha=1, fit_prior=True)
nb_clf_best.fit(train_data_final_tfidf,train_output)

test_prob=nb_clf_best.predict_proba(test_data_final_tfidf)[:,1]
test_AUC=roc_auc_score(test_output,test_prob)
print("For alpha=0.1, test AUC=%f" % (test_AUC))
    
test_scores=nb_clf_best.predict(test_data_final_tfidf)
test_f1=f1_score(test_output,test_scores)
print("For alpha=0.1, test f1 score=%f" % (test_f1))

For alpha=0.1, test AUC=0.999408
For alpha=0.1, test f1 score=0.989590


In [57]:
print(classification_report(test_output,test_scores,target_names=['Label 0','Label 1']))

              precision    recall  f1-score   support

     Label 0       0.98      1.00      0.99      3116
     Label 1       1.00      0.98      0.99      3092

    accuracy                           0.99      6208
   macro avg       0.99      0.99      0.99      6208
weighted avg       0.99      0.99      0.99      6208



#### 8.2. Logistic Regression with Word2Vec encoded features

In [58]:
c_range=[0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000]
for i in c_range:
    logistic_clf=LogisticRegression(C=i,max_iter=300)
    logistic_clf.fit(train_data_final_w2v,train_output)
    
    train_prob=logistic_clf.predict_proba(train_data_final_w2v)[:,1]
    train_AUC=roc_auc_score(train_output,train_prob)
    print("For C=%f, train AUC=%f" % (i,train_AUC))
    
    cv_prob=logistic_clf.predict_proba(cv_data_final_w2v)[:,1]
    cv_AUC=roc_auc_score(cv_output,cv_prob)
    print("For C=%f, cv AUC=%f" % (i,cv_AUC))
    
    train_scores=logistic_clf.predict(train_data_final_w2v)
    train_f1=f1_score(train_output,train_scores)
    print("For C=%f, train f1 score=%f" % (i,train_f1))
    
    cv_scores=logistic_clf.predict(cv_data_final_w2v)
    cv_f1=f1_score(cv_output,cv_scores)
    print("For C=%f, cv f1 score=%f" % (i,cv_f1))

    print("-"*50)

For C=0.000100, train AUC=0.988264
For C=0.000100, cv AUC=0.980405
For C=0.000100, train f1 score=0.953616
For C=0.000100, cv f1 score=0.929505
--------------------------------------------------
For C=0.001000, train AUC=0.999853
For C=0.001000, cv AUC=0.996697
For C=0.001000, train f1 score=0.997722
For C=0.001000, cv f1 score=0.973856
--------------------------------------------------
For C=0.010000, train AUC=0.999894
For C=0.010000, cv AUC=0.997323
For C=0.010000, train f1 score=0.998714
For C=0.010000, cv f1 score=0.975450
--------------------------------------------------
For C=0.100000, train AUC=0.999924
For C=0.100000, cv AUC=0.995751
For C=0.100000, train f1 score=0.998714
For C=0.100000, cv f1 score=0.966313
--------------------------------------------------
For C=1.000000, train AUC=0.999920
For C=1.000000, cv AUC=0.993544
For C=1.000000, train f1 score=0.999010
For C=1.000000, cv f1 score=0.955160
--------------------------------------------------
For C=10.000000, train AU

We can observe that with c=0.01, we got good AUC, f1_score

In [59]:
logistic_clf_best=LogisticRegression(C=0.01,max_iter=300)
logistic_clf_best.fit(train_data_final_w2v,train_output)

test_prob=logistic_clf_best.predict_proba(test_data_final_w2v)[:,1]
test_AUC=roc_auc_score(test_output,test_prob)
print("For C=0.01, test AUC=%f" % (test_AUC))
    
test_scores=logistic_clf_best.predict(test_data_final_w2v)
test_f1=f1_score(test_output,test_scores)
print("For C=0.01, test f1 score=%f" % (test_f1))

For C=0.01, test AUC=0.996436
For C=0.01, test f1 score=0.970540


In [60]:
print(classification_report(test_output,test_scores,target_names=['Label 0','Label 1']))

              precision    recall  f1-score   support

     Label 0       0.96      0.98      0.97      3116
     Label 1       0.98      0.96      0.97      3092

    accuracy                           0.97      6208
   macro avg       0.97      0.97      0.97      6208
weighted avg       0.97      0.97      0.97      6208



### 9. Conclusion

After observing the classification reports of the above two models, we can say that MultiNomial Naive Bayes Model provided best results on test data.

__test AUC=0.999408__<br>
__test f1 score=0.98959__


In [61]:
#saving the best model
with open("nb_clf_best.pickle","wb") as fp:
    pickle.dump(nb_clf_best,fp,protocol=pickle.HIGHEST_PROTOCOL)