**Problem Statement**

This is a problem to identify wheather a given news is fake or real.
The Indepedent variables or factors used here can be title of news or news lines themselves.
We will be looking at different model to solve this problem.
In this notebook we will using Bag of Words model with TFDIF.

**Dataset**

The dataset for news is taken from Kaggle https://www.kaggle.com/c/fake-news/.
The dataset is quite huge containing thousands of rows, we expect our model to be highly accurate.

# Step 1 - Importing the libraries

In [1]:
import numpy as np
import pandas as pd
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


# Step 2 - Importing the dataset

In [2]:
data = pd.read_csv('/content/drive/My Drive/Data Science Projects/Dataset Fakenews.csv') # 1 is for fake 0 for real
data.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


# Step 3 - Dealing with null values

In [3]:
print(data.isnull().sum())
data.dropna(inplace = True)
data.reset_index(inplace = True)

id           0
title      558
author    1957
text        39
label        0
dtype: int64


In [4]:
print(data.isnull().sum())

index     0
id        0
title     0
author    0
text      0
label     0
dtype: int64


# Step 4 - Making dependent variable Y

In [5]:
Y = data.iloc[:, -1]
# We will be dealing with Independent Variable X later on

In [6]:
Y.head()

0    1
1    0
2    1
3    1
4    1
Name: label, dtype: int64

# Step 5 - Cleaning the text

In [7]:
corpus = []                                           # Declaring a list
for i in range(0,data.shape[0]):
  review = re.sub('[^a-zA-Z]', ' ',  data['title'][i])  # Keeping only a-z or A-Z characters in news
  review = review.lower()                               # Lower all characters
  review = review.split()
  ps = PorterStemmer()
  all_stopwords = stopwords.words('english')            # Removing all stopwords/unnecessary words from news
  all_stopwords.remove('not') #otherwise this word "not" will be included in stopwords removing of which from news is not idle
  review = [ps.stem(word) for word in review if not word in set(all_stopwords) ]
  review = ' '.join(review) #Converting back the review list to String
  corpus.append(review)

# Step 6 - Creating Bag of Words with TFDIF

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
tv = TfidfVectorizer(max_features = 5000, ngram_range = (1,3)) #take feature as combination of words 
X = tv.fit_transform(corpus).toarray()  # Finally independent variable X is ready

# Step 7 - Splitting dataset into Training and Test

In [9]:
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split(X, Y, test_size = 0.20, random_state = 0)

# Step 8 - Creating Machine Learning models

In [10]:
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(X_train, Y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [16]:

from sklearn.linear_model import PassiveAggressiveClassifier
classifier = PassiveAggressiveClassifier()
classifier.fit(X_train, Y_train)


PassiveAggressiveClassifier(C=1.0, average=False, class_weight=None,
                            early_stopping=False, fit_intercept=True,
                            loss='hinge', max_iter=1000, n_iter_no_change=5,
                            n_jobs=None, random_state=None, shuffle=True,
                            tol=0.001, validation_fraction=0.1, verbose=0,
                            warm_start=False)

# Step 9 - Predicting results and checking accuracy

In [17]:
yhat = classifier.predict(X_test)

from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(Y_test, yhat)
accuracy = accuracy_score(Y_test, yhat)

print(accuracy)
print(cm)

0.9207000273448182
[[1890  150]
 [ 140 1477]]


# Step 10 - Checking which wards are most real and fake

In [18]:
feature_names = tv.get_feature_names()
coefficients = classifier.coef_[0]

# Top 20 real news words
sorted(zip(coefficients, feature_names), reverse = True)[:20]

[(12.802859970702288, 'hillari'),
 (10.838551735104053, 'comment'),
 (10.263076298814536, 'journal'),
 (9.77526564700751, 'trump need'),
 (9.13576006356684, 'idiot'),
 (8.726138807450635, 'american peopl'),
 (8.535729682723796, 'daesh'),
 (8.500468783884587, 'invad'),
 (8.276049081889935, 'video'),
 (8.126878392741988, 'migrant crisi'),
 (7.714572687667521, 'negoti'),
 (7.520025741141372, 'gap'),
 (7.512996184645416, 'poll show'),
 (7.441698730599627, 'humili'),
 (7.402140791154557, 'bill clinton'),
 (7.27973202765165, 'jame matti'),
 (6.987827688628461, 'report new york'),
 (6.906191740127908, 'dog'),
 (6.744162558797096, 'clinton'),
 (6.736782350174677, 'aleppo')]

In [19]:
# Top 20 fake news words
sorted(zip(coefficients, feature_names), reverse = False)[:20]

[(-43.73383177793861, 'breitbart'),
 (-18.865996662652257, 'new york time'),
 (-18.865996662652257, 'york time'),
 (-13.286545999455958, 'hillari clinton'),
 (-12.91900260809555, 'new york'),
 (-12.61222552418386, 'york'),
 (-10.748890980291863, 'delingpol'),
 (-10.633370055418885, 'penc'),
 (-10.222231066561756, 'gorka'),
 (-9.952497607419604, 'virgil'),
 (-9.863513057260814, 'espn'),
 (-9.13326771375477, 'cartel'),
 (-9.104193188198208, 'ross'),
 (-9.082313000978253, 'town hall'),
 (-9.033985801155316, 'abort'),
 (-8.943108109793014, 'streisand'),
 (-8.837915555065766, 'clinton aid'),
 (-8.807560361386352, 'westminst'),
 (-8.765110368371158, 'march'),
 (-8.606292961885638, 'potenti')]

# Step 11 - Saving model for future use and deplyoment

In [20]:
import pickle 

with open("model.pkl", "wb") as filename:
  pickle.dump(classifier, filename)

with open("tv.pkl" , "wb") as filename:
  pickle.dump(tv, filename)

**Conclusion**

Lets compare accuracies by different models

**Multinomial NB**

Accuracy - 0.8881596937380366

**Passive Agressive Classifier**

0.0.9207000273448182

Thus, it can be clearly seen that both the models Multinomial NB and Passive Agressive Classifier are great for NLP classification problems.
Since Passive Agressive gave us a bit more accuracy , we will be chosing it for this specific problem.

