# Fake news Classifier 📰

This notebook looks into various Python-based machine learning and data science libraries in an attempt to build a machine learning model capable of predicting weather or not a news is fake.

we are going to take following approch:

1. problem defination
2. data
3. evaluation
4. features
5. modelling

### 1. Problem

Build a system to identify unreliable news articles.

### 2. Data

Original data came from the kaggle competition https://www.kaggle.com/c/fake-news/data 

1. train.csv: A full training dataset with the following attributes:

  * id: unique id for a news article
  * title: the title of a news article
  * author: author of the news article
  * text: the text of the article; could be incomplete
  * label: a label that marks the article as potentially unreliable
     1: unreliable   0: reliable

2. test.csv: A testing training dataset with all the same attributes at train.csv without the label.

### 3. Evaluation

The evaluation metric for this competition is accuracy, a very straightforward metric.

Accuracy measures false positives and false negeatives equally, and really should only be used in simple cases and when classes are of (generally) equal class size

### 4. Features

  * id: unique id for a news article
  * title: the title of a news article
  * author: author of the news article
  * text: the text of the article; could be incomplete
  * label: a label that marks the article as potentially unreliable
     1: unreliable   0: reliable

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [4]:
df = pd.read_csv('/content/drive/My Drive/fake-news-classifier/train.csv')
df

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1
...,...,...,...,...,...
20795,20795,Rapper T.I.: Trump a ’Poster Child For White S...,Jerome Hudson,Rapper T. I. unloaded on black celebrities who...,0
20796,20796,"N.F.L. Playoffs: Schedule, Matchups and Odds -...",Benjamin Hoffman,When the Green Bay Packers lost to the Washing...,0
20797,20797,Macy’s Is Said to Receive Takeover Approach by...,Michael J. de la Merced and Rachel Abrams,The Macy’s of today grew from the union of sev...,0
20798,20798,"NATO, Russia To Hold Parallel Exercises In Bal...",Alex Ansary,"NATO, Russia To Hold Parallel Exercises In Bal...",1


In [5]:
# Split dsata into independent and dependent variables
x = df.drop('label', axis =1)
y = df['label']

In [6]:
df.shape

(20800, 5)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20800 entries, 0 to 20799
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      20800 non-null  int64 
 1   title   20242 non-null  object
 2   author  18843 non-null  object
 3   text    20761 non-null  object
 4   label   20800 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 812.6+ KB


In [9]:
# Check weather there are null values or not
df.isna().sum()

id           0
title      558
author    1957
text        39
label        0
dtype: int64

In [10]:
df.fillna('Missing data', inplace=True)

In [11]:
df.isna().sum()

id        0
title     0
author    0
text      0
label     0
dtype: int64

In [None]:
df['text'][10]

'Organizing for Action, the activist group that morphed from Barack Obama’s first presidential campaign, has partnered with the   Indivisible Project for “online trainings” on how to protest President Donald Trump’s agenda. [Last week, Breitbart News extensively reported that Indivisible leaders are openly associated with groups financed by billionaire George Soros.  Politico earlier this month profiled Indivisible in an article titled, “Inside the protest movement that has Republicans reeling. ”  The news agency not only left out the Soros links, but failed to note that the organizations cited in its article as helping to amplify Indivisible’s message are either financed directly by Soros or have close ties to groups funded by the billionaire, as Breitbart News documented. Organizing for Action (OFA) is a   community organizing project that sprung from Obama’s 2012 campaign organization, Organizing for America, becoming a nonprofit described by the Washington Post as “advocate[ing] fo

## Stemming with BOW

To preprocess data we remove all special characters and use stopwords to remove words present in stopword dictionary
Then we use contVectorizer to convert words into Vectors

In [14]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
nltk.download('stopwords')
ps = PorterStemmer()
lm = WordNetLemmatizer()
corpus = []
for i in range(0, len(df)):
  review = re.sub('[^a-zA-Z]', ' ', df['title'][i])
  review = review.lower()
  review = review.split()

  review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
  review = ' '.join(review)
  corpus.append(review)


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [15]:
corpus

['hous dem aid even see comey letter jason chaffetz tweet',
 'flynn hillari clinton big woman campu breitbart',
 'truth might get fire',
 'civilian kill singl us airstrik identifi',
 'iranian woman jail fiction unpublish stori woman stone death adulteri',
 'jacki mason hollywood would love trump bomb north korea lack tran bathroom exclus video breitbart',
 'life life luxuri elton john favorit shark pictur stare long transcontinent flight',
 'beno hamon win french socialist parti presidenti nomin new york time',
 'excerpt draft script donald trump q ampa black church pastor new york time',
 'back channel plan ukrain russia courtesi trump associ new york time',
 'obama organ action partner soro link indivis disrupt trump agenda',
 'bbc comedi sketch real housew isi caus outrag',
 'russian research discov secret nazi militari base treasur hunter arctic photo',
 'us offici see link trump russia',
 'ye paid govern troll social media blog forum websit',
 'major leagu soccer argentin find hom

In [16]:
cv = CountVectorizer(max_features= 5000, ngram_range=(1,3))
x = cv.fit_transform(corpus).toarray()

In [17]:
from sklearn.model_selection import  train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size = 0.2, random_state = 30)

In [None]:
cv.get_feature_names()[:20]

['abandon',
 'abc',
 'abc news',
 'abduct',
 'abe',
 'abedin',
 'abl',
 'abort',
 'abroad',
 'absolut',
 'absurd',
 'abus',
 'abus new',
 'abus new york',
 'academi',
 'accept',
 'access',
 'access pipelin',
 'access pipelin protest',
 'accid']

In [None]:
count_df = pd.DataFrame(xtrain, columns= cv.get_feature_names())
count_df.head(10)

Unnamed: 0,abandon,abc,abc news,abduct,abe,abedin,abl,abort,abroad,absolut,absurd,abus,abus new,abus new york,academi,accept,access,access pipelin,access pipelin protest,accid,accident,accord,account,accus,accus trump,achiev,acknowledg,acknowledg emf,acknowledg emf damag,aclu,acquit,acquitt,acr,across,act,act like,act new,act new york,action,activ,...,xi,xi jinp,yahoo,yale,ye,year,year ago,year breitbart,year eve,year later,year new,year new york,year old,year old girl,yemen,yemeni,yet,yet anoth,yiannopoulo,yield,york,york citi,york new,york new york,york time,yorker,young,youth,youtub,zealand,zero,zika,zika viru,zionist,zone,zone new,zone new york,zoo,zu,zuckerberg
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


### MultinominalNB

In [None]:
import numpy as np
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB(alpha= 0.4)
clf.fit(xtrain, ytrain)

MultinomialNB(alpha=0.4, class_prior=None, fit_prior=True)

In [None]:
ypred = clf.predict(xtest)
ypred

array([0, 1, 1, ..., 0, 0, 0])

In [None]:
ytest

18432    1
14787    1
4143     1
13744    1
16140    0
        ..
3957     1
14638    0
16170    0
8961     0
14941    0
Name: label, Length: 4160, dtype: int64

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(ytest, ypred)

array([[1893,  181],
       [ 149, 1937]])

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(ytest, ypred)

0.9206730769230769

### PassiveAggressiveClassifier

In [None]:
from sklearn.linear_model import PassiveAggressiveClassifier
clf1 = PassiveAggressiveClassifier(random_state=23)
clf1.fit(xtrain, ytrain)

PassiveAggressiveClassifier(C=1.0, average=False, class_weight=None,
                            early_stopping=False, fit_intercept=True,
                            loss='hinge', max_iter=1000, n_iter_no_change=5,
                            n_jobs=None, random_state=23, shuffle=True,
                            tol=0.001, validation_fraction=0.1, verbose=0,
                            warm_start=False)

In [None]:
ypred1 = clf1.predict(xtest)
ypred1

array([1, 1, 0, ..., 0, 0, 1])

In [None]:
accuracy_score(ytest, ypred1)

0.9223557692307692

In [None]:
confusion_matrix(ytest, ypred1)

array([[1905,  169],
       [ 154, 1932]])

## LSTM

In [None]:
from tensorflow.keras.layers import Embedding, Dense, LSTM
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import one_hot

In [None]:
voc_size = 5000

In [None]:
x = df.drop('label', axis=1)
y = df['label']

In [None]:
import nltk
import re
from nltk.corpus import stopwords

In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()
corpus = []
for i in range(0, len(df)):
  review = re.sub('[^a-zA-Z]', ' ', df['title'][i])
  review = review.lower()
  review = review.split()

  review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
  review = ' '.join(review)
  corpus.append(review)


In [None]:
onehot_rep = [one_hot(words, voc_size) for words in corpus]
onehot_rep

[[2906, 866, 700, 693, 203, 2853, 2867, 532, 3884, 4120],
 [4774, 2769, 4182, 551, 775, 673, 4035],
 [737, 1752, 4083, 3229],
 [1515, 1824, 2537, 576, 673, 4726],
 [4320, 775, 4486, 2792, 1111, 1279, 775, 347, 464, 3097],
 [3379,
  466,
  1329,
  1439,
  2784,
  279,
  2534,
  3011,
  3151,
  4262,
  3242,
  2950,
  1010,
  4864,
  4035],
 [147, 147, 3452, 2481, 2085, 1126, 451, 493, 3582, 1318, 962, 913],
 [2625, 1697, 4696, 4023, 4465, 3981, 1297, 1158, 3633, 751, 597],
 [29, 666, 2502, 668, 279, 3657, 2690, 570, 2675, 51, 3633, 751, 597],
 [4255, 1205, 3846, 4001, 1386, 2710, 279, 974, 3633, 751, 597],
 [884, 1079, 236, 1908, 3895, 3285, 3223, 4952, 279, 4442],
 [212, 3435, 1380, 2145, 1683, 855, 500, 3905],
 [4549, 1102, 3148, 3434, 2194, 1788, 1027, 4363, 3269, 2196, 4216],
 [576, 783, 203, 3285, 279, 1386],
 [3958, 53, 2473, 1238, 3299, 404, 2829, 2782, 3663],
 [2258, 3692, 4953, 1759, 4857, 2525, 3011, 3633, 751, 597],
 [3698, 3373, 4402, 50, 314, 3633, 751, 597],
 [1354, 2819, 

In [None]:
sen_length = 40
embedded_doc = pad_sequences(onehot_rep, padding= 'pre', maxlen= sen_length)
embedded_doc

array([[   0,    0,    0, ...,  532, 3884, 4120],
       [   0,    0,    0, ...,  775,  673, 4035],
       [   0,    0,    0, ..., 1752, 4083, 3229],
       ...,
       [   0,    0,    0, ..., 3633,  751,  597],
       [   0,    0,    0, ..., 1407, 3476, 1767],
       [   0,    0,    0, ..., 1191, 1511,  141]], dtype=int32)

In [None]:
embedding_vector_features = 40
model= Sequential()
model.add(Embedding(voc_size, embedding_vector_features, input_length=sen_length))
model.add(LSTM(100))
model.add(Dense(1, activation ='sigmoid'))
model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 40, 40)            200000    
_________________________________________________________________
lstm (LSTM)                  (None, 100)               56400     
_________________________________________________________________
dense (Dense)                (None, 1)                 101       
Total params: 256,501
Trainable params: 256,501
Non-trainable params: 0
_________________________________________________________________


In [None]:
import numpy as np

xfinal = np.array(embedded_doc)
yfinal = np.array(y)

In [None]:
xfinal

array([[   0,    0,    0, ...,  532, 3884, 4120],
       [   0,    0,    0, ...,  775,  673, 4035],
       [   0,    0,    0, ..., 1752, 4083, 3229],
       ...,
       [   0,    0,    0, ..., 3633,  751,  597],
       [   0,    0,    0, ..., 1407, 3476, 1767],
       [   0,    0,    0, ..., 1191, 1511,  141]], dtype=int32)

In [None]:
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(xfinal, yfinal, test_size = 0.2, random_state = 32)

In [None]:
model.fit(xtrain, ytrain, validation_data=(xtest, ytest), epochs = 10, batch_size=64)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7fce1b279080>

In [None]:
y_pred = model.predict_classes(xtest)
y_pred

array([[1],
       [0],
       [0],
       ...,
       [1],
       [0],
       [1]], dtype=int32)

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(ytest, y_pred)

array([[1914,  144],
       [ 180, 1922]])

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(ytest, y_pred)

0.9221153846153847

## Testing data for submission with LSTM

In [None]:
test_df = pd.read_csv('/content/drive/My Drive/fake-news-classifier/test.csv')
test_df.head()

Unnamed: 0,id,title,author,text
0,20800,"Specter of Trump Loosens Tongues, if Not Purse...",David Streitfeld,"PALO ALTO, Calif. — After years of scorning..."
1,20801,Russian warships ready to strike terrorists ne...,,Russian warships ready to strike terrorists ne...
2,20802,#NoDAPL: Native American Leaders Vow to Stay A...,Common Dreams,Videos #NoDAPL: Native American Leaders Vow to...
3,20803,"Tim Tebow Will Attempt Another Comeback, This ...",Daniel Victor,"If at first you don’t succeed, try a different..."
4,20804,Keiser Report: Meme Wars (E995),Truth Broadcast Network,42 mins ago 1 Views 0 Comments 0 Likes 'For th...


In [None]:
test_df.isna().sum()

id          0
title     122
author    503
text        7
dtype: int64

In [None]:
test_df.fillna('missing', inplace=True)

In [None]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()
corpus = []
for i in range(0, len(test_df)):
  review = re.sub('[^a-zA-Z]', ' ', df['title'][i])
  review = review.lower()
  review = review.split()

  review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
  review = ' '.join(review)
  corpus.append(review)


In [None]:
onehot_rep = [one_hot(words, voc_size) for words in corpus]
onehot_rep

[[2906, 866, 700, 693, 203, 2853, 2867, 532, 3884, 4120],
 [4774, 2769, 4182, 551, 775, 673, 4035],
 [737, 1752, 4083, 3229],
 [1515, 1824, 2537, 576, 673, 4726],
 [4320, 775, 4486, 2792, 1111, 1279, 775, 347, 464, 3097],
 [3379,
  466,
  1329,
  1439,
  2784,
  279,
  2534,
  3011,
  3151,
  4262,
  3242,
  2950,
  1010,
  4864,
  4035],
 [147, 147, 3452, 2481, 2085, 1126, 451, 493, 3582, 1318, 962, 913],
 [2625, 1697, 4696, 4023, 4465, 3981, 1297, 1158, 3633, 751, 597],
 [29, 666, 2502, 668, 279, 3657, 2690, 570, 2675, 51, 3633, 751, 597],
 [4255, 1205, 3846, 4001, 1386, 2710, 279, 974, 3633, 751, 597],
 [884, 1079, 236, 1908, 3895, 3285, 3223, 4952, 279, 4442],
 [212, 3435, 1380, 2145, 1683, 855, 500, 3905],
 [4549, 1102, 3148, 3434, 2194, 1788, 1027, 4363, 3269, 2196, 4216],
 [576, 783, 203, 3285, 279, 1386],
 [3958, 53, 2473, 1238, 3299, 404, 2829, 2782, 3663],
 [2258, 3692, 4953, 1759, 4857, 2525, 3011, 3633, 751, 597],
 [3698, 3373, 4402, 50, 314, 3633, 751, 597],
 [1354, 2819, 

In [None]:
sen_length = 40
embedded_doc = pad_sequences(onehot_rep, padding= 'pre', maxlen= sen_length)
embedded_doc

array([[   0,    0,    0, ...,  532, 3884, 4120],
       [   0,    0,    0, ...,  775,  673, 4035],
       [   0,    0,    0, ..., 1752, 4083, 3229],
       ...,
       [   0,    0,    0, ..., 4894, 3560, 2642],
       [   0,    0,    0, ..., 2422,  315,  288],
       [   0,    0,    0, ..., 3633,  751,  597]], dtype=int32)

In [None]:
xfinal = np.array(embedded_doc)

In [None]:
y_preds = model.predict_classes(xfinal)
y_preds

array([[1],
       [0],
       [1],
       ...,
       [1],
       [1],
       [0]], dtype=int32)

In [None]:
submission = pd.DataFrame()
submission['id'] = test_df['id']
submission['label'] = y_preds
submission.head()

Unnamed: 0,id,label
0,20800,1
1,20801,0
2,20802,1
3,20803,1
4,20804,1


In [None]:
submission.to_csv('fake-news-identidfier', index=False)