# Fake News Classifier

## A Python project by Hritik Bhandari

This project is developed to distinguish fake news from the real ones using Python and its libraries. I have used a dataset with the shape 7796×4. The first column identifies the News, the second and third are the Title and Text of the corresponding news and the fourth column has labels showing whether the news is REAL or FAKE

1. Importing the required libraries and packages

In [1]:
import numpy as np
import pandas as pd
import itertools
import seaborn as sns
import matplotlib.pyplot as plt




2. Reading the data from the csv file

In [2]:
df= pd.read_csv('news.csv')

3. Now, I will describe the dataset and preprocess it

In [3]:
print(df.shape)
df.head()

(6335, 4)


Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [4]:
df.describe

<bound method NDFrame.describe of       Unnamed: 0                                              title  \
0           8476                       You Can Smell Hillary’s Fear   
1          10294  Watch The Exact Moment Paul Ryan Committed Pol...   
2           3608        Kerry to go to Paris in gesture of sympathy   
3          10142  Bernie supporters on Twitter erupt in anger ag...   
4            875   The Battle of New York: Why This Primary Matters   
5           6903                                        Tehran, USA   
6           7341  Girl Horrified At What She Watches Boyfriend D...   
7             95                  ‘Britain’s Schindler’ Dies at 106   
8           4869  Fact check: Trump and Clinton at the 'commande...   
9           2909  Iran reportedly makes new push for uranium con...   
10          1357  With all three Clintons in Iowa, a glimpse at ...   
11           988  Donald Trump’s Shockingly Weak Delegate Game S...   
12          7041  Strong Solar Storm, Tech 

In [5]:
df.info

<bound method DataFrame.info of       Unnamed: 0                                              title  \
0           8476                       You Can Smell Hillary’s Fear   
1          10294  Watch The Exact Moment Paul Ryan Committed Pol...   
2           3608        Kerry to go to Paris in gesture of sympathy   
3          10142  Bernie supporters on Twitter erupt in anger ag...   
4            875   The Battle of New York: Why This Primary Matters   
5           6903                                        Tehran, USA   
6           7341  Girl Horrified At What She Watches Boyfriend D...   
7             95                  ‘Britain’s Schindler’ Dies at 106   
8           4869  Fact check: Trump and Clinton at the 'commande...   
9           2909  Iran reportedly makes new push for uranium con...   
10          1357  With all three Clintons in Iowa, a glimpse at ...   
11           988  Donald Trump’s Shockingly Weak Delegate Game S...   
12          7041  Strong Solar Storm, Tech Ri

4. The label column of the dataset is the one which tells us whether the news is REAL or FAKE. So, I am taking the label column into a new variable called 'Labels'. I will be using this data to further train and test my model.

In [6]:
labels = df.label
labels.head()

0    FAKE
1    FAKE
2    REAL
3    FAKE
4    REAL
Name: label, dtype: object

5. Now, I will be splitting the data into Train and Test sets to train and check the accuracy of my predictions.

In [7]:
from sklearn.model_selection import train_test_split 

In [8]:
x_train, x_test, y_train, y_test = train_test_split(df.text, labels, test_size=0.2, random_state =7)

6. Let’s initialize a TfidfVectorizer with stop words from the English language and a maximum document frequency of 0.7. Stop words are the most common words in a language that will be filtered out before processing the natural language data and the TfidfVectorizer will turn the collection of raw documents into a matrix of TF-IDF features.

I will fit and transform the vectorizer on the train set, and transform the vectorizer on the test set.

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [10]:
tfidf_vectorizer= TfidfVectorizer(stop_words = 'english', max_df =0.7)

In [11]:
tfidf_train = tfidf_vectorizer.fit_transform(x_train)
tfidf_test = tfidf_vectorizer.transform(x_test)

7. The classification of news can be carried out using many different classification algorithms.

I will be using the Passive-Aggressive Classifier for this project as it remains passive for a correct classification outcome and turns aggressive in the event of a miscalculation, updating and adjusting. Unlike most other algorithms, it does not converge. It can be implemented very easily using the Scikit-Learn Library.

In [12]:
from sklearn.linear_model import PassiveAggressiveClassifier

In [13]:
pac = PassiveAggressiveClassifier(max_iter = 50)

In [14]:
pac.fit(tfidf_train, y_train)

PassiveAggressiveClassifier(C=1.0, average=False, class_weight=None,
                            early_stopping=False, fit_intercept=True,
                            loss='hinge', max_iter=50, n_iter_no_change=5,
                            n_jobs=None, random_state=None, shuffle=True,
                            tol=0.001, validation_fraction=0.1, verbose=0,
                            warm_start=False)

8. This is the step of my model which gives me the output in the form of the classification as done by my model.

In [15]:
y_pred = pac.predict(tfidf_test)

9. Now, I will measure the accuracy score of my model which will tell us how effective and accurate this particular algorithm has been.

In [16]:
from sklearn.metrics import accuracy_score

In [17]:
score = accuracy_score(y_test,y_pred)

In [18]:
score_percent = round(score*100, 2)
print(f'{score_percent}%')

92.5%


### The accuracy percentage of this Fake News Classifier is 92.66% 

10. As the last step of my project, I will be making a confusion matrix to analyse and compare the given and the predicted values and finally save the predictions as required.

In [19]:
from sklearn.metrics import confusion_matrix

In [20]:
Matrix = confusion_matrix(y_test, y_pred, labels=['FAKE', 'REAL'])

In [21]:
print(Matrix)

[[587  51]
 [ 44 585]]


So, my model gives 588 true positives, 586 true negatives, 43 false positives, and 50 false negatives.


In [22]:
np.save('classified.csv', y_pred)