# Fake News Detection
## (Self-Guided Project)

## by Justin Sierchio

In this project, we will be looking at determining if certain news headlines are legitimate or not.

This data is in .csv file format and is from DataFlair at: https://data-flair.training/blogs/advanced-python-project-detecting-fake-news/. More information related to the dataset can be found at the same link.

Note: this is a self-guided project following the tutorial provided the contributors at DataFlair.

## Notebook Initialization

In [1]:
# Import Relevant Libraries
import pandas as pd
import numpy as np
import seaborn as sns 
import itertools
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

print('Initial libraries loaded into workspace!')

Initial libraries loaded into workspace!


In [2]:
# Upload Datasets for Study
df = pd.read_csv('news.csv')

print('Datasets uploaded!');

Datasets uploaded!


In [3]:
# Display 1st 5 rows of Amazon Items dataset
df.shape
df.head(5)

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


Now we get the labels from the DataFrame:

In [4]:
# Obtain the labels from the DataFrame
labels = df.label
labels.head()

0    FAKE
1    FAKE
2    REAL
3    FAKE
4    REAL
Name: label, dtype: object

## Machine Learning Algorithm

At this juncture, we can split the dataset into training and test sets.

In [5]:
# Split the dataset into training and test sets
x_train, x_test, y_train, y_test = train_test_split(df['text'], labels, test_size=0.2, random_state=7)

Now we will use what is called a "Tfidf Vectorizer" (or term-frequency, inverse document-frequency) where terms that appear too frequently in a document get discarded. This is commonly use to eliminate so-called "stop words" that appear commonly in language (i.e. 'the', 'and').

In [6]:
# Start a Tfidf Vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_df=0.7)

# Fit and Transform the training set; Transform the test set
tfidf_train = tfidf_vectorizer.fit_transform(x_train) 
tfidf_test = tfidf_vectorizer.transform(x_test)

Next, we use what is called a 'Passive-Aggressive Classifier'. This algorithm works by doing the following:

<ul>
    <li>When operating 'passive', if the prediction is TRUE, the model is kept without any changes.</li>
    <li>When operating 'aggressive', if the prediction is FALSE, the model is changed.</li>
</ul>

In [7]:
# Instantiate a Passive-Aggressive Classifier
pac = PassiveAggressiveClassifier(max_iter=50)
pac.fit(tfidf_train,y_train)

# Predict the accuracy based upon the test set
y_pred = pac.predict(tfidf_test)
score = accuracy_score(y_test,y_pred)
print(f'Accuracy: {round(score*100,2)}%')

Accuracy: 92.74%


So we can see that we obtained almost a 93% accuracy based upon our model. To wrap up this project, let's plot a confusion matrix.

In [8]:
# Build a confusion matrix and plot it
confusion_matrix(y_test,y_pred, labels=['FAKE','REAL'])

array([[590,  48],
       [ 44, 585]], dtype=int64)

From this plot, we have 590 true positives, 585 true negatives, 44 false positives and 48 false negatives.

# Conclusion

The goal of this self-guided project (via DataFlair) was to explore and detect Fake Newss. In this project, we were able to upload a real-world dataset, run a machine learning algorithm and present our results. It is the author's hope that others find this exericse useful. Thanks for reading!