# Fake News Detection

## by Justin Sierchio

In this project, we will be looking at determining if certain news headlines are legitimate or not.

This data is in .csv file format and is from Kaggle at: https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset/download. More information related to the dataset can be found at https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset.

## Notebook Initialization

In [1]:
# Import Relevant Libraries
import pandas as pd
import numpy as np
import seaborn as sns 
import itertools
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

print('Initial libraries loaded into workspace!')

Initial libraries loaded into workspace!


In [2]:
# Upload Datasets for Study
df_NEWS_Fake = pd.read_csv('Fake.csv')
df_NEWS_True = pd.read_csv('True.csv')

print('Datasets uploaded!');

Datasets uploaded!


In [3]:
# Display 1st 5 rows of the fake news headline dataset
df_NEWS_Fake.head(5)

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


In [4]:
# Display 1st 5 rows of the true news headline dataset
df_NEWS_True.head(5)

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


There are only four columns for this data:

<ul>
    <li>title: the headline of the article.</li>
    <li>text: the text of the article.</li>
    <li>subject: the subject matter of the article.</li>
    <li>date: the date of the article in Month Day, Year format.</li>

## Data Cleaning

Before beginning our detection and analysis, let's make sure this dataset is sufficiently cleaned. Let's check for any 'NaN' or 'null' values.

In [5]:
# Check for 'NaN' or 'null' values
df_NEWS_Fake.isnull().sum()

title      0
text       0
subject    0
date       0
dtype: int64

In [6]:
# Check for 'NaN' or 'null' values
df_NEWS_True.isnull().sum()

title      0
text       0
subject    0
date       0
dtype: int64

We see there are no null values, so we are safe to proceed.

## Data Merge

In order to do a proper detection, we need to merge the two datasets. Let's add an additional column to each set identifying if the article is true or fake.

In [7]:
# Add a column to state whether the article is true or fake news
df_NEWS_Fake['True_or_Fake'] = 1
df_NEWS_True['True_or_Fake'] = 0

In [8]:
# Merge the data frames
df_frames = [df_NEWS_Fake, df_NEWS_True] 
df_NEWS = pd.concat(df_frames) 
display(df_NEWS) 

Unnamed: 0,title,text,subject,date,True_or_Fake
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",1
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",1
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",1
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",1
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",1
...,...,...,...,...,...
21412,'Fully committed' NATO backs new U.S. approach...,BRUSSELS (Reuters) - NATO allies on Tuesday we...,worldnews,"August 22, 2017",0
21413,LexisNexis withdrew two products from Chinese ...,"LONDON (Reuters) - LexisNexis, a provider of l...",worldnews,"August 22, 2017",0
21414,Minsk cultural hub becomes haven from authorities,MINSK (Reuters) - In the shadow of disused Sov...,worldnews,"August 22, 2017",0
21415,Vatican upbeat on possibility of Pope Francis ...,MOSCOW (Reuters) - Vatican Secretary of State ...,worldnews,"August 22, 2017",0


Lastly, let's reset the index for the new combined dataset.

In [9]:
# Reset index for combined dataframe
df_NEWS.reset_index()

Unnamed: 0,index,title,text,subject,date,True_or_Fake
0,0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",1
1,1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",1
2,2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",1
3,3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",1
4,4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",1
...,...,...,...,...,...,...
44893,21412,'Fully committed' NATO backs new U.S. approach...,BRUSSELS (Reuters) - NATO allies on Tuesday we...,worldnews,"August 22, 2017",0
44894,21413,LexisNexis withdrew two products from Chinese ...,"LONDON (Reuters) - LexisNexis, a provider of l...",worldnews,"August 22, 2017",0
44895,21414,Minsk cultural hub becomes haven from authorities,MINSK (Reuters) - In the shadow of disused Sov...,worldnews,"August 22, 2017",0
44896,21415,Vatican upbeat on possibility of Pope Francis ...,MOSCOW (Reuters) - Vatican Secretary of State ...,worldnews,"August 22, 2017",0


## Machine Learning Algorithm

Before proceeding, we need to put the "True or Fake" status into a variable.

In [10]:
# Obtain labels from dataset
True_or_Fake = df_NEWS.True_or_Fake
True_or_Fake.head()

0    1
1    1
2    1
3    1
4    1
Name: True_or_Fake, dtype: int64

Here, 1 = Fake and 0 = True.

### Title

Let's begin by splitting the overall dataset into training and test sets.

In [11]:
# Split the dataset into training and test sets
x_trainTITLE, x_testTITLE, y_trainTITLE, y_testTITLE = train_test_split(df_NEWS['title'], True_or_Fake, test_size=0.2, random_state=7)

Next, we will employ a TDIDF Vectorizer to remove stop words (i.e. 'the', 'and') from the titles.

In [12]:
# Create a TDIDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_df=0.7)

In [13]:
# Fit, Transform the training set; Transform the Test set
tfidf_trainTITLE = tfidf_vectorizer.fit_transform(x_trainTITLE) 
tfidf_testTITLE = tfidf_vectorizer.transform(x_testTITLE)

Now we will use a Passive-Aggressive Classifier. The algorithm works by keeping the model as-is when the prediction is true and modifying it if it is false.

In [14]:
# Deploy the Passive-Aggressive Classifier
pacTITLE = PassiveAggressiveClassifier(max_iter=50)
pacTITLE.fit(tfidf_trainTITLE,y_trainTITLE)

# Predict the accuracy based upon the test set
y_predTITLE = pacTITLE.predict(tfidf_testTITLE)
scoreTITLE = accuracy_score(y_testTITLE,y_predTITLE)
print(f'Title Accuracy: {round(scoreTITLE*100,2)}%')

Title Accuracy: 93.98%


So we can see that we obtained 94% accuracy (among titles) based upon our model.

### Text

Now let's look at the text of each of the articles. We'll repeat the process as before by starting with a training and test set split.

In [15]:
# Split the dataset into training and test sets
x_trainTEXT, x_testTEXT, y_trainTEXT, y_testTEXT = train_test_split(df_NEWS['text'], True_or_Fake, test_size=0.2, random_state=7)

Next, we employ the TDIDF Vectorizer as before.

In [16]:
# Create a TDIDF Vectorizer
tfidf_vectorizer2 = TfidfVectorizer(stop_words='english', max_df=0.7)

In [17]:
# Fit, Transform the training set; Transform the Test set
tfidf_trainTEXT = tfidf_vectorizer2.fit_transform(x_trainTEXT) 
tfidf_testTEXT = tfidf_vectorizer2.transform(x_testTEXT)

Now we use the Passive-Aggressive Classifier as before.

In [18]:
# Deploy the Passive-Aggressive Classifier
pacTEXT = PassiveAggressiveClassifier(max_iter=50)
pacTEXT.fit(tfidf_trainTEXT,y_trainTEXT)

# Predict the accuracy based upon the test set
y_predTEXT = pacTEXT.predict(tfidf_testTEXT)
scoreTEXT = accuracy_score(y_testTEXT,y_predTEXT)
print(f'Text Accuracy: {round(scoreTEXT*100,2)}%')

Text Accuracy: 99.43%


We see that the text gave us an accurate of over 99%. A possible reason that the text was more accurate than the title is that the pool of potential "true" vs "fake" words was greater.

# Conclusions

The goal of this project was to explore and detect fake news articles. We successfully deployed both TDIDF Vectorizers and Passive-Aggressive Classifiers, showing accuracies of 94% and 99% for the titles and text, respectively. 

In this project, we were able to upload a real-world dataset, perform classifications and draw useful conclusions. It is the author's hope that others find this exericse useful. Thanks for reading!