# Fake News Detector using TFIDF

The aim of this project is to detect if articles are real or fake news. The analysis used checks the frequencies of words to determine differences in word frequency in real vs fake news.

The data used has 7796 entries and includes an id for the news, the title and text and the fourth contains a label for if the news is real or fake.

In [10]:
# Imports

import numpy as np
import pandas as pd
import itertools
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

In [11]:
# Read in the data

df = pd.read_csv('data/news.csv')

# Peek into the data to check it's format
df.shape
df.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [12]:
# Get the labels of the data
labels = df.label
labels.head()

0    FAKE
1    FAKE
2    REAL
3    FAKE
4    REAL
Name: label, dtype: object

In [13]:
# Perform a train test split with 0.8 train and 0.2 test
# Seed is set for reproducibility
x_train, x_test, y_train, y_test = train_test_split(df['text'], labels, test_size = 0.2, random_state = 7)

# Initialize TfidfVectorizer with word frequency set to a maximum of 0.7
tfidf_vectorizer = TfidfVectorizer(stop_words = "english", max_df = 0.7)

# Fit and transform train set and transform test set
tfidf_train = tfidf_vectorizer.fit_transform(x_train)
tfidf_test = tfidf_vectorizer.transform(x_test)

Now a PassiveAggressiveClassifier is initialized and used to make predictions on the test set.

In [15]:
# Initializing PassiveAggressiveClassifier

pac = PassiveAggressiveClassifier(max_iter = 50)
pac.fit(tfidf_train, y_train)

# Predict on the test set and calculate the accuracy
y_pred = pac.predict(tfidf_test)
score = accuracy_score(y_test, y_pred)
print(f'Accuracy: {round(score*100,2)}%')
# Build a confusion matrix to check whether the classifier leans towards false positives or false negatives
confusion_matrix(y_test, y_pred, labels = ['FAKE', 'REAL'])


Accuracy: 92.98%


array([[592,  46],
       [ 43, 586]])

Quite a good accuracy for a simple classifier! The classifier is balanced between false positives and false negatives.