# Detecting Fake News with PassiveAggressiveClassifier

The objective of this project is to build and assess accuracy of a model classifying news into fake and real news.

The pre-cleaned dataset of politcal news was made available by data-flair.training

In order to build the model, the TfidfVectorizer (Term Frequency - Inverse Document Frequency Vectorizer) is applied on the data, which turns the dataset's raw documents into a matrix of TF-IDF features. Next, the model built with a PassiveAgressiveClassifier is fit with the train data and the accuracy score is calculated. Finally, the confusion matrix is printed to find the numbers of true and false positives and negatives and learn more about the reliability of the model. 


## Importing the necessary packages and loading the data

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, confusion_matrix 


news_df = pd.read_csv('news.csv', index_col = 0)

## Exploring and preparing the data

In [2]:
news_df.head()

Unnamed: 0,title,text,label
8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [3]:
news_df.reset_index(drop = True, inplace = True)

In [4]:
news_df.head()

Unnamed: 0,title,text,label
0,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [5]:
news_df.shape

(6335, 3)

In [6]:
news_df.columns

Index(['title', 'text', 'label'], dtype='object')

Data description

The dataset has 6335 rows and 3 columns: 

- **title** - contains titles of news stories
- **text** - contains news documents
- **label** - indicates if the news stories are real or fake

Below, I double check if the dataset is clean / can be used for building a ML model. 

In [7]:
news_df.title.nunique()

6256

In [8]:
news_df.title.isna().sum()

0

In [9]:
news_df.text.nunique()

6060

In [10]:
news_df.text.isna().sum()

0

In [11]:
news_df.label.unique()

array(['FAKE', 'REAL'], dtype=object)

In [12]:
news_df.drop_duplicates()

Unnamed: 0,title,text,label
0,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL
...,...,...,...
6330,State Department says it can't find emails fro...,The State Department told the Republican Natio...,REAL
6331,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...,FAKE
6332,Anti-Trump Protesters Are Tools of the Oligarc...,Anti-Trump Protesters Are Tools of the Oligar...,FAKE
6333,"In Ethiopia, Obama seeks progress on peace, se...","ADDIS ABABA, Ethiopia —President Obama convene...",REAL


### Preparing the data for model building

In [13]:
# Extracting the labels from the dataframe

labels = news_df.label
labels.head()

0    FAKE
1    FAKE
2    REAL
3    FAKE
4    REAL
Name: label, dtype: object

In [14]:
# Splitting the data into training and testing sets

x_train, x_test, y_train, y_test = train_test_split(news_df.text, labels, train_size = 0.8, test_size = 0.2, random_state = 5)

In [15]:
# Initializing a tf-idf vectorizer

tfidf_vectorizer = TfidfVectorizer(stop_words = 'english', max_df = 0.7)

### Building the PAC model

In [16]:
# Fitting and transforming the train set
tfidf_train=tfidf_vectorizer.fit_transform(x_train) 

# Transforming the test set
tfidf_test=tfidf_vectorizer.transform(x_test)


In [17]:
# Initializing a PassiveAggressiveClassifier

pac = PassiveAggressiveClassifier(max_iter = 50)
pac.fit(tfidf_train, y_train)

# Predicting on the test set
y_predict = pac.predict(tfidf_test)


### Inspecting the reliability of the model

In [18]:
# Calculating the accuracy of the model

print('Accuracy: ' + str(round(accuracy_score(y_test, y_predict) * 100, 2)) + '%')


Accuracy: 94.24%


In [19]:
# Generating confusion matrix

cm = confusion_matrix(y_test, y_predict, labels = ['FAKE', 'REAL'])
print(cm)

[[601  36]
 [ 37 593]]


In [20]:
print('''Implementing the PAC model results in {tp} true positives, {tn} true negatives, {fp} false positives 
and {fn} false negatives.'''.format(tp = cm[0, 0], tn = cm[1, 1], fp = cm [0, 1], fn = cm[1, 0]))

Implementing the PAC model results in 601 true positives, 593 true negatives, 36 false positives 
and 37 false negatives.


In [21]:
# Calculating precision, recall and F1 score

precision = cm[0, 0] / (cm[0, 0] + cm[0, 1])
print('Precision: ' + str(round(precision * 100, 2)) + '%')

recall = cm[0, 0] / (cm[0, 0] + cm[1, 0])
print('Recall: ' + str(round(recall * 100, 2)) + '%')

f1 = (2 * recall * precision) / (recall + precision)
print('F1 score: ' + str(round(f1 * 100, 2)) + '%')


Precision: 94.35%
Recall: 94.2%
F1 score: 94.27%
