# Building a Machine Learning Model to Predict Fake News

Can a computer tell the difference between real and fake news? In this project, I use natural language processing and machine learning to build a model that tries to do just that. In this era of information overload and misinformation, it's a small step in the fight -- and a fun way to explore how computers understand text.

## Table of Contents
- Import pandas and load the dataset
- Clean the text
- Vectorize the cleaned text
- Transform the cleaned text into a TF-IDF Matrix
- Split the data into training and test sets
- Train a logistic regression model
- Make predictions and evaluate the model
- The takeaway

## Import pandas and load the dataset

In [88]:
import pandas as pd

In [90]:
fake_news = pd.read_csv('fake_news.csv', index_col=0)

In [92]:
# Check to make sure the dataset loaded properly
fake_news.head()

Unnamed: 0,title,text,subject,date,fake
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",1
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",1
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",1
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",1
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",1


In [94]:
# Check the length of the dataset
len(fake_news)

44898

Nearly 45K rows! Now to make sure there isn't any missing values and to do a little digging...

In [96]:
# Check for missing data
fake_news.isna().any()

title      False
text       False
subject    False
date       False
fake       False
dtype: bool

In [98]:
# Check the shape
fake_news.shape

(44898, 5)

There are, of course, five columns. Let's zero in on that last one -- "fake" and check what proportion are fake (1) and real (0).

In [100]:
# Create a boolean mask and use .mean() to see the proportion of fake stories
is_fake = (fake_news.fake == 1).mean()
is_fake

0.5229854336496058

In [102]:
# Now check the proportion of real stories
is_real = (fake_news.fake == 0).mean()
is_real

0.47701456635039424

In [104]:
# And just to confirm that math...
is_fake + is_real

1.0

## Clean the text

In [106]:
# Import regular expressions to clean up the text
import re

Time to create the function...

In [108]:
# Define a function to clean the text
# This will: lowercase everything, remove punctuation, 
# remove numbers and remove extra whitespace

def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\d+', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

In [110]:
# Combine title and text into one column so that the model will have more content to learn from
fake_news['combined_text'] = fake_news['title'] + ' ' + fake_news['text']
fake_news['combined_text']

0         Donald Trump Sends Out Embarrassing New Year’...
1         Drunk Bragging Trump Staffer Started Russian ...
2         Sheriff David Clarke Becomes An Internet Joke...
3         Trump Is So Obsessed He Even Has Obama’s Name...
4         Pope Francis Just Called Out Donald Trump Dur...
                               ...                        
44893    'Fully committed' NATO backs new U.S. approach...
44894    LexisNexis withdrew two products from Chinese ...
44895    Minsk cultural hub becomes haven from authorit...
44896    Vatican upbeat on possibility of Pope Francis ...
44897    Indonesia to buy $1.14 billion worth of Russia...
Name: combined_text, Length: 44898, dtype: object

In [112]:
# Clean 'combined_text' using the above function
fake_news['cleaned_text'] = fake_news['combined_text'].apply(clean_text)

In [113]:
# Preview 'combined_text' and 'cleaned_text'
fake_news[['combined_text', 'cleaned_text']].head()

Unnamed: 0,combined_text,cleaned_text
0,Donald Trump Sends Out Embarrassing New Year’...,donald trump sends out embarrassing new years ...
1,Drunk Bragging Trump Staffer Started Russian ...,drunk bragging trump staffer started russian c...
2,Sheriff David Clarke Becomes An Internet Joke...,sheriff david clarke becomes an internet joke ...
3,Trump Is So Obsessed He Even Has Obama’s Name...,trump is so obsessed he even has obamas name c...
4,Pope Francis Just Called Out Donald Trump Dur...,pope francis just called out donald trump duri...


## Vectorize the cleaned text

Before we can train our machine learning model, we need to **turn the text into numbers**. We'll use TfidfVectorizer for ths.

TF-IDF stands for **Term Frequency-Inverse Document Frequency**:
- TF: How ofen a word shows up in a single document
- IDF: How rare that word is across *all* documents
- The result, or TFxIDF provides a score that rewards useful, unique words and downgrades common ones

TfidfVectorizer creates a matrix of these scores for all out text. And that's what we'll feed into the model.

In [116]:
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [118]:
# Initialize the vectorizer and set a max number of features
# Worth noting: This dataset had 45k rows or documents, so, let's
# keep the size of the matrix manageable; Instead of focusing on 100K or so
# unique words let's focus on the 5K most frequent

vectorizer = TfidfVectorizer(max_features=5000)
vectorizer

## Transform the cleaned text into a TF-IDF Matrix

In [128]:
# Grab the 'cleaned_text' column
X = fake_news.cleaned_text
X

0        donald trump sends out embarrassing new years ...
1        drunk bragging trump staffer started russian c...
2        sheriff david clarke becomes an internet joke ...
3        trump is so obsessed he even has obamas name c...
4        pope francis just called out donald trump duri...
                               ...                        
44893    fully committed nato backs new us approach on ...
44894    lexisnexis withdrew two products from chinese ...
44895    minsk cultural hub becomes haven from authorit...
44896    vatican upbeat on possibility of pope francis ...
44897    indonesia to buy billion worth of russian jets...
Name: cleaned_text, Length: 44898, dtype: object

In [139]:
# Use .fit_transform() to turn 'cleaned_text' into a sparse matrix
X_tfidf = vectorizer.fit_transform(X)
X_tfidf

<44898x5000 sparse matrix of type '<class 'numpy.float64'>'
	with 7730991 stored elements in Compressed Sparse Row format>

Doing some quick math, the above sparse matrix has **224 MILLION** possible entries. But we see that only 7.7 million are non-zero. Hence the brilliance of the sparse matrix in the context of NLP.

## Split the data into training and test sets

Now we split the TF-IDF matrix into a traning set and a test set. The model will learn from the training dat and then we'll use the test data tpo see how well it performs on unseen examples!

In [146]:
# Import train_test_split to divide up the data
from sklearn.model_selection import train_test_split

In [148]:
# Set the target variable, or y as the 'fake' column
# Note: 1 = fake and 0 = real
y = fake_news.fake

In [154]:
# Split the data 80% training and 20% testing
# X_tfidf: Feature matrix(TF-IDF of the cleaned text)
# y: target labels, 1 or 0
# test size=0.2: Or 20% for testing
# random_state=42: Ensures the same split every time
# X_train and y_train: Used to train the model
# X_test and y_test: Used to evaluate how well the model performs on new data

X_train, X_test, y_train, y_test = train_test_split(
    X_tfidf, y, test_size=0.2, random_state=42
)

Here's an intuitive way to think about this rather dense block of code above: Shuffle the deck, then deal 80% of the cards into the training pile, and 20% into the test pile -- for both features (the article text) and labels (whether it's real or fake).

## Train a logistic regression model

Now to train a logistical regression model -- one of the most foundational and intuitive techniques in the machine learning toolkit. It's all about classification, predicting 1's and 0's. Or, in our case, whether a news story is real or fake.

In [162]:
# Import LogisticRegression
from sklearn.linear_model import LogisticRegression

In [164]:
# Initialize and train the model
# Note: Increase max_iter from the default 100 to 1000 so that
# the model has enough time to learn from this large dataset
fake_news_model = LogisticRegression(max_iter=1000)
fake_news_model.fit(X_train, y_train)

## Make predictions and evaluate the model

After training the model it's time to use it to make predictions on the test set and see how well it performs. This will tell us how accurately it can classify unseen articles.

In [168]:
# Import evaluation tools: classification_report and confusion_matrix
from sklearn.metrics import classification_report, confusion_matrix

Now to make the predictions.

In [171]:
# Use the pretrained model to make predictions on the test set
y_pred = fake_news_model.predict(X_test)

In [175]:
# Display 'y_pred'
y_pred

array([1, 0, 0, ..., 1, 1, 1])

**The confusion matrix:**

Here's what the confusion matrix tells us:
- **True negatives:** 4212 articles correctly predicted as real
- **False positives:** 35 articles incorrectly predicted as fake
- **False negatives:** 46 articles incorrectly predicted as real
- **True positives:** 4687 articles correctly predicted as fake

In [177]:
# Check the confusion matrix
confusion_matrix(y_test, y_pred)

array([[4212,   35],
       [  46, 4687]])

**The classification report aka the Super Report:**
- **Precision:** Of all the articles predicted as fake, how many were *actually* fake?
- **Recall:** Of all the fake articles in the test set, how many did the model catch?
- **F1-score:** The balance between precision and recall -- provides a fuller picture of performance
- **Support:** The number of actual examples of each class in the test set
- **Accuracy:** Overall -- How often did the model get the prediction right?

In [179]:
# Check the classification report
classification_report(y_test, y_pred)

'              precision    recall  f1-score   support\n\n           0       0.99      0.99      0.99      4247\n           1       0.99      0.99      0.99      4733\n\n    accuracy                           0.99      8980\n   macro avg       0.99      0.99      0.99      8980\nweighted avg       0.99      0.99      0.99      8980\n'

## The takeaway

With 99% accuracy scores across the board, this model -- using nothing more than the lanuage in each article -- is doing exactly what it was trained to do: expertly predict whether a news story is real or fake.