# How to Make a Text Classifier: Fake News Edition
<br><br>

<b>Agenda:</b>

We are going to build a Naive Bayes classifier for the purpose of classifying news as "FAKE" or "REAL"

- Prepare the corpus (documents) for modeling using count and tfidf vectorizers
- Train a naive bayes model on the vectorized documents
- Use grid search to optimize our model.

[Article about this project](https://opendatascience.com/blog/how-to-build-a-fake-news-classification-model/)

## “A lie gets halfway around the world before the truth has a chance to get its pants on.” – Winston Churchill


<b>“What is fake news?”</b>
<br><br>
<b>Can you build a model that can differentiate between “Real” news vs “Fake” news.</b>

Requirements: pandas, numpy, matplotlib, sklearn, nltk

This is in Python 3

In [None]:
#Imports
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from nltk.stem.snowball import SnowballStemmer
from sklearn.model_selection import cross_val_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.grid_search import GridSearchCV
pd.set_option("max.columns", 100)

Load in the data set

In [None]:
df = 
#view data


In [None]:
#View tail


In [None]:
#Print 10 random titles



In [None]:
#Print article



### The Articles

- 4594 articles published between October 2015 and December 2016.
- The "FAKE" articles came from this [Kaggle page.](https://www.kaggle.com/mrisdal/fake-news)
- The "REAL" articles came from www.allsides.com and are from publications like New York Times, WSJ, Bloomberg, NPR, and the Guardian.

### Tokenizing text with Count and TFIDF Vectorizers

Before we can build a model, we have to turn words to numbers.

- **What:** Separate text into units such as sentences or words
- **Why:** Gives structure to previously unstructured text
- **Notes:** Relatively easy with English language text, not easy with some languages

From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.

We will use [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to "convert text into a matrix of token counts":

In [None]:
#Assign text to X variable and labels to y

X = 
y = 

In [None]:
#Intialize Count Vectorizer

#Fit Count Vectorizer

#Convert it to a pandas data frame


In [None]:
#Look at df_cv


In [None]:
#How big is data


In [None]:
#Print first 100 feature names


In [None]:
#Print random slice of feature names


Let's configure out count vectorizer

- **lowercase:** boolean, True by default
- Convert all characters to lowercase before tokenizing.

- **ngram_range:** tuple (min_n, max_n)
- The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.

- <b>stop_words</b> : string {‘english’}, list, or None (default)
- If ‘english’, a built-in stop word list for English is used.
- If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if analyzer == 'word'.
- If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms.

- **max_features:** int or None, default=None
- If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

- **min_df:** float in range [0.0, 1.0] or int, default=1
- When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts.

Train a Count Vectorizer with lowercase words, includes words and two-word phrases, filters out stop words, and only uses words that show up at least 3 times.

In [None]:
#Intialize Count Vectorizer

#Fit Count Vectorizer

#Convert it to a pandas data frame


In [None]:
#Look at data


In [None]:
#Look at random slice of features


Time for TFIDF Vectorizer

- **What:** Computes "relative frequency" that a word appears in a document compared to its frequency across all documents
- **Why:** More useful than "term frequency" for identifying "important" words in each document (high frequency in that document, low frequency in other documents)
- **Notes:** Used for search engine scoring, text summarization, document clustering

In [None]:
#Intialize TFIDF Vectorizer

#Fit TFIDF Vectorizer

#Convert it to a pandas data frame


In [None]:
#View data 


Now that we have our two vectorized datasets. Let's get into the modeling.

### Naive Bayes

Bayes Theorem covers the probabilistic relationship between multiple variables, and specifically allows us to define one conditional in terms of the underlying probabilities and the inverse condition. Specifically, it can be defined as:

$$P(y|x) = P(y)P(x|y)/P(x)$$

This means the probability of y given x condition equals the probability of y times the probability of x given y condition divided by the probability of x.

This theorem can be extended to when x is a vector (containing the multiple x variables used as inputs for the model) to:

$$P(y|x_1,...,x_n) = P(y)P(x_1,...,x_n|y)/P(x_1,...,x_n)$$

Let's pretend we have an email with three words: "Send money now." We'll use Naive Bayes to classify it as **ham or spam.**

$$P(spam \ | \ \text{send money now}) = \frac {P(\text{send money now} \ | \ spam) \times P(spam)} {P(\text{send money now})}$$

By assuming that the features (the words) are **conditionally independent**, we can simplify the likelihood function:

$$P(spam \ | \ \text{send money now}) \approx \frac {P(\text{send} \ | \ spam) \times P(\text{money} \ | \ spam) \times P(\text{now} \ | \ spam) \times P(spam)} {P(\text{send money now})}$$

We can calculate all of the values in the numerator by examining a corpus of **spam email**:

$$P(spam \ | \ \text{send money now}) \approx \frac {0.2 \times 0.1 \times 0.1 \times 0.9} {P(\text{send money now})} = \frac {0.0018} {P(\text{send money now})}$$

We would repeat this process with a corpus of **ham email**:

$$P(ham \ | \ \text{send money now}) \approx \frac {0.05 \times 0.01 \times 0.1 \times 0.1} {P(\text{send money now})} = \frac {0.000005} {P(\text{send money now})}$$

All we care about is whether spam or ham has the **higher probability**, and so we predict that the email is **spam**.

#### Key takeaways

- The **"naive" assumption** of Naive Bayes (that the features are conditionally independent) is critical to making these calculations simple.
- The **normalization constant** (the denominator) can be ignored since it's the same for all classes.
- The **prior probability** is much less relevant once you have a lot of features.

### <b>Pros</b>: 
#### - Very fast. Adept at handling tens of thousands of features which is why it's used for text classification
#### - Works well with a small number of observations
#### - Isn't negatively affected by "noise"

### <b>Cons</b>:
#### - Useless for probabilities. Most of the time assigns probabilites that are close to zero or one
#### - It is literally "naive". Meaning it assumes features are independent.

Test NB model on the count vectorized data. Re-run the Count Vectorizer with stop_words = "english"

In [None]:
#Intialize Count Vectorizer

#Fit Count Vectorizer

#Convert it to a pandas data frame


Fit model

In [None]:
#Initialize model

#Fit model with df_cv and y

#score the model


Re-run the TFIDF Vectorizer with stop_words = "english"

In [None]:
#Intialize TFIDF Vectorizer

#Fit TFIDF Vectorizer

#Convert it to a pandas data frame


In [None]:
#Initialize model

#Fit model with df_tf and y

#score the model


Some good scores! Or are they?

Time for some cross validation.

In [None]:
#Call cross_val_score on the count vectorized dataset. Call .values on df_cv


In [None]:
#Call cross_val_score on the tfidf vectorized dataset. Call .values on df_tf


Which one wins? Count or TFIDF vectorizer?

Now we're going to optimize our model by testing out every possible configuration

### Grid Searching

https://machinelearningmastery.com/how-to-tune-algorithm-parameters-with-scikit-learn/

Create dictionaries for our "grids" aka every possible combination of configuration.

In [None]:
#Grid dictionary for count vectorized data
param_grid_cv = {}
param_grid_cv["countvectorizer__ngram_range"] = [(1,1), (1,2), (2,2)]
param_grid_cv["countvectorizer__max_features"] = [1000,5000, 10000]

In [None]:
#Grid dictionary for tfidf vectorized data
param_grid_tf = {}
param_grid_tf["tfidfvectorizer__ngram_range"] = [(1,1), (1,2), (2,2)]
param_grid_tf["tfidfvectorizer__max_features"] = [1000, 5000, 10000]

Make a pipeline

In [None]:
#Create pipeline for count vectorized data


In [None]:
#Create pipeline for tfidf vectorized data


Establish the grids

In [None]:
#Create grid for count vectorized data


In [None]:
#Create grid for tfidf vectorized data


This is gonna take a while so let's measure how long it takes

In [None]:
#Import time library
from time import time 

In [None]:
#Fit and time the grid_cv

#Fit grid_cv on X and y


In [None]:
#Fit and time the grid_tf

#Fit grid_cv on X and y


In [None]:
#Look at the best parameters and the best scores for count vectorized data


In [None]:
#Look at the best parameters and the best scores for tfidf vectorized data


## Bonus Section: How to find the "fakest" and "realest" words

In [None]:
#Fit count vectorizer and NB model

count_vec = 
#Fit Count Vectorizer
dtm_cv = 

model = 


In [None]:
#Assign feature list to tokens
tokens = 

In [None]:
#Counts words in fake articles
fake_token_count = 
fake_token_count

In [None]:
#Counts words in real articles 
real_token_count = 
real_token_count

In [None]:
#Input tokens, fake_token_count, and real_token_count into a pandas data frame
tok_df = pd.DataFrame({"token":tokens, 
                       "fake":fake_token_count, 
                       "real":real_token_count}).set_index("token")

In [None]:
#Add 1 to fake and real columns


In [None]:
#Divide each value in the fake and real columns by their corresponding class_count value
tok_df.fake = 
tok_df.real = 

In [None]:
#Derive the ratio between fake and real
tok_df["ratio"] = 

Time to see the "fakest" words

"Realest" words

Let's plot them

In [None]:
top_20_fake = 
top_20_fake

In [None]:
tok_df["real_ratio"] = 

In [None]:
top_20_real = 
top_20_real

Plot fakes

Plot reals