### Fake News Classifier using [scikit-learn](http://scikit-learn.org/stable/), [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) and [Bayesian model](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)

[Miguel's post on fake news](https://miguelmalvarez.com/2017/03/23/how-can-machine-learning-and-ai-help-solving-the-fake-news-problem/)

The term “fake news” was almost non-existent in the general context and media providers prior to October 2016.

Fake news is a term that has been used to describe very different issues, from satirical articles to completely fabricated news and plain government propaganda in some outlets. Fake news, information bubbles, news manipulation and the lack of trust in the media are growing problems with huge ramifications in our society. However, in order to start addressing this problem, we need to have an understanding on what Fake News is. Only then can we look into the different techniques and fields of machine learning (ML), natural language processing (NLP) and artificial intelligence (AI) that could help us fight this situation.

[Dataset link](https://github.com/GeorgeMcIntire/fake_real_news_dataset)

#### Contents:
1. Data Exploration
2. Creating Varibales for model
3. Vectorizing the data
4. Building the model
5. Model evaluation
6. Compare with Logistic regression
7. Model imporvement

**1. Data Exploration**

Let's take a quick look at data and get a feel of the content using pandas (info, head etc)

In [1]:
# import numpy and pandas
import numpy
import pandas as pd

In [66]:
#load the data using pandas
data=pd.read_csv('fake_or_real_news.csv')

In [58]:
# Check the statistics of data [total samples, features, datatype, null values etc]
data.info()
# total entries: 6335
# no null values

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6335 entries, 0 to 6334
Data columns (total 5 columns):
Unnamed: 0    6335 non-null int64
title         6335 non-null object
text          6335 non-null object
label         6335 non-null object
label_no      6335 non-null int64
dtypes: int64(2), object(3)
memory usage: 247.5+ KB


In [59]:
# Check the class distribution
data['label'].value_counts()
# Real: 50.1%, Fake: 49.9 i.e. the dataset has balanced class distribution

REAL    3171
FAKE    3164
Name: label, dtype: int64

In [62]:
# Check the first two samples of data
data.head(2)

Unnamed: 0.1,Unnamed: 0,title,text,label,label_no
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE,0
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE,0


In order to **build a model**, the features must be **numeric**, and every observation must have the **same features in the same order**.

In [67]:
# Convert label to a numerical variable in a new column
data['label_num']=data['label'].map({'REAL':1,'FAKE':0})

In [68]:
# Check the conversion
data.head(2)

Unnamed: 0.1,Unnamed: 0,title,text,label,label_num
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE,0
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE,0


**2. Creating Varibales for model** [feature and target variable]
- **"Features"** are also known as predictors, inputs, or attributes. The **"Target"** is also known as the response, label, or output.
- **"Observations"** are also known as samples, instances, or records.

In [63]:
# X is the feature variable which contains all observations of "text" column.
# I am taking longer article i.e. text. 
# Using longer text will allow for distinct words and features for real and fake news data.
X=data.iloc[:,2]

In [9]:
# Check the feature variable
X.head(2)

0    Daniel Greenfield, a Shillman Journalism Fello...
1    Google Pinterest Digg Linkedin Reddit Stumbleu...
Name: text, dtype: object

In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning.

In [10]:
# Create target variable i.e. label
# scikit works only with numerical data hence we will take 'label_num' column for target variable
y=data.iloc[:,4]

In [11]:
# Check the target variable
y.head(2)

0    0
1    0
Name: label_no, dtype: int64

In [69]:
# split X and y into training and testing sets using scikit learn train_test_split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test=train_test_split(X,y, test_size=.25, random_state=2018)

In [70]:
# Check the shape
X_train.shape, X_test.shape

((4751,), (1584,))

**3. Vectorizing the data** 

CountVectorizer: Convert a collection of text documents to a matrix of token counts

This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix.

- CountVectorizer(input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\\b\\w\\w+\\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class'numpy.int64'>)

In [14]:
# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
# Create an instant of CountVectorizer
cv=CountVectorizer()

In [74]:
# learn training data vocabulary, then use it to create a document-term matrix
cv.fit(X_train) # fit method happens in-place hence not required store in a variable
X_train_cv=cv.transform(X_train) # use learned vocabulary to create a document-term matrix

In [75]:
# examine the document-term matrix
X_train_cv

<4751x59857 sparse matrix of type '<class 'numpy.int64'>'
	with 1603659 stored elements in Compressed Sparse Row format>

1. A **sparse matrix** is a huge matrix of zeros.
2. It stroes locaton of only non-zero values.
3. To store all zero and non zero values we have to convert it into a **Dense matrix** with **.toarray()** function.

In [17]:
# convert sparse matrix to a dense matrix using toarray
X_train_cv=X_train_cv.toarray()

In [18]:
# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_cv=cv.transform(X_test).toarray()

**4. Building the model**

The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

In [19]:
# import and instantiate a Multinomial Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
mnb=MultinomialNB()

In [76]:
# train the model using training data. Check the training timing with an IPython "magic command"
%time mnb.fit(X_train_cv,y_train)

Wall time: 34 ms


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [89]:
# Make the predictions for X_test_cv
%time y_pred_mnb=mnb.predict(X_test_cv)

Wall time: 703 ms


**5. Model evaluation**

We will evaluate the model using accuracy score, confusion_matrix, classification_report

In [22]:
# Import evaluation metrices
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [90]:
# Check model accuracy
accuracy_score(y_test, y_pred_mnb)*100

88.131313131313121

In [91]:
# Check the confusion matrix
print(confusion_matrix(y_test, y_pred_mnb, labels=[0, 1]))

[[675 130]
 [ 58 721]]


In [92]:
# Check the classification report
print(classification_report(y_test, y_pred_mnb))

             precision    recall  f1-score   support

          0       0.92      0.84      0.88       805
          1       0.85      0.93      0.88       779

avg / total       0.88      0.88      0.88      1584



**6. Compare with Logistic regression**

[Logistic regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression) despite its name, is a linear model for classification rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.

In [85]:
from sklearn.linear_model import LogisticRegression
lr=LogisticRegression()

In [94]:
%time lr.fit(X_train_cv,y_train)

Wall time: 3.18 s


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [95]:
%time y_pred_lr=lr.predict(X_test_cv)

Wall time: 708 ms


In [96]:
accuracy_score(y_test, y_pred_lr)*100

91.603535353535349

In [93]:
# Check the classification report
print(classification_report(y_test, y_pred_lr))

             precision    recall  f1-score   support

          0       0.92      0.91      0.92       805
          1       0.91      0.92      0.92       779

avg / total       0.92      0.92      0.92      1584



** Exploring the classification report** i.e. precision, recall, f1-score and support

To be updated

**7. Model imporvement**

To be updated