# Training machine to spot fake news

In this tutorial, we will be training a program to spot fake news. This tutorial will be run on the Jupyter Notebook environment, which is a bit different from that of a conventional coding environment. If you are not familiar with Jupyter Notebook, this tutorial is a good way to get you started on it too!

You can see that this Python solution is segmentized into many code blocks (the different grey boxes). You can run each code blocks sequentially by clicking on the run button at the top of this page after you click into the code blocks. Alternatively, you can enter the Shift-Enter keyboard shortcut after you click into each code block to run them.

In this tutorial you would not be doing any coding, your primary task is to run through the code and examine the output results. If you have the extra time, you can spend some effort into understanding what the code is trying to do!

Dataset retrieved from: https://github.com/ajayjindal/Fake-News-Detection

## Importing and preparing data for training

In [None]:
# import required modules and functions
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn.linear_model import PassiveAggressiveClassifier
import numpy as np
import itertools
from matplotlib import pyplot as plt

In [None]:
# read the data into a dataframe
df = pd.read_csv('fake_or_real_news.csv', index_col = 'number', usecols = ['number', 'title', 'text', 'label'], nrows = 2000)

# fills any empty cells with a common word
df.fillna("the", inplace = True)

# examine dataframe df
df.head()

In [None]:
#dedicate array y as table containing all the labels (i.e FAKE or REAL labelling of news)
y = df["label"]

# examine array y
y.head()

In [None]:
#Split data into 2 different subsets, one used for training our model (X_train) and the other used for testing (X_test)
X_train, X_test, y_train, y_test = train_test_split(df["text"], y, test_size = 0.33, random_state = 52)

#Use this function to check the size of new database after split into training and testing pools
print(y_train.shape) 


## Training and testing of data using different classification models

In [None]:
count_vectorizer = CountVectorizer()

#Generate word count for each unique word for X_train
count_train = count_vectorizer.fit_transform(X_train.values.astype('U'))

#Generate word count for each unique word for X_test
count_test = count_vectorizer.transform(X_test.values.astype('U')) 

In [None]:
#by declaring max_df = 0.95 the program removes words which appear in more than 95% of the articles
tfidf_vectorizer = TfidfVectorizer(max_df=0.95)

#Generate TFIDF-values for X_train
tfidf_train = tfidf_vectorizer.fit_transform(X_train.values.astype('U')) 

#Generate TFIDF-values for X_test
tfidf_test = tfidf_vectorizer.transform(X_test.values.astype('U')) 

In [None]:
#defining a function for creating a confusion matrix
def plot_confusion_matrix(number,cm, classes,title, normalize=False,
                          cmap=plt.cm.Blues):
    plt.figure(number)
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.title(title)
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [None]:
#initialize text classifer using Multinomial N-B classification system on tfidf model
clf = MultinomialNB()
#instruct program to find patterns between text articles and real/fake label in training pool
clf.fit(tfidf_train, y_train)

#use newfound pattern to predict article text from testing pool
prediction = clf.predict(tfidf_test)

#compare the result of real/fake classification to the actual labelling
score = metrics.accuracy_score(y_test, prediction)
print("accuracy N-B on TFIDF: %0.3f" % score)

#calculate Confusion Matrix (CM) and generate date for plotting CM
cm = metrics.confusion_matrix(y_test, prediction, labels=['FAKE', 'REAL'])
plot_confusion_matrix("1", cm, classes=['FAKE', 'REAL'], title="N-B on TFIDF")

plt.show()

In [None]:
#initialize text classifer using Passive Aggressive classification system on tfidf model

clf = PassiveAggressiveClassifier()
clf.fit(tfidf_train, y_train)
prediction = clf.predict(tfidf_test)
score = metrics.accuracy_score(y_test, prediction)
print("accuracy P-A on TFIDF: %0.3f" % score)
cm = metrics.confusion_matrix(y_test, prediction, labels=['FAKE', 'REAL'])
plot_confusion_matrix("2", cm, classes=['FAKE', 'REAL'], title="P-A on TFIDF")

plt.show()

In [None]:
#initialize text classifer using Multinomial N-B classification system on count vectorizer model

clf = MultinomialNB()
clf.fit(count_train, y_train)
prediction = clf.predict(count_test)
score = metrics.accuracy_score(y_test, prediction)
print("accuracy N-B on CountVec: %0.3f" % score)
cm = metrics.confusion_matrix(y_test, prediction, labels=['FAKE', 'REAL'])
plot_confusion_matrix("3", cm, classes=['FAKE', 'REAL'], title="N-B on CountVec")

plt.show()

In [None]:
#initialize text classifer using Passive Aggressive classification system on count vectorizer  model

clf = PassiveAggressiveClassifier()
clf.fit(count_train, y_train)
prediction = clf.predict(count_test)
score = metrics.accuracy_score(y_test, prediction)
print("accuracy P-A on TFIDF: %0.3f" % score)
cm = metrics.confusion_matrix(y_test, prediction, labels=['FAKE', 'REAL'])
plot_confusion_matrix("4", cm, classes=['FAKE', 'REAL'], title="P-A on CountVec")

plt.show()