# In this notebook, we document the necessary basic exploratory analysis of the Fake and Real News Dataset provided by Kaggle. After that, we will attempt to apply various Machine Learning Models to classify our data into Real or Fake news based on their title text. The best parameters for each model is obtained.

### IMPORT ALL MODULES
##### All relevant modules used for Machine Learning analysis is imported in this section.

In [54]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
import time
import pandas as pd
import numpy as np

### CREATE THE DATASET 
##### We obtain 2 seperate data csv files representing all real or fake data points.

In [2]:
realdataobj = pd.read_csv("newsdataset/True.csv")
fakedataobj = pd.read_csv("newsdataset/Fake.csv")

### VISUALISE THE DATASET
##### We plot the dataframe of the raw data file to have a glimpse of what it looks like. Upon plot, we figure out that there are 4 columns within the dataframe. We only utilise the "title" column in this classification analysis. 

In [3]:
print("DATASET LOOKS LIKE : ")
print(realdataobj.head(4))

DATASET LOOKS LIKE : 
                                               title  \
0  As U.S. budget fight looms, Republicans flip t...   
1  U.S. military to accept transgender recruits o...   
2  Senior U.S. Republican senator: 'Let Mr. Muell...   
3  FBI Russia probe helped by Australian diplomat...   

                                                text       subject  \
0  WASHINGTON (Reuters) - The head of a conservat...  politicsNews   
1  WASHINGTON (Reuters) - Transgender people will...  politicsNews   
2  WASHINGTON (Reuters) - The special counsel inv...  politicsNews   
3  WASHINGTON (Reuters) - Trump campaign adviser ...  politicsNews   

                 date  
0  December 31, 2017   
1  December 29, 2017   
2  December 31, 2017   
3  December 30, 2017   


##### The data type of the title column entries belongs to the String Class. As such, we can potentially use a vectorizer to obtain word counts and utilise word counts as input data for classification analysis.

In [13]:
type(realdataobj.title[0])

str

### MEASURE OUR RAW DATASET
##### The length of the 2 data files amount to over 44'000 data points, which by industry standards is moderately huge. We have to consider the computational time of our models later.

In [14]:
print("- ", len(realdataobj.index), " Real News")
print("- ", len(fakedataobj.index), " Fake News")

-  21417  Real News
-  23481  Fake News


### DATA CLEANING
##### We will carry out the necessary procedures in order to ensure the integrity of our data is upheld. More specifically, we will scan for any null data points with regards to the "title" column. No null columns are detected at all.

In [20]:
for columns in realdataobj.columns:
    if pd.isnull(columns):
        print("Column ", columns, " has null values.")
print("Search completed")

Search completed


### OBTAIN WORKING DATA SET WITH RELEVANT COLUMNS
##### We will extract the "title" column from both real and fake data files. Next, a new column called "Status" will be created for each data point holding their respective classification targets such as "Real" or "Fake". We proceed by merging both data sets horizontally and creating our numpy array matrix from the values. Like every machine learning model preparation, 2 matrices containing our data and our target is derived and this will be followed by a further split into training and test sets.

In [4]:
realdataobj["Status"] = "Real"
fakedataobj["Status"] = "Fake"
joineddf = pd.concat([realdataobj, fakedataobj])
joinarray = joineddf.values
joindata = joinarray[:,0]
jointarget = joinarray[:,-1]
xtrain, xtest, ytrain, ytest = train_test_split(joindata, jointarget, random_state = 8)

### VECTORISE COUNTS OF EACH WORD IN DATASET
##### Our machine learning models will be trained based on frequency of word appearance in each news article title. Each unique word will be awarded a column, containing the number of appearances in each data point title. For instance, "word A" that appear 10 times in 2 seperate titles will have the value of 10 in the "word A" column of both of their respective rows. This creates a sparse matrix of integers, sized at about 33673 titles and 19196 unique words/columns. 

In [6]:
countvec = CountVectorizer().fit(xtrain)
xtrainvec = countvec.transform(xtrain)
print("\n",repr(xtrainvec))
words = countvec.get_feature_names()
xtestvec = countvec.transform(xtest)
print(words[2000:2020])


 <33673x19196 sparse matrix of type '<class 'numpy.int64'>'
	with 409456 stored elements in Compressed Sparse Row format>
['bharara', 'bhumibol', 'bi', 'biafra', 'bias', 'biased', 'bibi', 'bible', 'biblical', 'bicker', 'bicycle', 'bid', 'biden', 'bids', 'big', 'bigger', 'biggest', 'biggie', 'bigly', 'bigot']


### ALL MACHINE LEARNING MODELS INSTANTIATED
##### We will begin our implementation of the models by first instantiating objects of all the relevant classification algorithms. All these models are capable of using word counts to classify if a news article is likely real or fake. Some default parameters are included within instantiation while variable parameters that affect accuracy of modelling is specified in the parameter grid. We will attempt to obtain the best parameters for each algorithm using GridSearch with Cross Validation.

In [73]:
modelnames = ["K Nearest Neighbour", "Logistic Regression", "Decision Tree", "Random Forest", "Kernel SVC", "Neural Network MLP"]
modellist = [KNeighborsClassifier(n_jobs = 4), LogisticRegression(max_iter = 10000, n_jobs = 4), DecisionTreeClassifier(), RandomForestClassifier(n_jobs = 4), SVC(), MLPClassifier()]
param_gridlist = [{'n_neighbors': [3, 4, 5, 6, 7]},{'C': [0.001, 0.01, 0.1, 1, 10]}, {'max_depth': [90, 100, 110]}, {'n_estimators': [80]}, {'C': [0.001, 0.01, 0.1, 1, 10]}, {'hidden_layer_sizes': [[4], [5],[8], [10]]}]

#### BEST OF K NEAREST NEIGHBOURS
##### K nearest neighbors is a classification technique that identifies class membership based on their nearest neighboring points. The algorithm concludes when all data points are classified. Out of all the algorithms tested, K nearest neighbours provides the least accuracy in both training and test sets and the code takes 6.7 minutes to run. The optimal number of neighbours is 4, as found out by gridsearch.

In [9]:
start = time.time()
grid = GridSearchCV(modellist[0], param_gridlist[0], cv=5, n_jobs = 4)
grid.fit(xtrainvec, ytrain)
print("\nBest", modelnames[0], "ML algorithm best parameters : ")
print("Best cross-validation score: {:.2f}".format(grid.best_score_)) 
print("Best parameters: ", grid.best_params_)
print("Train Set Accuracy : ")
print(grid.score(xtrainvec, ytrain))
xtestvec = countvec.transform(xtest)
print("Test Set Accuracy : ")
print(grid.score(xtestvec, ytest))
end = time.time()


Best K Nearest Neighbour ML algorithm best parameters : 
Best cross-validation score: 0.79
Best parameters:  {'n_neighbors': 4}
Train Set Accuracy : 
0.8990585929379622
Test Set Accuracy : 
0.8038307349665924


### BEST OF LOGISTIC REGRESSIONS
##### Logistic Regression is a popular classification technique that builds on Linear Regression with binary output value based on inequalities. On a binary classsification problem like ours, I expect logistic regression to perform very well in terms of accuracy of prediction. It proves to be the case as it gives an extremely high cross validation score of 0.96, and an equally impressive test set accuracy. On my run, Logistic Regression code is the fastest algorithm to execute at only 13 seconds, with C = 10 as the best regularisation parameter.

In [12]:
start = time.time()
grid = GridSearchCV(modellist[1], param_gridlist[1], cv=5, n_jobs = 4)
grid.fit(xtrainvec, ytrain)
print("\nBest", modelnames[1], "ML algorithm best parameters : ")
print("Best cross-validation score: {:.2f}".format(grid.best_score_)) 
print("Best parameters: ", grid.best_params_)
print("Train Set Accuracy : ")
print(grid.score(xtrainvec, ytrain))
xtestvec = countvec.transform(xtest)
print("Test Set Accuracy : ")
print(grid.score(xtestvec, ytest))
end = time.time()


Best Logistic Regression ML algorithm best parameters : 
Best cross-validation score: 0.96
Best parameters:  {'C': 10}
Train Set Accuracy : 
0.9993169601758085
Test Set Accuracy : 
0.9614253897550111


### BEST OF DECISION TREE CLASSIFIERS
##### Decision Trees offer comprehensive classification analysis using branch decisions and predictions to seperate data points into classes. Trees are susceptible to overfitting on the training set if the max depth of their roots is not regularised enough. Decision Trees provide disappointing cross validation marks on training, but moderate accuracy on both sets with an optimal max_depth of 100. It takes 31 seconds to run the gridsearch.

In [31]:
grid = GridSearchCV(modellist[2], param_gridlist[2], cv=5, n_jobs = 4)
grid.fit(xtrainvec, ytrain)
print("\nBest", modelnames[2], "ML algorithm best parameters : ")
print("Best cross-validation score: {:.2f}".format(grid.best_score_)) 
print("Best parameters: ", grid.best_params_)
print("Train Set Accuracy : ")
print(grid.score(xtrainvec, ytrain))
xtestvec = countvec.transform(xtest)
print("Test Set Accuracy : ")
print(grid.score(xtestvec, ytest))


Best Decision Tree ML algorithm best parameters : 
Best cross-validation score: 0.90
Best parameters:  {'max_depth': 100}
Train Set Accuracy : 
0.9735990259258159
Test Set Accuracy : 
0.9128730512249443


### BEST OF RANDOM FORESTS
##### Random Forests is basicially a conglomeration of multiple decision trees to obtain averages. This is often attributed to decision biases and randomisation that can hinder individual decision trees and their outputs. The larger the number of trees built, the longer the computation time but the more accurate the decision because of a more representative average. The highest accuracy here occurs when we allow the trees to grow infinitely with an optimal number of 80 trees in total. Random Forest here takes 2 minutes to execute.

In [52]:
grid = GridSearchCV(modellist[3], param_gridlist[3], cv=5, n_jobs = 4)
grid.fit(xtrainvec, ytrain)
print("\nBest", modelnames[3], "ML algorithm best parameters : ")
print("Best cross-validation score: {:.2f}".format(grid.best_score_)) 
print("Best parameters: ", grid.best_params_)
print("Train Set Accuracy : ")
print(grid.score(xtrainvec, ytrain))
xtestvec = countvec.transform(xtest)
print("Test Set Accuracy : ")
print(grid.score(xtestvec, ytest))


Best Random Forest ML algorithm best parameters : 
Best cross-validation score: 0.95
Best parameters:  {'n_estimators': 80}
Train Set Accuracy : 
1.0
Test Set Accuracy : 
0.9506458797327394


## BEST OF KERNEL SVM CLASSIFIER
##### Kernelised Standard Vector Machines (SVMs) are a class of mathematical models that transform data inputs into centralised vectors. These vectors use magnitude(radial) and direction to determine the class of each data point. It is a high level machine learning algorithm that is known to be computationally expensive and parameter dependant, but produces excellent accuracies. For consistency across all algorithms, our data is not initially scaled down. However, this causes early data convergence for SVM. As such, 2 variants of this algorithm is ran. The non standardised dataset variant runs for a staggering 29 mins giving best parameters of C = 10 and the highest accuracy of all algorithms I've obtained despite overfitting. The standardised variant takes even longer for 17 mins to give a more unsatisfactory accuracy.

In [64]:
grid = GridSearchCV(modellist[4], param_gridlist[4], cv=5, n_jobs = -1)
grid.fit(xtrainvec, ytrain)
print("\nBest", modelnames[4], "ML algorithm best parameters : ")
print("Best cross-validation score: {:.2f}".format(grid.best_score_)) 
print("Best parameters: ", grid.best_params_)
print("Train Set Accuracy : ")
print(grid.score(xtrainvec, ytrain))
xtestvec = countvec.transform(xtest)
print("Test Set Accuracy : ")
print(grid.score(xtestvec, ytest))


Best Kernel SVC ML algorithm best parameters : 
Best cross-validation score: 0.97
Best parameters:  {'C': 10}
Train Set Accuracy : 
1.0
Test Set Accuracy : 
0.9692650334075724


In [60]:
scalerobj = StandardScaler(with_mean = False)
xtrainvecscaled = scalerobj.fit_transform(xtrainvec)
xtestvec = countvec.transform(xtest)
xtestvecscaled = scalerobj.fit_transform(xtestvec)

In [61]:
grid = GridSearchCV(modellist[4], param_gridlist[4], cv=5, n_jobs = -1)
grid.fit(xtrainvecscaled, ytrain)
print("STANDARDISED AND SCALED :")
print("\nBest", modelnames[4], "ML algorithm best parameters : ")
print("Best cross-validation score: {:.2f}".format(grid.best_score_)) 
print("Best parameters: ", grid.best_params_)
print("Train Set Accuracy : ")
print(grid.score(xtrainvecscaled, ytrain))
print("Test Set Accuracy : ")
print(grid.score(xtestvecscaled, ytest))
end = time.time()



STANDARDISED AND SCALED :

Best Kernel SVC ML algorithm best parameters : 
Best cross-validation score: 0.91
Best parameters:  {'C': 10}
Train Set Accuracy : 
0.9985151308169751
Test Set Accuracy : 
0.94815144766147


### BEST OF NEURAL NETWORK MLP CLASSIFIER
##### MultiLayer Perceptron Neural Network is an artificial Neural Network consisting of many hidden layers of neurons to predict and classify data points based on an activation function specified. This is an extremely parameter dependant algorithm that can give vastly different weights to the hidden layer units and thus different results altogether. We will run 5 different variants of our MLP model with varying settings. After the execution, we found that
##### Variant 1: Single Layer of neurons, "relu" activation function, "adam" solver, Takes 11 mins to run
##### Variant 2: Double Layer of neurons, "relu" activation function, "adam" solver, Takes 6 mins to run
##### Variant 3: Triple Layer of neurons, "relu" activation function, "adam" solver, Takes 5 mins to run
##### Variant 4: 100 neurons, different alpha values, others default, Takes 1 hour to run
##### Variant 5: 5 neurons, different activation functions and solvers, Takes 12 mins to run

In [74]:
grid = GridSearchCV(modellist[5], param_gridlist[5], cv=5, n_jobs = -1)
grid.fit(xtrainvec, ytrain)
print("\nBest", modelnames[5], "ML algorithm best parameters : ")
print("Best cross-validation score: {:.2f}".format(grid.best_score_)) 
print("Best parameters: ", grid.best_params_)
print("Train Set Accuracy : ")
print(grid.score(xtrainvec, ytrain))
print("Test Set Accuracy : ")
print(grid.score(xtestvec, ytest))


Best Neural Network MLP ML algorithm best parameters : 
Best cross-validation score: 0.96
Best parameters:  {'hidden_layer_sizes': [5]}
Train Set Accuracy : 
1.0
Test Set Accuracy : 
0.955456570155902


In [78]:
grid = GridSearchCV(modellist[5], {'hidden_layer_sizes': [[7,7], [8,8], [9,9]]}, cv=5, n_jobs = -1)
grid.fit(xtrainvec, ytrain)
print("\nBest", modelnames[5], "ML algorithm best parameters : ")
print("Best cross-validation score: {:.2f}".format(grid.best_score_)) 
print("Best parameters: ", grid.best_params_)
print("Train Set Accuracy : ")
print(grid.score(xtrainvec, ytrain))
print("Test Set Accuracy : ")
print(grid.score(xtestvec, ytest))


Best Neural Network MLP ML algorithm best parameters : 
Best cross-validation score: 0.96
Best parameters:  {'hidden_layer_sizes': [8, 8]}
Train Set Accuracy : 
0.997505419772518
Test Set Accuracy : 
0.9571492204899777


In [94]:
grid = GridSearchCV(modellist[5], {'hidden_layer_sizes': [[10,10,10], [10,10,20], [10,10,30]]}, cv=5, n_jobs = -1)
grid.fit(xtrainvec, ytrain)
print("\nBest", modelnames[5], "ML algorithm best parameters : ")
print("Best cross-validation score: {:.2f}".format(grid.best_score_)) 
print("Best parameters: ", grid.best_params_)
print("Train Set Accuracy : ")
print(grid.score(xtrainvec, ytrain))
print("Test Set Accuracy : ")
print(grid.score(xtestvec, ytest))


Best Neural Network MLP ML algorithm best parameters : 
Best cross-validation score: 0.96
Best parameters:  {'hidden_layer_sizes': [10, 10, 20]}
Train Set Accuracy : 
1.0
Test Set Accuracy : 
0.9576837416481069


In [97]:
grid = GridSearchCV(modellist[5], {"alpha": [0.00001, 0.0001, 0.001]}, cv=5, n_jobs = -1)
grid.fit(xtrainvec, ytrain)
print("\nBest", modelnames[5], "ML algorithm best parameters : ")
print("Best cross-validation score: {:.2f}".format(grid.best_score_)) 
print("Best parameters: ", grid.best_params_)
print("Train Set Accuracy : ")
print(grid.score(xtrainvec, ytrain))
print("Test Set Accuracy : ")
print(grid.score(xtestvec, ytest))


Best Neural Network MLP ML algorithm best parameters : 
Best cross-validation score: 0.95
Best parameters:  {'alpha': 0.0001}
Train Set Accuracy : 
1.0
Test Set Accuracy : 
0.9567037861915367


In [100]:
grid = GridSearchCV(modellist[5], {'hidden_layer_sizes': [5], "activation": ["relu", "tanh"], "solver": ["lbfgs", "adam", "sgd"]}, cv=5, n_jobs = -1)
grid.fit(xtrainvec, ytrain)
print("\nBest", modelnames[5], "ML algorithm best parameters : ")
print("Best cross-validation score: {:.2f}".format(grid.best_score_))
print("Best parameters: ", grid.best_params_)
print("Train Set Accuracy : ")
print(grid.score(xtrainvec, ytrain))
print("Test Set Accuracy : ")
print(grid.score(xtestvec, ytest))


Best Neural Network MLP ML algorithm best parameters : 
Best cross-validation score: 0.96
Best parameters:  {'activation': 'tanh', 'hidden_layer_sizes': 5, 'solver': 'lbfgs'}
Train Set Accuracy : 
1.0
Test Set Accuracy : 
0.958663697104677
