# Predictive Modelling
Using several classifcation algorithms, let's create some baseline models, perform some feature engineering and hyperparameter tuning to create the most effecting predictive model.

In [2]:
import pandas as pd
import numpy as np

In [3]:
data = pd.read_csv('df_clean.csv')

In [4]:
data.head()

Unnamed: 0,A1,A2,A3,A8,A9,A10,A11,A12,A14,A15,...,A7_h,A7_j,A7_n,A7_o,A7_v,A7_z,A13_g,A13_p,A13_s,target
0,0.661,0.002,-0.957,-0.291,0.955,1.157,-0.288,-0.919,0.107,-0.195,...,-0.5,-0.108,-0.076,-0.054,0.831,-0.108,0.322,-0.108,-0.3,1
1,-1.512,1.228,-0.06,0.244,0.955,1.157,0.741,-0.919,-0.817,-0.088,...,2.0,-0.108,-0.076,-0.054,-1.203,-0.108,0.322,-0.108,-0.3,1
2,-1.512,0.002,-0.856,-0.216,0.955,-0.864,-0.494,-0.919,0.56,-0.037,...,2.0,-0.108,-0.076,-0.054,-1.203,-0.108,0.322,-0.108,-0.3,1
3,0.661,0.002,-0.647,0.457,0.955,1.157,0.535,1.088,-0.486,-0.195,...,-0.5,-0.108,-0.076,-0.054,0.831,-0.108,0.322,-0.108,-0.3,1
4,0.661,-1.224,0.174,-0.154,0.955,-0.864,-0.494,-0.919,-0.369,-0.195,...,-0.5,-0.108,-0.076,-0.054,0.831,-0.108,-3.101,-0.108,3.332,1


## Baseline Models
Baseline models used are:
- Gaussian and Bernoulli Naive Bayes
- Decision Tree
- Random Forest
- K Neighbours Classifier
- Logistic Regression

### Gaussian Naive Bayes

In [6]:
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

In [33]:
X = data.drop(columns = 'target')
y = data.target.values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

gnb = GaussianNB()

y_pred = gnb.fit(X_train, y_train).predict(X_test)

print(f'Gaussian Naive Bayes accuracy score: ' + str(round(metrics.accuracy_score(y_test, y_pred), 3)))

Gaussian Naive Bayes accuracy score: 0.667


### Bernoulli Naive  Bayes

In [31]:
from sklearn.naive_bayes import BernoulliNB

In [36]:
bernoulli = BernoulliNB()

y_pred = bernoulli.fit(X_train, y_train).predict(X_test)

print('Bernoulli Naive Bayes accuracy score: ' + str(round(metrics.accuracy_score(y_test, y_pred), 3)))

Bernoulli Naive Bayes accuracy score: 0.816


Assuming that the Bernoulli Naive Bayes algorithm works better in this instance, due to the one hot encoding that was performed in the data preprocessing stage, where Gaussian Naive Bayes assumes a normal distribution, and the Bernoulli algorithm penalizes the non-presence of the categorical features created.

### Decision Tree Classifier
Using the default values including using the Gini coefficient as a measure for purity, and containing a minimum of 2 features when splitting on leaf nodes.

In [42]:
from sklearn.tree import DecisionTreeClassifier

In [43]:
dt_clf = DecisionTreeClassifier()

y_pred = dt_clf.fit(X_train, y_train).predict(X_test)

print('Decision Tree accuracy score: ' + str(round(metrics.accuracy_score(y_test, y_pred), 3)))

Decision Tree accuracy score: 0.744


The Decision Tree Algorithm might do better with out the extensive encoding and normalisation that was performed in the data preprocessing stage.

### Random Forest Classifier
Using bagging, subspace sampling and a Decision Tree forest

In [45]:
from sklearn.ensemble import RandomForestClassifier

In [46]:
random_forest = RandomForestClassifier()

y_pred = random_forest.fit(X_train, y_train).predict(X_test)

print('Random Forest accuracy score: ' + str(round(metrics.accuracy_score(y_test, y_pred), 3)))

Random Forest accuracy score: 0.836


### K Neighbours Classifier

With default values, 5 nearest neighbours measuered by the Minkowski distance without weighting.

In [53]:
from sklearn.neighbors import KNeighborsClassifier

In [54]:
knn = KNeighborsClassifier()

y_pred = knn.fit(X_train, y_train).predict(X_test)

print('K Neighbors Classifier accuracy score: ' + str(round(metrics.accuracy_score(y_test, y_pred), 3)))

3 Nearest Neighbors accuracy score: 0.783
