## II. Model creation and Algorithm Testing

In this notebook, I shall used prepared data from the previous notebook to create and train a model.

To create the model, I shall check for collinearity (and remove redundant columns) and apply Smote to deal with data imbalance (on the label). 

The model will be trained using the following algorithms: 
    * Linear Regression (l1 penalty)
    * Linear Regression (l2 penalty)
    * Linear Discriminant Analysis 
    * Gaussian Naive Bayes
    * Decision Tree Classifier
    * Random Forest Classifier
    * SVM classifier

First, the algorithms will be tested without tunning its parameters

Secondly, I will find the best paramaters for some of the above algorithms and re-train the model to see if there was an improvement. 

Then, I will compare resulst and select the best one.

Conclusion and cross validation

### 1 Importing libraries

In [1]:
# import libraries
import pandas as pd
from sklearn import preprocessing
import sklearn.model_selection as ms
from sklearn import linear_model
import sklearn.metrics as sklm
import numpy as np
import numpy.random as nr
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as ss
import math
from glob import glob
from sklearn.model_selection import train_test_split

from sklearn.ensemble import AdaBoostClassifier
from numpy import loadtxt
import xgboost as xgb

import warnings
warnings.filterwarnings('ignore')


%matplotlib inline

### 2. Import datasets 

In [2]:
# Merge the two datasets
X_train=pd.read_csv('X_train.csv',sep= ',')
X_train.shape

(807, 22)

In [3]:
X_test=pd.read_csv('X_test.csv',sep= ',')
X_test.shape

(143, 22)

In [4]:
# Merge the two datasets
y_train=pd.read_csv('y_train.csv',sep= ',')
y_train.shape

(807, 1)

In [5]:
# Merge the two datasets
y_test=pd.read_csv('y_test.csv',sep= ',')
y_test.shape

(143, 1)

### 3. Data balancing and collinearity


#### 3.3 Smote for data imbalance

In [6]:
train_input = X_train
train_output = y_train

In [7]:
from imblearn.over_sampling import SMOTE
from collections import Counter
print('Original dataset shape {}'.format(Counter(train_output)))
smt = SMOTE(random_state=20)
train_input_new, train_output_new = smt.fit_sample(train_input, train_output)
print('New dataset shape {}'.format(Counter(train_output_new)))

Original dataset shape Counter({'CreditStatus': 1})
New dataset shape Counter({0: 439, 1: 439})


### 4. Algorithm Testing 

**Logistic Regression**

**LR classifier (l1) penalty**

In [8]:
#Given smote, we have to do a little adjustment
X_train, X_dev, Y_train, Y_dev = train_test_split(train_input_new, train_output_new, test_size=0.20, random_state=123)


from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
logreg = LogisticRegression(fit_intercept=True, penalty='l1')
logreg.fit(X_train, Y_train)

logregprediction=logreg.predict(X_test)
#importing the metrics module
from sklearn import metrics
#evaluation(Accuracy)
print("Lasso Accuracy:",metrics.accuracy_score(logregprediction,y_test))

Lasso Accuracy: 0.7622377622377622


**LR Classifier (l2) penalty**

In [9]:
#Given smote, we have to do a little adjustment
X_train, X_dev, Y_train, Y_dev = train_test_split(train_input_new, train_output_new, test_size=0.20, random_state=0)

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
logreg2 = LogisticRegression(fit_intercept=True, penalty='l2')
logreg2.fit(X_train, Y_train)
logreg2prediction=logreg2.predict(X_test)
#evaluation(Accuracy)
print("Lasso Accuracy:",metrics.accuracy_score(logregprediction,y_test))

Lasso Accuracy: 0.7622377622377622


**Linear Discriminant Analysis** 

In [10]:
#Given smote, we have to do a little adjustment
X_train, X_dev, Y_train, Y_dev = train_test_split(train_input_new, train_output_new, test_size=0.20, random_state=0)


from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis()
lda.fit(X_train, Y_train)
lda.fit(X_train, Y_train)


lda_prediction=lda.predict(X_test)
#evaluation(Accuracy)
print("LDA Accuracy:",metrics.accuracy_score(lda_prediction,y_test))

LDA Accuracy: 0.7762237762237763


**Gaussian Naive Bayes** 

In [11]:
#Given smote, we have to do a little adjustment
X_train, X_dev, Y_train, Y_dev = train_test_split(train_input_new, train_output_new, test_size=0.20, random_state=0)


from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, Y_train)

gnbprediction=logreg2.predict(X_test)
print("GNB Accuracy:",metrics.accuracy_score(gnbprediction,y_test))

GNB Accuracy: 0.7622377622377622


**Decision Tree Classifier**

In [12]:
#remember becaus of the new data after imbalance adjustment, we use the following
X_train, X_dev, Y_train, Y_dev = train_test_split(train_input_new, train_output_new, test_size=0.20, random_state=0)

#importing module
from sklearn.tree import DecisionTreeClassifier
#making the instance
dtc= DecisionTreeClassifier(random_state=1234)
#learning
dtc.fit(X_train,Y_train)
#Prediction
dtcprediction=dtc.predict(X_test)
#importing the metrics module
from sklearn import metrics
#evaluation(Accuracy)
print("Decision Tree Classifier Accuracy:",metrics.accuracy_score(dtcprediction,y_test))



Decision Tree Classifier Accuracy: 0.6993006993006993


**Random Forest Classifier**

In [13]:
#remember becaus of the new data after imbalance adjustment, we use the following
X_train, X_dev, Y_train, Y_dev = train_test_split(train_input_new, train_output_new, test_size=0.20, random_state=0)

#importing module
from sklearn.ensemble import RandomForestClassifier
#making the instance
rfc=RandomForestClassifier(n_jobs=-1,random_state=123)
#learning
rfc.fit(X_train,Y_train)
#Prediction
rfcprediction=rfc.predict(X_test)
#importing the metrics module
from sklearn import metrics
#evaluation(Accuracy)
print("Random Forest Classifier Accuracy:",metrics.accuracy_score(rfcprediction,y_test))


Random Forest Classifier Accuracy: 0.7972027972027972


**SVM Classifier**

In [14]:
#remember becaus of the new data after imbalance adjustment, we use the following
X_train, X_dev, Y_train, Y_dev = train_test_split(train_input_new, train_output_new, test_size=0.20, random_state=0)

#importing module
from sklearn import svm
#making the instance
svc = svm.SVC(random_state=123)
#learning
svc.fit(X_train,Y_train)
#Prediction
svcprediction=svc.predict(X_test)
#importing the metrics module
from sklearn import metrics
#evaluation(Accuracy)
print("Accuracy of SVM classifier:",metrics.accuracy_score(svcprediction,y_test))


Accuracy of SVM classifier: 0.7412587412587412


**K-NearestNeighbours Classifier**

In [15]:
#remember becaus of the new data after imbalance adjustment, we use the following
X_train, X_dev, Y_train, Y_dev = train_test_split(train_input_new, train_output_new, test_size=0.20, random_state=0)

#importing module
from sklearn.neighbors import KNeighborsClassifier
#making the instance
knn = KNeighborsClassifier()
#learning
knn.fit(X_train,Y_train)
#Prediction
knnprediction=knn.predict(X_test)
#importing the metrics module
from sklearn import metrics
#evaluation(Accuracy)
print("KNN classifier Accuracy:",metrics.accuracy_score(knnprediction,y_test))




KNN classifier Accuracy: 0.7622377622377622


**Ada Boost**

In [16]:
#remember becaus of the new data after imbalance adjustment, we use the following
X_train, X_dev, Y_train, Y_dev = train_test_split(train_input_new, train_output_new, test_size=0.20, random_state=0)

ada_model = AdaBoostClassifier(n_estimators=200, random_state=44)


#learning
ada_model.fit(X_train,Y_train)
#Prediction
ada_modelprediction=ada_model.predict(X_test)
#importing the metrics module
from sklearn import metrics
#evaluation(Accuracy)
print("NN classifier Accuracy:",metrics.accuracy_score(ada_modelprediction,y_test))



NN classifier Accuracy: 0.8041958041958042


In [17]:
### 6. Compare models (baseline vs tuned models)
print("KNN classifier Accuracy:",metrics.accuracy_score(knnprediction,y_test))
print("Random Forest Classifier Accuracy:",metrics.accuracy_score(rfcprediction,y_test))
print("Accuracy of SVM classifier:",metrics.accuracy_score(svcprediction,y_test))
print("Decision Tree Classifier Accuracy:",metrics.accuracy_score(dtcprediction,y_test))
print("GNB Accuracy:",metrics.accuracy_score(gnbprediction,y_test))
print("LDA Accuracy:",metrics.accuracy_score(lda_prediction,y_test))
print("Ridge Accuracy:",metrics.accuracy_score(logreg2prediction,y_test))
print("Lasso Accuracy:",metrics.accuracy_score(logregprediction,y_test))




KNN classifier Accuracy: 0.7622377622377622
Random Forest Classifier Accuracy: 0.7972027972027972
Accuracy of SVM classifier: 0.7412587412587412
Decision Tree Classifier Accuracy: 0.6993006993006993
GNB Accuracy: 0.7622377622377622
LDA Accuracy: 0.7762237762237763
Ridge Accuracy: 0.7622377622377622
Lasso Accuracy: 0.7622377622377622


## 5. Conclusion and Cross Validation

Sector Random Forest Classifier as the best classifier.

Now in the next notebook, I shall try to improve it by:
- Selecting the best feature set
- Tuning its parameters

In [18]:
# Cross Validation 
print("Cross Validation Score: {:.2%}".format(np.mean(cross_val_score(rfc, X_train, Y_train, cv=10))))
lda.fit(X_train, Y_train)
print("Dev Set score: {:.2%}".format(rfc.score(X_dev, Y_dev)))

Cross Validation Score: 80.49%
Dev Set score: 83.52%
