## II. Model creation and Algorithm Testing

In this notebook, I shall used prepared data from the previous notebook to create and train a model.

To create the model, I shall check for collinearity (and remove redundant columns) and apply Smote to deal with data imbalance (on the label). 

The model will be trained using the following algorithms: 
    * Linear Regression (l1 penalty)
    * Linear Regression (l2 penalty)
    * Linear Discriminant Analysis 
    * Gaussian Naive Bayes
    * Decision Tree Classifier
    * Random Forest Classifier
    * SVM classifier
    * KNN classifier

First, the algorithms will be tested without tunning its parameters

Secondly, I will find the best paramaters for some of the above algorithms and re-train the model to see if there was an improvement. 

Then, I will compare resulst and select the best one.

Conclusion and cross validation

### 1 Importing libraries

In [1]:
# import libraries
import pandas as pd
from sklearn import preprocessing
import sklearn.model_selection as ms
from sklearn import linear_model
import sklearn.metrics as sklm
import numpy as np
import numpy.random as nr
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as ss
import math
from glob import glob
from sklearn.model_selection import train_test_split



%matplotlib inline
%matplotlib inline

### 2. Import datasets 

In [2]:
# Merge the two datasets
df=pd.read_csv('dfprepared.csv')
df.shape

(937, 36)

### 3. Model creation, data balancing and collinearity

#### 3.1 Model Creation

In [3]:
# separate train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    df.drop(labels=['CreditStatus'], axis=1),
    df['CreditStatus'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape


((655, 35), (282, 35))

#### 3.3 Smote for data imbalance

In [4]:
train_input = X_train
train_output = y_train

In [5]:
from imblearn.over_sampling import SMOTE
from collections import Counter
print('Original dataset shape {}'.format(Counter(train_output)))
smt = SMOTE(random_state=20)
train_input_new, train_output_new = smt.fit_sample(train_input, train_output)
print('New dataset shape {}'.format(Counter(train_output_new)))

Original dataset shape Counter({1: 369, 0: 286})
New dataset shape Counter({1: 369, 0: 369})




### 4. Algorithm Testing 

**Logistic Regression**

**LR classifier (l1) penalty**

In [6]:
#Given smote, we have to do a little adjustment
X_train, X_dev, Y_train, Y_dev = train_test_split(train_input_new, train_output_new, test_size=0.20, random_state=0)


from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
logreg = LogisticRegression(fit_intercept=True, penalty='l1')
logreg.fit(X_train, Y_train)

logregprediction=logreg.predict(X_test)
#importing the metrics module
from sklearn import metrics
#evaluation(Accuracy)
print("Lasso Accuracy:",metrics.accuracy_score(logregprediction,y_test))

Lasso Accuracy: 0.7375886524822695




**LR Classifier (l2) penalty**

In [7]:
#Given smote, we have to do a little adjustment
X_train, X_dev, Y_train, Y_dev = train_test_split(train_input_new, train_output_new, test_size=0.20, random_state=0)

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
logreg2 = LogisticRegression(fit_intercept=True, penalty='l2')
logreg2.fit(X_train, Y_train)
logreg2prediction=logreg2.predict(X_test)
#evaluation(Accuracy)
print("Lasso Accuracy:",metrics.accuracy_score(logregprediction,y_test))

Lasso Accuracy: 0.7375886524822695




**Linear Discriminant Analysis** 

In [8]:
#Given smote, we have to do a little adjustment
X_train, X_dev, Y_train, Y_dev = train_test_split(train_input_new, train_output_new, test_size=0.20, random_state=0)


from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis()
lda.fit(X_train, Y_train)
lda.fit(X_train, Y_train)


lda_prediction=lda.predict(X_test)
#evaluation(Accuracy)
print("LDA Accuracy:",metrics.accuracy_score(lda_prediction,y_test))

LDA Accuracy: 0.7269503546099291




**Gaussian Naive Bayes** 

In [9]:
#Given smote, we have to do a little adjustment
X_train, X_dev, Y_train, Y_dev = train_test_split(train_input_new, train_output_new, test_size=0.20, random_state=0)


from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, Y_train)

gnbprediction=logreg2.predict(X_test)
print("GNB Accuracy:",metrics.accuracy_score(gnbprediction,y_test))

GNB Accuracy: 0.7198581560283688


**Decision Tree Classifier**

In [10]:
#remember becaus of the new data after imbalance adjustment, we use the following
X_train, X_dev, Y_train, Y_dev = train_test_split(train_input_new, train_output_new, test_size=0.20, random_state=0)

#importing module
from sklearn.tree import DecisionTreeClassifier
#making the instance
dtc= DecisionTreeClassifier(random_state=1234)
#learning
dtc.fit(X_train,Y_train)
#Prediction
dtcprediction=dtc.predict(X_test)
#importing the metrics module
from sklearn import metrics
#evaluation(Accuracy)
print("Decision Tree Classifier Accuracy:",metrics.accuracy_score(dtcprediction,y_test))



Decision Tree Classifier Accuracy: 0.7375886524822695


**Random Forest Classifier**

In [11]:
#remember becaus of the new data after imbalance adjustment, we use the following
X_train, X_dev, Y_train, Y_dev = train_test_split(train_input_new, train_output_new, test_size=0.20, random_state=0)

#importing module
from sklearn.ensemble import RandomForestClassifier
#making the instance
rfc=RandomForestClassifier(n_jobs=-1,random_state=123)
#learning
rfc.fit(X_train,Y_train)
#Prediction
rfcprediction=rfc.predict(X_test)
#importing the metrics module
from sklearn import metrics
#evaluation(Accuracy)
print("Random Forest Classifier Accuracy:",metrics.accuracy_score(rfcprediction,y_test))




Random Forest Classifier Accuracy: 0.8120567375886525


**SVM Classifier**

In [12]:
#remember becaus of the new data after imbalance adjustment, we use the following
X_train, X_dev, Y_train, Y_dev = train_test_split(train_input_new, train_output_new, test_size=0.20, random_state=0)

#importing module
from sklearn import svm
#making the instance
svc = svm.SVC(random_state=123)
#learning
svc.fit(X_train,Y_train)
#Prediction
svcprediction=svc.predict(X_test)
#importing the metrics module
from sklearn import metrics
#evaluation(Accuracy)
print("Accuracy of SVM classifier:",metrics.accuracy_score(svcprediction,y_test))


Accuracy of SVM classifier: 0.7163120567375887




**K-NearestNeighbours Classifier**

In [13]:
#remember becaus of the new data after imbalance adjustment, we use the following
X_train, X_dev, Y_train, Y_dev = train_test_split(train_input_new, train_output_new, test_size=0.20, random_state=0)

#importing module
from sklearn.neighbors import KNeighborsClassifier
#making the instance
knn = KNeighborsClassifier()
#learning
knn.fit(X_train,Y_train)
#Prediction
knnprediction=knn.predict(X_test)
#importing the metrics module
from sklearn import metrics
#evaluation(Accuracy)
print("KNN classifier Accuracy:",metrics.accuracy_score(knnprediction,y_test))




KNN classifier Accuracy: 0.7056737588652482


**XGBoost Classifier**

In [14]:
#remember becaus of the new data after imbalance adjustment, we use the following
X_train, X_dev, Y_train, Y_dev = train_test_split(train_input_new, train_output_new, test_size=0.20, random_state=0)

#importing module
import xgboost as xgb

#making the instance
xgb = xgb.XGBClassifier()
#learning
xgb.fit(X_train,Y_train)
#Prediction
xgbprediction=knn.predict(X_test)
#importing the metrics module
from sklearn import metrics
#evaluation(Accuracy)
print("XGB classifier Accuracy:",metrics.accuracy_score(xgbprediction,y_test))





XGB classifier Accuracy: 0.7056737588652482


In [15]:
### 6. Compare models (baseline vs tuned models)
print("KNN classifier Accuracy:",metrics.accuracy_score(knnprediction,y_test))
print("Random Forest Classifier Accuracy:",metrics.accuracy_score(rfcprediction,y_test))
print("Accuracy of SVM classifier:",metrics.accuracy_score(svcprediction,y_test))
print("Decision Tree Classifier Accuracy:",metrics.accuracy_score(dtcprediction,y_test))
print("GNB Accuracy:",metrics.accuracy_score(gnbprediction,y_test))
print("LDA Accuracy:",metrics.accuracy_score(lda_prediction,y_test))
print("Ridge Accuracy:",metrics.accuracy_score(logreg2prediction,y_test))
print("Lasso Accuracy:",metrics.accuracy_score(logregprediction,y_test))
print("XGB classifier Accuracy:",metrics.accuracy_score(xgbprediction,y_test))



KNN classifier Accuracy: 0.7056737588652482
Random Forest Classifier Accuracy: 0.8120567375886525
Accuracy of SVM classifier: 0.7163120567375887
Decision Tree Classifier Accuracy: 0.7375886524822695
GNB Accuracy: 0.7198581560283688
LDA Accuracy: 0.7269503546099291
Ridge Accuracy: 0.7198581560283688
Lasso Accuracy: 0.7375886524822695
XGB classifier Accuracy: 0.7056737588652482


## 5. Conclusion and Cross Validation

Random Forest Classifier as the best classifier.

Now in the next notebook, I shall try to improve it by:
- Selecting the best feature set
- Tuning its parameters

In [16]:
# Cross Validation 
print("Cross Validation Score: {:.2%}".format(np.mean(cross_val_score(rfc, X_train, Y_train, cv=10))))
lda.fit(X_train, Y_train)
print("Dev Set score: {:.2%}".format(rfc.score(X_dev, Y_dev)))

Cross Validation Score: 76.79%
Dev Set score: 77.03%


