# <font color='#2F4F4F'>1. Defining the Question</font>

## a) Specifying the Data Analysis Question

What is the question or problem you are trying to solve?

<!-- Prediction if a loan will be paid off to manage defaults and non-performing loans -->

## b) Defining the Metric for Success

How will you know your project has succeeded?

In [None]:
# Accurate prediction of repayment status of applied loans to guide whether to continue to issue or decline the loan application

## c) Understanding the context 

What is the background information surrounding the research question?

In [None]:
# Use historical data to create prediction model for propensity for an applied loan repayment

## d) Recording the Experimental Design

What steps will you take to answer the research question?

## e) Data Relevance

Is the data provided relevant to the research question?

# <font color='#2F4F4F'>2. Data Cleaning & Preparation</font>

In [1]:
# load libraries 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
pd.set_option("display.max.columns", None)
pd.set_option("display.max_colwidth", None)

In [3]:
# load dataset
df = pd.read_csv('https://bit.ly/KoppoKoppoDS')
df.head(3)

Unnamed: 0,Selected,LoanNr_ChkDgt,Name,Zip,ICS,ApprovalDate,ApprovalFY,Term,NoEmp,NewExist,CreateJob,RetainedJob,FranchiseCode,UrbanRural,RevLineCr,LowDoc,ChgOffDate,DisbursementDate,DisbursementGross,BalanceGross,MIS_Status,ChgOffPrinGr,GrAppv,SBA_Appv,New,RealEstate,Portion,daysterm,xx,Default
0,0,1004285007,SIMPLEX OFFICE SOLUTIONS,92801,532420,15074,2001,36,1,1.0,0,0,1,0,Y,N,,15095.0,32812,0,P I F,0,30000,15000,0,0,0.5,1080,16175.0,0
1,1,1004535010,DREAM HOME REALTY,90505,531210,15130,2001,56,1,1.0,0,0,1,0,Y,N,,15978.0,30000,0,P I F,0,30000,15000,0,0,0.5,1680,17658.0,0
2,0,1005005006,"Winset, Inc. dba Bankers Hill",92103,531210,15188,2001,36,10,1.0,0,0,1,0,Y,N,,15218.0,30000,0,P I F,0,30000,15000,0,0,0.5,1080,16298.0,0


In [7]:
# Check data types
df.dtypes

Selected               int64
LoanNr_ChkDgt          int64
Name                  object
Zip                    int64
ICS                    int64
ApprovalDate           int64
ApprovalFY             int64
Term                   int64
NoEmp                  int64
NewExist             float64
CreateJob              int64
RetainedJob            int64
FranchiseCode          int64
UrbanRural             int64
RevLineCr             object
LowDoc                object
ChgOffDate           float64
DisbursementDate     float64
DisbursementGross      int64
BalanceGross           int64
MIS_Status            object
ChgOffPrinGr           int64
GrAppv                 int64
SBA_Appv               int64
New                    int64
RealEstate             int64
Portion              float64
daysterm               int64
xx                   float64
Default                int64
dtype: object

In [None]:
# load glossary
glossary = pd.read_csv('Koppokoppo - Glossary - t0001-10.1080_10691898.2018.1434342 (1).csv')
glossary

From what I can tell, our target variable is 'MIS_Status' as it can help us identify whether a loan is likely to be paid (PIF = paid in full) or not (CHGOFF = chargeoff).

Using the glossary, we will remove the following variables either because they are not needed or because their definitions are not included in the glossary:
- Selected
- LoanNr_ChkDgt
- Name
- New
- RealEstate
- Portion
- daysterm
- xx
- Default

In [8]:
# dropping unneeded columns
df = df.drop(columns = ['Selected', 'LoanNr_ChkDgt', 'Name', 'New', 'RealEstate', 
                        'Portion', 'daysterm', 'xx', 'Default'])
df.shape

(2102, 21)

In [9]:
# dropping duplicates, if any
df.drop_duplicates(inplace = True)
df.shape

(2102, 21)

In [10]:
# checking for missing values
df.isnull().sum()

Zip                     0
ICS                     0
ApprovalDate            0
ApprovalFY              0
Term                    0
NoEmp                   0
NewExist                1
CreateJob               0
RetainedJob             0
FranchiseCode           0
UrbanRural              0
RevLineCr               2
LowDoc                  3
ChgOffDate           1405
DisbursementDate        3
DisbursementGross       0
BalanceGross            0
MIS_Status              0
ChgOffPrinGr            0
GrAppv                  0
SBA_Appv                0
dtype: int64

'ChgOffDate' has missing values for more than half of the records so we will drop that variable. As for the rest of the variables with missing values, we will remove only the records that have null values.

In [11]:
df.drop(columns = ['ChgOffDate'], inplace = True)

df.dropna(inplace = True)
df.isna().sum()

Zip                  0
ICS                  0
ApprovalDate         0
ApprovalFY           0
Term                 0
NoEmp                0
NewExist             0
CreateJob            0
RetainedJob          0
FranchiseCode        0
UrbanRural           0
RevLineCr            0
LowDoc               0
DisbursementDate     0
DisbursementGross    0
BalanceGross         0
MIS_Status           0
ChgOffPrinGr         0
GrAppv               0
SBA_Appv             0
dtype: int64

In [12]:
# check datatypes
df.dtypes

Zip                    int64
ICS                    int64
ApprovalDate           int64
ApprovalFY             int64
Term                   int64
NoEmp                  int64
NewExist             float64
CreateJob              int64
RetainedJob            int64
FranchiseCode          int64
UrbanRural             int64
RevLineCr             object
LowDoc                object
DisbursementDate     float64
DisbursementGross      int64
BalanceGross           int64
MIS_Status            object
ChgOffPrinGr           int64
GrAppv                 int64
SBA_Appv               int64
dtype: object

In [13]:
# changing the 'ApprovalDate' and 'DisbursementDate' variables to datetime
df['ApprovalDate'] = pd.to_datetime(df['ApprovalDate'], unit = 'd')
df['DisbursementDate'] = pd.to_datetime(df['DisbursementDate'], unit = 'd')
df.head()

Unnamed: 0,Zip,ICS,ApprovalDate,ApprovalFY,Term,NoEmp,NewExist,CreateJob,RetainedJob,FranchiseCode,UrbanRural,RevLineCr,LowDoc,DisbursementDate,DisbursementGross,BalanceGross,MIS_Status,ChgOffPrinGr,GrAppv,SBA_Appv
0,92801,532420,2011-04-10,2001,36,1,1.0,0,0,1,0,Y,N,2011-05-01,32812,0,P I F,0,30000,15000
1,90505,531210,2011-06-05,2001,56,1,1.0,0,0,1,0,Y,N,2013-09-30,30000,0,P I F,0,30000,15000
2,92103,531210,2011-08-02,2001,36,10,1.0,0,0,1,0,Y,N,2011-09-01,30000,0,P I F,0,30000,15000
3,92108,531312,2013-01-14,2003,36,6,1.0,0,0,1,0,Y,N,2013-01-31,50000,0,P I F,0,50000,25000
4,91345,531390,2016-02-09,2006,240,65,1.0,3,65,1,1,0,N,2016-04-12,343000,0,P I F,0,343000,343000


In [14]:
# our datetime conversion is 10 years ahead. We'll offset this
df['ApprovalDate'] = df['ApprovalDate'] - pd.DateOffset(years = 10)
df['DisbursementDate'] = df['DisbursementDate'] - pd.DateOffset(years = 10)
df.head()

Unnamed: 0,Zip,ICS,ApprovalDate,ApprovalFY,Term,NoEmp,NewExist,CreateJob,RetainedJob,FranchiseCode,UrbanRural,RevLineCr,LowDoc,DisbursementDate,DisbursementGross,BalanceGross,MIS_Status,ChgOffPrinGr,GrAppv,SBA_Appv
0,92801,532420,2001-04-10,2001,36,1,1.0,0,0,1,0,Y,N,2001-05-01,32812,0,P I F,0,30000,15000
1,90505,531210,2001-06-05,2001,56,1,1.0,0,0,1,0,Y,N,2003-09-30,30000,0,P I F,0,30000,15000
2,92103,531210,2001-08-02,2001,36,10,1.0,0,0,1,0,Y,N,2001-09-01,30000,0,P I F,0,30000,15000
3,92108,531312,2003-01-14,2003,36,6,1.0,0,0,1,0,Y,N,2003-01-31,50000,0,P I F,0,50000,25000
4,91345,531390,2006-02-09,2006,240,65,1.0,3,65,1,1,0,N,2006-04-12,343000,0,P I F,0,343000,343000


In [15]:
# extracting the year in 'DisbursementDate'
df['DisbursementDate'] = pd.DatetimeIndex(df['DisbursementDate']).year
df.head()

Unnamed: 0,Zip,ICS,ApprovalDate,ApprovalFY,Term,NoEmp,NewExist,CreateJob,RetainedJob,FranchiseCode,UrbanRural,RevLineCr,LowDoc,DisbursementDate,DisbursementGross,BalanceGross,MIS_Status,ChgOffPrinGr,GrAppv,SBA_Appv
0,92801,532420,2001-04-10,2001,36,1,1.0,0,0,1,0,Y,N,2001,32812,0,P I F,0,30000,15000
1,90505,531210,2001-06-05,2001,56,1,1.0,0,0,1,0,Y,N,2003,30000,0,P I F,0,30000,15000
2,92103,531210,2001-08-02,2001,36,10,1.0,0,0,1,0,Y,N,2001,30000,0,P I F,0,30000,15000
3,92108,531312,2003-01-14,2003,36,6,1.0,0,0,1,0,Y,N,2003,50000,0,P I F,0,50000,25000
4,91345,531390,2006-02-09,2006,240,65,1.0,3,65,1,1,0,N,2006,343000,0,P I F,0,343000,343000


In [16]:
# drop 'ApprovalDate' since it is no longer needed
df.drop(columns = ['ApprovalDate'], inplace = True)

In [17]:
# check unique values in each variable to ensure there is no inconsistency
my_cols = df.columns.to_list()

for col in my_cols:
    print("Variable:", col)
    print("Number of unique values:", df[col].nunique())
    print(df[col].unique())
    print()

Variable: Zip
Number of unique values: 810
[92801 90505 92103 92108 91345 95831 90255 90808 92704 94583 91360 94103
 95746 91354 95965 91505 94550 92101 95695 91356 92507 90662 91604 94080
 90305 95030 94901 94707 95037 91208 95630 93610 92656 92545 92708 92868
 92307 93274 93546 92590 92301 91941 91789 92831 92648 92083 91730 94117
 93611 92260 94931 92701 95945 92688 92887 95628 91502 93604 91221 94070
 92879 94526 92115 90210 93010 94546 92807 91301 92056 91311 92130 90036
 94588 91607 90504 91914 92173 90274 90006 92649 90016 90242 91030 91780
 94706 94115 94519 93906 91606 92612 91201 90010 95521 92084 92627 91342
 91364 92411 95337 90211 90266 95678 91601 94523 91709 92553 91776 95010
 92549 90201 94606 90631 92110 92544 92234 90004 90293 90303 92883 91106
 92501 91303 92404 93433 94104 95821 95621 93905 94510 93654 90039 92630
 95603 92009 91762 91362 92660 92026 94541 90706 95361 94089 92504 94301
 92653 95677 95825 90025 90063 95128 95123 94111 92663 93060 91205 92408
 90606 9

The 'BalanceGross' variable has only one unique value (0) so we will remove it.

In [18]:
df.drop(columns = ['BalanceGross'], inplace = True)

We will encode some of our categorical variables for modeling.

In [19]:
df['RevLineCr'] = df['RevLineCr'].replace({'Y' : '1', 'N' : '2', 'T' : '3'})
df['LowDoc'] = df['LowDoc'].replace({'N' : '1', 'Y' : '2', 'S' : '3', 'A' : '4'})
df['MIS_Status'] = df['MIS_Status'].replace({'P I F' : '0', 'CHGOFF' : '1'})

df.head()

Unnamed: 0,Zip,ICS,ApprovalFY,Term,NoEmp,NewExist,CreateJob,RetainedJob,FranchiseCode,UrbanRural,RevLineCr,LowDoc,DisbursementDate,DisbursementGross,MIS_Status,ChgOffPrinGr,GrAppv,SBA_Appv
0,92801,532420,2001,36,1,1.0,0,0,1,0,1,1,2001,32812,0,0,30000,15000
1,90505,531210,2001,56,1,1.0,0,0,1,0,1,1,2003,30000,0,0,30000,15000
2,92103,531210,2001,36,10,1.0,0,0,1,0,1,1,2001,30000,0,0,30000,15000
3,92108,531312,2003,36,6,1.0,0,0,1,0,1,1,2003,50000,0,0,50000,25000
4,91345,531390,2006,240,65,1.0,3,65,1,1,0,1,2006,343000,0,0,343000,343000


In [20]:
# changing the variable data types to their appropriate formats
to_obj = ['Zip', 'ICS', 'ApprovalFY', 'NewExist', 'FranchiseCode', 'UrbanRural', 'DisbursementDate']

for item in to_obj:
    df[item] = df[item].astype('object')
    
# confirming our data types have been appropriately modified
df.dtypes

Zip                  object
ICS                  object
ApprovalFY           object
Term                  int64
NoEmp                 int64
NewExist             object
CreateJob             int64
RetainedJob           int64
FranchiseCode        object
UrbanRural           object
RevLineCr            object
LowDoc               object
DisbursementDate     object
DisbursementGross     int64
MIS_Status           object
ChgOffPrinGr          int64
GrAppv                int64
SBA_Appv              int64
dtype: object

There are no other anomalies or inconsistencies with the data so we will now move on to analysis.

In [23]:
df.to_csv('kopokopo_clean.csv', index = False)

df = pd.read_csv('kopokopo_clean.csv')
df.head()

Unnamed: 0,Zip,ICS,ApprovalFY,Term,NoEmp,NewExist,CreateJob,RetainedJob,FranchiseCode,UrbanRural,RevLineCr,LowDoc,DisbursementDate,DisbursementGross,MIS_Status,ChgOffPrinGr,GrAppv,SBA_Appv
0,92801,532420,2001,36,1,1.0,0,0,1,0,1,1,2001,32812,0,0,30000,15000
1,90505,531210,2001,56,1,1.0,0,0,1,0,1,1,2003,30000,0,0,30000,15000
2,92103,531210,2001,36,10,1.0,0,0,1,0,1,1,2001,30000,0,0,30000,15000
3,92108,531312,2003,36,6,1.0,0,0,1,0,1,1,2003,50000,0,0,50000,25000
4,91345,531390,2006,240,65,1.0,3,65,1,1,0,1,2006,343000,0,0,343000,343000


# <font color='#2F4F4F'>3. Data Analysis</font>

Because there are too many variables to carry out individual univariate and bivariate analysis, we will use Pandas Profiling to generate a report for this section.

In [24]:
import pandas_profiling

df.profile_report()

AttributeError: ignored

In [26]:
df.head()

Unnamed: 0,Zip,ICS,ApprovalFY,Term,NoEmp,NewExist,CreateJob,RetainedJob,FranchiseCode,UrbanRural,RevLineCr,LowDoc,DisbursementGross,MIS_Status,ChgOffPrinGr
0,92801,532420,2001,36,1,1.0,0,0,1,0,1,1,32812,0,0
1,90505,531210,2001,56,1,1.0,0,0,1,0,1,1,30000,0,0
2,92103,531210,2001,36,10,1.0,0,0,1,0,1,1,30000,0,0
3,92108,531312,2003,36,6,1.0,0,0,1,0,1,1,50000,0,0
4,91345,531390,2006,240,65,1.0,3,65,1,1,0,1,343000,0,0


From the report, we can note the following correlations:
- DisbursementDate is highly correlated with ApprovalFY
- ApprovalFY is highly correlated with DisbursementDate and SBA_Appv
- GrAppv is highly correlated with DisbursementGross and SBA_Appv
- DisbursementGross is highly correlated with GrAppv and SBA_Appv
- SBA_Appv is highly correlated with DisbursementGross and GrAppv

We will remove DisbursementDate, GrAppv, and SBA_Appv.

In [41]:
df.drop(columns = ['DisbursementDate', 'GrAppv', 'SBA_Appv'], inplace = True)

KeyError: ignored

# <font color='#2F4F4F'>4. Data Modeling</font>

In [42]:
# split into features (X) and target (Y)
X = df.iloc[:, 0:13].values  # Independent variables
y = df.iloc[:, 13].values          # Dependent variable
print(y)

[0 0 0 ... 0 0 0]


In [43]:
# splitting into 75-25 training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

In [44]:
# scaling our features
from sklearn.preprocessing import MinMaxScaler  
norm = MinMaxScaler().fit(X_train) 
X_train = norm.transform(X_train) 
X_test = norm.transform(X_test)

In [53]:
# fit the classifiers to the training data and make predictions on the test set

# Logistic Regression
from sklearn.linear_model import LogisticRegression
logistic_classifier = LogisticRegression()

# Decision Tree 
from sklearn.tree import DecisionTreeClassifier
decision_classifier = DecisionTreeClassifier()

# Support Vector Machine
from sklearn.svm import SVC
svm_classifier = SVC()

# Naive Bayes
from sklearn.naive_bayes import GaussianNB
naive_classifier = GaussianNB()

# K-Neighbors
from sklearn.neighbors import KNeighborsClassifier
knn_classifier = KNeighborsClassifier()

# Bagging
from sklearn.ensemble import BaggingClassifier
bagging_classifier = BaggingClassifier()

# Random Forest
from sklearn.ensemble import RandomForestClassifier
random_forest_classifier = RandomForestClassifier()

# Ada Boosting
from sklearn.ensemble import AdaBoostClassifier
adaboost=AdaBoostClassifier()

# Gradient Boosting
from sklearn.ensemble import GradientBoostingClassifier
gradient_boost = GradientBoostingClassifier()

# XG Boosting
from xgboost import XGBClassifier
xgb_classifier = XGBClassifier()

In [55]:
# evaluating the classification reports and confusion matrices of each classifier
from sklearn.metrics import classification_report, confusion_matrix

# Logistic Regression
logistic_classifier.fit(X_train, y_train)

# Decision Tree
decision_classifier.fit(X_train, y_train)

# Support Vector Machine
svm_classifier.fit(X_train, y_train)

# Naive Bayes
naive_classifier.fit(X_train, y_train)

# K-Neighbors
knn_classifier.fit(X_train, y_train)

# Bagging
bagging_classifier.fit(X_train, y_train)

# Random Forest
random_forest_classifier.fit(X_train, y_train)

# Ada Boosting
adaboost.fit(X_train, y_train)

# Gradient Boosting
gradient_boost.fit(X_train, y_train)

# XG Boosting
xgb_classifier.fit(X_train, y_train)

XGBClassifier()

In [60]:
# Testing the models
logistic_y_prediction = logistic_classifier.predict(X_test) 
decision_y_prediction = decision_classifier.predict(X_test) 
svm_y_prediction = svm_classifier.predict(X_test) 
naive_y_prediction = naive_classifier.predict(X_test)
knn_y_prediction = knn_classifier.predict(X_test) 
bagging_y_prediction = bagging_classifier.predict(X_test)
random_forest_y_prediction = random_forest_classifier.predict(X_test)
adaboost_y_prediction = adaboost.predict(X_test)
gradient_y_prediction = gradient_boost.predict(X_test)
x_gb_y_prediction = xgb_classifier.predict(X_test)



In [61]:
# Accuracy scores
from sklearn.metrics import classification_report, accuracy_score 


print(accuracy_score(logistic_y_prediction, y_test))
print(accuracy_score(decision_y_prediction, y_test))
print(accuracy_score(svm_y_prediction, y_test))
print(accuracy_score(knn_y_prediction, y_test))
print(accuracy_score(naive_y_prediction, y_test))
print(accuracy_score(bagging_y_prediction, y_test))
print(accuracy_score(random_forest_y_prediction, y_test))
print(accuracy_score(adaboost_y_prediction, y_test))
print(accuracy_score(gradient_y_prediction, y_test))
print(accuracy_score(x_gb_y_prediction, y_test))

0.8454198473282443
0.9122137404580153
0.8912213740458015
0.8912213740458015
0.6774809160305344
0.9255725190839694
0.9312977099236641
0.9370229007633588
0.9561068702290076
0.9484732824427481


In [58]:
# Confusion matrix
from sklearn.metrics import confusion_matrix 
 
print('Logistic Regression classifier:')
print(confusion_matrix(logistic_y_prediction, y_test))

print('Decision Tree classifier:')
print(confusion_matrix(decision_y_prediction, y_test))

print('KNN Classifier:')
print(confusion_matrix(knn_y_prediction, y_test))

print('SVM classifier:')
print(confusion_matrix(svm_y_prediction, y_test))

print('Naive Bayes classifier:')
print(confusion_matrix(naive_y_prediction, y_test))

Logistic Regression classifier:
[[315  35]
 [ 46 128]]
Decision Tree classifier:
[[339  24]
 [ 22 139]]
KNN Classifier:
[[336  32]
 [ 25 131]]
SVM classifier:
[[335  31]
 [ 26 132]]
Naive Bayes classifier:
[[210  18]
 [151 145]]


# <font color='#2F4F4F'>5. Summary of Findings</font>

What can you conclude?

In [None]:
# Highest accuracy recorded on bagging and boosting models

# <font color='#2F4F4F'>6. Recommendations</font>

What recommendations can you provide?

# <font color='#2F4F4F'>7. Challenging your Solution</font>

#### a) Did we have the right question?


#### b) Did we have the right data?

#### c) What can be done to improve the solution?


### 7.1 Improving the Solution

We will try cross validation and K-Folds cross validation in an attempt to improve our models.

#### 7.1.1 Cross Validation

In [62]:
from sklearn.model_selection import cross_val_score as cvs

log_scores = cvs(log, X, y, cv = 3)
print("Logistic Regression CV scores:", log_scores)

tree_scores = cvs(tree, X, y, cv = 3)
print("Decision Tree CV scores:", tree_scores)

svm_scores = cvs(svm, X, y, cv = 3)
print("Support Vector Machine CV scores:", svm_scores)

nb_scores = cvs(nb, X, y, cv = 3)
print("Naive Bayes CV scores:", nb_scores)

knn_scores = cvs(knn, X, y, cv = 3)
print("K-Neighbors CV scores:", knn_scores)

bag_scores = cvs(bag, X, y, cv = 3)
print("Bagging CV scores:", bag_scores)

rf_scores = cvs(rf, X, y, cv = 3)
print("Random Forest CV scores:", rf_scores)

ada_scores = cvs(ada, X, y, cv = 3)
print("Ada Boosting CV scores:", ada_scores)

grad_scores = cvs(grad, X, y, cv = 3)
print("Gradient Boosting CV scores:", grad_scores)

xgb_scores = cvs(xgb, X, y, cv = 3)
print("XG Boosting CV scores:", xgb_scores)

NameError: ignored

#### 7.1.2 K-Folds Cross Validation

In [None]:
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score

# applying KFold with 3 splits onto our features
kf3 = KFold(n_splits = 3, shuffle = True)
kf3.split(X)

print("We are using " + str(kf3.get_n_splits(X)) + " folds.\n")

model_count = 1 # to keep track of the model we are working with

# creating training and test sets using these folds
for train_index, test_index in kf3.split(X):
    print("Training fold", model_count)
    
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    log.fit(X_train, y_train)
    log_pred = log.predict(X_test)
    print("Assessing accuracy of model", model_count)
    print("Linear Regression Confusion Matrix:\n", confusion_matrix(y_test, log_pred))
    print("Accuracy:", accuracy_score(y_test, log_pred))
    print()
    
    tree.fit(X_train, y_train)
    tree_pred = tree.predict(X_test)
    print("Assessing accuracy of model", model_count)
    print("Decision Tree Confusion Matrix:\n", confusion_matrix(y_test, tree_pred))
    print("Accuracy:", accuracy_score(y_test, tree_pred))
    print()
    
    svm.fit(X_train, y_train)
    svm_pred = svm.predict(X_test)
    print("Assessing accuracy of model", model_count)
    print("Support Vector Machine Confusion Matrix:\n", confusion_matrix(y_test, svm_pred))
    print("Accuracy:", accuracy_score(y_test, svm_pred))
    print()
    
    nb.fit(X_train, y_train)
    nb_pred = nb.predict(X_test)
    print("Assessing accuracy of model", model_count)
    print("Naive Bayes Confusion Matrix:\n", confusion_matrix(y_test, nb_pred))
    print("Accuracy:", accuracy_score(y_test, nb_pred))
    print()
    
    knn.fit(X_train, y_train)
    knn_pred = knn.predict(X_test)
    print("Assessing accuracy of model", model_count)
    print("K-Neighbors Confusion Matrix:\n", confusion_matrix(y_test, knn_pred))
    print("Accuracy:", accuracy_score(y_test, knn_pred))
    print()
    
    bag.fit(X_train, y_train)
    bag_pred = bag.predict(X_test)
    print("Assessing accuracy of model", model_count)
    print("Bagging Confusion Matrix:\n", confusion_matrix(y_test, bag_pred))
    print("Accuracy:", accuracy_score(y_test, bag_pred))
    print()
    
    rf.fit(X_train, y_train)
    rf_pred = rf.predict(X_test)
    print("Assessing accuracy of model", model_count)
    print("Random Forest Confusion Matrix:\n", confusion_matrix(y_test, rf_pred))
    print("Accuracy:", accuracy_score(y_test, rf_pred))
    print()
    
    ada.fit(X_train, y_train)
    ada_pred = ada.predict(X_test)
    print("Assessing accuracy of model", model_count)
    print("Ada Boosting Confusion Matrix:\n", confusion_matrix(y_test, ada_pred))
    print("Accuracy:", accuracy_score(y_test, ada_pred))
    print()
    
    grad.fit(X_train, y_train)
    grad_pred = grad.predict(X_test)
    print("Assessing accuracy of model", model_count)
    print("Gradient Boosting Confusion Matrix:\n", confusion_matrix(y_test, grad_pred))
    print("Accuracy:", accuracy_score(y_test, grad_pred))
    print()
    
    xgb.fit(X_train, y_train)
    xgb_pred = xgb.predict(X_test)
    print("Assessing accuracy of model", model_count)
    print("XG Boosting Confusion Matrix:\n", confusion_matrix(y_test, xgb_pred))
    print("Accuracy:", accuracy_score(y_test, xgb_pred))
    print()
    
    model_count += 1

We are using 3 folds.

Training fold 1
Assessing accuracy of model 1
Linear Regression Confusion Matrix:
 [[480   4]
 [  8 206]]
Accuracy: 0.9828080229226361

Assessing accuracy of model 1
Decision Tree Confusion Matrix:
 [[480   4]
 [  4 210]]
Accuracy: 0.9885386819484241

Assessing accuracy of model 1
Support Vector Machine Confusion Matrix:
 [[484   0]
 [ 87 127]]
Accuracy: 0.8753581661891118

Assessing accuracy of model 1
Naive Bayes Confusion Matrix:
 [[481   3]
 [  2 212]]
Accuracy: 0.9928366762177651

Assessing accuracy of model 1
K-Neighbors Confusion Matrix:
 [[481   3]
 [ 10 204]]
Accuracy: 0.9813753581661891

Assessing accuracy of model 1
Bagging Confusion Matrix:
 [[480   4]
 [  2 212]]
Accuracy: 0.9914040114613181

Assessing accuracy of model 1
Random Forest Confusion Matrix:
 [[480   4]
 [  2 212]]
Accuracy: 0.9914040114613181

Assessing accuracy of model 1
Ada Boosting Confusion Matrix:
 [[480   4]
 [  4 210]]
Accuracy: 0.9885386819484241

Assessing accuracy of model 1
G