# Data Mining Coursework - Spring 2023

Katerina Marie (Katya) Reichert - 33781583

I worked and submitted alone :)

# Part 2 - Credit Risk Analysis

This task is based on a real credit risk data, and is to predict a response variable Y which represents a credit card default payment (Yes = 1, No = 0), using the 23 input variables as follows:

| Variable | Description |
| :- | :- |
| X1 | Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit. |
| X2 | Gender (1 = male; 2 = female). |
| X3 | Education (1 = graduate school; 2 = university; 3 = high school; 4 = others). |
| X4 | Marital status (1 = married; 2 = single; 3 = others). |
| X5 | Age (year). |
| X6 - X11 | History of past payment. One tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above. |
| X12 - X17 | Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005. |
| X18 - X23 | Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005. |

### Main Task:

Using Python and any relevant libraries, you are required to build the best predictive model by tuning models using cross validation on the training dataset with each of the following algorithms discussed in this module: 

- k-Nearest Neighbours
- Decision Trees
- Random Forest
- Bagging
- AdaBoost
- SVM

Out of the models tuned with the above algorithms, select the best model and clearly justify your choice, and evaluate its performances on the test set.

Moreover, for each algorithm mentioned above, include 1 chart in the notebook illustrating how accuracy of the models vary when you vary the values of one numeric hyperparameter only (at your choice).

In [13]:
import pandas as pd
import numpy as np

from sklearn import svm
from sklearn.model_selection import GridSearchCV

import sys
import os

# Prepare training data set

In [9]:
train_df = pd.read_csv(os.path.join('csv', 'creditdefault_train.csv'), header=0)
print(train_df.shape)
train_df.head()

(15000, 24)


Unnamed: 0,Y,X1,X2,X3,X4,X5,X6,X7,X8,X9,...,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23
0,1,20000,2,2,1,24,2,2,-1,-1,...,689,0,0,0,0,689,0,0,0,0
1,0,50000,2,2,1,37,0,0,0,0,...,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000
2,0,50000,1,2,1,57,-1,0,-1,0,...,35835,20940,19146,19131,2000,36681,10000,9000,689,679
3,0,50000,1,1,2,37,0,0,0,0,...,57608,19394,19619,20024,2500,1815,657,1000,1000,800
4,0,500000,1,1,2,29,0,0,0,0,...,445007,542653,483003,473944,55000,40000,38000,20239,13750,13770


### Preprocess data

In [10]:
x, y = train_df.drop(columns='Y'), train_df['Y']

Normalize X-column values

In [11]:
x = (x - x.mean())/x.std()
x.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23
0,-1.133274,0.808216,0.190673,-1.064004,-1.241838,1.795894,1.778073,-0.695989,-0.66538,-1.517486,...,-0.671103,-0.672907,-0.665631,-0.653726,-0.361115,-0.238118,-0.362679,-0.30288,-0.310301,-0.288755
1,-0.9027,0.808216,0.190673,-1.064004,0.178288,0.018192,0.109252,0.135539,0.181662,0.223682,...,0.031416,-0.230618,-0.186997,-0.154726,-0.232512,-0.176421,-0.274631,-0.236211,-0.241172,-0.233506
2,-0.9027,-1.23721,0.190673,-1.064004,2.363097,-0.870659,0.109252,-0.695989,0.181662,0.223682,...,-0.163084,-0.345806,-0.349186,-0.330635,-0.232512,1.431521,0.371049,0.242596,-0.265746,-0.251241
3,-0.9027,-1.23721,-1.080482,0.848984,0.178288,0.018192,0.109252,0.135539,0.181662,0.223682,...,0.151634,-0.369956,-0.341368,-0.315553,-0.200361,-0.185884,-0.314473,-0.242272,-0.245634,-0.244556
4,2.555914,-1.23721,-1.080482,0.848984,-0.695636,0.018192,0.109252,0.135539,0.181662,0.223682,...,5.751307,7.803806,7.317449,7.350405,3.175474,1.585487,2.425487,0.923774,0.578864,0.472025


# Prepare test data set

In [26]:
test_df = pd.read_csv(os.path.join('csv', 'creditdefault_test.csv'), header=0)
print(test_df.shape)
test_df.head()

(15000, 24)


Unnamed: 0,Y,X1,X2,X3,X4,X5,X6,X7,X8,X9,...,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23
0,1,120000,2,2,2,26,-1,2,0,0,...,2682,3272,3455,3261,0,1000,1000,1000,0,2000
1,0,90000,2,2,2,34,0,0,0,0,...,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000
2,0,140000,2,3,1,28,0,0,2,0,...,12108,12211,11793,3719,3329,0,432,1000,1000,1000
3,0,20000,1,3,2,35,-2,-2,-2,-2,...,0,0,13007,13912,0,0,0,13007,1122,0
4,0,200000,2,3,2,34,0,0,2,0,...,5535,2513,1828,3731,2306,12,50,300,3738,66


### Preprocess data

In [28]:
x_test, y_test = test_df.drop(columns='Y'), test_df['Y']

Normalize X-column values

In [29]:
x_test = (x_test - x_test.mean())/x_test.std()
x_test.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23
0,-0.367253,0.812054,0.181022,0.868163,-1.034806,-0.879285,1.786526,0.142218,0.195981,0.24651,...,-0.636191,-0.621456,-0.603928,-0.597272,-0.326039,-0.205451,-0.21638,-0.247204,-0.318092,-0.183647
1,-0.599113,0.812054,0.181022,0.868163,-0.172743,0.011521,0.114219,0.142218,0.195981,0.24651,...,-0.479728,-0.450394,-0.4158,-0.392114,-0.23938,-0.184989,-0.21638,-0.247204,-0.251827,-0.011724
2,-0.21268,0.812054,1.440455,-1.050571,-0.81929,0.011521,0.114219,1.821291,0.195981,0.24651,...,-0.5006,-0.483186,-0.467444,-0.589625,-0.135994,-0.246375,-0.243641,-0.247204,-0.251827,-0.240954
3,-1.140118,-1.231363,1.440455,0.868163,-0.064985,-1.770091,-1.558089,-1.536856,-1.531742,-0.648371,...,-0.67477,-0.672068,-0.447572,-0.419445,-0.326039,-0.246375,-0.264375,0.564922,-0.243743,-0.298262
4,0.251039,0.812054,1.440455,0.868163,-0.172743,0.011521,0.114219,1.821291,0.195981,0.24651,...,-0.595151,-0.633197,-0.63056,-0.589425,-0.194395,-0.245884,-0.261975,-0.29455,-0.070393,-0.294479


# Create Predictions

Create dataframe to record prediction results

In [38]:
results_df = pd.DataFrame(test_df['Y'])
results_df.rename(columns={'Y': 'Actual'}, inplace=True)

results_df.head()

Unnamed: 0,Actual
0,1
1,0
2,0
3,0
4,0


## K-Nearest Neighbors Model

## SVM Model

Creating the SVM model and fixing random state parameter for repeatability

In [32]:
svm_model = svm.SVC(random_state = 3)
svm_params = {'kernel':('linear', 'rbf', 'sigmoid'), 'C':[1, 2, 5]}

In [17]:
svm_gs = GridSearchCV(svm_model, svm_params, refit = True, cv=4, verbose=1)
svm_gs.fit(x, y)

Fitting 4 folds for each of 9 candidates, totalling 36 fits
[CV] END .................................C=1, kernel=linear; total time=   2.7s
[CV] END .................................C=1, kernel=linear; total time=   2.6s
[CV] END .................................C=1, kernel=linear; total time=   4.3s
[CV] END .................................C=1, kernel=linear; total time=   2.9s
[CV] END ....................................C=1, kernel=rbf; total time=   2.4s
[CV] END ....................................C=1, kernel=rbf; total time=   2.4s
[CV] END ....................................C=1, kernel=rbf; total time=   2.6s
[CV] END ....................................C=1, kernel=rbf; total time=   2.5s
[CV] END ................................C=1, kernel=sigmoid; total time=   1.7s
[CV] END ................................C=1, kernel=sigmoid; total time=   2.1s
[CV] END ................................C=1, kernel=sigmoid; total time=   2.2s
[CV] END ................................C=1, ker

In [25]:
print('best params:', svm_gs.best_params_)
print('best score:', svm_gs.best_score_)

best params: {'C': 1, 'kernel': 'rbf'}
best score: 0.8184666666666667


Fit the SVM model with the best parameters from the grid search

In [39]:
svm_model = svm.SVC(C = 1, kernel= 'rbf', random_state = 3)
svm_model.fit(x,y)

results_df['SVM'] = svm_model.predict(x_test)
results_df.head()

Unnamed: 0,Actual,SVM
0,1,0
1,0,0
2,0,0
3,0,0
4,0,0
