## 1. Variable Identification
- **Numerical**
    - X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit - **Discrete**
    - X5: Age (year) - **Discrete**
    - X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005 - **Discrete**
    - X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005 - **Discrete**
- **Categorical**
    - **Y: default payment (Yes = 1, No = 0) - Target Variable/Nominal**
    - X2: Gender (1 = male; 2 = female) - **Nominal**
    - X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others) - **Nominal/Ordinal**
    - X4: Marital status (1 = married; 2 = single; 3 = others) - **Nominal**
    - X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above - **Nominal/Ordinal**

In [61]:
# ---------------------- import libraries
import pandas as pd
import numpy as np


from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn import metrics

# ---------------- set random seed to produce reproducible results
from numpy.random import seed
seed(1)

In [62]:
# ----------------------- import dataset
data = pd.read_csv('C:\\Users\\Inno Mvula\\Desktop\\Kaggle files\\Projects - Classification\\CreditDefault\\default-of-credit-card-clients.csv', skiprows = 1)
data = data[:]

In [63]:
data.columns

Index(['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0',
       'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6',
       'default payment next month'],
      dtype='object')

In [64]:
data.columns = ['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_SEP',
       'PAY_AUG', 'PAY_JUL', 'PAY_JUN', 'PAY_MAY', 'PAY_APR', 'BILL_AMT_SEP', 'BILL_AMT_AUG',
       'BILL_AMT_JUL', 'BILL_AMT_JUN', 'BILL_AMT_MAY', 'BILL_AMT_APR', 'PAY_AMT_SEP',
       'PAY_AMT_AUG', 'PAY_AMT_JUL', 'PAY_AMT_JUN', 'PAY_AMT_MAY', 'PAY_AMT_APR',
       'def_pay']

In [65]:
# ------------------------ variables
feat = ['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_SEP',
       'PAY_AUG', 'PAY_JUL', 'PAY_JUN', 'PAY_MAY', 'PAY_APR', 'BILL_AMT_SEP', 'BILL_AMT_AUG',
       'BILL_AMT_JUL', 'BILL_AMT_JUN', 'BILL_AMT_MAY', 'BILL_AMT_APR', 'PAY_AMT_SEP',
       'PAY_AMT_AUG', 'PAY_AMT_JUL', 'PAY_AMT_JUN', 'PAY_AMT_MAY', 'PAY_AMT_APR']
num_data = ['LIMIT_BAL', 'AGE', 'BILL_AMT_SEP', 'BILL_AMT_AUG', 'BILL_AMT_JUL', 'BILL_AMT_JUN', 'BILL_AMT_MAY', 'BILL_AMT_APR', 'PAY_AMT_SEP',
            'PAY_AMT_AUG', 'PAY_AMT_JUL', 'PAY_AMT_JUN', 'PAY_AMT_MAY', 'PAY_AMT_APR']
cat_data = ['SEX', 'EDUCATION', 'MARRIAGE', 'PAY_SEP', 'PAY_AUG', 'PAY_JUL', 'PAY_JUN', 'PAY_MAY', 'PAY_APR']
target = 'def_pay'

In [66]:
data.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_SEP,PAY_AUG,PAY_JUL,PAY_JUN,...,BILL_AMT_JUN,BILL_AMT_MAY,BILL_AMT_APR,PAY_AMT_SEP,PAY_AMT_AUG,PAY_AMT_JUL,PAY_AMT_JUN,PAY_AMT_MAY,PAY_AMT_APR,def_pay
0,1,20000,2,2,1,24,2,2,-1,-1,...,0,0,0,0,689,0,0,0,0,1
1,2,120000,2,2,2,26,-1,2,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,3,90000,2,2,2,34,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,4,50000,2,2,1,37,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,5,50000,1,2,1,57,-1,0,-1,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0


## 2. Split the data into a training and test set
- Training data - 25000 values
- Test data - 5000 values
- Drop ID column as it currently has no relevance for now.

In [67]:
# ---------------------------- split the data into training and test data
train_data = data[0:25000]
test_data = data[25000:]

In [68]:
train_data.shape, test_data.shape

((25000, 25), (5000, 25))

In [69]:
train_data['def_pay'].value_counts()

0    19422
1     5578
Name: def_pay, dtype: int64

In [70]:
test_data['def_pay'].value_counts()

0    3942
1    1058
Name: def_pay, dtype: int64

In [71]:
# ------------------------------ split training set into features and target
X = train_data[feat]
y = train_data[target]

In [72]:
# ------------------------------ split training set
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.20, random_state = 42, stratify = y)

In [73]:
X_train.shape, y_train.shape, X_val.shape, y_val.shape

((20000, 23), (20000,), (5000, 23), (5000,))

## 3. Build and train models
- Task: Classification
    - Decision Tree
    - Random Forest
    - Logistic Regression
    - Support Vector Machine classifier
    - KNN
    - Adaptive Boost
    - Gradient Boost
    - Multi-Layer Perceptron
    

In [74]:
dt = DecisionTreeClassifier()
rf = RandomForestClassifier()
lr = LogisticRegression()
svc = SVC()
knn = KNeighborsClassifier()
nb = GaussianNB()
gb = GradientBoostingClassifier()
ab = AdaBoostClassifier()
mlp = MLPClassifier()

### 1. Decision Tree

In [75]:
dt.fit(X_train, y_train)
y_pred = dt.predict(X_val)
#evaluate
class_rep_forest = classification_report(y_val, y_pred)
print(class_rep_forest)

              precision    recall  f1-score   support

           0       0.83      0.81      0.82      3884
           1       0.38      0.41      0.39      1116

    accuracy                           0.72      5000
   macro avg       0.60      0.61      0.61      5000
weighted avg       0.73      0.72      0.72      5000



### 2. Random Forest

In [76]:
rf.fit(X_train, y_train)
y_pred = rf.predict(X_val)
#evaluate
class_rep_forest2 = classification_report(y_val, y_pred)
print(class_rep_forest2)

              precision    recall  f1-score   support

           0       0.84      0.95      0.89      3884
           1       0.66      0.37      0.48      1116

    accuracy                           0.82      5000
   macro avg       0.75      0.66      0.68      5000
weighted avg       0.80      0.82      0.80      5000



### 3. Logistic Regression

In [77]:
lr.fit(X_train, y_train)
y_pred = lr.predict(X_val)
#evaluate
class_rep_forest3 = classification_report(y_val, y_pred)
print(class_rep_forest3)

              precision    recall  f1-score   support

           0       0.78      1.00      0.87      3884
           1       0.11      0.00      0.00      1116

    accuracy                           0.78      5000
   macro avg       0.44      0.50      0.44      5000
weighted avg       0.63      0.78      0.68      5000



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### 4. Support Vector Machine

In [78]:
svc.fit(X_train, y_train)
y_pred = svc.predict(X_val)
#evaluate
class_rep_forest4 = classification_report(y_val, y_pred)
print(class_rep_forest4)

              precision    recall  f1-score   support

           0       0.78      1.00      0.87      3884
           1       0.00      0.00      0.00      1116

    accuracy                           0.78      5000
   macro avg       0.39      0.50      0.44      5000
weighted avg       0.60      0.78      0.68      5000



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### 5. K-Nearest Neighbor

In [79]:
knn.fit(X_train, y_train)
y_pred = knn.predict(X_val)
#evaluate
class_rep_forest5 = classification_report(y_val, y_pred)
print(class_rep_forest5)

              precision    recall  f1-score   support

           0       0.80      0.91      0.85      3884
           1       0.39      0.19      0.26      1116

    accuracy                           0.75      5000
   macro avg       0.59      0.55      0.55      5000
weighted avg       0.71      0.75      0.72      5000



### 6. Naive Bayes

In [80]:
nb.fit(X_train, y_train)
y_pred = nb.predict(X_val)
#evaluate
class_rep_forest6 = classification_report(y_val, y_pred)
print(class_rep_forest6)

              precision    recall  f1-score   support

           0       0.86      0.27      0.42      3884
           1       0.25      0.85      0.39      1116

    accuracy                           0.40      5000
   macro avg       0.56      0.56      0.40      5000
weighted avg       0.73      0.40      0.41      5000



### 7. Adaptive Boost

In [81]:
ab.fit(X_train, y_train)
y_pred = ab.predict(X_val)
#evaluate
class_rep_forest7 = classification_report(y_val, y_pred)
print(class_rep_forest7)

              precision    recall  f1-score   support

           0       0.83      0.96      0.89      3884
           1       0.66      0.30      0.41      1116

    accuracy                           0.81      5000
   macro avg       0.74      0.63      0.65      5000
weighted avg       0.79      0.81      0.78      5000



### 8. Gradient Boost

In [82]:
gb.fit(X_train, y_train)
y_pred = gb.predict(X_val)
#evaluate
class_rep_forest8 = classification_report(y_val, y_pred)
print(class_rep_forest8)

              precision    recall  f1-score   support

           0       0.84      0.95      0.89      3884
           1       0.66      0.35      0.46      1116

    accuracy                           0.82      5000
   macro avg       0.75      0.65      0.67      5000
weighted avg       0.80      0.82      0.79      5000



### 9. Multi-layer perceptron

In [83]:
mlp.fit(X_train, y_train)
y_pred = mlp.predict(X_val)
#evaluate
class_rep_forest9 = classification_report(y_val, y_pred)
print(class_rep_forest9)

              precision    recall  f1-score   support

           0       0.83      0.68      0.75      3884
           1       0.32      0.52      0.40      1116

    accuracy                           0.65      5000
   macro avg       0.58      0.60      0.57      5000
weighted avg       0.72      0.65      0.67      5000

