<a href="https://colab.research.google.com/github/iamNirmeshGupta/Credit-Card-Default-Prediction/blob/main/Credit_Card_Default_Prediction_Nirmesh_Gupta_Individual_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install xlrd==1.2.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting xlrd==1.2.0
  Downloading xlrd-1.2.0-py2.py3-none-any.whl (103 kB)
[K     |████████████████████████████████| 103 kB 6.0 MB/s 
[?25hInstalling collected packages: xlrd
  Attempting uninstall: xlrd
    Found existing installation: xlrd 1.1.0
    Uninstalling xlrd-1.1.0:
      Successfully uninstalled xlrd-1.1.0
Successfully installed xlrd-1.2.0


##**This project is aimed at predicting the case of customers' default payment in Taiwan. From the perspective of risk management, the result of predictive accuracy of the estimated probability of default will be more valuable than the binary result of classification - Credible or not Credible clients.**

In [2]:
# Importing the required libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 

from sklearn.model_selection import KFold
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import Pipeline

from sklearn import metrics  
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import plot_roc_curve
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import plot_precision_recall_curve

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from xgboost import XGBClassifier


import warnings
warnings.simplefilter("ignore")
from pprint import pprint
import joblib
import imblearn

In [3]:
# Importing the dataset

df = pd.read_excel("/content/drive/MyDrive/Colab Notebooks/alma better/Credit Card Default prediction - Capstone Project - Classification/Copy of default of credit card clients.xls")

In [4]:
# Basic Inspection

df.head()

Unnamed: 0.1,Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,...,X15,X16,X17,X18,X19,X20,X21,X22,X23,Y
0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
1,1,20000,2,2,1,24,2,2,-1,-1,...,0,0,0,0,689,0,0,0,0,1
2,2,120000,2,2,2,26,-1,2,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
3,3,90000,2,2,2,34,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
4,4,50000,2,2,1,37,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0


In [17]:
# Removing the unnamed row and setting row 0 as column header

df1 = df.rename(columns=df.iloc[0]).loc[1:]
df1.reset_index(inplace=True)
df1.drop(['index'],inplace=True,axis = 1)

In [19]:
# Basic Inspection

df1.head(10)

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
0,1,20000,2,2,1,24,2,2,-1,-1,...,0,0,0,0,689,0,0,0,0,1
1,2,120000,2,2,2,26,-1,2,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,3,90000,2,2,2,34,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,4,50000,2,2,1,37,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,5,50000,1,2,1,57,-1,0,-1,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0
5,6,50000,1,1,2,37,0,0,0,0,...,19394,19619,20024,2500,1815,657,1000,1000,800,0
6,7,500000,1,1,2,29,0,0,0,0,...,542653,483003,473944,55000,40000,38000,20239,13750,13770,0
7,8,100000,2,2,2,23,0,-1,-1,0,...,221,-159,567,380,601,0,581,1687,1542,0
8,9,140000,2,3,1,28,0,0,2,0,...,12211,11793,3719,3329,0,432,1000,1000,1000,0
9,10,20000,1,3,2,35,-2,-2,-2,-2,...,0,13007,13912,0,0,0,13007,1122,0,0


In [20]:
# Checking the shape of the data

df1.shape

(30000, 25)

##**Data Description**

###**There are 25 variables:**
**• ID**: ID of each client. \
**• LIMIT_BAL:** Amount of given credit in NT dollars (includes individual and family/supplementary credit. \
**• SEX:** Gender (1=male, 2=female) \
**• EDUCATION:** (1=graduate school, 2=university, 3=high school, 4=others) \
**• MARRIAGE:** Marital status (1=married, 2=single, 3=others) \
**• AGE:** Age in years \
**• PAY_0:** Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above) \
**• PAY_2:** Repayment status in August, 2005 (scale same as above) \
**• PAY_3:** Repayment status in July, 2005 (scale same as above) \
**• PAY_4:** Repayment status in June, 2005 (scale same as above) \
**• PAY_5:** Repayment status in May, 2005 (scale same as above) \
**• PAY_6:** Repayment status in April, 2005 (scale same as above) \
**• BILL_AMT1:** Amount of bill statement in September, 2005 (NT dollar) \
**• BILL_AMT2:** Amount of bill statement in August, 2005 (NT dollar) \
**• BILL_AMT3:** Amount of bill statement in July, 2005 (NT dollar) \
**• BILL_AMT4:** Amount of bill statement in June, 2005 (NT dollar) \
**•BILL_AMT5**: Amount of bill statement in May, 2005 (NT dollar) \
**• BILL_AMT6:** Amount of bill statement in April, 2005 (NT dollar) \
**• PAY_AMT1:** Amount of previous payment in September, 2005 (NT dollar) \
**• PAY_AMT2:** Amount of previous payment in August, 2005 (NT dollar) \
**• PAY_AMT3:** Amount of previous payment in July, 2005 (NT dollar) \
**• PAY_AMT4:** Amount of previous payment in June, 2005 (NT dollar) \
**• PAY_AMT5:** Amount of previous payment in May, 2005 (NT dollar) \
**• PAY_AMT6:** Amount of previous payment in April, 2005 (NT dollar) \
**• default.payment.next.month:** Default payment (1=yes, 0=no)

In [21]:
# Printing all the columns in the data

df1.columns

Index(['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0',
       'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6',
       'default payment next month'],
      dtype='object')

In [23]:
# Checking the pay column
df1['PAY_0'].value_counts()

 0    14737
-1     5686
 1     3688
-2     2759
 2     2667
 3      322
 4       76
 5       26
 8       19
 6       11
 7        9
Name: PAY_0, dtype: int64