Dataset Information
This dataset contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005.

Content
There are 25 variables:

ID: ID of each client
LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit
SEX: Gender (1=male, 2=female)
EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
MARRIAGE: Marital status (1=married, 2=single, 3=others)
AGE: Age in years
PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)
PAY_2: Repayment status in August, 2005 (scale same as above)
PAY_3: Repayment status in July, 2005 (scale same as above)
PAY_4: Repayment status in June, 2005 (scale same as above)
PAY_5: Repayment status in May, 2005 (scale same as above)
PAY_6: Repayment status in April, 2005 (scale same as above)
BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar)
BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)
BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)
BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)
BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)
BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)
PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar)
PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)
PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)
PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)
PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)
PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar)
default.payment.next.month: Default payment (1=yes, 0=no)
Inspiration
Some ideas for exploration:

How does the probability of default payment vary by categories of different demographic variables?
Which variables are the strongest predictors of default payment?
Acknowledgements
Any publications based on this dataset should acknowledge the following:

Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

The original dataset can be found here at the UCI Machine Learning Repository.

In [1]:
import pandas as pd
df=pd.read_csv('D:/Dropbox/Machine Learning/Data/Credit Default/UCI_Credit_Card.csv')
df.sample(5)

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
23156,23157,30000.0,2,3,2,38,2,0,0,2,...,28538.0,28208.0,29326.0,1426.0,3700.0,718.0,1500.0,1500.0,3000.0,1
14294,14295,20000.0,1,2,2,24,2,2,4,4,...,1650.0,1650.0,1650.0,0.0,0.0,0.0,0.0,0.0,0.0,1
8859,8860,180000.0,1,1,2,29,-1,-1,-1,-1,...,998.0,847.0,1912.0,1597.0,2604.0,1001.0,849.0,1915.0,4055.0,1
8387,8388,30000.0,1,2,2,28,0,0,0,0,...,18585.0,0.0,0.0,2000.0,2000.0,2000.0,0.0,0.0,0.0,0
1104,1105,20000.0,2,2,1,35,1,2,2,3,...,20419.0,19289.0,19600.0,0.0,3292.0,0.0,0.0,781.0,701.0,1


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 25 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   ID                          30000 non-null  int64  
 1   LIMIT_BAL                   30000 non-null  float64
 2   SEX                         30000 non-null  int64  
 3   EDUCATION                   30000 non-null  int64  
 4   MARRIAGE                    30000 non-null  int64  
 5   AGE                         30000 non-null  int64  
 6   PAY_0                       30000 non-null  int64  
 7   PAY_2                       30000 non-null  int64  
 8   PAY_3                       30000 non-null  int64  
 9   PAY_4                       30000 non-null  int64  
 10  PAY_5                       30000 non-null  int64  
 11  PAY_6                       30000 non-null  int64  
 12  BILL_AMT1                   30000 non-null  float64
 13  BILL_AMT2                   300

In [3]:
# categorical variables description
df[['SEX','EDUCATION','MARRIAGE']].describe()


Unnamed: 0,SEX,EDUCATION,MARRIAGE
count,30000.0,30000.0,30000.0
mean,1.603733,1.853133,1.551867
std,0.489129,0.790349,0.52197
min,1.0,0.0,0.0
25%,1.0,1.0,1.0
50%,2.0,2.0,2.0
75%,2.0,2.0,2.0
max,2.0,6.0,3.0


SEX: Gender (1=male, 2=female) 
EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown) 
MARRIAGE: Marital status (1=married, 2=single, 3=others)

EDUCTAION: undocumented 0 =>solution: change 0 to 6

MARRIAGE: undocumented 0 =>solution: change 0 to 3

In [4]:
# payment delay description
df[['PAY_0','PAY_2','PAY_3','PAY_4','PAY_5','PAY_6']].describe()

Unnamed: 0,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6
count,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0
mean,-0.0167,-0.133767,-0.1662,-0.220667,-0.2662,-0.2911
std,1.123802,1.197186,1.196868,1.169139,1.133187,1.149988
min,-2.0,-2.0,-2.0,-2.0,-2.0,-2.0
25%,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
50%,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0
max,8.0,8.0,8.0,8.0,8.0,8.0


PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)

PAY_?:undocumented -2

In [5]:
# bill statement description
df[['BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5','BILL_AMT6']].describe()

Unnamed: 0,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6
count,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0
mean,51223.3309,49179.075167,47013.15,43262.948967,40311.400967,38871.7604
std,73635.860576,71173.768783,69349.39,64332.856134,60797.15577,59554.107537
min,-165580.0,-69777.0,-157264.0,-170000.0,-81334.0,-339603.0
25%,3558.75,2984.75,2666.25,2326.75,1763.0,1256.0
50%,22381.5,21200.0,20088.5,19052.0,18104.5,17071.0
75%,67091.0,64006.25,60164.75,54506.0,50190.5,49198.25
max,964511.0,983931.0,1664089.0,891586.0,927171.0,961664.0


BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar)

negative values observed: treat it as credit?

In [6]:
# previous payment description
df[['PAY_AMT1','PAY_AMT2','PAY_AMT3','PAY_AMT4','PAY_AMT5','PAY_AMT6']].describe()

Unnamed: 0,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6
count,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0
mean,5663.5805,5921.163,5225.6815,4826.076867,4799.387633,5215.502567
std,16563.280354,23040.87,17606.96147,15666.159744,15278.305679,17777.465775
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,1000.0,833.0,390.0,296.0,252.5,117.75
50%,2100.0,2009.0,1800.0,1500.0,1500.0,1500.0
75%,5006.0,5000.0,4505.0,4013.25,4031.5,4000.0
max,873552.0,1684259.0,896040.0,621000.0,426529.0,528666.0


In [7]:
df.LIMIT_BAL.describe()

count      30000.000000
mean      167484.322667
std       129747.661567
min        10000.000000
25%        50000.000000
50%       140000.000000
75%       240000.000000
max      1000000.000000
Name: LIMIT_BAL, dtype: float64

LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit


In [8]:
# rename two columns
df=df.rename(columns={'default.payment.next.month':'def_pay',
                     'PAY_0':'PAY_1'})
df.columns

Index(['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_1',
       'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6', 'def_pay'],
      dtype='object')

In [9]:
# Calculate default rate
df.def_pay.sum()/len(df.def_pay)

0.2212

In [10]:
# prepare X and y for machine learning
y=df['def_pay'].copy()
y.sample(5)

18198    1
16678    0
19134    0
11838    0
24273    0
Name: def_pay, dtype: int64

In [11]:
features=['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_1',
       'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']
X = df[features].copy()
X.columns

Index(['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_1', 'PAY_2',
       'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6'],
      dtype='object')

In [12]:
# split data into training and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [13]:
df.def_pay.describe()

count    30000.000000
mean         0.221200
std          0.415062
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          1.000000
Name: def_pay, dtype: float64

In [14]:
y_train.describe()

count    24000.000000
mean         0.221792
std          0.415460
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          1.000000
Name: def_pay, dtype: float64

In [15]:
y_test.describe()

count    6000.000000
mean        0.218833
std         0.413490
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max         1.000000
Name: def_pay, dtype: float64

# Dummy Classifier

In [23]:
from sklearn.dummy import DummyClassifier
clf = DummyClassifier(strategy="most_frequent")
clf.fit(X_train, y_train)
y_predicted=clf.predict(X_test)
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_predicted)

0.7811666666666667

# Decision Tree Classifier

In [16]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
y_predicted=clf.predict(X_test)
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_predicted)

0.7196666666666667

In [17]:
import numpy as np
from sklearn.model_selection import GridSearchCV
param_grid={
    'max_depth':[2,4,5,10,15,20],
    'criterion':['gini','entropy'],
    'max_leaf_nodes':[5,10,20,50,100],
    'min_samples_split':[5,10,15,20]
}

grid_tree = GridSearchCV(DecisionTreeClassifier(),param_grid,cv=5,scoring='accuracy')

grid_tree.fit(X_train,y_train)

GridSearchCV(cv=5, estimator=DecisionTreeClassifier(),
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [2, 4, 5, 10, 15, 20],
                         'max_leaf_nodes': [5, 10, 20, 50, 100],
                         'min_samples_split': [5, 10, 15, 20]},
             scoring='accuracy')

In [23]:
print(grid_tree.best_estimator_)
print(np.abs(grid_tree.best_score_))

DecisionTreeClassifier(max_depth=4, max_leaf_nodes=20, min_samples_split=20)
0.8212916666666666


# K Nearest Neighbour

In [22]:
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_jobs=-1)
clf.fit(X_train, y_train)
y_predicted=clf.predict(X_test)
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_predicted)

0.756

In [25]:
import numpy as np
from sklearn.model_selection import GridSearchCV
param_grid={
    'n_neighbors':[1,3,5,10,20],
    'algorithm':['auto', 'ball_tree', 'kd_tree', 'brute'],
    'weights':['uniform', 'distance'],
    'leaf_size':[2,5,10,15,20]
}

grid_tree = GridSearchCV(KNeighborsClassifier(n_jobs=-1),param_grid,cv=5,scoring='accuracy')

grid_tree.fit(X_train,y_train)

GridSearchCV(cv=5, estimator=KNeighborsClassifier(n_jobs=-1),
             param_grid={'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
                         'leaf_size': [2, 5, 10, 15, 20],
                         'n_neighbors': [1, 3, 5, 10, 20],
                         'weights': ['uniform', 'distance']},
             scoring='accuracy')

In [26]:
print(grid_tree.best_estimator_)
print(np.abs(grid_tree.best_score_))

KNeighborsClassifier(leaf_size=20, n_jobs=-1, n_neighbors=20)
0.7782083333333334


# Support Vector Machine

In [66]:
from sklearn.svm import SVC
clf = SVC()
clf.fit(X_train, y_train)
y_predicted=clf.predict(X_test)
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_predicted)

0.7811666666666667

In [21]:
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
param_grid={
    'C':[0.1,1,10],
    'gamma':[1,0.1,0.01,0.001,0.0001],
    'kernel':['rbf']
}

grid_svm = GridSearchCV(SVC(gamma='auto'),param_grid,cv=3,scoring='accuracy',verbose=5, n_jobs=8)

grid_svm.fit(X_train,y_train)

Fitting 3 folds for each of 15 candidates, totalling 45 fits


[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done   2 tasks      | elapsed:  1.9min
[Parallel(n_jobs=8)]: Done  40 out of  45 | elapsed: 15.0min remaining:  1.9min
[Parallel(n_jobs=8)]: Done  45 out of  45 | elapsed: 15.8min finished


GridSearchCV(cv=3, estimator=SVC(gamma='auto'), n_jobs=8,
             param_grid={'C': [0.1, 1, 10],
                         'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
                         'kernel': ['rbf']},
             scoring='accuracy', verbose=5)

In [22]:
print(grid_svm.best_estimator_)
print(np.abs(grid_svm.best_score_))

SVC(C=1, gamma=0.001)
0.7799166666666667
