# About Dataset
## Dataset Information
This dataset contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005.

### Content
There are 25 variables:

<b>ID:</b> ID of each client <br>
<b>LIMIT_BAL:</b> Amount of given credit in NT dollars (includes individual and family/supplementary credit <br>
<b>SEX:</b> Gender (1=male, 2=female) <br>
<b>EDUCATION:</b> (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown) <br>
<b>MARRIAGE:</b> Marital status (1=married, 2=single, 3=others) <br>
<b>AGE:</b> Age in years <br>
<b>PAY_0:</b> Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above) <br>
<b>PAY_2:</b> Repayment status in August, 2005 (scale same as above) <br>
<b>PAY_3:</b> Repayment status in July, 2005 (scale same as above) <br>
<b>PAY_4:</b> Repayment status in June, 2005 (scale same as above) <br>
<b>PAY_5:</b> Repayment status in May, 2005 (scale same as above) <br>
<b>PAY_6:</b> Repayment status in April, 2005 (scale same as above) <br>
<b>BILL_AMT1:</b> Amount of bill statement in September, 2005 (NT dollar) <br>
<b>BILL_AMT2:</b> Amount of bill statement in August, 2005 (NT dollar) <br>
<b>BILL_AMT3:</b> Amount of bill statement in July, 2005 (NT dollar) <br>
<b>BILL_AMT4:</b> Amount of bill statement in June, 2005 (NT dollar) <br>
<b>BILL_AMT5:</b> Amount of bill statement in May, 2005 (NT dollar) <br>
<b>BILL_AMT6:</b> Amount of bill statement in April, 2005 (NT dollar) <br>
<b>PAY_AMT1:</b> Amount of previous payment in September, 2005 (NT dollar) <br>
<b>PAY_AMT2:</b> Amount of previous payment in August, 2005 (NT dollar) <br>
<b>PAY_AMT3:</b> Amount of previous payment in July, 2005 (NT dollar) <br>
<b>PAY_AMT4:</b> Amount of previous payment in June, 2005 (NT dollar) <br>
<b>PAY_AMT5:</b> Amount of previous payment in May, 2005 (NT dollar) <br>
<b>PAY_AMT6:</b> Amount of previous payment in April, 2005 (NT dollar) <br>
default.payment.next.month: Default payment (1=yes, 0=no) <br>

### Inspiration
Some ideas for exploration:

How does the probability of default payment vary by categories of different demographic variables?
Which variables are the strongest predictors of default payment?
Acknowledgements
Any publications based on this dataset should acknowledge the following:

Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

The original dataset can be found here at the UCI Machine Learning Repository.

Data_Exploration_and_Preprocessing.ipynb

Model_Selection_and_Tuning.ipynb

Final_Model_Training_and_Evaluation.ipynb

In [10]:
import pandas as pd

In [11]:
df=pd.read_csv("UCI_Credit_Card.csv")

In [12]:
df.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
0,1,20000.0,2,2,1,24,2,2,-1,-1,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,2,120000.0,2,2,2,26,-1,2,0,0,...,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,3,90000.0,2,2,2,34,0,0,0,0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
3,4,50000.0,2,2,1,37,0,0,0,0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
4,5,50000.0,1,2,1,57,-1,0,-1,0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0


In [13]:
df.isnull().sum()

ID                            0
LIMIT_BAL                     0
SEX                           0
EDUCATION                     0
MARRIAGE                      0
AGE                           0
PAY_0                         0
PAY_2                         0
PAY_3                         0
PAY_4                         0
PAY_5                         0
PAY_6                         0
BILL_AMT1                     0
BILL_AMT2                     0
BILL_AMT3                     0
BILL_AMT4                     0
BILL_AMT5                     0
BILL_AMT6                     0
PAY_AMT1                      0
PAY_AMT2                      0
PAY_AMT3                      0
PAY_AMT4                      0
PAY_AMT5                      0
PAY_AMT6                      0
default.payment.next.month    0
dtype: int64

In [14]:
X = df.drop('default.payment.next.month', axis=1)

In [16]:
y = df['default.payment.next.month']

In [18]:
from sklearn.model_selection import train_test_split

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [20]:
from sklearn.ensemble import RandomForestClassifier

In [22]:
rf = RandomForestClassifier()

In [23]:
classifier = rf.fit(X_train, y_train)

In [35]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score


In [25]:
predict = classifier.predict(X_test)


In [36]:
print(confusion_matrix(y_test, predict))
print(classification_report(y_test, predict))
print(accuracy_score(y_test, predict))

[[7300  442]
 [1395  763]]
              precision    recall  f1-score   support

           0       0.84      0.94      0.89      7742
           1       0.63      0.35      0.45      2158

    accuracy                           0.81      9900
   macro avg       0.74      0.65      0.67      9900
weighted avg       0.79      0.81      0.79      9900

0.8144444444444444


In [28]:
#!pip install pycaret

In [29]:

# init setup
from pycaret.classification import *
clf1 = setup(data = df, target = 'default.payment.next.month', session_id = 123)

Unnamed: 0,Description,Value
0,Session id,123
1,Target,default.payment.next.month
2,Target type,Binary
3,Original data shape,"(30000, 25)"
4,Transformed data shape,"(30000, 25)"
5,Transformed train set shape,"(21000, 25)"
6,Transformed test set shape,"(9000, 25)"
7,Numeric features,24
8,Preprocess,True
9,Imputation type,simple


In [30]:
best_model = compare_models() 

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
gbc,Gradient Boosting Classifier,0.82,0.7798,0.3636,0.6727,0.4719,0.3749,0.4008,2.663
lightgbm,Light Gradient Boosting Machine,0.8181,0.7782,0.3595,0.6645,0.4665,0.3684,0.3937,0.595
catboost,CatBoost Classifier,0.8166,0.7792,0.3589,0.6565,0.4639,0.3646,0.3889,5.785
ada,Ada Boost Classifier,0.815,0.7718,0.316,0.6738,0.43,0.3365,0.371,0.927
rf,Random Forest Classifier,0.8143,0.7637,0.3593,0.6446,0.4613,0.3599,0.3824,1.717
xgboost,Extreme Gradient Boosting,0.8115,0.7602,0.3531,0.6323,0.453,0.35,0.3718,1.58
et,Extra Trees Classifier,0.8109,0.7575,0.3587,0.6272,0.4562,0.3519,0.3721,0.949
lda,Linear Discriminant Analysis,0.8101,0.7142,0.2512,0.6951,0.3689,0.2852,0.3366,0.348
ridge,Ridge Classifier,0.7985,0.0,0.1432,0.725,0.2388,0.1791,0.2591,0.304
lr,Logistic Regression,0.7789,0.6467,0.0002,0.1,0.0004,0.0003,0.0041,1.116


Processing:   0%|          | 0/69 [00:00<?, ?it/s]

In [31]:
gbc = create_model('gbc')

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.8157,0.7837,0.3793,0.64,0.4763,0.3733,0.3921
1,0.8295,0.789,0.3858,0.7103,0.5,0.4079,0.4356
2,0.8171,0.7794,0.3534,0.6613,0.4607,0.3626,0.3884
3,0.8133,0.7539,0.3341,0.6513,0.4416,0.3432,0.3708
4,0.8252,0.7837,0.3664,0.6996,0.4809,0.3879,0.4173
5,0.8224,0.7805,0.3806,0.6756,0.4869,0.3895,0.413
6,0.8105,0.7739,0.3484,0.6304,0.4488,0.3456,0.3678
7,0.8271,0.7974,0.3871,0.6977,0.4979,0.4037,0.4293
8,0.82,0.7851,0.3548,0.679,0.4661,0.3704,0.3987
9,0.8195,0.7711,0.3462,0.6822,0.4593,0.3646,0.3949


Processing:   0%|          | 0/4 [00:00<?, ?it/s]

In [33]:
tuned_gbc = tune_model(gbc, optimize = 'Accuracy',n_iter=20 )

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.8148,0.7837,0.2888,0.6943,0.4079,0.3196,0.363
1,0.8171,0.7889,0.2759,0.7273,0.4,0.317,0.3691
2,0.8181,0.7777,0.2737,0.7384,0.3994,0.3178,0.3725
3,0.8095,0.7574,0.2672,0.6739,0.3827,0.2941,0.3383
4,0.8229,0.7877,0.2974,0.75,0.4259,0.3436,0.3952
5,0.8157,0.7857,0.286,0.7074,0.4074,0.3207,0.367
6,0.8076,0.7783,0.2624,0.6667,0.3765,0.2874,0.3313
7,0.8157,0.7957,0.2796,0.7143,0.4019,0.3167,0.3657
8,0.8195,0.7862,0.2903,0.7337,0.416,0.3322,0.3823
9,0.8186,0.7735,0.2753,0.7442,0.4019,0.3206,0.376


Processing:   0%|          | 0/7 [00:00<?, ?it/s]

Fitting 10 folds for each of 20 candidates, totalling 200 fits
Original model was better than the tuned model, hence it will be returned. NOTE: The display metrics are for the tuned model (not the original one).


In [34]:
df

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
0,1,20000.0,2,2,1,24,2,2,-1,-1,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,2,120000.0,2,2,2,26,-1,2,0,0,...,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,3,90000.0,2,2,2,34,0,0,0,0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
3,4,50000.0,2,2,1,37,0,0,0,0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
4,5,50000.0,1,2,1,57,-1,0,-1,0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29995,29996,220000.0,1,3,1,39,0,0,0,0,...,88004.0,31237.0,15980.0,8500.0,20000.0,5003.0,3047.0,5000.0,1000.0,0
29996,29997,150000.0,1,3,2,43,-1,-1,-1,-1,...,8979.0,5190.0,0.0,1837.0,3526.0,8998.0,129.0,0.0,0.0,0
29997,29998,30000.0,1,2,2,37,4,3,2,-1,...,20878.0,20582.0,19357.0,0.0,0.0,22000.0,4200.0,2000.0,3100.0,1
29998,29999,80000.0,1,3,1,41,1,-1,0,0,...,52774.0,11855.0,48944.0,85900.0,3409.0,1178.0,1926.0,52964.0,1804.0,1
