<a href="https://colab.research.google.com/github/parmarnarayan31/credit-card-default-prediction/blob/main/Credit_Card_Default_Prediction_Capstone_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <b><u> Project Title : Predicting whether a customer will default on his/her credit card </u></b>

## <b> Problem Description </b>

### This project is aimed at predicting the case of customers default payments in Taiwan. From the perspective of risk management, the result of predictive accuracy of the estimated probability of default will be more valuable than the binary result of classification - credible or not credible clients. We can use the [K-S chart](https://www.listendata.com/2019/07/KS-Statistics-Python.html) to evaluate which customers will default on their credit card payments


## <b> Data Description </b>

### <b>Attribute Information: </b>

### This research employed a binary variable, default payment (Yes = 1, No = 0), as the response variable. This study reviewed the literature and used the following 23 variables as explanatory variables:
* ### X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
* ### X2: Gender (1 = male; 2 = female).
* ### X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).
* ### X4: Marital status (1 = married; 2 = single; 3 = others).
* ### X5: Age (year).
* ### X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.
* ### X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005.
* ### X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005.

In [None]:
# import important library for project
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.preprocessing import MinMaxScaler

from sklearn.metrics import roc_auc_score, confusion_matrix
from sklearn.model_selection import cross_val_score 
from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix,mean_squared_error
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import KFold


In [None]:
# mount for join drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# !pip install xlrd==1.2.0

In [None]:
!pip install --upgrade --force-reinstall xlrd

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting xlrd
  Downloading xlrd-2.0.1-py2.py3-none-any.whl (96 kB)
[K     |████████████████████████████████| 96 kB 4.8 MB/s 
[?25hInstalling collected packages: xlrd
  Attempting uninstall: xlrd
    Found existing installation: xlrd 1.2.0
    Uninstalling xlrd-1.2.0:
      Successfully uninstalled xlrd-1.2.0
Successfully installed xlrd-2.0.1


In [None]:
#Data uploading
df = pd.read_excel('/content/drive')

IsADirectoryError: ignored

#Exploratory Data Analysis

In [None]:
# data head for analysis
df.head()

In [None]:
# all columns 
df.columns

In [None]:
df.tail()

In [None]:
# describe dataset
df.describe()

In [None]:
# data info for analysis
df.info()

In [None]:
df.shape

#Null values 

In [None]:
df.columns = df.iloc[0]
df.drop(labels= 0, axis = 0, inplace = True)

In [None]:
df.head()

In [None]:
df.shape

In [None]:
# bar plot for defaulters percentage
df.rename(columns = {'default payment next month': 'defaulters'}, inplace = True)

In [None]:
ax= df['defaulters'].value_counts(normalize = True)*100
ax.plot.bar(figsize=(6,6), color = ('r','y'))

plt.title("Defaulters Percentage", fontsize=15)
for x,y in zip([0,1],ax):
    plt.text(x,y,y,fontsize=12)
plt.show()

 so we have 22.12% defaulters in our dataset and 77.88% persons are non defaulters

In [None]:
sns.distplot(df['LIMIT_BAL'])

In [None]:
sns.distplot(df['AGE'])

 the data shows that most people are of age range 20-40 and a few only from 50-60 age group

In [None]:
#  check the defaulters by age, sex, limit balance
bins = [20,30,40,50,60,70,80]
names = ['21-30','31-40','41-50','51-60','61-70','71-80']
df['AGE_BIN'] = pd.cut(x=df.AGE, bins=bins, labels=names, right=True)

age_cnt = df.AGE_BIN.value_counts()
age_0 = (df.AGE_BIN[df['defaulters'] == 0].value_counts())
age_1 = (df.AGE_BIN[df['defaulters'] == 1].value_counts())

plt.subplots(figsize=(8,5))
# sns.barplot(data=defaulters, x='AGE_BIN', y='LIMIT_BAL', hue='def_pay', ci=0)
plt.bar(age_0.index, age_0.values, label='0', color = ('r'))
plt.bar(age_1.index, age_1.values, label='1', color = ('y'))
for x,y in zip(names,age_0):
    plt.text(x,y,y,fontsize=12)
for x,y in zip(names,age_1):
    plt.text(x,y,y,fontsize=12)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.title("Number of clients in each age group", fontsize=15)
plt.legend(loc='upper right', fontsize=15)
plt.show()


 maximum clients from 21-30 age group followed by 31-40. 
Hence with increasing age group the number of clients that will default the payment next month is decreasing.

In [None]:
sns.FacetGrid(df, hue = 'defaulters', size = 5).map(sns.histplot, 'SEX').add_legend()

In [None]:
sns.barplot(x ='SEX', y ='defaulters', data = df)

So we have more male deafulters 

In [None]:
bins_sex = [20,30,40,50,60,70,80]
names = ['21-30','31-40','41-50','51-60','61-70','71-80']

sex_1 = df[df['SEX'] == 1][df['defaulters'] ==1]['AGE_BIN'].value_counts()
sex_2 = df[df['SEX'] == 2][df['defaulters'] ==1]['AGE_BIN'].value_counts()


plt.bar(sex_2.index, sex_2.values, label='FEMALE', color = ('r'))
plt.bar(sex_1.index, sex_1.values, label='MALE', color = ('y'))

for x,y in zip(names,sex_2):
    plt.text(x,y,y,fontsize=10)

for x,y in zip(names,sex_1):
    plt.text(x,y,y,fontsize=10)

plt.xticks(fontsize=10)
plt.yticks(fontsize= 10)

plt.legend(loc='upper right', fontsize=10)
plt.title("Number of defaulters order by Sex", fontsize=15)


Number of defaulters order by Sex


* Regarding the attribute EDUCATION there are three categories not listed in the description of the dataset provided by the UCI website that corresponds to 0, 5, and 6.
* While for MARRIAGE we can notice the presence of category 0 that does not correspond to any categories previously described.

In [None]:
df['MARRIAGE'].value_counts()

In [None]:
df['MARRIAGE'].replace(0,2, inplace = True)

In [None]:
ax = df[df['defaulters'] == 1]['MARRIAGE'].value_counts(normalize = True)*100
ax.plot.bar(figsize=(6,6), color = ('r','y'))

plt.title("Defaulters Percentage order by Marriage", fontsize=15)
for x,y in zip([0,1,2],ax):
    plt.text(x,y,y,fontsize=12)
plt.show()

In [None]:
# histplot with facegrid 
plt.figure(figsize=(10,12))
sns.FacetGrid(df, row='defaulters', col = 'MARRIAGE').map(sns.histplot, 'AGE')


 married people between the age of 30-45 have maximum chances of being defaulters, same for unmarried. So i think marriage is not the case, Age is. 

In [None]:
df['EDUCATION'].value_counts()

In [None]:
edu_condition =(df['EDUCATION'] == 5) | (df['EDUCATION'] == 6) | (df['EDUCATION'] == 4)

In [None]:
df.loc[edu_condition, 'EDUCATION' ] = 3

In [None]:
ax = df[df['defaulters'] == 1]['EDUCATION'].value_counts(normalize = True)*100
ax.plot.bar(figsize=(6,6), color = ('r','y'))

plt.title("Defaulters Percentage order by Education", fontsize=15)
for x,y in zip([0,1,2],ax):
    plt.text(x,y,y,fontsize=12)
plt.show()

  graduate and high school students

In [None]:
# Payment delay description
df[['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']].describe()

In [None]:
# Amount of given credit limit
df.LIMIT_BAL.describe()

In [None]:
 plt.figure(figsize = (14,6))
 plt.title('Amount of credit limit - Density Plot')
 sns.set_color_codes("pastel")
 sns.distplot(df['LIMIT_BAL'],kde=True,bins=200, color="blue")
 plt.show()

In [None]:
df['LIMIT_BAL'].value_counts().head(10)

In [None]:
x1 = list(df[df['defaulters'] == 1]['LIMIT_BAL'])
x2 = list(df[df['defaulters'] == 0]['LIMIT_BAL'])

plt.figure(figsize=(12,4))
sns.set_context('notebook', font_scale=1.2)
sns.set_color_codes("pastel")
plt.hist([x1, x2], bins = 40, color=['r', 'b'])
plt.xlim([0,600000])
plt.legend(['Yes', 'No'], title = 'defaulters', loc='upper right', facecolor='white')
plt.xlabel('Limit Balance (NT dollar)')
plt.ylabel('Frequency')
plt.title('LIMIT BALANCE HISTOGRAM BY TYPE OF CREDIT CARD', SIZE=15)
plt.box(False)
plt.savefig('ImageName', format='png', dpi=200, transparent=True);

In [None]:
class_1 = df.loc[df['defaulters'] == 1]["LIMIT_BAL"]
class_0 = df.loc[df['defaulters'] == 0]["LIMIT_BAL"]
plt.figure(figsize = (14,6))
plt.title('defaulters amount of credit limit  - grouped by Payment Next Month (Density Plot)')
sns.set_color_codes("pastel")
sns.distplot(class_1,kde=True,bins=200, color="red")
sns.distplot(class_0,kde=True,bins=200, color="green")
plt.savefig('Fig - Density plot LIMIT_BAL grouped by label.png')

In [None]:
#distribution correlated features -- scatter interaction
import matplotlib.patches as mpatches


df_np=df.to_numpy()
target=df.defaulters

# variables to 
BILL_AMT1 = df['BILL_AMT1'].to_numpy()
BILL_AMT2 = df['BILL_AMT2'].to_numpy()
BILL_AMT3 = df['BILL_AMT3'].to_numpy()
BILL_AMT4 = df['BILL_AMT4'].to_numpy()
BILL_AMT5 = df['BILL_AMT5'].to_numpy()
BILL_AMT6 = df['BILL_AMT6'].to_numpy()
AGE = df['AGE'].to_numpy()
LIMIT_BAL = df['LIMIT_BAL'].to_numpy()
PAY_AMT1 = df['PAY_AMT1'].to_numpy()

fig, ax = plt.subplots(1,3, figsize= (15,6))

labels=["Non defaulters","defaulters"]
pop_a = mpatches.Patch(color='steelblue', label='Non defaulters')
pop_b = mpatches.Patch(color='crimson', label='defaulters')
colors=['crimson', 'steelblue']
ax[0].scatter(BILL_AMT1, BILL_AMT2, c=target, cmap=matplotlib.colors.ListedColormap(colors), label=labels, alpha=0.5)
ax[0].grid()
ax[0].set_xlabel('BILL_AMT1')
ax[0].set_ylabel('BILL_AMT2')
ax[0].legend(handles= [pop_a,pop_b])

ax[1].scatter(BILL_AMT2, BILL_AMT3, c=target, cmap=matplotlib.colors.ListedColormap(colors), alpha=0.5)
ax[1].grid()
ax[1].set_xlabel('BILL_AMT2')
ax[1].set_ylabel('BILL_AMT3')
ax[1].legend(handles= [pop_a,pop_b])

ax[2].scatter(BILL_AMT4,BILL_AMT5, c=target, cmap=matplotlib.colors.ListedColormap(colors), alpha=0.5)
ax[2].grid()
ax[2].set_xlabel('BILL_AMT4')
ax[2].set_ylabel('BILL_AMT5')
ax[2].legend(handles= [pop_a,pop_b])

plt.tight_layout()# let's make good plots
plt.show()

In [None]:
#distribution un-correlated features -- scatter interaction

fig, ax = plt.subplots(1,3, figsize= (15,6))

labels=["Non defaulters","defaulters"]
pop_a = mpatches.Patch(color='steelblue', label='Non defaulters')
pop_b = mpatches.Patch(color='crimson', label='defaulters')
colors=['crimson', 'steelblue']

ax[0].scatter(AGE, LIMIT_BAL, c=target, cmap=matplotlib.colors.ListedColormap(colors), label=labels, alpha=0.5)
ax[0].grid()
ax[0].set_xlabel('AGE')
ax[0].set_ylabel('LIMIT_BAL')
ax[0].legend(handles= [pop_a,pop_b])

ax[1].scatter(AGE, BILL_AMT1, c=target, cmap=matplotlib.colors.ListedColormap(colors), alpha=0.5)
ax[1].grid()
ax[1].set_xlabel('AGE')
ax[1].set_ylabel('BILL_AMT1')
ax[1].legend(handles= [pop_a,pop_b])

ax[2].scatter(PAY_AMT1,BILL_AMT1, c=target, cmap=matplotlib.colors.ListedColormap(colors), alpha=0.5)
ax[2].grid()
ax[2].set_xlabel('PAY_AMT1')
ax[2].set_ylabel('BILL_AMT1')
ax[2].legend(handles= [pop_a,pop_b])

plt.tight_layout()# let's make good plots
plt.show()

In [None]:
df.head()

In [None]:
df_final = df.drop(['AGE_BIN', 'ID'], axis = 1) 

In [None]:
df_final.LIMIT_BAL  = df_final.LIMIT_BAL.astype("int64")
df_final.AGE  = df_final.AGE.astype("int64")


In [None]:
df_final.head()

In [None]:
pd.get_dummies( columns= ['SEX', 'EDUCATION'], prefix = ['SEX', 'EDUCATION'], data = df_final, drop_first = True)

In [None]:
#df_final_2 = pd.get_dummies(df_final, drop_first= True)

# Classifiers

In [None]:
X = df_final.iloc[:, :-1]
Y = df_final['defaulters']

In [None]:
x_train, x_test, y_train, y_test = train_test_split(X,Y, test_size = 0.2, random_state = 24,stratify = Y )

In [None]:
x_train

In [None]:

y_train

In [None]:
y_train = y_train.astype('int')

In [None]:
Y = Y.astype('int')


#Random Forest

In [None]:
classifier = RandomForestClassifier() 
grid_values = {'n_estimators':[50,60,70,80,90,100], 'max_depth':[3, 5, 7,9,11,14]}
classifier = GridSearchCV(classifier, param_grid = grid_values, scoring = 'roc_auc', cv=5)

# Fit the object to train dataset
classifier.fit(x_train, y_train)

In [None]:
classifier.best_estimator_

In [None]:
classifier.best_params_

 #Testing Accuracy 

In [None]:
pred = classifier.predict(x_test)

In [None]:
y_test = y_test.astype('int')

In [None]:
accuracy_score(y_test,pred)

In [None]:
classifier.predict_proba(x_test)

In [None]:
roc_auc_score(y_test,pred)

In [None]:
print(classification_report(pred, y_test))
print(confusion_matrix( y_test, pred))

In [None]:
recall_score(y_test, pred)

recall is 38%


In [None]:
num_folds = 10
seed = 7
kfold = KFold(n_splits=num_folds, random_state= seed,shuffle=True)
model = KNeighborsClassifier()


In [None]:
results = cross_val_score(model, X, Y, cv=kfold)
print("Mean Estimated KNeighbors: %f " % (results.mean()))

In [None]:
neighbors = np.arange(1,9)
Train_accuracy = np.empty(len(neighbors))
Test_accuracy = np.empty(len(neighbors))

for i, K in enumerate(neighbors):
  knn = KNeighborsClassifier(n_neighbors = K)
  knn.fit(x_train,y_train)
  Train_accuracy[i] = knn.score(x_train,y_train)
  Test_accuracy[i] = knn.score(x_test,y_test)



In [None]:
Train_accuracy

In [None]:
Test_accuracy

In [None]:
classifier_knn = KNeighborsClassifier()
grid_values_knn = {'n_neighbors':[1,2,3,4,5,6,7,8,9,10,11]}
classifier_knn = RandomizedSearchCV(classifier_knn, param_distributions = grid_values_knn, scoring = 'roc_auc', cv=5)

# Fit the object to train dataset
classifier_knn.fit(x_train, y_train)

In [None]:
classifier_knn.best_params_

In [None]:
x_test.isna().sum()

In [None]:
pred_knn = classifier_knn.predict(x_test)

In [None]:
print("Accuracy score %s" %accuracy_score(y_test,pred_knn))
#print("F1 score %s" %f1_score(y_test,pred_knn))
print("Classification report  \n %s" %(classification_report(y_test, pred_knn)))

In [None]:
print(confusion_matrix( y_test, pred_knn))

In [None]:
roc_auc_score(y_test,pred_knn)

In [None]:
recall_score(y_test,pred_knn)

the dataset ,KNN is given a better accuracy

# Implementing SMOT
increasing the number of cases in your dataset 

In [None]:
X.shape

In [None]:
Y.shape

In [None]:
Y.value_counts()

In [None]:
from collections import Counter
from imblearn.over_sampling import SMOTE

smote = SMOTE()
X_sm, Y_sm = smote.fit_resample(X, Y)
# summarize the new class distribution
counter = Counter(Y_sm)

In [None]:
counter

In [None]:
x_train_sm, x_test_sm, y_train_sm, y_test_sm = train_test_split(X_sm,Y_sm, test_size = 0.2, random_state = 24,stratify = Y_sm)

## Random Forest with SMOT

In [None]:
classifier_sm = RandomForestClassifier() 
grid_values_sm = {'n_estimators':[50,60,70,80,90,100], 'max_depth':[3, 5, 7,9,11,14]}
classifier = GridSearchCV(classifier, param_grid = grid_values_sm, scoring = 'roc_auc', cv=5)

# Fit the object to train dataset
classifier_sm.fit(x_train_sm, y_train_sm)

In [None]:
pred_sm = classifier_sm.predict(x_test_sm)

In [None]:
roc_auc_score(y_test_sm,pred_sm)

In [None]:
accuracy_score(y_test_sm,pred_sm)

In [None]:
confusion_matrix(y_test_sm,pred_sm)

In [None]:
recall_score(y_test_sm,pred_sm)

## KNN with SMOT

In [None]:
# Fit the object to train dataset
classifier_knn.fit(x_train_sm, y_train_sm)

In [None]:
pred_knn_sm = classifier_knn.predict(x_test_sm)

In [None]:
classifier_knn.best_params_

In [None]:
roc_auc_score(y_test_sm,pred_knn_sm)

In [None]:
accuracy_score(y_test_sm,pred_knn_sm)

In [None]:
confusion_matrix(y_test_sm,pred_knn_sm)

In [None]:
recall_score(y_test_sm,pred_knn_sm)

# #XGBoost

In [None]:
x_train['SEX'] = x_train['SEX'].astype('int') 
x_train['EDUCATION'] = x_train['EDUCATION'].astype('int') 
x_train['PAY_0'] = x_train['PAY_0'].astype('int') 
x_train['PAY_2'] = x_train['PAY_2'].astype('int') 
x_train['PAY_3'] = x_train['PAY_3'].astype('int') 
x_train['PAY_4'] = x_train['PAY_4'].astype('int') 
x_train['PAY_5'] = x_train['PAY_5'].astype('int') 
x_train['PAY_6'] = x_train['PAY_6'].astype('int') 
x_train['BILL_AMT1'] = x_train['BILL_AMT1'].astype('int') 
x_train['BILL_AMT2'] = x_train['BILL_AMT2'].astype('int') 
x_train['BILL_AMT3'] = x_train['BILL_AMT3'].astype('int') 
x_train['BILL_AMT4'] = x_train['BILL_AMT4'].astype('int') 
x_train['BILL_AMT5'] = x_train['BILL_AMT5'].astype('int') 
x_train['BILL_AMT6'] = x_train['BILL_AMT6'].astype('int') 
x_train['PAY_AMT1'] = x_train['PAY_AMT1'].astype('int') 
x_train['PAY_AMT2'] = x_train['PAY_AMT2'].astype('int') 
x_train['PAY_AMT3'] = x_train['PAY_AMT3'].astype('int') 
x_train['PAY_AMT4'] = x_train['PAY_AMT4'].astype('int') 
x_train['PAY_AMT5'] = x_train['PAY_AMT5'].astype('int') 
x_train['PAY_AMT6'] = x_train['PAY_AMT6'].astype('int') 

In [None]:
x_test['SEX'] = x_test['SEX'].astype('int') 
x_test['EDUCATION'] = x_test['EDUCATION'].astype('int') 
x_test['PAY_0'] = x_test['PAY_0'].astype('int') 
x_test['PAY_2'] = x_test['PAY_2'].astype('int') 
x_test['PAY_3'] = x_test['PAY_3'].astype('int') 
x_test['PAY_4'] = x_test['PAY_4'].astype('int') 
x_test['PAY_5'] = x_test['PAY_5'].astype('int') 
x_test['PAY_6'] = x_test['PAY_6'].astype('int') 
x_test['BILL_AMT1'] = x_test['BILL_AMT1'].astype('int') 
x_test['BILL_AMT2'] = x_test['BILL_AMT2'].astype('int') 
x_test['BILL_AMT3'] = x_test['BILL_AMT3'].astype('int') 
x_test['BILL_AMT4'] = x_test['BILL_AMT4'].astype('int') 
x_test['BILL_AMT5'] = x_test['BILL_AMT5'].astype('int') 
x_test['BILL_AMT6'] = x_test['BILL_AMT6'].astype('int') 
x_test['PAY_AMT1'] = x_test['PAY_AMT1'].astype('int') 
x_test['PAY_AMT2'] = x_test['PAY_AMT2'].astype('int') 
x_test['PAY_AMT3'] = x_test['PAY_AMT3'].astype('int') 
x_test['PAY_AMT4'] = x_test['PAY_AMT4'].astype('int') 
x_test['PAY_AMT5'] = x_test['PAY_AMT5'].astype('int') 
x_test['PAY_AMT6'] = x_test['PAY_AMT6'].astype('int') 

In [None]:
xgb = XGBClassifier()

xgb.fit(x_train,y_train)

In [None]:
xgb_pred = xgb.predict(x_test)

In [None]:
roc_auc_score(y_test,xgb_pred)

In [None]:
recall_score(y_test,xgb_pred)

## XGboost with smote

In [None]:
x_train_sm['SEX'] = x_train_sm['SEX'].astype('int') 
x_train_sm['EDUCATION'] = x_train_sm['EDUCATION'].astype('int') 
x_train_sm['PAY_0'] = x_train_sm['PAY_0'].astype('int') 
x_train_sm['PAY_2'] = x_train_sm['PAY_2'].astype('int') 
x_train_sm['PAY_3'] = x_train_sm['PAY_3'].astype('int') 
x_train_sm['PAY_4'] = x_train_sm['PAY_4'].astype('int') 
x_train_sm['PAY_5'] = x_train_sm['PAY_5'].astype('int') 
x_train_sm['PAY_6'] = x_train_sm['PAY_6'].astype('int') 
x_train_sm['BILL_AMT1'] = x_train_sm['BILL_AMT1'].astype('int') 
x_train_sm['BILL_AMT2'] = x_train_sm['BILL_AMT2'].astype('int') 
x_train_sm['BILL_AMT3'] = x_train_sm['BILL_AMT3'].astype('int') 
x_train_sm['BILL_AMT4'] = x_train_sm['BILL_AMT4'].astype('int') 
x_train_sm['BILL_AMT5'] = x_train_sm['BILL_AMT5'].astype('int') 
x_train_sm['BILL_AMT6'] = x_train_sm['BILL_AMT6'].astype('int') 
x_train_sm['PAY_AMT1'] = x_train_sm['PAY_AMT1'].astype('int') 
x_train_sm['PAY_AMT2'] = x_train_sm['PAY_AMT2'].astype('int') 
x_train_sm['PAY_AMT3'] = x_train_sm['PAY_AMT3'].astype('int') 
x_train_sm['PAY_AMT4'] = x_train_sm['PAY_AMT4'].astype('int') 
x_train_sm['PAY_AMT5'] = x_train_sm['PAY_AMT5'].astype('int') 
x_train_sm['PAY_AMT6'] = x_train_sm['PAY_AMT6'].astype('int') 

In [None]:
x_test_sm['SEX'] = x_test_sm['SEX'].astype('int') 
x_test_sm['EDUCATION'] = x_test_sm['EDUCATION'].astype('int') 
x_test_sm['PAY_0'] = x_test_sm['PAY_0'].astype('int') 
x_test_sm['PAY_2'] = x_test_sm['PAY_2'].astype('int') 
x_test_sm['PAY_3'] = x_test_sm['PAY_3'].astype('int') 
x_test_sm['PAY_4'] = x_test_sm['PAY_4'].astype('int') 
x_test_sm['PAY_5'] = x_test_sm['PAY_5'].astype('int') 
x_test_sm['PAY_6'] = x_test_sm['PAY_6'].astype('int') 
x_test_sm['BILL_AMT1'] = x_test_sm['BILL_AMT1'].astype('int') 
x_test_sm['BILL_AMT2'] = x_test_sm['BILL_AMT2'].astype('int') 
x_test_sm['BILL_AMT3'] = x_test_sm['BILL_AMT3'].astype('int') 
x_test_sm['BILL_AMT4'] = x_test_sm['BILL_AMT4'].astype('int') 
x_test_sm['BILL_AMT5'] = x_test_sm['BILL_AMT5'].astype('int') 
x_test_sm['BILL_AMT6'] = x_test_sm['BILL_AMT6'].astype('int') 
x_test_sm['PAY_AMT1'] = x_test_sm['PAY_AMT1'].astype('int') 
x_test_sm['PAY_AMT2'] = x_test_sm['PAY_AMT2'].astype('int') 
x_test_sm['PAY_AMT3'] = x_test_sm['PAY_AMT3'].astype('int') 
x_test_sm['PAY_AMT4'] = x_test_sm['PAY_AMT4'].astype('int') 
x_test_sm['PAY_AMT5'] = x_test_sm['PAY_AMT5'].astype('int') 
x_test_sm['PAY_AMT6'] = x_test_sm['PAY_AMT6'].astype('int') 

In [None]:
xgb_sm = XGBClassifier()

xgb_sm.fit(x_train_sm,y_train_sm)

In [None]:
xgb_pred_sm = xgb_sm.predict(x_test_sm)

In [None]:
xgb_pred_sm = xgb_sm.predict(x_test_sm)

In [None]:
roc_auc_score(y_test_sm,xgb_pred_sm)

#  Performance Improvement :ENSEMBLES Voting


In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier

In [None]:

seed = 7
num_trees = 100
max_features = 3
kfold = KFold(n_splits=10, random_state=7,shuffle=True)

In [None]:
# ADA Boost model
model = AdaBoostClassifier(n_estimators=num_trees,random_state=seed)
results = cross_val_score(model, X, Y, cv=kfold)
print("Accuracy ADA Boost: %f " % (results.mean()))


In [None]:
# Gradient Boosting model
model =GradientBoostingClassifier(n_estimators=num_trees,random_state=seed)
results = cross_val_score(model, X, Y, cv=kfold)
print("Accuracy Gradinet Boost: %f " % (results.mean()))


In [None]:
# voting Ensemble for Classification
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier




In [None]:

seed = 7
num_trees = 100
max_features = 3
kfold = KFold(n_splits=10, random_state=7,shuffle=True)

In [None]:
# bagged Decision Tress model
cart = DecisionTreeClassifier()
model = BaggingClassifier(base_estimator=cart, n_estimators= num_trees,random_state=seed)
results = cross_val_score(model, X, Y, cv=kfold)
print("Accuracy Bagged Decision Trees: %f" % (results.mean()))

In [None]:
# Random forest model
model = RandomForestClassifier(n_estimators=num_trees, max_features=max_features)
results = cross_val_score(model, X, Y, cv=kfold)
print("Accuracy Random Forest: %f " % (results.mean()))

In [None]:
# Extra Trees Model
model =ExtraTreesClassifier(n_estimators=num_trees, max_features=max_features)
results = cross_val_score(model, X, Y, cv=kfold)
print("Accuracy Extre Trees: %f " % (results.mean()))
