<a href="https://colab.research.google.com/github/kajalwasnik/Credit-card-default-predictions/blob/main/Credit_card_default_predictions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - **Credit Card Default**



##### **Project Type**    - Classification
##### **Contribution**    - Individual

**Name** - Kajal Wasnik

# **Project Summary -**

The aim of this project was to analyze a dataset on credit card defaults and develop a predictive model to identify potential defaulters. The dataset included information about credit card holders, covering demographics, credit card usage patterns, and payment history.

The initial steps involved data exploration, cleaning, and organization. This included renaming columns (e.g., changing PAY_0 to PAY_1 and Is_defaulter for default payment the next month). Exploratory data analysis (EDA) was then conducted to understand the data and highlight connections between various features and the target variable. Noteworthy predictors of default were identified, such as credit limit, payment history, and age.

To validate EDA findings, a hypothesis test was conducted. A two-sample z-test for proportions was used to determine if females were more likely than males to miss payments, and a two-sample t-test was employed to assess if the average credit limit for defaulters differed from that of non-defaulters. The results of these tests supported the EDA conclusions.

Subsequent to EDA and hypothesis testing, data pre-processing was carried out. This involved handling missing values, removing redundant columns, and selecting features based on low correlation and VIF factor. To address class imbalance in the target variable, the SMOTE approach was applied to oversample the minority class.

Several machine learning models, including logistic regression, decision trees, random forests, XGBoost, and K-nearest neighbors, were trained and evaluated using metrics such as recall, accuracy, and F1 scores. Cross-validation and hyperparameter tuning were performed to enhance model performance.

The XGBoost model, demonstrating the highest performance, was selected to estimate the likelihood of new credit card users defaulting on their payments.







# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


The primary objective of this project is to forecast customer default payments in Taiwan. From a risk management standpoint, the predictive accuracy of the estimated probability of default holds greater significance than a binary classification outcome distinguishing between credible and non-credible clients. The K-S chart can be employed to assess and identify customers likely to default on their credit card payments.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

In [None]:
# libraries that are used for analysis and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import missingno

# libraries to do statistical analysis and tests
import scipy
from statsmodels.stats.proportion import proportions_ztest

# libraries for data preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from collections import Counter

# libraries for performance analysis
from sklearn import metrics
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, ConfusionMatrixDisplay,roc_auc_score, confusion_matrix, roc_curve, auc

# ML model libraries
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier
# Hypermeter technique libraries
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

# libraries for model interpretation
!pip install shap==0.40.0
import shap
import graphviz

import pickle

sns.set(style='whitegrid')
pd.set_option('display.max_columns', None)


### Dataset Loading

In [None]:
# Load Dataset

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Data set
Credit_cf = pd.read_excel('/content/drive/MyDrive/default of credit card clients.xls')


### Dataset First View

In [None]:
# Dataset First Look

In [None]:
Credit_cf.head() # Top 5 rows of our dataset.


In [None]:
Credit_cf.tail() # last 5 rows of our data set

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
# Check the how many raws and columns in our data set.
print(f'Creidt_Card = {Credit_cf.shape[0]} Rows , {Credit_cf.shape[1]} columns.')


### Dataset Information

In [None]:
# Dataset Info
Credit_cf.info()


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

In [None]:
# Check Duplicate values
Credit_cf.duplicated().sum()


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

In [None]:
# Check Missing/null values
Credit_cf.isnull().sum().sort_values(ascending = False)

In [None]:
# Visualizing the missing values
# Visualize missing/values
import missingno as msno

In [None]:
# using matrix bar chart
msno.matrix(Credit_cf)

### What did you know about your dataset?

In our Credit card default data set have no null Values.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
Credit_cf.keys()

In [None]:
# Dataset Describe
Credit_cf.describe()

### Variables Description



*  ID: ID of each client

*   LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit)
*   SEX: Gender (1 = male, 2 = female)
*   EDUCATION: (1 = graduate school, 2 = university, 3 = high school, 0,4,5,6 = others)

*   MARRIAGE: Marital status (0 = others, 1 = married, 2 = single, 3 = others)

*   AGE: Age in years

*   Scale for PAY_0 to PAY_6 : (-2 = No consumption, -1 = paid in full, 0 = use of revolving credit (paid minimum only), 1 = payment delay for one month, 2 = payment delay for two months, 3 = payment delay for three months,.... 8 = payment delay for eight months, 9 = payment delay for nine months and above)

*   PAY_0: Repayment status in September, 2005 (scale same as above)



*   PAY_2: Repayment status in August, 2005 (scale same as above)



*  PAY_3: Repayment status in July, 2005 (scale same as above)



*   PAY_4: Repayment status in June, 2005 (scale same as above)



*   PAY_5: Repayment status in May, 2005 (scale same as above)



*   PAY_6: Repayment status in April, 2005 (scale same as above)
*   BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar)


*   BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)


*   BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)


*  BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)


*   BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)


*   BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)


*   PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar)


*   PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)




*   PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)




*   PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)




*   PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)


*  PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar)


*  default.payment.next.month: Default payment (1=yes, 0=no)



### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

In [None]:
for col in Credit_cf.columns:
  if col in []:
    continue
  else:
      print(f'The unique values in column {col} are' ,Credit_cf[col].unique() )

In [None]:
Credit_cf.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:

# Rename some columns for Better Understanding
Credit_cf.rename(columns ={'default payment next month' : 'Is_defulter'},inplace = True) # change name defult payment to Is_defuter
# name change according to months
Credit_cf.rename(columns = {'PAY_0' : 'Pay_sep', 'PAY_2' : 'Pay_aug','PAY_3': 'Pay_jul','PAY_4':'Pay_Jun','PAY_5':'Pay_may' , 'PAY_6': 'Pay_Apr'},inplace = True) # name change according to months
# Rename the bill amount
Credit_cf.rename(columns = {'BILL_AMT1':'Bill_amt_sept','BILL_AMT2':'Bill_amt_aug','BILL_AMT3':'Bill_amt_jul','BILL_AMT4':'Bill_amt_jun','BILL_AMT5' : 'Bill_amt_may','BILL_AMT6': 'Bill_amt_apr'},inplace = True)
# Rename the payment amount
Credit_cf.rename(columns={'PAY_AMT1':'Pay_amt_sept','PAY_AMT2':'Pay_amt_aug','PAY_AMT3':'Pay_amt_jul','PAY_AMT4':'Pay_amt_jun','PAY_AMT5':'Pay_amt_may','PAY_AMT6':'PAY_amt_apr'},inplace=True)


Education

In [None]:
# Check how many values in Eduction
Credit_cf['EDUCATION'].value_counts()


1 = graduate school; 2 = university; 3 = high school; 4 = others

In [None]:
fill = (Credit_cf['EDUCATION'] == 5) | (Credit_cf['EDUCATION'] == 6) | (Credit_cf['EDUCATION'] == 0)
Credit_cf.loc[fill,'EDUCATION'] = 4
Credit_cf['EDUCATION'].value_counts()

As we can see from the dataset, there are numbers like 5, 6, and 0 for which there is no explanation, thus we can add them together to make 4, or Others.

Marriage

In [None]:
# Check how many value in marriage
Credit_cf['MARRIAGE'].value_counts()


1 = married , 2 = single , 3 = others

In [None]:
M_fill = (Credit_cf['MARRIAGE'] == 0)
Credit_cf.loc[M_fill , 'MARRIAGE'] = 3
Credit_cf['MARRIAGE'].value_counts()


A couple of the values for 0 are undetermined. I've added them to the Others category.

In [None]:
Credit_cf.head()



### What all manipulations have you done and insights you found?



*   First we rename column name for understand the all features like default payment next month to Is_defulter, Pay_0 to pay_6 to all the number replace to month like bill amt 1 to bill_amt_6 and pay_amt1 to Pay_6 and these feature also.
*   The dataset, there are numbers like 5, 6, and 0 for which there is no explanation, thus we can add them together to make 4, or Others.


*   A couple of the values for 0 are undetermined. I've added them to the Others category.






## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:

# first bar chart
# pie chart
# Check how many defaulter and non defaulter.
fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(20,6))
ax = Credit_cf['Is_defulter'].value_counts().plot(kind='bar',title="Is_defulter",ax=axes[0])
Credit_cf['Is_defulter'].value_counts().plot(kind='pie',title="Is_defulter",autopct='%1.1f%%',ax=axes[1])
ax.set_ylabel("Count")
ax.set_xlabel("Is_defulter")
fig.tight_layout()



##### 1. Why did you pick the specific chart?

We Pick the countplot because Countplot Shows Number of Dataset.

##### 2. What is/are the insight(s) found from the chart?

As we clearly see in the Countplot our data is Imbalance Data set.

0 is higher then 1

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Are there any insights that lead to negative growth? Justify with specific reason.

Our data set is imbalanced Non defulter user value is high and defulter user value is low .

#### Chart - 2

In [None]:
# Check how many males or females are defaulter and non defaulter
plt.figure(figsize=(12,7), dpi=80)
Credit_cf['SEX'].value_counts().plot(kind = 'pie',autopct='%1.1f%%',explode = (0.2, 0.0), colors = ['y','red', 'green','orange'],startangle=360,fontsize=14,shadow=True)
plt.title("Sex")
fig=plt.gcf()
plt.legend(loc="best")
fig.set_size_inches(6,6)
plt.show()
Credit_cf.groupby("SEX")["Is_defulter"].sum().plot.pie(title='Sex defaulter', legend=True, autopct='%1.1f%%', labels=['Not Default','Default'], shadow=True)
plt.show()

##### 1. Why did you pick the specific chart?

pie chart: How many male and female use credit card.

Pie Chart: To get proportion of defaults for each sex. 1 = male 2 = female

##### 2. What is/are the insight(s) found from the chart?

There are more women than men in our dataset and apparently, men have a slightly higher rate of default compared to female

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Are there any insights that lead to negative growth? Justify with specific reason.

It may not have a significant positive business impact

#### Chart - 3

In [None]:
# Chart - 3 visualization code

In [None]:
# Using bar chart
fig,ax=plt.subplots(figsize=(20,5))
sns.barplot(data= Credit_cf , x='SEX' , y = 'Is_defulter' ,hue = 'EDUCATION' , ax=ax , palette = 'pastel')
ax.set(title = 'Is Defulter According to Sex and Education')
plt.xlabel("Education : (1=graduate school, 2=university, 3=high school, 4=others)")
plt.show()



##### 1. Why did you pick the specific chart?

Pie Chart: To get proportion of defaults for clients by each education and sex

##### 2. What is/are the insight(s) found from the chart?

Default rate for High School educated clients is highest and Others category clients has lowest rate of default and High school of females more default than man.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes.It will help to gain insight to help creating a positive business impact.

#### Chart - 4

In [None]:
# Chart - 4 visualization code

In [None]:
# Using Bar chart
fig, axes  = plt.subplots(ncols=2,figsize = (20,6))
sns.countplot(x = 'MARRIAGE',ax = axes[0] ,data = Credit_cf)
sns.countplot (x ='MARRIAGE',hue = 'Is_defulter',ax= axes [1], data = Credit_cf )



##### 1. Why did you pick the specific chart?

Pie Chart: To get proportion of defaults for clients by each marriage.

##### 2. What is/are the insight(s) found from the chart?

Most people fall under Married and Single category with singles being highest.

The default rate in all the categories is almost same with in Others and Married clients

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Are there any insights that lead to negative growth? Justify with specific reason.

Yes.It will help to gain insight to help creating a positive business impact.

#### Chart - 5

In [None]:
# Chart - 5 visualization code

In [None]:
plt.figure(figsize=(20,5))
sns.countplot(x = 'AGE', data = Credit_cf,palette= "Set1" )

In [None]:
plt.figure(figsize = (20,5))
sns.countplot(x = 'AGE' , hue = 'Is_defulter' , data = Credit_cf , palette = 'pastel')


##### 1. Why did you pick the specific chart?

To compare customer numbers by age

##### 2. What is/are the insight(s) found from the chart?

As compared to clients who are between the ages of 23 to 37, users and defrauders are more prevalent between those two ages.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes.It will help to gain insight to help creating a positive business impact

#### Chart - 6

In [None]:
# Chart - 6 visualization code
Credit_cf['LIMIT_BAL'].value_counts()


In [None]:
plt.figure(figsize=(20,5))
sns.histplot(x = 'LIMIT_BAL', hue = 'Is_defulter' ,data = Credit_cf, kde = True)
plt.show()


##### 1. Why did you pick the specific chart?

Histogram to visualize distribution of LIMIT_BAL.

##### 2. What is/are the insight(s) found from the chart?

The distribution is right-skewed, as predicted, and the majority of clients have credit limits of 200k or less, with a higher rate of default in that range.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes.It will help to gain insight to help creating a positive business impact.

#### Chart - 7

In [None]:
Pay_col = ['Pay_sep',	'Pay_aug',	'Pay_jul','Pay_Jun' , 'Pay_may','Pay_Apr']
for col in Pay_col:
  plt.figure(figsize = (20,5))
  sns.countplot (x = col , hue = 'Is_defulter', data = Credit_cf , palette = 'pastel')


##### 1. Why did you pick the specific chart?

Bar plot to plot rate of default for different late payment count.

##### 2. What is/are the insight(s) found from the chart?

Chances of default rise as the number of late payments rises.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes.It will help to gain insight to help creating a positive business impact.

#### Chart - 8

In [None]:
# Chart - 8 visualization code

In [None]:
pay_amnt_df = Credit_cf[['Pay_amt_sept','Pay_amt_aug',	'Pay_amt_jul',	'Pay_amt_jun', 'Pay_amt_may',	'PAY_amt_apr','Is_defulter']]



In [None]:
sns.pairplot(data = pay_amnt_df, hue ='Is_defulter' )

##### 1. Why did you pick the specific chart?

scatter plot To visualize default rate for different pay amount groups and find any pattern.

##### 2. What is/are the insight(s) found from the chart?

As is evident, clients in the orange are defaulter or blue spots are non defaulters and are now being paid for a period of six months.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes.It will help to gain insight to help creating a positive business impact.

#### Chart - 9

In [None]:
# Chart - 9 visualization code

In [None]:
# Check correlation between Bill amount
Bill_amount = Credit_cf[['Bill_amt_sept',	'Bill_amt_aug',	'Bill_amt_jun',	'Bill_amt_may',	'Bill_amt_apr']]


In [None]:
sns.pairplot(data = Bill_amount)

##### 1. Why did you pick the specific chart?

pair plot show Bill amt of every month

##### 2. What is/are the insight(s) found from the chart?

Negative bill statements are associated with a reduced likelihood of default than positive ones, as would be predicted. What is notable is that those who didn't have a bill in the preceding months had a somewhat higher likelihood of defaulting.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes.It will help to gain insight to help creating a positive business impact

#### Chart - 10

In [None]:
# Chart - 10 visualization code

In [None]:
plt.style.use('ggplot')
plt.figure(figsize=(20,7))
plt.title("Credit Limit vs Sex")
sns.boxplot(x='SEX',y='LIMIT_BAL',hue='Is_defulter',data=Credit_cf)
plt.ylabel("Credit Limit")
plt.xlabel('Sex  1:Male   2: Female')
plt.show()


##### 1. Why did you pick the specific chart?

box plot help me to check the mean vaule so i use the box plot.

##### 2. What is/are the insight(s) found from the chart?

Females Credit limit higher level than Men and default value same both of the categories.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes.It will help to gain insight to help creating a positive business impact

#### Chart - 11

In [None]:
# Chart - 11 visualization code

In [None]:
plt.style.use('ggplot')
plt.figure(figsize=(20,7))
plt.title("Age vs Marriage")
sns.boxplot(x='Is_defulter',hue='MARRIAGE', y='AGE',data=Credit_cf)
plt.ylabel("Age")
plt.xlabel('0:Non-Default,1:Default')
plt.legend(['Married','Unmarried','Others'])
plt.show()


##### 1. Why did you pick the specific chart?

box plot help me to check the mean vaule so i use the box plot.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes.It will help to gain insight to help creating a positive business impact

#### Chart - 12

In [None]:
# Chart - 12 visualization code

In [None]:
plt.figure(figsize=(20,10))
correlation = Credit_cf.corr()
sns.heatmap(abs(correlation), annot=True, cmap='coolwarm')
plt.show()

##### 1. Why did you pick the specific chart?

To check the correlation between each features in the dataset.

##### 2. What is/are the insight(s) found from the chart?

The BILL_AMT columns have a strong correlation, which is understandable given that everyone has similar spending habits. Similarly, there is a correlation between PAY_AMT.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes.It will help to gain insight to help creating a positive business impact

#### Chart - 13

In [None]:
# Chart - 13 visualization code

In [None]:
sns.pairplot(Credit_cf, vars=Credit_cf.columns[11:17], kind='scatter',hue= 'Is_defulter')
sns.pairplot(Credit_cf, vars=Credit_cf.columns[17:23],hue = 'Is_defulter')


##### 1. Why did you pick the specific chart?

Using pairplot for check tha all variable Outliers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes.It will help to gain insight to help creating a positive business impact



#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null hypothesis: Maritial Status did not have any affect on default payment.

Alternate hypothesis: Maritial Status have affected default payment

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

In [None]:
# Create two groups based on sex
male_defaults = Credit_cf[( Credit_cf['SEX'] == 1) & ( Credit_cf['Is_defulter'] == 1)]
female_defaults =  Credit_cf[( Credit_cf['SEX'] == 2) & ( Credit_cf['Is_defulter'] == 1)]

# Calculate the proportions of defaulters in each group
male_proportion = len(male_defaults) / len(Credit_cf[Credit_cf['SEX'] == 1])
female_proportion = len(female_defaults) / len(Credit_cf[ Credit_cf['SEX'] == 2])

# Perform the test for the difference in proportions
z_score, p_value = proportions_ztest([len(female_defaults), len(male_defaults)],
                                           [len( Credit_cf[ Credit_cf['SEX'] == 2]), len( Credit_cf[ Credit_cf['SEX'] == 1])],
                                           alternative='smaller')

# Print the results
print("Male default rate:", round(male_proportion, 4))
print("Female default rate:", round(female_proportion, 4))
print("Z-score:", z_score)
print("P-value:", p_value)
print()
if p_value < 0.05:
  print(f"Since p-value ({p_value}) is less than 0.05, we reject null hypothesis.\nHence, The proportion of defaulters is lower for females than for males.")
else:
  print(f"Since p-value ({p_value}) is greater than 0.05, we fail to reject null hypothesis.\nHence, The proportion of defaulters is the same for males and females.")



##### Which statistical test have you done to obtain P-Value?

The null hypothesis was rejected, and the proportion of defaulters is smaller for women than for men, according to the one-tailed two-sample z-test I used as the statistical test to get the P-Value.

##### Why did you choose the specific statistical test?

The two-sample z-test for proportions is used to detect whether there is a significant difference between two groups (in this example, males and females) in the proportion of a particular outcome (in this case, default payment). The one-tailed nature of the test implies that the direction of the difference—whether females have a smaller proportion of defaulters than males—rather than the size of the difference is all that is important.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null hypothesis: The average credit limit for defaulters is equal to the average credit limit for non-defaulters.

Alternative hypothesis: The average credit limit for defaulters is lower than the average credit limit for non-defaulters.

Test Type: Two-sample t-test

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

In [None]:
# Perform Statistical Test to obtain P-Value
# Split the data into defaulters and non-defaulters
defaulters = Credit_cf[Credit_cf['Is_defulter'] == 1]
non_defaulters = Credit_cf[Credit_cf['Is_defulter'] == 0]

# Calculating the mean credit limit for defaulters and non-defaulters
mean_credit_limit_defaulters = defaulters['LIMIT_BAL'].mean()
mean_credit_limit_non_defaulters = non_defaulters['LIMIT_BAL'].mean()

# Calculating the standard deviation of credit limit for defaulters and non-defaulters
std_credit_limit_defaulters = defaulters['LIMIT_BAL'].std()
std_credit_limit_non_defaulters = non_defaulters['LIMIT_BAL'].std()

# Calculate the sample sizes for defaulters and non-defaulters
n_defaulters = len(defaulters)
n_non_defaulters = len(non_defaulters)

# Calculate the standard error of the mean difference
se_mean_difference = ((std_credit_limit_defaulters ** 2 / n_defaulters) + (std_credit_limit_non_defaulters ** 2 / n_non_defaulters)) ** 0.5

# Calculate the t-statistic and p-value using the two-sample t-test
t_stat, p_value = scipy.stats.ttest_ind(defaulters['LIMIT_BAL'], non_defaulters['LIMIT_BAL'], equal_var=False)

# Print the results
print('Mean credit limit for defaulters:', mean_credit_limit_defaulters)
print('Mean credit limit for non-defaulters:', mean_credit_limit_non_defaulters)
print('t-statistic:', t_stat)
print('p-value:', p_value)
print()
if p_value < 0.05:
  print(f"Since p-value ({p_value}) is less than 0.05, we reject null hypothesis.\nHence, The average credit limit for defaulters is lower than the average credit limit for non-defaulters.")
else:
  print(f"Since p-value ({p_value}) is greater than 0.05, we fail to reject null hypothesis.\nHence, The average credit limit for defaulters is equal to the average credit limit for non-defaulters.")



In [None]:
# Perform Statistical Test to obtain P-Value


##### Which statistical test have you done to obtain P-Value?

The null hypothesis was rejected, and the average credit limit for defaulters is lower than the average credit limit for non-defaulters, according to the results of my two-sample t-test statistical analysis.

##### Why did you choose the specific statistical test?

The two-sample t-test is a statistical test for comparing means and is useful when the requirements of normality are met, yet we may still apply the t-test if the sample sizes are big (6636 and 23364 in this example).

## ***6. Feature Engineering & Data Pre-processing***

**1. Handling Missing Values**

No Missing value in our data set.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

In [None]:
Numerical_columns = ['ID', 'LIMIT_BAL', 'AGE', 'Bill_amt_sept',	'Bill_amt_aug','Bill_amt_jul'	,'Bill_amt_jun',	'Bill_amt_may',	'Bill_amt_apr','Pay_amt_sept','Pay_amt_aug',	'Pay_amt_jul',	'Pay_amt_jun', 'Pay_amt_may',	'PAY_amt_apr']


In [None]:
def draw_histograms(Credit_cf, columns, bins=50):
    fig=plt.figure(figsize=(18,25))
    for i, col in enumerate(Numerical_columns):
      plt.subplot(10, 4, i+1)
      sns.histplot(Credit_cf[col], kde=True, bins=bins)
      plt.title(col)
    fig.tight_layout()


In [None]:
draw_histograms(Credit_cf , Numerical_columns)

We can see that most of the numerical columns are right skewed.

Also late_payment_count have 6 values with 0 being majority and we can't treat high values (5 or 6) as outliers as these high values increases chances of defaulting.

In [None]:
#Looking deep into cases with high values for BILL_AMT1 to study if they are genuine observations of data entry errors
Credit_cf[Credit_cf['Bill_amt_sept'] > 400000][['LIMIT_BAL', 'Pay_sep', 'Pay_aug', 'Pay_jul', 'Bill_amt_sept','Bill_amt_aug', 'Bill_amt_jul', 'Pay_amt_sept', 'Pay_amt_aug', 'Pay_amt_jul', 'Is_defulter']].head(20)


They looks like rows with very high values for BILL_AMTX also their LIMIT_BAL is very high. So they must be representing few super rich people and the data are genuine not an error. Hence they are not outliers.

Those who defaulted have significantly lower PAY_AMT compared to BILL_AMT which is expected.

In [None]:
Credit_cf[Credit_cf['Bill_amt_sept'] > 300000][['LIMIT_BAL', 'Pay_sep', 'Pay_aug', 'Pay_jul', 'Bill_amt_sept','Bill_amt_aug', 'Bill_amt_jul', 'Pay_amt_sept', 'Pay_amt_aug', 'Pay_amt_jul', 'Is_defulter']].head(20)



Similarly for very high PAY_AMT values we can see they have very high BILL_AMT in previous months and all payments are done duly, So they must be representing super rich people and data is genuine and not errors. Hence they are not outliers.

##### What all outlier treatment techniques have you used and why did you use those techniques?

There is no outliers.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

In [None]:
Credit_cf.replace({'SEX': {1 : 'MALE', 2 : 'FEMALE'}, 'EDUCATION' : {1 : 'graduate school', 2 : 'university', 3 : 'high school', 4 : 'others'}, 'MARRIAGE' : {1 : 'married', 2 : 'single', 3 : 'others'}}, inplace = True)



In [None]:
Credit_cf['EDUCATION']

In [None]:
Credit_cf

**One Hot encoding**



In [None]:
# Dumification of Education and marriage variable's
Credit_cf = pd.get_dummies(Credit_cf,columns=['EDUCATION','MARRIAGE'])


In [None]:
Credit_cf.head()

In [None]:
# Drop the Education others and marriage others variable's
Credit_cf.drop(['EDUCATION_others','MARRIAGE_others'],axis = 1, inplace = True)



In [None]:
# Do dummification of payment variables
Credit_cf = pd.get_dummies(Credit_cf, columns = ['Pay_sep',	'Pay_aug',	'Pay_jul','Pay_Jun' , 'Pay_may','Pay_Apr'], drop_first = True )


In [None]:
Credit_cf.head()

In [None]:
# LABEL ENCODING FOR SEX
encoders_nums = {
                 "SEX":{"FEMALE": 0, "MALE": 1}
}
Credit_cf = Credit_cf.replace(encoders_nums)



In [None]:
Credit_cf.head()

In [None]:
Credit_cf.shape

#### What all categorical encoding techniques have you used & why did you use those techniques?

I have use one hot cording sex , Education ,marriage and pay because the represents categorical values with no order also number of categories is not too high.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

In [None]:
plt.figure(figsize = (15,11))
sns.heatmap(abs(Credit_cf[Numerical_columns].corr()), annot = True, cmap = 'cool').set_title('Correlation Heatmap to analyze the features', fontsize = 18)
plt.show


#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
from statsmodels.stats.outliers_influence import variance_inflation_factor
def calc_vif(X):
  vif = pd.DataFrame()
  vif["variable"] = X.columns
  vif["VIF"] = [variance_inflation_factor(X.values ,i) for i in range (X.shape[1])]

  return(vif)



In [None]:
calc_vif(Credit_cf[Numerical_columns])

In [None]:
calc_vif(Credit_cf[[i for i in(Credit_cf[Numerical_columns]).describe().columns if i not in ['Bill_amt_sept',	'Bill_amt_aug',	'Bill_amt_jun',	'Bill_amt_may',	'Bill_amt_apr']]])



In [None]:
Cr = Credit_cf.drop('ID',axis = 1)

In [None]:
Cr

##### What all feature selection methods have you used  and why?

Since there are only 9 categorical features, I have used only One Hot Encoding. One Hot Encoding creates new columns as much as the number of unique values. One Hot Encoding makes our training data more useful and expressive, and it can be rescaled easily.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

In [None]:
X = Cr.drop('Is_defulter', axis=1)
y = Cr['Is_defulter']

In [None]:
#split the Dataset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
print(X_train.shape)
print(X_test.shape)

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

We can observe that the data is unbalanced from the above.



*   Only 22% of the data in our 24,000 training sample are defaults.

*  
Only 21.6% of the 6000 test dataset's data are defaults.



Therefore, we must balance the datasets we use for training and testing.

In [None]:
# Handling Imbalanced Dataset (If needed)
from imblearn.over_sampling import SMOTE
from collections import Counter



In [None]:
# Upsamping minority class using SMOTE method
print("Before oversampling: ",Counter(y_train))
SMOTE= SMOTE()
# Resampling the minority class
X_train,y_train= SMOTE.fit_resample(X_train,y_train)
print("After oversampling: ",Counter(y_train))


In [None]:

# Handling Imbalanced Dataset (If needed)
print("Before oversampling: ",Counter(y_test))

X_test,y_test= SMOTE.fit_resample(X_test,y_test)
print("After oversampling: ",Counter(y_test))

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Because undersampling might result in data loss, I have not employed it. This would result in the loss of around 13,000 rows from our training dataset.

To balance our data, I have instead employed Synthetic Minority Oversampling Technique (SMOTE). This method produces fake data for the minority class. SMOTE operates by selecting a point at random from the minority class and calculating its k-nearest neighbors. Between the selected point and its neighbors, the synthetic points are inserted.

# **7. Data Scaling**

In [None]:
# Scaling your data

In [None]:
# Data Scaling
from sklearn.preprocessing import StandardScaler
std = StandardScaler()
X_train= std.fit_transform(X_train)
X_test = std.fit_transform(X_test)


**Which method have you used to scale you data and why?**

The process used to normalize the variety of characteristics in data is known as feature scaling, often referred to as data normalization. Since data values might fluctuate considerably, preparing the data before utilizing machine learning algorithms becomes essential.

Standardization and Normalization

Normalization is the process of transforming your observations into something that can be compared to a normal distribution. Your data are transformed by standardization, also known as z-score normalization, to produce a distribution with a mean of 0 and a standard deviation of 1.

**7. Dimesionality Reduction**

Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

**Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)**


Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation
log = LogisticRegression(random_state = 42)

# Fit the Algorithm
log.fit(X_train, y_train)

# Predict on the model

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [None]:
# Predict on the model
log_train  = log.predict(X_train)
log_test  = log.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
train_accuracy = accuracy_score(log_train,y_train)
test_accuracy = accuracy_score(log_test,y_test)

print("The accuracy on train data is ", train_accuracy)
print("The accuracy on test data is ", test_accuracy)

In [None]:
labels = ['NO-DEFAULT', 'DEFAULT']
cm = confusion_matrix(y_train, log_train)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix of Training Data')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

In [None]:
labels = ['NO-DEFAULT', 'DEFAULT']
cm = confusion_matrix(y_test, log_test)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix of Test Data')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

**1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.**

Logistic Regression was utilized. It's a classification method that forecasts the likelihood of a result that can only take one of two possible forms (i.e., dichotomy). It generates a logistic curve with a range of 0 to 1 values only.

In [None]:
print(classification_report(y_test,log_test))

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import cross_validate


In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
param_grid = {'penalty':['l1','l2'], 'C' : [0.001, 0.01, 0.1, 1, 10, 100, 1000] }
# searching the best parameter
grid_lr_clf = GridSearchCV(LogisticRegression(), param_grid, scoring = 'accuracy', n_jobs = -1, verbose = 3, cv = 3)
# Fit the Algorithm
grid_lr_clf.fit(X_train, y_train)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [None]:
# Best parameter
print("The best parameters is found out to be :" ,grid_lr_clf.best_params_)
print("\nUsing ",grid_lr_clf.best_params_, " the recall score is: ", grid_lr_clf.best_score_)


In [None]:
# take best parmeter and fit x train and y train
logit = LogisticRegression(C=0.01, fit_intercept=False, penalty='l2')
logit.fit(X_train, y_train)


In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [None]:
# Predict the values
train_log_hp = logit.predict(X_train)
test_log_hp = logit.predict(X_test)

In [None]:

# Check the accurracy of  of predict model
train_accuracy2 = accuracy_score(train_log_hp,y_train)
test_accuracy2 = accuracy_score(test_log_hp,y_test)

print("The accuracy on train data is ", train_accuracy2)
print("The accuracy on test data is ", test_accuracy2)


In [None]:
# Check recall score
print(classification_report(y_test,test_log_hp))

In [None]:
# ROC AUC CURVE
fpr, tpr, _ = roc_curve(y_test,test_log_hp)
auc = roc_auc_score(y_test,test_log_hp )
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

I used Logistic regression algorithm to create the model. As I got not so good result.

The precision, recall, F1 and roc auc score on test data are: 0.88, 0.76 and 0.81

##### Which hyperparameter optimization technique have you used and why?

I have used GridSearchCV to obtain the best parameters to improve upon my Logistic Regression Model. GridSearchCV which uses the gridSearch technique for finding the optimal hyperparameters to increase the model performance.

GridSearchCV is a technique to search through the best parameter values from the given set of the grid of parameters.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.



*   There is no increment in accuracy score after using hyperparameter tuning.
*   The recall score has increased by 2% after using hyperparameter tuning.



### ML Model - 2

In [None]:
# ML Model-2 Implementation
clf_dt = DecisionTreeClassifier(criterion='gini', random_state=0, max_leaf_nodes=10)

# Fit the Algorithm
clf_dt.fit(X_train, y_train)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [None]:
# Predict the model
train_dt = clf_dt.predict(X_train)
test_dt = clf_dt.predict (X_test)

In [None]:
# Check accuracy of our model
train_dt_acc = accuracy_score(train_dt,y_train)
test_dt_acc = accuracy_score(test_dt,y_test)
print('The accuracy of model train data set is',train_dt_acc)
print('The accuracy of model test data set is ',test_dt_acc)


In [None]:

# Auc_roc score
fpr, tpr, _ = roc_curve(y_test,test_dt)
auc = roc_auc_score(y_test,test_dt)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Decision Tree Classifier is the second model I've employed. The most important variable and its value that produces the finest homogenous groupings of population are identified via the decision tree.

In [None]:
# Check recall score of test date set
print(classification_report(y_test,test_dt))

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques ( GridSearch CV)
# giving parameters
dt_param = {'criterion':['gini','entropy','log_loss'], 'max_depth': np.arange(3, 15)}

dt_grid = GridSearchCV(DecisionTreeClassifier(), param_grid = dt_param, cv=10, verbose=2, scoring='recall')

# Fit the Algorithm
dt_grid.fit(X_train,y_train)

In [None]:
# Check best parameter
print("The best parameters is found out to be :" ,dt_grid.best_params_)
print("\nUsing ",dt_grid.best_params_, " the recall score is: ", dt_grid.best_score_)


In [None]:
# take the best parmeter
dt_tree_grid = DecisionTreeClassifier(criterion='gini', max_depth=13)
dt_tree_grid.fit(X_train,y_train)


In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [None]:
# predict model
dt_train_grid = dt_tree_grid.predict(X_train)
dt_test_grid = dt_tree_grid.predict(X_test)


In [None]:
# Check accuracy of our model
dt_train_grid_acc = accuracy_score(dt_train_grid,y_train)
dt_test_grid_acc = accuracy_score(dt_test_grid,y_test)
print('The accuracy of modl of hyperparmeter training is ', dt_train_grid_acc)
print('The acccuracy of modal of hyperparmeter test is ',dt_test_grid_acc)


In [None]:
# Check metrix analysis by heatmap
labels = ['NO-DEFAULT', 'DEFAULT']
cm = confusion_matrix(y_train,dt_train_grid)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix of Training Data')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

In [None]:
labels = ['NO-DEFAULT', 'DEFAULT']
cm = confusion_matrix(y_test,dt_test_grid)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix of Test Data')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)


In [None]:
# Check evaluation matrix score
print(classification_report(y_test,test_dt))


In [None]:
# Check auc_roc curve by graph
fpr, tpr, _ = metrics.roc_curve(y_test,dt_test_grid)
auc = metrics.roc_auc_score(y_test,dt_test_grid)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

**Explain the ML Model used and it's performance**

I used decision tree Classifier algorithm to create the model. As I got not so good result similar to Logistic Regression.

The precision, recall, F1 and roc auc score on test data are: 0.83, 0.68, 0.74 and 0.77

##### Which hyperparameter optimization technique have you used and why?

GridSearchCV which uses the gridSearch technique for finding the optimal hyperparameters to increase the model performance.

GridSearchCV is a technique to search through the best parameter values from the given set of the grid of parameters.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

The third model I used was Random Forest. Random Forest develops several trees in contrast to the CART model's one tree. We build trees from the subsets of the original dataset. These subsets could only include some of the columns and rows. We say that a tree "votes" for a class when it offers a classification to classify a new object based on features. Regression uses a forest to pick the classification that obtained the most votes (out of all the trees in the forest) by averaging the results from numerous trees.

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier()
rf_clf.fit(X_train , y_train)


In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [None]:
train_rf = rf_clf.predict(X_train)
test_rf  = rf_clf.predict(X_test)

In [None]:
train_accuracy_rf = accuracy_score(train_rf, y_train)
test_accuracy_rf = accuracy_score(test_rf, y_test)
print("The accuracy on train data is ", train_accuracy_rf)
print("The accuracy on test data is ", test_accuracy_rf)


In [None]:
labels = ['NO-DEFAULT', 'DEFAULT']
cm = confusion_matrix(y_train, train_rf)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix of Training Data')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)


In [None]:
labels = ['NO-DEFAULT', 'DEFAULT']
cm = confusion_matrix(y_test,test_rf)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix of Test Data')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)


In [None]:
print(classification_report(y_test,test_rf))


In [None]:
fpr, tpr, _ = roc_curve(y_test,test_rf)
auc = roc_auc_score(y_test,test_rf)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
param_grid = {'criterion':['gini','entropy','log_loss'],'n_estimators': [5,10], 'max_depth': [3,15]}
# Fit the Algorithm
grid_rf_clf = GridSearchCV(RandomForestClassifier(), param_grid, scoring = 'accuracy', n_jobs = -1, verbose = 3, cv = 3)
grid_rf_clf.fit(X_train , y_train)


In [None]:
# Check the best parmeter
print("The best parameters is found out to be :" ,grid_rf_clf)
print("\nUsing ",grid_rf_clf.best_params_, " the recall score is: ", grid_rf_clf.best_score_)


In [None]:
# run model with best best parmeter
rf_grid_clf = RandomForestClassifier(criterion='gini', max_depth=15, n_estimators=10)
rf_grid_clf.fit(X_train, y_train)


In [None]:
# predict model
train_rf_grid = rf_grid_clf.predict(X_train)
test_rf_grid = rf_grid_clf.predict(X_test)



In [None]:
# Check the accuracy of our test model
train_rf_clf= accuracy_score(train_rf_grid, y_train)
test_rf_clf = accuracy_score(test_rf_grid, y_test)
print('The accuracy of train set is ',train_rf_clf)
print('The accuracy of test set is ',test_rf_clf)


In [None]:
# Training confusion matrix
labels = ['NO-DEFAULT', 'DEFAULT']
cm = confusion_matrix(y_train,train_rf_grid)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix of Training Data')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

In [None]:
# Testing Confusion Matrix
labels = ['NO-DEFAULT', 'DEFAULT']
cm = confusion_matrix(y_test,test_rf_grid)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix of Test Data')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)


In [None]:
# Check the matrix score
print(classification_report(y_test,test_rf_grid))


In [None]:
# Check the auc_roc curve
fpr, tpr, _ = roc_curve(y_test,test_rf_grid)
auc = roc_auc_score(y_test,test_rf_grid )
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

##### Which hyperparameter optimization technique have you used and why?

GridSearchCV which uses the gridSearch technique for finding the optimal hyperparameters to increase the model performance.

GridSearchCV is a technique to search through the best parameter values from the given set of the grid of parameters

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

I used Random Forest Classifier algorithm to create the model. As I got not so good result similar to Logistic Regression.

The precision, recall, F1 and roc auc score on test data are: 0.80, 0.81, 0.80 and 0.80

# **ML Model - 4 XGBoost**

In [None]:
from xgboost import XGBClassifier
import xgboost as xgb


In [None]:
dtrain=xgb.DMatrix(X_train,label=y_train)
dtest=xgb.DMatrix(X_test)

In [None]:
# giving parameters for xgboost
parameters={'max_depth':7, 'eta':1, 'silent':1,'objective':'binary:logistic','eval_metric':'auc','learning_rate':.05}


In [None]:
#training our model
num_round=50
from datetime import datetime
start = datetime.now()
xg=xgb.train(parameters,dtrain,num_round)
stop = datetime.now()

In [None]:
#Execution time of the model
execution_time_xgb = stop-start
execution_time_xgb


In [None]:
#now predicting our model on train set
train_class_preds_probs=xg.predict(dtrain)
#now predicting our model on test set
test_class_preds_probs =xg.predict(dtest)


In [None]:
len(train_class_preds_probs)

In [None]:
train_class_preds = []
test_class_preds = []
for i in range(0,len(train_class_preds_probs)):
  if train_class_preds_probs[i] >= 0.5:
    train_class_preds.append(1)
  else:
    train_class_preds.append(0)

for i in range(0,len(test_class_preds_probs)):
  if test_class_preds_probs[i] >= 0.5:
    test_class_preds.append(1)
  else:
    test_class_preds.append(0)


In [None]:

test_class_preds_probs[:20]

In [None]:
# Check accuracy of train or test model
train_accuracy_xgb = accuracy_score(train_class_preds,y_train)
test_accuracy_xgb = accuracy_score(test_class_preds,y_test)

print("The accuracy on train data is ", train_accuracy_xgb)
print("The accuracy on test data is ", test_accuracy_xgb)

In [None]:
# Check the matrix score
print(classification_report(y_test,test_class_preds ))

In [None]:
# Check the auc and roc curve
fpr, tpr, _ = metrics.roc_curve(y_test,test_class_preds)
auc = metrics.roc_auc_score(y_test,test_class_preds)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()


**2. Cross- Validation & Hyperparameter Tuning**

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques ( GridSearch CV)
param_test1 = {
 'max_depth':range(3,10,2),
 'min_child_weight':range(1,6,2)
}
gsearch1 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=140, max_depth=5,
 min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8, # Fit the Algorithm
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27),
 param_grid = param_test1, scoring='accuracy',n_jobs=-1,cv=3, verbose = 2)
gsearch1.fit(X_train, y_train)

# Predict on the model


In [None]:
# Check the best of xgb boost score
gsearch1.best_score_

In [None]:
# run with best estimator
optimal_xgb = gsearch1.best_estimator_


In [None]:
# Predict on the model
train_class_pred = optimal_xgb.predict(X_train)
test_class_pred = optimal_xgb.predict(X_test)

In [None]:
# Check accuracy score
train_accuracy_xgb_tuned = accuracy_score(train_class_pred,y_train)
test_accuracy_xgb_tuned = accuracy_score(test_class_pred,y_test)

print("The accuracy on train data is ", train_accuracy_xgb_tuned)
print("The accuracy on test data is ", test_accuracy_xgb_tuned)

In [None]:
# Training Confusion matrix
labels = ['NO-DEFAULT', 'DEFAULT']
cm = confusion_matrix(y_train,train_class_pred)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix of Training Data')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)


In [None]:
# Testing confusion matrix
labels = ['NO-DEFAULT', 'DEFAULT']
cm = confusion_matrix(y_test,test_class_pred)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix of test Data')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

In [None]:
print(classification_report(y_test,test_class_pred))

In [None]:
fpr, tpr,_ = metrics.roc_curve(y_test,test_class_pred)
auc = metrics.roc_auc_score(y_test,test_class_pred)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

Explain the ML Model used and it's performance using Evaluation metric Score Chart.

I used XGBoost Classifier algorithm to create the model. As I got notso good result.

The precision, recall, F1 and roc auc score on test data are: 0.80, 0.84, 0.82 and 0.81

Which hyperparameter optimization technique have you used and why?

GridSearchCV which uses the gridSearch technique for finding the optimal hyperparameters to increase the model performance.

GridSearchCV is a technique to search through the best parameter values from the given set of the grid of parameters.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

# **Ml Model - 5 K-Nearest neighbor**

In [None]:
from sklearn.neighbors import KNeighborsClassifier


In [None]:
#fit the parameter
param_grid = {'n_neighbors' : [3,4]}
knn = GridSearchCV(KNeighborsClassifier(), param_grid, n_jobs = -1, verbose = 3, cv = 4)
#training model
knn.fit(X_train, y_train)

In [None]:

# best parameter
knn.best_params_

In [None]:
# Predict tha model
y_pred_knn_train = knn.predict(X_train)
y_pred_knn_test = knn.predict(X_test)


In [None]:
# Check accuracy of model
train_accuracy_knn = accuracy_score(y_pred_knn_train,y_train)
test_accuracy_knn = accuracy_score(y_pred_knn_test,y_test)

print("The accuracy on train data is ",train_accuracy_knn)
print("The accuracy on test data is ",test_accuracy_knn)

In [None]:
# Check matrix score
print(classification_report(y_test,y_pred_knn_test))

In [None]:
# Check roc_auc curve
fpr, tpr,_ =metrics.roc_curve(y_test,y_pred_knn_test)
auc = metrics.roc_auc_score(y_test,y_pred_knn_test )
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()


I used K nearest neighbors Classifier algorithm to create the model. As I got not so good result similar to Logistic Regression.

The precision, recall, F1 and roc auc score on test data are: 0.88, 0.72, 0.79 and 0.89

**Which hyperparameter optimization technique have you used and why?**

GridSearchCV which uses the gridSearch technique for finding the optimal hyperparameters to increase the model performance.

GridSearchCV is a technique to search through the best parameter values from the given set of the grid of parameters.

# 1. Which Evaluation metrics did you consider for a positive business impact and why?

The different assessment metrics for categorization issues include:



*   Accuracy - Accuracy is simply the proportion of accurately predicted events. Accuracy performs a decent job of balancing specificity and sensitivity, recall and precision, as long as classes are roughly balanced (equal numbers of dog and not-dog photos in the prior example).
*   Confusion Matrix: The Confusion Matrix is a performance measurement for classification problems in machine learning in which there can be two or more classes output. It is a table with actual and predicted value combinations. The table that is frequently used to describe the performance of a classification model on a set of test data for which the true values are known is referred to as a confusion matrix. It is extremely helpful for determining the AUC-ROC curves, precision, recall, and accuracy.

*   Recall /senstivity :True Positive Rate is another name for recall and sensitivity. It is the percentage of genuine "YES" votes that were placed in the appropriate bin. In essence, this provides sensitivity/recall/TPR a very particular use case: utilize it when each instance of what you're seeking for is too valuable to let go.
*   Precision: What percentage of the things the model flagged as YES are actually correct?

*   F1-Score - F1 score is the harmonic average of recall and precision, taking values between 0 and 1.








# 2. Which ML model did you choose from the above created models as your final prediction model and why?

The XGB boost model seems to be the best choice for this dataset as it has the highest score in all the metrics (Accuracy, Precision, Recall, F1 and AUC). As shown in figure table:

In [None]:
data = {'Logistic':78, 'LogisticGSCV': 76,
                'Decision_Tree':68, 'Decision_TreeGSCV':68,
               'Random_Forest': 81, 'Random_ForestGSCV': 81,
                'XGBoost': 82 , 'XGBoostGCV':84 , 'KnnGCV': 72}
courses = list(data.keys())
values = list(data.values())
plt.figure(figsize=(15,8))
plt.title('Comparing Recall of ML Models',fontsize=20)
colors=['red','blue','grey','green','black']
plt.bar(courses, values, color =colors,alpha=0.5,width = 0.4)
plt.xticks(rotation = 45)


Given that the XGBoost model gets the greatest scores across all criteria (Accuracy, Precision, Recall, F1, and AUC), it appears to be the best option for this dataset. According to the figure table

# 3. Explain the model which you have used and the feature importance using any model explainability tool?

The total significance of each feature in a machine learning model is displayed in a SHAP summary graphic. It may be used to determine the key characteristics that are impacting the model's predictions and to learn more about those characteristics.

We can observe that the two most crucial features to forecast default are Pay_sept and Marriage.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

In this assignment, our task was to classify and predict whether a credit card customer is likely to experience payment delays. For credit card companies, this is of paramount importance as it empowers them to identify high-risk borrowers and implement necessary measures to mitigate potential losses.

Within the dataset, approximately 22% are defaulters, while 78% are non-defaulters. Men exhibit a slightly higher frequency of default compared to women. Both younger and older customers have higher default rates. The default rate correlates with increased credit usage, and the likelihood of default rises with the number of late payments. Customers who haven't made any payments in previous months also have a higher default rate.

To address data imbalances, Synthetic Minority Over-sampling Technique (SMOTE) was applied during data preparation. The XGBoost model outperformed others in forecasting customer default, achieving maximum accuracy (0.95), recall (0.84), F1 score (0.82), and area under the curve (AUC) (0.81).

A feature significance analysis using SHAP values identified PAY_1, Marriage, Education, and other factors as the most crucial for forecasting customer default.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***