<a href="https://colab.research.google.com/github/kratikajawariya28/classification_Credit_card_default_prediction-/blob/main/KJ_of_classification_Credit_card_default_prediction_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Credit card default prediction


##### **Project Type**    - EDA/Classification/supervised
##### **Contribution**    - Team
##### **Team Member 1 -**  Dharmendra Yadav
##### **Team Member 2 -**  Pranita Tiwari
##### **Team Member 3 -**  Kratika Jawariya


# **Project Summary -**

Write the summary here within 500-600 words.

This project revolves around the crucial task of predicting credit card payment defaults among customers in Taiwan. Rather than focusing solely on binary classification (credibility or not), our primary objective is to estimate the probability of default, which offers deeper insights into risk assessment. The dataset used encompasses 23 explanatory variables, including credit amount, gender, education, marital status, age, and extensive payment history, with the ultimate goal of developing a predictive model that can effectively identify customers at risk of defaulting on their credit card payments.

**Data Overview: **

The dataset at the core of this project consists of a binary response variable, "Default Payment," where 1 indicates a default, and 0 signifies no default. It's paired with a comprehensive set of explanatory variables. These variables encompass a wide range of aspects, such as the credit amount extended, gender of the cardholder, their educational background, marital status, age, and the history of past payments over several months. Additionally, it includes data on bill statement amounts and previous payment amounts, providing a holistic view of each customer's financial behavior.


**Business Objective:**

The primary aim of this project is to develop a robust predictive model capable of identifying customers who are likely to default on their credit card payments in the upcoming months. Credit card default, in this context, refers to the scenario where individuals consistently fail to pay the Minimum Amount Due for consecutive months. By predicting potential defaults proactively, our objective is to empower credit card companies with the tools to make informed decisions. This, in turn, can significantly reduce the incidence of defaults and facilitate targeted engagement with low-risk customer segments.

**Key Insights and Impact:**

**The impact of this project extends to several crucial areas:**

**Accurate Prediction:**

The developed predictive models offer the ability to identify potential defaulters at an early stage of delinquency.

**Risk Reduction:**

Enhanced risk assessment and management strategies enable credit card companies to minimize the impact of defaults on their financial health.

**Targeted Marketing:**

With the ability to predict potential defaults, credit card companies can tailor their credit offerings and engage effectively with low-risk customers.

**Operational Efficiency:**

Efficient resource allocation and streamlined credit approval processes lead to a more optimized and cost-effective operation.

**In conclusion,** this project addresses a significant concern within the credit card industry. It provides insights and tools that can help manage credit card default risk effectively, resulting in improved financial outcomes and operational efficiency. By predicting and proactively addressing defaults, credit card companies can navigate the challenges of risk management more effectively and offer better services to their customers.






# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Write Problem Statement Here.**

This project is dedicated to forecasting customer payment defaults in Taiwan. From a risk management viewpoint, the precision of predicting the probability of default holds greater significance than simply categorizing clients as either credible or not. We can employ the K-S chart to assess which customers are likely to experience credit card payment defaults.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:

!pip install seaborn

In [None]:
# Import Libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.svm import SVC
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import Pipeline



### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')



In [None]:
df = pd.read_excel('/content/default of credit card clients.xls')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

In the data table, we have find 30000 rows and 25 columns.

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

There is no duplicate  value in the data

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isna().sum()

There is no missing value in the data.

In [None]:
# Visualizing the missing values
missing_values_per = pd.DataFrame((df.isnull().sum()/len(df))*100).reset_index()
plt.figure(figsize=(15,5))
plt.stem(missing_values_per['index'],missing_values_per[0])
plt.xticks(rotation=45,fontsize=10)
plt.title('Percentage of Missing Values')
plt.ylabel('%')
plt.show()


### What did you know about your dataset?

There are 4 object type variables which need to be converted to numerical data type for applying a machine learning algorithm. Additionally, it's worth noting that all columns have no missing values, and the dataset comprises 30,000 rows and 25 columns.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

In [None]:
df.describe().T

### Variables Description

**Breakdown of Our Features:**

We possess data for 30,000 customers, and the following describes all the available features.


**ID:** ID of each client


**LIMIT_BAL:** Amount of given credit in NT dollars (includes individual and family/supplementary credit)


**SEX:** Gender (1 = male, 2 = female)


**EDUCATION:** (1 = graduate school, 2 = university, 3 = high school, 0,4,5,6 = others)


**MARRIAGE:** Marital status (0 = others, 1 = married, 2 = single, 3 = others)


**AGE:** Age in years


**Scale for PAY_0 to PAY_6 :**


(-2 = No consumption, -1 = paid in full, 0 = use of revolving credit (paid minimum only), 1 = payment delay for one month, 2 = payment delay for two months, ... 8 = payment delay for eight months, 9 = payment delay for nine months and above)


**PAY_0:** Repayment status in September, 2005 (scale same as above)


**PAY_2:** Repayment status in August, 2005 (scale same as above)


**PAY_3:** Repayment status in July, 2005 (scale same as above)


**PAY_4:** Repayment status in June, 2005 (scale same as above)


**PAY_5:** Repayment status in May, 2005 (scale same as above)


**PAY_6:** Repayment status in April, 2005 (scale same as above)


**BILL_AMT1:**  Amount of bill statement in September, 2005 (NT dollar)


**BILL_AMT2:** Amount of bill statement in August, 2005 (NT dollar)


**BILL_AMT3:** Amount of bill statement in July, 2005 (NT dollar)


**BILL_AMT4:** Amount of bill statement in June, 2005 (NT dollar)


**BILL_AMT5:** Amount of bill statement in May, 2005 (NT dollar)


**BILL_AMT6:** Amount of bill statement in April, 2005 (NT dollar)


**PAY_AMT1:** Amount of previous payment in September, 2005 (NT dollar)


**PAY_AMT2:** Amount of previous payment in August, 2005 (NT dollar)


**PAY_AMT3:** Amount of previous payment in July, 2005 (NT dollar)


**PAY_AMT4:** Amount of previous payment in June, 2005 (NT dollar)


**PAY_AMT5:** Amount of previous payment in May, 2005 (NT dollar)


**PAY_AMT6:** Amount of previous payment in April, 2005 (NT dollar)


**default.payment.next.month:** Default payment (1=yes, 0=no)

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

## 3. ***Data Wrangling***

### **Analysis of Dependent Variable:**

In [None]:
#renaming
df.rename(columns={'default payment next month' : 'default_payment_next_month'}, inplace=True)

In [None]:
# counts the dependent variable data set
df['default_payment_next_month'].value_counts()

### What all manipulations have you done and insights you found?

Rename the column default payment next month' to 'default_payment_next_month.

counts the dependent variable data set



## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Get the proportion of customers who had default payment in the next month
# About 22% customers had default payment next month

df['default_payment_next_month'].value_counts(normalize=True)

In [None]:
# Chart - 1 visualization code

#plotting the count plot to vizualize the data distribution
#plot the count plot to check the data distribution
plt.figure(figsize=(10,5))
sns.countplot(x = 'default_payment_next_month', data = df)

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Class 0 represents "Not Default."
Class 1 represents "Default.

there are fewer defaulters compared to non-defaulters

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 There is an imbalance between the two classes in the dataset. In the next step, we must normalize the data.

## **Analysis of Independent Variable:**


---




## **Categorical  Features**

***We have few categorical features in our dataset that are***
*   sex
*   education
*   marraige
*   age


Categorical variables are qualitative data in which the values are assigned to a set of distinct groups or categories. These groups may consist of alphabetic (e.g., male, female) or numeric labels (e.g., male = 0, female = 1) that do not contain mathematical information beyond the frequency counts related to group membership.


 ***Let's inspect how they are related with out target class.***

### **SEX**



*   1 - Male
*   2 - Female


#### Chart - 2

In [None]:
# counts the SEX variable data set
df['SEX'].value_counts()

In [None]:
#plotting the count plot to vizualize the data distribution
plt.figure(figsize=(10,5))
sns.countplot(x = 'SEX', data = df)

Percentage of Defaulters are smaller than the Non Defaulters in the given dataset

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?


1 - Male

2 - Female

Number of Male credit holder is less than Female.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

### **Education**



 **1 = graduate school; 2 = university; 3 = high school; 0 = others**

#### Chart - 3

In [None]:
# counts the education  data set variable
df['EDUCATION'].value_counts()


In the 'EDUCATION' column, we observe that both 5 and 6 are labeled as 'unknown,' and there's a 0 category that lacks an explanation in the dataset description. Given the relatively small quantities in these categories, let's group 0, 4, 5, and 6 into a single category labeled as '0,' which signifies "other."

In [None]:
# Change values 4, 5, 6 to 0 and define 0 as 'others'
# 1=graduate school, 2=university, 3=high school, 0=others

df["EDUCATION"] =df["EDUCATION"].replace({4:0,5:0,6:0})
df["EDUCATION"].value_counts()

In [None]:
## Chart - 3 visualization code
plt.figure(figsize=(10,5))
sns.countplot(x = 'EDUCATION', data = df)

1 = graduate school; 2 = university; 3 = high school; 0 = others

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?


University students make up the largest group of credit holders, followed by graduates, and high school students come next in terms of numbers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

### **Marriage**


**1 = married; 2 = single; 3 = others**

#### Chart - 4

In [None]:
# From dataset description: MARRIAGE: Marital status (1=married, 2=single, 3=others), but there is
df["MARRIAGE"].unique()

In [None]:
# counts the education  data set
df['MARRIAGE'].value_counts()

In [None]:
#  customers who have  "MARRIAGE" status as 0?
df["MARRIAGE"].value_counts(normalize=True)

MARRIAGE' column: what does 0 mean in 'MARRIAGE'?

 Since there are only 0.18% (or 54) observations of 0, we will combine 0 and 3 in one value as 'others'.

In [None]:
# Combine 0 and 3 by changing the value 0 into others

# Combine 0 and 3 by changing the value 0 into others

df["MARRIAGE"] = df["MARRIAGE"].replace({0:3})
df["MARRIAGE"].value_counts(normalize=True)

In [None]:
# Chart - 4 visualization code

plt.figure(figsize=(10,5))
sns.countplot(x = 'MARRIAGE', data = df)

1 = married; 2 = single; 3 = others

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

1 - married

2 - single

3 - others

Most of credit cards holder are Single.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

### **AGE**

**Plotting graph of number of ages of all people with credit card irrespective of gender.**

#### Chart - 5

In [None]:
# counts the education  data set
df['AGE'].value_counts()

In [None]:
#check the mean of the age group rescpective to the default_payment_next_month
df.groupby('default_payment_next_month')['AGE'].mean()

In [None]:
df = df.astype('int')

In [None]:
# Chart - 5 visualization code
#plotting the count plot to vizualize the data distribution
plt.figure(figsize=(15,7))
sns.countplot(x = 'AGE', data = df)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

We observe a higher proportion of credit card holders in the age group of 26-30 years. Conversely, individuals above 60 years of age tend to use credit cards infrequently.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code
#plotting the box plot to vizualize the data distribution
plt.figure(figsize=(10,10))
ax = sns.boxplot(x="default_payment_next_month", y="AGE", data=df)

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

we have  outiler in the data

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

females have overall less default payments

## **Numerical features**

### **Limit Balance**

#### Chart - 7

In [None]:
# describe  the limit balance  data set
df['LIMIT_BAL'].describe()

***Distribution of LIMIT_BAL.***

In [None]:
# Chart - 7 visualization code
#plotting the dist plot to vizualize the data distribution
plt.figure(figsize=(10,5))
sns.distplot(df['LIMIT_BAL'], kde=True)
plt.show()

#plotting the bar plot to vizualize the data distribution
sns.barplot(x='default_payment_next_month', y='LIMIT_BAL', data=df)

#plotting the box plot to vizualize the data distribution
plt.figure(figsize=(10,10))
ax = sns.boxplot(x="default_payment_next_month", y="LIMIT_BAL", data=df)



##### 1. Why did you pick the specific chart?

**Distribution Plot:**

The distribution plot is chosen to visualize the overall distribution of credit limits in the dataset

Bar Plot:

The bar plot is chosen to compare the average credit limits between customers who did not default ('default_payment_next_month' = 0) and those who defaulted ('default_payment_next_month' = 1).

**Box Plot :**

The box plot is selected to visualize the distribution of credit limits for each default status category.

##### 2. What is/are the insight(s) found from the chart?

Displot right-skewed, indicating that most customers have relatively lower credit limits.

In bar plot customers who did not default have slightly higher credit limits compared to those who defaulted.

It highlights the presence of outliers, particularly in the non-defaulted category.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Overall, these insights can positively impact business decisions related to credit risk management, marketing strategies, and customer segmentation

##**Renaming columns**

#### Chart - 8

In [None]:
#renaming columns

df.rename(columns={'PAY_0':'PAY_SEPT','PAY_2':'PAY_AUG','PAY_3':'PAY_JUL','PAY_4':'PAY_JUN','PAY_5':'PAY_MAY','PAY_6':'PAY_APR'},inplace=True)
df.rename(columns={'BILL_AMT1':'BILL_AMT_SEPT','BILL_AMT2':'BILL_AMT_AUG','BILL_AMT3':'BILL_AMT_JUL','BILL_AMT4':'BILL_AMT_JUN','BILL_AMT5':'BILL_AMT_MAY','BILL_AMT6':'BILL_AMT_APR'}, inplace = True)
df.rename(columns={'PAY_AMT1':'PAY_AMT_SEPT','PAY_AMT2':'PAY_AMT_AUG','PAY_AMT3':'PAY_AMT_JUL','PAY_AMT4':'PAY_AMT_JUN','PAY_AMT5':'PAY_AMT_MAY','PAY_AMT6':'PAY_AMT_APR'},inplace=True)

In [None]:
#check details about the data set

df.info()

###**Total Bill Amount**

In [None]:
#assign the bill amount variable to a single variable
total_bill_amnt_df = df[['BILL_AMT_SEPT',	'BILL_AMT_AUG',	'BILL_AMT_JUL',	'BILL_AMT_JUN',	'BILL_AMT_MAY',	'BILL_AMT_APR']]

In [None]:
# Chart - 8 visualization code

sns.pairplot(data = total_bill_amnt_df) #plotting the pair plot for bill amount


##### 1. Why did you pick the specific chart?

It is suitable to visualize the relationships between multiple variables simultaneously.

##### 2. What is/are the insight(s) found from the chart?

if bill amounts in one month increase, they tend to increase in other months as well. This could signify consistent spending patterns.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

If there is a strong positive correlation between bill amounts in different months, it can help in targeting customers who consistently use their credit cards.