<a href="https://colab.research.google.com/github/nailaimtiyaz/Credit_Card_Default_Almabetter/blob/main/Credit_Card_Default_Prediction_Almabetter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#project name

Credit card default prediction

#Project Type - EDA/Regression/Classification/Unsupervised

Contribution - Naila Imtiyaz,
project type- classification

#Project Summary -

Write the summary here within 500-600 words.



Project Summary:

1. Introduction:

Credit card default prediction is a critical task in the financial industry, aimed at identifying individuals who are at risk of defaulting on their credit card payments. This project leverages machine learning techniques to predict credit card defaults, ultimately helping financial institutions make informed decisions and manage risk effectively.

2. Objectives:

The primary objectives of this project were:

To build a predictive model that accurately identifies individuals likely to default on their credit card payments.
To evaluate the model's performance using relevant metrics and assess its generalization to unseen data.
To provide insights into the key factors influencing credit card default.
3. Data:

The project utilized a dataset containing historical credit card transaction and payment information, along with demographic and financial features for a sample of credit card users. The dataset was preprocessed, which included handling missing data, feature engineering, and encoding categorical variables.

4. Methodology:

Several machine learning algorithms, including logistic regression, decision trees, random forests, and gradient boosting, were explored and evaluated for their predictive performance. Hyperparameter tuning and cross-validation techniques were used to optimize the models. Feature importance analysis was conducted to understand the factors driving predictions.

5. Results:

The predictive models achieved promising results:

The model demonstrated high accuracy, with an AUC-ROC score of X% on the validation set.
Key predictors of credit card default included 'LIMIT_BAL,' 'PAY_STATUS,' 'BILL_AMT,' and 'PAY_AMT.'
The model's performance was validated on an independent test set, yielding consistent results.
6. Implications:

The developed model can be deployed by financial institutions to proactively identify customers at risk of default, allowing for timely intervention and risk management.
Insights gained from the analysis can inform credit risk assessment and credit limit decisions.
The project highlights the significance of accurate feature engineering and model selection in credit card default prediction.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Write Problem Statement Here.**

Predict the probability of a customer defaulting payment for the credit card the subsequent month, based on past information. The past information is provided in the dataset. This probability will help the collections team to prioritise follow up with customers who have a high propensity of defaulting.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

In [None]:
import pandas            as pd
import numpy             as np
import matplotlib.pyplot as plt
import seaborn           as sns
import statsmodels.api   as sm

import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection   import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.linear_model      import LogisticRegression
from sklearn.metrics           import classification_report
from sklearn.tree              import DecisionTreeClassifier
from sklearn.ensemble          import RandomForestClassifier
from scipy.stats               import randint as sp_randint
from imblearn.over_sampling    import SMOTE


### Dataset Loading

In [None]:
# Load Dataset

In [None]:
from google.colab import drive
drive.flush_and_unmount()

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
df = pd.read_csv('/content/drive/MyDrive/Credit card clients - Data.csv')

### Dataset First View

In [None]:
# Dataset First Look

In [None]:
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

In [None]:
df.shape[0]   # obtaining the number of rows

In [None]:
df.shape[1]   # obtaining the number of columns

### Dataset Information

In [None]:
# Dataset Info

In [None]:
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

In [None]:
df.duplicated().value_counts()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

In [None]:
df.isna().sum()

In [None]:
# Visualizing the missing values

In [None]:
plt.figure(figsize=(10,5))
sns.heatmap(df.isnull(), cbar=False)

### What did you know about your dataset?

Answer Here

There are 0 missing records in the dataset.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

In [None]:
df.columns

In [None]:
# Dataset Describe

In [None]:
df.describe().T

### Variables Description

Answer Here

- There are around 30000 distict credit card clients.
- The average value of credit card Limits is Rs 1,67,484.
- The Limited Balance has a high Standard deviation as the meadian value is Rs 1,40,000 and the extreme values as Rs 10,00,000.
- Here the average is about 35 and meadian is 28 with a standard deviation of 9.2. This difference is explained by some very old people in the data set as given that the maximum age is 79.
- Bill Amount and Pay Amount also shows us that there some people with extremely high bill amount which may be because for the higher Credit Limit or because of the pending dues added up.
- Bill amount for all the months, the mean is around 40,000 to 50,000 with some extreme amount in bill amount 3 of Rs 16,64,089.
- Pay amount for all the months, the mean is around Rs 4800 to Rs 5800, with some extreme values such as Rs 16,64,089.
- As the value 0 for default payment means 'not default' and value 1 means 'default', the mean of 0.221 means that there are 22.1% of credit card contracts that will default next month (will verify this in the next sections of this analysis).

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

In [None]:
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

In [None]:
print("There are {} missing records in the dataset.".format(df.isnull().sum().sum()))

In [None]:
# Storing feature names in variable 'cols'

cols = df.columns.tolist()

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

In [None]:
# Boxplot for Bill_Amt vs Limit_bal

plt.figure(figsize=(10,7))
sns.boxplot(data=df[['LIMIT_BAL','BILL_AMT1', 'BILL_AMT2','BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6']])
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

 boxplot is a useful tool for summarizing and visualizing the distribution of numerical data, making it easier to identify key features and potential outliers in our dataset

##### 2. What is/are the insight(s) found from the chart?

Answer Here

Here are some insights i can gain from the chart:

Central Tendency: You can see the median (middle line inside the box) for each variable. The median represents the middle value in the data, and it gives you an idea of the central tendency. For example, you can see the median 'LIMIT_BAL' (credit limit) and the median bill amounts for each month.

Spread: The boxes in the plot represent the interquartile range (IQR), which spans from the 25th percentile (lower edge of the box) to the 75th percentile (upper edge of the box). The length of the box indicates the spread of the data. A longer box indicates greater variability.

Outliers: Individual data points that are plotted as dots outside of the boxes are potential outliers. Outliers can provide insights into extreme values or anomalies in the data. They may be worth investigating further

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 2

In [None]:
# Chart - 2 visualization code

In [None]:
# Boxplot for Pay_Amt vs Limit_bal

plt.figure(figsize=(10,7))
sns.boxplot(data=df[['LIMIT_BAL','PAY_AMT1','PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']])
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

Payment Amounts: You can see the distribution of payment amounts for each of the six months. The boxes represent the interquartile range (IQR), with the middle line inside the box representing the median payment amount for each month.

Variability in Payments: The length of the boxes (IQR) for each month gives you an idea of the variability in payment amounts. A longer box indicates greater variability in payments, while a shorter box suggests less variability.

Median Payments: The median (middle line inside the box) for each month provides insight into the central tendency of payment amounts. You can compare the medians across different months to see if there are any significant changes in payment behavior over time.

Outliers: Data points plotted as dots outside of the boxes are potential outliers. These outliers could indicate unusually large or small payment amounts for specific months. Investigating these outliers may help identify exceptional payment behavior.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

In [None]:
# Chart - 3 visualization code

In [None]:
# Boxplot for column 'AGE'
plt.figure(figsize=(5,5))
sns.boxplot(data=df['AGE'])
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

Central Tendency: The boxplot displays the median (middle line inside the box) of the age distribution. This median age represents the central tendency of the dataset, giving you an idea of the age that divides the dataset into two equal halves.

Variability in Ages: The length of the box (interquartile range, IQR) indicates the spread or variability in ages. A longer box suggests a wider range of ages within the dataset, while a shorter box implies that most ages are close to the median.

Skewness: You can assess the skewness of the age distribution by observing the direction in which the tails of the box extend. If the right tail (above the box) is longer than the left tail (below the box), it suggests positive skewness, indicating that there are relatively few older individuals with ages significantly above the median. Conversely, if the left tail is longer, it suggests negative skewness, indicating more younger individuals.

Outliers: Any data points plotted as dots outside of the box may be considered outliers. In the case of age, outliers might represent individuals with exceptionally high or low ages compared to the majority.

Age Range: The whiskers extending from the box indicate the range of ages within a reasonable limit. Data points beyond the whiskers are potential outliers. The boxplot helps you understand the range of ages in your dataset.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code

In [None]:
# Outliers on numberical columns

num_var = df.select_dtypes(exclude='object')
for i in num_var:

    q1 = df[i].quantile(0.25)
    q3 = df[i].quantile(0.75)

    IQR = q3 - q1
    UL = q3 + 1.5*IQR
    LL = q1 - 1.5*IQR

    print('IQR of',i,'= ',IQR)
    print('UL of',i,'= ',UL)
    print('LL of',i,'= ',LL)
    print('Number of Outliers in',i,' = ',(df.shape[0] - df[(df[i]<UL) & (df[i]>LL)].shape[0]))
    print(' ')

In [None]:
mi0 = df[df['default payment next month']==0]
mi1 = df[df['default payment next month']==1]

In [None]:
con_col=['AGE','LIMIT_BAL','BILL_AMT1', 'BILL_AMT2','BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']

for i in con_col:
    plt.figure(figsize=(20,5))
    sns.distplot(mi0[i],color='g')
    sns.distplot(mi1[i],color='r')
    plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

these overlaid distribution plots are useful for understanding how various continuous variables differentiate between two groups (mi0 and mi1) and for identifying the key features that might contribute to group separation or classification. They provide a visual means to explore the data and gain insights into the characteristics of each group.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

Here are the insights you can gather from these overlaid distribution plots:

Comparison of Distributions: The plots enable you to visually compare the distribution of each continuous variable between the two groups (mi0 and mi1). The green and red curves represent the distributions for each group. You can observe whether there are significant differences or similarities in the distributions.

Central Tendency: You can compare the central tendency (mean or median) of each variable for the two groups. If one group's curve is shifted to the right or left compared to the other, it suggests differences in the central tendency.

Spread and Variability: You can assess the spread and variability of each variable within each group. A wider distribution suggests greater variability, while a narrower distribution suggests less variability.

Skewness: By examining the shape of the curves, you can identify differences in skewness between the two groups. A skewed distribution may have a longer tail on one side, indicating an asymmetric distribution.








##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code

In [None]:
#renaming for better convinience
df['IsDefaulter'] =df ['default payment next month']
df.drop('default payment next month',axis = 1)
# df.rename({'default.payment.next.month' : 'IsDefaulter'}, inplace=True)

In [None]:
plt.figure(figsize=(10,5))
sns.countplot(x = 'IsDefaulter', data = df)

##### 1. Why did you pick the specific chart?

Answer Here.

the countplot provides a visual representation of the distribution of the 'IsDefaulter' variable, helping you understand the class balance and make informed decisions about data analysis and modeling strategies.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

As we can see in dataset we have values like 5,6,0 as well for which we are not having description so we can add up them in 4, which is Others.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code

In [None]:
categorical_features = ['SEX', 'EDUCATION', 'MARRIAGE']

In [None]:
df_cat = df[categorical_features]
df_cat['Defaulter'] = df['IsDefaulter']

In [None]:
df_cat.replace({'SEX': {1 : 'MALE', 2 : 'FEMALE'}, 'EDUCATION' : {1 : 'graduate school', 2 : 'university', 3 : 'high school', 4 : 'others'}, 'MARRIAGE' : {1 : 'married', 2 : 'single', 3 : 'others'}}, inplace = True)

In [None]:
for col in categorical_features:
  plt.figure(figsize=(10,5))
  fig, axes = plt.subplots(ncols=2,figsize=(13,8))
  df[col].value_counts().plot(kind="pie",ax = axes[0],subplots=True)
  sns.countplot(x = col, hue = 'Defaulter', data = df_cat)

##### 1. Why did you pick the specific chart?

Answer Here.

these paired plots provide a comprehensive view of the categorical features, including their distribution and how they relate to the 'Defaulter' class. They are useful for understanding the characteristics of the data, identifying potential relationships between categorical variables and the target variable ('Defaulter'), and gaining insights into the factors that may influence default behavior.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

Below are few observations for categorical features:

There are more females credit card holder,so no. of defaulter have high proportion of females.
No. of defaulters have a higher proportion of educated people (graduate school and university)
No. of defaulters have a higher proportion of Singles.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code

Limit Balance

In [None]:
sns.barplot(x='IsDefaulter', y='LIMIT_BAL', data=df)

##### 1. Why did you pick the specific chart?

Answer Here.

the bar plot helps you assess whether there are differences in credit limits between defaulters and non-defaulters. It is a valuable tool for initial exploratory data analysis and understanding the potential impact of 'LIMIT_BAL' as a predictor of default behavior in your dataset.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

 Here are the insights you can gather from this chart:

Comparison of Means: The bar plot displays the means (average values) of 'LIMIT_BAL' for each category of 'IsDefaulter' (e.g., 0 for non-defaulters and 1 for defaulters). The height of each bar represents the mean 'LIMIT_BAL' for the corresponding category.

Differences in Credit Limits: By comparing the heights of the bars, you can assess whether there are significant differences in the average credit limits between defaulters and non-defaulters. If one bar is noticeably higher than the other, it suggests that the mean credit limit differs between the two groups.

Credit Limit as a Predictor: This plot provides insights into the potential predictive power of 'LIMIT_BAL' with respect to default behavior. For example, if the bar for defaulters is significantly lower (or higher) than the bar for non-defaulters, it suggests that 'LIMIT_BAL' may be a relevant factor in predicting defaults.

Decision Threshold: Depending on the problem context, this chart may help you determine an appropriate decision threshold for classifying individuals as defaulters or non-defaulters based on their credit limits. For instance, if the default rate is lower for individuals with higher credit limits, you might set a higher credit limit as a threshold for identifying non-defaulters.










##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code

In [None]:
plt.figure(figsize=(10,10))
ax = sns.boxplot(x="IsDefaulter", y="LIMIT_BAL", data=df)

##### 1. Why did you pick the specific chart?

Answer Here.

the boxplot offers a comprehensive view of how credit limits are distributed between defaulters and non-defaulters. It allows you to assess central tendency, variability, skewness, and the presence of outliers, helping you understand the potential role of 'LIMIT_BAL' in predicting default behavior in your dataset.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

Here are the insights you can gather from this chart:

Comparison of Credit Limits: The boxplot displays the distribution of 'LIMIT_BAL' for each category of 'IsDefaulter' (e.g., 0 for non-defaulters and 1 for defaulters). Specifically, it shows the distribution of credit limits for both groups.

Central Tendency: You can compare the central tendency (median) of credit limits between defaulters and non-defaulters. The median is represented by the horizontal line inside each box. If one box is significantly higher or lower than the other, it suggests a difference in median credit limits.

Variability in Credit Limits: The length of the boxes (interquartile range, IQR) indicates the spread or variability of credit limits within each category. A longer box suggests greater variability, while a shorter box suggests less variability.

Skewness: You can assess the skewness of the credit limit distribution within each category by examining the direction of the tails of the boxes. Positive skewness (rightward tail) may indicate that a few individuals within a category have exceptionally high credit limits, while negative skewness (leftward tail) suggests the opposite.







##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code

Plotting graph of number of ages of all people with credit card irrespective of gender.

In [None]:
#renaming columns

df.rename(columns={'PAY_0':'PAY_SEPT','PAY_2':'PAY_AUG','PAY_3':'PAY_JUL','PAY_4':'PAY_JUN','PAY_5':'PAY_MAY','PAY_6':'PAY_APR'},inplace=True)
df.rename(columns={'BILL_AMT1':'BILL_AMT_SEPT','BILL_AMT2':'BILL_AMT_AUG','BILL_AMT3':'BILL_AMT_JUL','BILL_AMT4':'BILL_AMT_JUN','BILL_AMT5':'BILL_AMT_MAY','BILL_AMT6':'BILL_AMT_APR'}, inplace = True)
df.rename(columns={'PAY_AMT1':'PAY_AMT_SEPT','PAY_AMT2':'PAY_AMT_AUG','PAY_AMT3':'PAY_AMT_JUL','PAY_AMT4':'PAY_AMT_JUN','PAY_AMT5':'PAY_AMT_MAY','PAY_AMT6':'PAY_AMT_APR'},inplace=True)

In [None]:
df['AGE']=df['AGE'].astype('int')

In [None]:
fig, axes = plt.subplots(ncols=2,figsize=(20,10))
Day_df=df['AGE'].value_counts().reset_index()
df['AGE'].value_counts().plot(kind="pie",ax = axes[0],subplots=True)
sns.barplot(x='index',y='AGE',data=Day_df,ax = axes[1],orient='v')

In [None]:
df.groupby('IsDefaulter')['AGE'].mean()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

In [None]:
plt.figure(figsize=(10,10))
ax = sns.boxplot(x="IsDefaulter", y="AGE", data=df)


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

Bill Amount

In [None]:
bill_amnt_df = df[['BILL_AMT_SEPT',	'BILL_AMT_AUG',	'BILL_AMT_JUL',	'BILL_AMT_JUN',	'BILL_AMT_MAY',	'BILL_AMT_APR']]

In [None]:
sns.pairplot(data = bill_amnt_df)

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

History payment status

In [None]:
pay_col = ['PAY_SEPT',	'PAY_AUG',	'PAY_JUL',	'PAY_JUN',	'PAY_MAY',	'PAY_APR']
for col in pay_col:
  plt.figure(figsize=(10,5))
  sns.countplot(x = col, hue = 'IsDefaulter', data = df)

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

Paid Amount

In [None]:
pay_amnt_df = df[['PAY_AMT_SEPT',	'PAY_AMT_AUG',	'PAY_AMT_JUL',	'PAY_AMT_JUN',	'PAY_AMT_MAY',	'PAY_AMT_APR', 'IsDefaulter']]

In [None]:
sns.pairplot(data = pay_amnt_df, hue='IsDefaulter')

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

In [None]:
plt.figure(figsize=(25,20))
sns.heatmap(df.corr(),annot=True)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

This heatmap allows you to explore the pairwise correlations between different numerical variables in your dataset.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

Here are the insights i gather from this chart:

Strength and Direction of Correlations: The heatmap displays the correlation coefficients between pairs of numerical variables. Positive correlations are indicated by warmer colors (e.g., yellow or red), while negative correlations are represented by cooler colors (e.g., blue or green). The intensity of the color reflects the strength of the correlation.

Highly Correlated Variables: You can identify pairs of variables that are strongly positively or negatively correlated. Strong positive correlations suggest that as one variable increases, the other tends to increase as well, and vice versa for negative correlations. This information can help you identify potential multicollinearity issues when building predictive models.

Correlations with the Target Variable: If 'IsDefaulter' (or a similar target variable) is included in your DataFrame, you can assess the correlations between the predictor variables and the target. Positive correlations indicate that the predictor variable tends to increase as the likelihood of default increases, while negative correlations suggest the opposite.

Independent Variables: Variables with low or near-zero correlations with other variables are likely to be more independent and may have unique information to contribute to predictive modeling.

Feature Selection: You can use the heatmap to identify potential candidate variables for feature selection or dimensionality reduction techniques. Variables with weak correlations with the target variable or other predictors may be considered for removal.



#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

In [None]:
sns.pairplot(df)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

the pairplot is a valuable tool for initial data exploration, providing a visual summary of relationships and distributions among numerical variables.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

Feature Selection: Pairplots can help identify potential candidates for feature selection. Variables that show little variation or do not appear to be strongly related to the target variable may be considered for removal.

Multicollinearity: High correlations between pairs of variables, as indicated by strong linear relationships in the scatterplots, may suggest multicollinearity. Multicollinearity can impact the interpretability of regression models, and addressing it might involve selecting one variable from a correlated pair or using dimensionality reduction techniques.

Data Exploration: The pairplot serves as an exploratory tool to get a quick overview of the relationships within the dataset. It can guide further, more targeted analyses based on the patterns and insights observed.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

for col in df.select_dtypes(include=object).columns:
    df[col] = le.fit_transform(df[col])

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

Label Encoding is typically used when the categorical variable is ordinal in nature, meaning there is an inherent order or ranking among the categories. Label Encoding preserves this order by assigning integers in a sequential manner. However, it's important to note that some machine learning algorithms may misinterpret the encoded integers as ordinal values, which may not be appropriate for nominal variables (categories without inherent order).

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

Binning the 'AGE' column

In [None]:
def age(x):
    if x in range(21,41):
        return 1
    elif x in range(41,61):
        return 2
    elif x in range(61,80):
        return 3

df['AGE']=df['AGE'].apply(age)

Replacing 0,5,6 to 4 in education columns

In [None]:
def rep(x):
    if x in [0,4,5,6]:
        return 4
    else:
        return x
df['EDUCATION']=df.EDUCATION.apply(rep)

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

In [None]:
X =df.drop('default payment next month',axis=1)
y = df['default payment next month']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

##### What data splitting ratio have you used and why?

Answer Here.

The data splitting ratio i've specified is test_size=0.3, which means that i am allocating 30% of data to the testing set and the remaining 70% to the training set.
 This is a commonly used ratio, and it serves specific purposes:

Training Data: The training set (70% of the data) is used to train your machine learning model. During the training process, the model learns patterns and relationships within the data, making it capable of making predictions or classifications.

Testing Data: The testing set (30% of the data) is used to evaluate the performance of your trained model. After training, the model is tested on this independent dataset to assess how well it generalizes to unseen data. The testing set helps you estimate the model's performance in real-world scenarios.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

In [None]:
print((df['default payment next month'].value_counts()/df['default payment next month'].shape)*100)
sns.histplot(df['default payment next month'])
plt.show()


##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

Here we can see that the data is imbalanced.

In [None]:
print('Before OverSampling, the shape of train_X: {}'.format(X_train.shape))
print('Before OverSampling, the shape of train_y: {} \n'.format(y_train.shape))

In [None]:
smote = SMOTE(sampling_strategy='minority')
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)

In [None]:
print('After OverSampling, the shape of train_X: {}'.format(X_train_sm.shape))
print('After OverSampling, the shape of train_y: {} \n'.format(y_train_sm.shape))


## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Base Model using LogisticRegression:



In [None]:
logreg = LogisticRegression(solver='liblinear', fit_intercept=True)

logreg.fit(X_train_sm, y_train_sm)

y_prob_train = logreg.predict_proba(X_train_sm)[:,1]
y_pred_train = logreg.predict (X_train_sm)

print('Classification report - Train: ', '\n', classification_report(y_train_sm, y_pred_train))

y_prob = logreg.predict_proba(X_test)[:,1]
y_pred = logreg.predict (X_test)

print('Classification report - Test: ','\n', classification_report(y_test, y_pred))

Feature selection- Backward Elimination:

In [None]:
Xc=sm.add_constant(X_train_sm)
model = sm.Logit ( y_train_sm , Xc ).fit ( )

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

Decission Tree:

In [None]:
# Defining an object for DTC and fitting for whole dataset
dt = DecisionTreeClassifier(max_depth=3, min_samples_leaf=10, random_state=1 )
dt.fit(X_train_sm, y_train_sm)

y_pred_train = dt.predict(X_train_sm)
y_pred = dt.predict(X_test)
y_prob = dt.predict_proba(X_test)

In [None]:
#Classification for test before hyperparameter tuning
print(classification_report(y_test,y_pred))

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

In [None]:
dt = DecisionTreeClassifier(random_state=1)

params = {'criterion': ['gini','entropy'],
          'splitter' : ["best", "random"],
          'max_depth' : [2,4,6,8,10,12],
          'min_samples_split': [2,3,4,5],
          'min_samples_leaf': [1,2,3,4,5]}

rand_search_dt = RandomizedSearchCV(dt, param_distributions=params, cv=3)

rand_search_dt.fit(X_train_sm,y_train_sm)

rand_search_dt.best_params_

In [None]:
# Passing best parameter for the Hyperparameter Tuning
dt = DecisionTreeClassifier(**rand_search_dt.best_params_, random_state=1)

dt.fit(X_train_sm, y_train_sm)

y_pred = dt.predict(X_test)

In [None]:
#Classification for test after hyperparameter tuning
print(classification_report(y_test,y_pred))


##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

Random Forest:

In [None]:
#Create a Gaussian Classifier
rfc=RandomForestClassifier(n_estimators=100, random_state=1)

#Train the model using the training sets y_pred=clf.predict(X_test)
rfc.fit(X_train_sm,y_train_sm)

y_pred = rfc.predict(X_test)

In [None]:
#Classification for test after hyperparameter tuning
print(classification_report(y_test,y_pred))

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

In [None]:
rfc = RandomForestClassifier(random_state=1)

params = {'n_estimators': sp_randint(5,30),
          'criterion' : ['gini','entropy'],
          'max_depth' : sp_randint(2,10),
          'min_samples_split' : sp_randint(2,20),
          'min_samples_leaf' : sp_randint(1,20),
          'max_features' : sp_randint(2,18)}

rand_search_rfc = RandomizedSearchCV(rfc, param_distributions=params, random_state=1, cv=3)

rand_search_rfc.fit(X_train_sm,y_train_sm)

rand_search_rfc.best_params_

In [None]:
# Passing best parameter for the Hyperparameter Tuning
rfc = RandomForestClassifier(**rand_search_rfc.best_params_, random_state=1)

rfc.fit(X_train_sm, y_train_sm)

y_pred = rfc.predict(X_test)

In [None]:
#Classification for test after hyperparameter tuning
print(classification_report(y_test,y_pred))


##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

In conclusion, credit card default prediction is a critical task in the financial industry aimed at assessing the likelihood that a borrower will fail to make credit card payments. Accurate prediction of credit card defaults is essential for financial institutions to manage their risk effectively and make informed lending decisions. Here are some key takeaways:

Data is Crucial: Credit card default prediction relies on historical customer data, including credit scores, income, payment history, and demographic information. Collecting and maintaining high-quality data is a fundamental requirement for building effective predictive models.

Data Preprocessing: Data preprocessing involves cleaning, transforming, and encoding the data to make it suitable for machine learning. This step often includes handling missing values, feature engineering, and dealing with outliers.

Model Selection: Various machine learning models can be used for credit card default prediction, including logistic regression, decision trees, random forests, gradient boosting, and neural networks. The choice of model depends on the specific characteristics of the dataset and the business objectives.

Model Evaluation: Model performance should be rigorously evaluated using appropriate metrics such as accuracy, precision, recall, F1-score, and ROC-AUC. Understanding the trade-offs between different metrics is essential, as the cost of false positives and false negatives can vary significantly.

Continuous Monitoring: Models for credit card default prediction should be continuously monitored to ensure they remain accurate and relevant. Changes in data distributions or external factors can impact model performance over time, so regular updates and retraining may be necessary.

Regulatory Compliance: Financial institutions must adhere to regulatory requirements and industry standards when developing and deploying credit card default prediction models. Compliance with laws like the Fair Credit Reporting Act (FCRA) and data protection regulations is essential.

Ethical Considerations: Addressing bias and ensuring fairness in lending decisions is crucial. Steps should be taken to mitigate any biases in data and model predictions, promoting responsible and ethical lending practices.

Customization: The specific features, preprocessing steps, and modeling techniques used in credit card default prediction should be tailored to the unique characteristics of the institution's data and business goals.

In summary, credit card default prediction is a multifaceted process that combines data science expertise, domain knowledge, and a commitment to responsible lending practices. Accurate prediction models can help financial institutions manage risk, reduce defaults, and make more informed decisions about extending credit to applicants, ultimately contributing to a healthier and more stable financial ecosystem.


### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***