# **Project Name**    - Credit Risk and Customer Behaviour Analysis: Paisabazaar Dataset



##### **Project Type**    - EDA
##### **Contribution**    - Team
##### **Team Member 1 -**  Faisal Khan
##### **Team Member 2 -** Harsh Jaisawal


# **Project Summary -**

This project focuses on performing an in-depth Exploratory Data Analysis (EDA) of the Paisabazaar Credit Score dataset to understand customer credit behavior, identify risk patterns, and uncover insights that can support better decision-making in the credit and lending ecosystem. Since credit score is one of the most critical indicators in financial services, analyzing the factors that influence it helps organizations optimize loan approvals, reduce default risk, and personalize financial offerings. The primary objective of this project was not to build a predictive machine learning model but to thoroughly analyze the dataset, identify key trends, and highlight patterns across demographic, financial, and behavioral attributes.

The dataset consists of various features such as age, occupation, annual income, credit utilization, number of credit cards, payment delays, credit history age, loan activity, EMI amounts, and credit inquiries. Through data cleaning, preprocessing, and visualization techniques, the project examines how these attributes relate to different credit score categories—Good, Standard, and Poor. Each visualization is crafted to answer a specific analytical question and uncover actionable business insights.

One of the major findings in the analysis is that the Standard credit score category dominates across almost all segments, including age groups, occupations, and customer profiles. This indicates that while most customers maintain average credit behavior, relatively few achieve a Good credit score due to factors such as delayed payments, higher credit utilization, and lower income levels. Adults form the largest customer segment and consistently show the highest counts across all credit score categories. Teens have the lowest credit participation, and their Poor credit score count is noticeably high, suggesting early challenges in managing credit. Senior citizens show more balanced credit behavior with fewer Poor scores, likely due to reduced credit usage and better financial discipline over time.

Income-related insights reveal that annual income and credit score have a positive correlation. Customers with higher income tend to maintain better repayment discipline, lower credit utilization, and fewer delays, resulting in stronger credit profiles. On the other hand, a large portion of the dataset consists of low-income customers earning between ₹10,000 and ₹40,000. This segment is more prone to delayed payments and higher financial stress, which may increase the default risk if not managed properly by lenders.

Behavioral factors such as number of delayed payments, credit card usage, and EMI burden strongly influence credit score outcomes. The analysis shows a clear inverse relationship between delayed payments and credit scores: as delayed payments increase, credit scores drop sharply. Similarly, customers with a high number of credit cards and high utilization ratios tend to fall into the Poor credit category. These variables highlight the importance of disciplined credit usage and timely repayments.

The visual analysis also uncovers patterns across occupation, where most professions reflect a concentration in the Standard score category. Good credit scores remain relatively low across all job types, which emphasizes the need for better financial literacy and repayment planning across segments.

From a business perspective, these insights can help financial institutions improve loan underwriting, identify high-risk customers, refine credit policies, personalize product offerings, and create targeted interventions such as credit-building programs or flexible repayment plans. Understanding demographic and behavioral factors also helps enhance customer segmentation and risk-adjusted lending strategies.

Overall, this project provides a comprehensive, data-driven understanding of the credit landscape within the Paisabazaar dataset. It establishes a strong foundation for building future predictive models and offers valuable insights for lenders, credit analysts, and financial decision-makers.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


The financial services industry relies heavily on credit scores to assess customer credibility, manage lending risks, and make loan approval decisions. However, many customers exhibit diverse financial behaviors influenced by income levels, credit utilization, delayed payments, credit history age, occupation, and banking activity. Without proper analysis, it becomes difficult for lenders to identify high-risk customers, understand key credit score drivers, and design effective credit strategies.
This project aims to analyze the Paisabazaar Credit Score dataset using Exploratory Data Analysis (EDA) to uncover patterns, detect risk indicators, and understand the factors that influence Good, Standard, and Poor credit scores.

#### **Define Your Business Objective?**

The main business objective of this project is to provide data-driven insights that help financial institutions:

Understand customer credit behavior across demographic, financial, and behavioral attributes.

Identify high-risk customer segments based on delayed payments, high utilization, and low income.

Support better lending decisions by highlighting key factors associated with Poor and Good credit scores.

Enable targeted financial strategies, such as personalized credit products, risk control measures, and customer education programs.

By achieving these objectives, lenders can reduce default risk, enhance portfolio quality, and improve customer engagement.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
dataset  = pd.read_csv("/content/Paisabazaar dataset.csv")
dataset

### Dataset First View

In [None]:
# Dataset First Look
dataset.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
dataset.shape


### Dataset Information

In [None]:
# Dataset Info
dataset.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
dataset.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

dataset.isnull().sum()

In [None]:
# Drop the rows which contains atleast one missing value

dataset.dropna(inplace=True)

In [None]:
# Visualizing the missing values
plt.figure(figsize=(10,8))
sns.heatmap(dataset.isnull(),cbar = False)

### What did you know about your dataset?

After performing the initial exploration, I observed that the dataset contains 100000 rows and 28 columns. It includes attribute like customer demographics(Customer_ID, Name, Age, Occupation), loan details(Type of loan, Interest Rate Num of loan) and credit information(Credit Score, Credit Mix, Credit History Age, Credit Utilization ratio and so on).

The dataset does not contains any null or duplicated values. It includes a mix of numerical,categorical and datetime variables. While checking the datatype, I found some attributes required datatype conversion like Age, Num of bank account, Interest rate, Delay from due date, among others.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
list(dataset.columns)

In [None]:
# Dataset Describe
dataset.describe(include="all")

### Variables Description



Variable Name - 	Description :





ID	-Unique identifier for each record.

Customer_ID -	Unique identifier assigned to each customer.

Month -	Month in which the data was recorded.

Name	- Name of the customer.

Age -	Age of the customer (in years).

SSN	 - Social Security Number of the customer (identifier).

Occupation	- Occupation or job role of the customer.

Annual_Income	- Customer’s annual income.

Monthly_Inhand_Salary	- Customer’s monthly in-hand salary.

Num_Bank_Accounts -	Number of bank accounts the customer holds.

Num_Credit_Card	- Number of credit cards owned by the customer.

Interest_Rate	- Interest rate applied on the customer's credit card.

Num_of_Loan -	Number of loans the customer has taken.

Type_of_Loan -	Types of loans taken by the customer.

Delay_from_due_date -	Number of days the customer delayed their payment.

Num_of_Delayed_Payment -	Total number of payment delays by the customer.

Changed_Credit_Card -	Percentage change in the customer's credit card limit.

Num_Credit_Inquiries -	Number of credit card or loan inquiries made by the customer.

Credit_Mix -	Classification of the customer's credit mix (Good/Bad/Standard).

Outstanding_Debt -	Total outstanding debt of the customer.

Credit_Utilization_Ratio -	Ratio of the credit utilized by the customer compared to the credit limit.

Credit_History_Age -	Age/duration of the customer’s credit history.

Payment_of_Min_Amount -	Indicator of whether the customer paid the minimum amount (Yes/No).

Total_EMI_per_month -	Total EMI amount paid by the customer per month.

Amount_invested_monthly -	Monthly amount invested by the customer.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in list(dataset.columns):

  print("No. of unique values in ",i,"is",dataset[i].nunique())

## 3. ***Data Wrangling***




### Data Wrangling Code

In [None]:
# Creating a copy of dataset

data = dataset.copy()

In [None]:
# pd.reset_option("display.max_rows")
pd.set_option("display.max_columns",None)
# pd.set_option("display.max_colwidth", None)
# pd.reset_option("display.max_colwidth")

In [None]:
# Dropping insignificant columns


data.drop(["SSN", "ID","Name"],axis=1,inplace=True,errors = 'ignore')

In [None]:
# Replacing a value of payment_of_Min_Amount as it is just written as NM

data["Payment_of_Min_Amount"] = data["Payment_of_Min_Amount"].replace("NM","Unknown")

In [None]:
# Replacing No data with Unknown in type_of_loan

data["Type_of_Loan"] = data["Type_of_Loan"].replace("No Data", "Unknown")

In [None]:
# Changing data type  of multiple columns as int
# column =  age, Num_Bank_Accounts, Num_Credit_Card, Interest_Rate, Num_of_Loan, Delay_from_due_date, Num_of_Delayed_Payment, Num_Credit_Inquiries, Credit_History_Age

data [["Age","Num_Bank_Accounts","Num_Credit_Card","Interest_Rate","Num_of_Loan","Delay_from_due_date","Num_of_Delayed_Payment","Num_Credit_Inquiries","Credit_History_Age"]] = data [["Age","Num_Bank_Accounts","Num_Credit_Card","Interest_Rate","Num_of_Loan","Delay_from_due_date","Num_of_Delayed_Payment","Num_Credit_Inquiries","Credit_History_Age"]].astype(int)

In [None]:
# Create age group buckets :
# - 13 to 25 -> Teen
# - 26 to 40 -> Adult
# - 41 to 65  -> Senior


data["Age_group"] = pd.cut(data["Age"], bins = [13,25,40,65],
                             labels = ["Teen","Adult","Senior Citizen"])


In [None]:
#  Create Annual income group bucket:
#  - 5000 to 50000   -> Economy_segment
#  - 50001 to 95000  -> Mass Market Segment
#  - 95001 to 140000 -> Affluent Segment
#  - Above 14000     -> High Net-Worth Segment


data["Anuual_income_group"] = pd.cut(data["Annual_Income"],
                                     bins = [5000,50000,95000,140000,180000],
                                     labels = ["Economy_segment","Mass Market Segment","Affluent Segment","High Net-Worth Segment"])

In [None]:
#  Create Monthly Inhand group bucket:
#  - 300 to 3500   -> Economy_segment
#  - 3501 to 7000  -> Mass Market Segment
#  - 70001 to 10500 -> Affluent Segment
#  - Above 10500     -> High Net-Worth Segment

data["Monthly _Inhand_Group"] = pd.cut(data["Monthly_Inhand_Salary"],
                                       bins =[300,3500,7000,10500,15300],
                                       labels =["Economy Segment","Mass Market Segment","Affluent Segment","High Net-Worth Segment"])

In [None]:
# Create Credit History Age group bucket:
# - 0 to 100 -> New
# - 100 to 200 -> Developing
# - 201 to 300 -> Establishing
# - Above 301 -> Mature

data["Credit_History_Age_Group"] = pd.cut(data["Credit_History_Age"],
                                          bins = [0,100,200,300,450],
                                          labels = ["New","Developing","Establishing","Mature"])

In [None]:
# Mapping numerical values to different credit score
# Poor : 0, Standard : 1, Good : 2

data['Score_Code'] = data['Credit_Score'].map({'Poor':0,'Standard':1,'Good':2})

Data Manipulation :



In [None]:
# 1. How many times does each loan have been sanctioned?


# Step 1: Replace ' and ' with comma
data["Type_of_Loan_clean"] = data["Type_of_Loan"].str.replace(" and ", ",")

# Step 2: Split into a list
data["Type_of_Loan_clean"] = data["Type_of_Loan_clean"].str.split(",")


# Step 4: Explode
exploded = data["Type_of_Loan_clean"].explode().str.strip()


# Convert it into a dataframe
exploded_df = exploded.reset_index(drop=True).to_frame(name="Type_of_Loan_cleaned")

# Remove empty values
final_result = exploded_df.loc[exploded_df["Type_of_Loan_cleaned"] != '']

# Step 5: Count
final_result.value_counts().head(20)



In [None]:
# 2. How many loans are there based on different age groups?   --> [Age group, Num_of_loan]

#  Chart no. - 5


# Calculating total number of loans sanctioned based on different age group
a = data.groupby("Age_group",observed = True)["Num_of_Loan"].sum().reset_index()
a


In [None]:
# 3. How many loans have been sanctioned across each annual income group?  -- > [Annual income group, Num of_loan]
# Chart no. -6

# Calculating total number of loans based on different annual income group
b =data.groupby("Anuual_income_group",observed = True)["Num_of_Loan"].sum().reset_index()
b


In [None]:
# 4. What is the total number of loans sanctioned across different months  ?

# Calculating total number of loans based on different month
data.groupby("Month")["Num_of_Loan"].sum().reset_index()

In [None]:
# 5. What is the total number of bank accounts and credit cards across different age groups?

# Chart no. - 7

# Calculating total num of credit card based on different age group
c = data.groupby("Age_group",observed = True)[["Num_Bank_Accounts","Num_Credit_Card"]].sum().reset_index()

# Melt the data to get the output in desired format
c_melt = c.melt(
    id_vars="Age_group",
    value_vars=["Num_Bank_Accounts", "Num_Credit_Card"],
    var_name="Account_Type",
    value_name="Count"
)


c_melt

In [None]:
# 6. How many credit cards have been issued by each monthly in-hand salary group?
# Chart no. - 8


# Calculating total number of credit card based on different monthly inhand group
c = data.groupby("Monthly _Inhand_Group",observed = True)["Num_Credit_Card"].sum().reset_index()
c

In [None]:
# 7. What is the distribution of number of bank accounts across various annual income groups ?

# Calculating total number of bank account based on different annual income group
d = data.groupby('Anuual_income_group',observed = True)["Num_Bank_Accounts"].sum().reset_index()
d

In [None]:
# 8. Which credit score  has the highest number of delayed payments ?

# Calculating total number of delayed payment based on different credit score
e = data.groupby("Credit_Score")["Num_of_Delayed_Payment"].sum().reset_index()
e

In [None]:
# 9. How many different types of credit score have been allocated to different age group ?
# chart - 18

# Counting customrs for each Credit score and different age group
credit_score_counts = data.groupby(["Age_group", "Credit_Score"],observed=True).size().reset_index(name="Count")



# Pivot the data
pivot_df = credit_score_counts.pivot(index="Age_group", columns="Credit_Score", values="Count")

In [None]:
# 10. What is the distribution of credit score based on different payment behaviour ?



# Counting customers for each payment behaviour and credit score category
ae = data.groupby(["Payment_Behaviour", "Credit_Score"]).size().reset_index(name="Count")
print(ae)







In [None]:
# 11. Distribution of credit score based on occupation:
# chart - 17


# Creating a cross-tab to show the distribution of credit scores across occupations
df =pd.crosstab(data["Occupation"], data["Credit_Score"])
df


In [None]:
# 12. Which monthly salary group has the total outstanding debt ?   -- > [Monthly_inhand_group , Outstanding debt]


# Calculating total outstanding debt for each Monthly inhand salary group
debt = data.groupby("Monthly _Inhand_Group",observed = True)["Outstanding_Debt"].sum().reset_index()
debt

In [None]:
# 13. In which month were the maximum delayed payments recorded?

# Calculating total number of delayed payment for each month
data.groupby("Month")["Num_of_Delayed_Payment"].sum().reset_index()

In [None]:
# 14. How many delayed payments are there based on credit history age group?

# Calculating total number of Delayed payment for each Credit history age group
data.groupby("Credit_History_Age_Group",observed = True)["Num_of_Delayed_Payment"].sum().reset_index()


In [None]:
# 15. How many credit enquiries are there based on different credit history age groups?


# Calculating total Credit card enquiries for each credit history age group
data.groupby("Credit_History_Age_Group",observed = True)["Num_Credit_Inquiries"].sum().reset_index()

In [None]:
# 16. how many number of credit card are there based on credit score?

# Counting total number of Credit card in each category
data.groupby("Credit_Score")["Num_Credit_Card"].size().reset_index()


In [None]:
# 17. how many number of delayed payment are there based on credit score?

# chart -16

# Couting number of Delayed payment in each Credit score category
analysis_18 = data.groupby("Credit_Score")["Num_of_Delayed_Payment"].size().reset_index()
analysis_18

In [None]:
# 18. Does a higher income correlate with a higher credit score ?
# chart - 15

# Selecting only Annual income and Credit Score
dff =data[["Annual_Income","Credit_Score"]]
dff



In [None]:
# 19. Is having a higher number of credit card associated with a higher credit scores, assuming the credit utilization ratio is managed well?


# Selecting data with good Credit Utilization ratio only
good_util = data[data['Credit_Utilization_Ratio'] < 30]


# Mapping Credit Score categoy to different numerical values
good_util['Score_Code'] = good_util['Credit_Score'].map({'Poor':0,'Standard':1,'Good':2})

# Check the correlation
good_util[['Num_Credit_Card','Score_Code']].corr()


In [None]:
# 20. Is there a negative correlation between the number of delayed payments and credit scores?

# chart - 13

# Check the correlation between Number of delayed payment and Credit score
print(data[['Num_of_Delayed_Payment','Score_Code']].corr())


In [None]:
# 21. Does the credit utiization ration have a grater impact on credit scores compared to the total number of loans taken ?

# Check the correlation between Credit Utlization ratio and Credit score
corr1 = data["Credit_Utilization_Ratio"].corr(data["Score_Code"])
print(f"Credit Utilization ration / Score Card : {corr1}")


# Check the correlation between Number of loans and Credit score
corr2 = data["Num_of_Loan"].corr(data["Score_Code"])
print(f"Number of loans / Score Card : {corr2}")


In [None]:
# # 22. Are there noticeable trends or seasonal patterns in credit scores based on the month of the year ?
# Chart - 10

# Count how many of each credit score category occur per month
monthly_counts = data.groupby(['Month', 'Credit_Score']).size().reset_index(name='Count')



### What all manipulations have you done and insights you found?

* What are you able to learn from the analysis?

From the analysis, I learned clear patterns about customer behaviour, risk segments, and credit performance. Most customers fall within the Standard credit score category across age groups and occupations, showing that average credit behaviour is common. Adults dominate the dataset and show more stable financial habits, while Teens and lower-income groups display higher risk due to poor scores, higher delays, and more credit inconsistencies. Income, delayed payments, number of credit cards, and EMI burden strongly influence credit score patterns. Overall, the analysis helps identify which groups are low risk (high-income, good repayment habits) and which groups require closer monitoring (low-income, high delays, Teen/Adult Poor category).


 -->

 * Did your assumptions turn out to be right ?



Yes, most assumptions aligned with the findings:

Higher income → Better credit score (✓ Confirmed)

More delayed payments → Lower credit score (✓ Confirmed)

More credit cards → Higher default risk (✓ Confirmed when utilization is considered)

Adults will have better credit discipline than Teens (✓ Confirmed)

Low-income groups are more likely to take more loans and have higher default risk (✓ Confirmed)

Standard credit score will be the most common category (✓ Confirmed)

So, the assumptions about financial behaviour, risk patterns, and demographic trends were mostly correct.


-->


* How would your analysis be helpful to the stockholders ?

This analysis is extremely valuable for stakeholders because it highlights both opportunities and risks. It helps them:

✔ Improve Credit Risk Assessment

Identify high-risk customers (low income, many credit cards, high delayed payments, Teens/Adults with Poor scores).

✔ Design Better Financial Products

Create targeted plans for Economy segment, low-income users, Teen borrowers, and high-utilization customers.

✔ Enhance Loan Approval Strategy

Decide whom to approve/decline, how much to lend, and what credit limit to assign.

✔ Strengthen Risk Controls

Monitor customers with high EMI burden, high credit utilization, or repeated delays to avoid defaults.

✔ Optimize Marketing & Customer Engagement

Focus more on:

Adults with Standard scores (large market)

Customers moving from Standard → Poor (at-risk group)

Low-income groups needing guidance and repayment plans

Overall, the insights help stakeholders reduce bad debt, improve portfolio quality, enhance customer satisfaction, and increase business growth.




## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

# Set up figure size
plt.figure(figsize=(8,6))

# Custom colors for each bar
colors = ["#1f77b4", "#2ca02c", "#ff7f0e"]   # Blue, Green, Orange

# Create a count plot
ax = sns.countplot(
    data=data,
    x="Credit_Score",
    palette=colors, hue = "Credit_Score"
)

# Adding titles and labels for better clarity

plt.title("Distribution of Credit Score Categories", fontsize=14, fontweight='bold')
plt.xlabel("Credit Score Category", fontsize=12)
plt.ylabel("Count of Customers", fontsize=12)

# Add value labels on top of bars
for p in ax.patches:
    height = p.get_height()
    ax.annotate(
        f"{height}",
        (p.get_x() + p.get_width() / 2, height),
        ha="center",
        va="bottom",
        fontsize=11
    )

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

A count plot is useful for visualizing how many observations fall into each category. Since Credit Score is a categorical variable, using a count plot helps me easily compare the number of customers in each credit score group.

##### 2. What is/are the insight(s) found from the chart?

The count plot shows that the majority of customers fall under the Standard credit score category, followed by Poor, while the Good credit score group is the smallest. This indicates that although most customers maintain an average credit profile, very few reach the ‘Good’ category, suggesting that consistently managing credit and repayment behavior is challenging for many individuals.

##### 3. Will the gained insights help creating a positive business impact?


The insights derived from the credit score distribution can significantly support positive business impact. By recognizing that most customers fall within the Standard and Poor segments, the business can refine risk assessment processes, tailor financial products to customer needs, and implement targeted strategies to reduce default risks. These insights also enable personalized marketing and customer engagement initiatives, ultimately improving loan performance, customer satisfaction, and overall business growth.

#### Chart - 2

In [None]:
# Chart - 2 visualization code :

# Set up the figure size
plt.figure(figsize=(10,4))

# Create a boxplot
sns.boxplot(
    data=data,
    x="Monthly_Inhand_Salary",
    color="#1f77b4"
)

# Adding title and labels for better clarity
plt.title("Distribution of Monthly Inhand Salary", fontsize=14, fontweight='bold')
plt.xlabel("Monthly Inhand Salary", fontsize=12)
plt.ylabel("")  # No y-label needed for a single variable plot
plt.tight_layout()

# Display the plot
plt.show()


##### 1. Why did you pick the specific chart?

A box plot enables user to visualize the distribution of numerical variable. aS salary is numerical by nature so I have chooosen boxplot which also highights the outliers.

##### 2. What is/are the insight(s) found from the chart?

75% of data falls under 6000 which reflects that most of the people belong belongs to low income group. The box(IQR) lies roughly between 2000 - 6000 suggesting that the majority of people fall under narrower zone.
Additionally there are some extreme values that are in range of 12000 to 16000 which can be considered as High Net-Worth individual.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes but a predominantly low-income customer base may restrict business profitability and increase operational risk if not managed with proper risk-adjusted lending strategies.

#### Chart - 3

In [None]:
# Chart - 3 visualization code:

# Set up figure size
plt.figure(figsize=(10,4))

# Create a voilen plot
sns.violinplot(data=data,x= "Delay_from_due_date")

# Adding title and labels for better clarity
plt.title("Variation in Delay from due date")
plt.xlabel("Delay from due date")
plt.tight_layout()

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

A voilen plot reflects the spread of the numerical variable.

##### 2. What is/are the insight(s) found from the chart?

The distribution shows that most delayed payments occur within 10 to 25 days, indicating that the typical delay period is approximately one month. Additionally, the presence of outliers in the 50–60 day range reflects a small but significant segment of customers with severely overdue payments, which may require targeted intervention due to the increased credit risk.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained from the analysis will help stakeholders make better credit decisions, understand customer repayment behavior, reduce financial risk, and design targeted strategies. However, some insights—such as a high number of delayed payments and a large portion of customers with poor credit behavior—highlight potential risk areas that could negatively impact growth if not addressed.

#### Chart - 4

In [None]:
# Chart - 4 visualization code

# Set up figure size
plt.figure(figsize=(10,4))

# Create a histogram
sns.histplot(data=data,x= "Annual_Income",kde=True,color = "Red")

# Adding title and labels for better clarity
plt.title("Distribution of Annual Income")
plt.xlabel("Annual Income")

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

Histogram shows the distribution of numerical variables thas't why I have choosen histogram.

##### 2. What is/are the insight(s) found from the chart?

The Annual Income distribution is heavily right-skewed, showing that a large majority of customers earn between ₹10,000 and ₹40,000. Higher income brackets (above ₹100,000) form a very small portion.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

These insights help businesses tailor financial products, improve credit risk assessment, and plan targeted marketing strategies. However, the dominance of low-income customers may increase the risk of delayed payments and affect revenue growth from premium offerings if not managed with proper segmentation and risk controls.

#### Chart - 5

In [None]:
# Chart - 5 visualization code

# Set up figure size
plt.figure(figsize=(10,8))

# Create a barplot
sns.barplot(data=a, x = "Age_group",y = "Num_of_Loan",palette = "viridis",hue = "Age_group")

# Adding title and labels for better clarity
plt.title("How many loans are there based on different age group ?")
plt.xlabel("Age Group")
plt.ylabel("Total number of loans")

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

I chose a bar chart because it is the most effective way to compare a categorical variable with a numerical measure. Each category (such as different age groups) is represented by a separate bar, making it easy to visually compare the differences in values across categories. Bar charts clearly highlight trends, patterns, and variations, which helps in understanding how the numerical values change for each category.

##### 2. What is/are the insight(s) found from the chart?

Adult category are the one who is dominating in terms of total number of loans sanctioned followed by Teen then at last seniour citizen took lowest number of loans.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, it helps the financial instituitons to know the demographich of majority of users also it highlights which customers do they have to target.

Although Adults took most number of loans but it is bit risky to give loans as there are chances of bad debts as well as no recovery. Proper cautions is required while sanctioning the loans.

#### Chart - 6

In [None]:
# Chart - 6 visualization code

# Set up figure size
plt.figure(figsize=(10,6))

# Create a barplot
sns.barplot(data= b, x = "Anuual_income_group",y = "Num_of_Loan",hue = "Anuual_income_group")

# Adding title and labels for better clarity
plt.title("Total number of loans sanctioned based on annual income group",pad =15,fontsize=14)
plt.xlabel("Annual Income group",fontsize=12)
plt.ylabel("Total number of loans",fontsize=12)
plt.xticks(rotation=45)

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

I chose a bar chart because it is the most effective way to compare a categorical variable with a numerical measure. Each category (such as different age groups) is represented by a separate bar, making it easy to visually compare the differences in values across categories. Bar charts clearly highlight trends, patterns, and variations, which helps in understanding how the numerical values change for each category.

##### 2. What is/are the insight(s) found from the chart?

Economy Segment are the one who have taken highest number of loans, followed by Mass Market segment. There is a sharp decline in total number of bank account for Affluent segment and High net- Worth Segment.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, Financial instituitons like Paisabazaar must provide personalized plans based on different annual income group as the one who belongs to the lower annual income group have taken highest number of loans. It highlights the chances of bad debt and unrevovered loans.

#### Chart - 7

In [None]:
# Chart - 7 visualization code

# Set up figure size
plt.figure(figsize=(10,6))

# Create a barplot
sns.barplot(data= c_melt, x = "Age_group", y = "Count",hue = "Account_Type")

# Adding titles and labels for better clarity
plt.title("Distribution of Account type based on different Age group")
plt.xlabel("Age group")
plt.ylabel("Total Count")

# Display the plot
plt.show()


##### 1. Why did you pick the specific chart?

I chose a clustered bar chart because it is the most effective way to compare multiple categorical variable with a numerical measure. Each category (such as different age groups) is represented by a separate bar, making it easy to visually compare the differences in values across categories. Clustered Bar charts clearly highlight trends, patterns, and variations, which helps in understanding how the numerical values change for each category.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that adults hold the highest number of bank accounts and credit cards, making them the most financially active segment. Teens and senior citizens show lower engagement, with seniors interestingly having more credit cards than bank accounts. While adults contribute to positive growth, the low engagement of teens and seniors and the unusual credit card usage among senior citizens could pose long-term risks if not addressed.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Businesses must target the adult segment as they dominate in terms of total number of bank account as well as total number of credit card.

#### Chart - 8

In [None]:
# Chart-8, visualization code:

# Set up figure size
plt.figure(figsize=(6,6))

# Create a pie chart
plt.pie(
    c["Num_Credit_Card"],
    labels=c["Monthly _Inhand_Group"],
    autopct="%1.1f%%",
    startangle=90
)

# Adding title for better clarity
plt.title("Distribution of Credit Cards by Monthly Inhand Income Group")

# Display the chart
plt.show()


##### 1. Why did you pick the specific chart?

 A pie chart enables the user to show the proportion of different category as a whole.

##### 2. What is/are the insight(s) found from the chart?

Economy segment hold the maximum number of credit card, accounted for more than 55% followed by Mass Market segment 27% then Affluent segment and at last High Net-Worth segment.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

As most of the cards have been issued to Economy segment, it demands personalized plans, proper tracking of repayment of loan, diferent financial products to consumers.

#### Chart - 9

In [None]:
# Chart visualization code


d = data.groupby('Anuual_income_group')["Num_Bank_Accounts"].sum().reset_index()

# Set up figure size for better visualization
plt.figure(figsize=(7,7))

# Create a Pie chart
plt.pie(
    d["Num_Bank_Accounts"],
    labels=d["Anuual_income_group"],
    autopct="%1.1f%%",
    startangle=90,
    wedgeprops={'width': 0.4}  # <-- This makes it a DONUT
)

# Adding title for better clarity
plt.title("Distribution of Bank Accounts by Annual Income Group")

# Display the chart
plt.show()


##### 1. Why did you pick the specific chart?

Donut chart similar to pie chart, helps user to represent th category as a whole.

##### 2. What is/are the insight(s) found from the chart?

Economy segment dominates again in terms of total number of bank account, accounted for 65% followed by Mass Market segment(26%) then Affluent segment(8%) and at last we have High Net Worth segment(2%).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights you gained do help the business positively, because they guide:

Targeted product design, Efficient marketing, Better resource allocation, Clear understanding of customer segmentation But, it also indicates possible negative areas, such as low penetration in premium segments.

#### Chart - 10

In [None]:
# Chart - 10 visualization code


# Pivot to get each category as a separate column
pivot_data = monthly_counts.pivot(index='Month', columns='Credit_Score', values='Count').reset_index()

# Plot three proper lines
plt.figure(figsize=(12, 6))
plt.plot(pivot_data['Month'], pivot_data['Poor'], marker='o', label='Poor', linewidth=2.5)
plt.plot(pivot_data['Month'], pivot_data['Standard'], marker='o', label='Standard', linewidth=2.5)
plt.plot(pivot_data['Month'], pivot_data['Good'], marker='o', label='Good', linewidth=2.5)

plt.title('Monthly Trends in Credit Score Categories')
plt.xlabel('Month')
plt.ylabel('Number of People')
plt.legend(title='Credit Score Type')
plt.grid(True, alpha=0.3)
plt.xticks(range(1, 9))  # Ensure all months 1-8 are shown

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

A line chart enables user to show the trend over time .

##### 2. What is/are the insight(s) found from the chart?

The chart shows that Poor credit score remains constant across all months, indicating no significant improvement or deterioration in the segment.However good credit score increase steadily, suggesting that customers in this group are improving their financial habits and repayment behaviour.
Additionally standard credit score shows a declining trend which may mean some customers are moving from top to bottom that is getting worsen becuase of their payment behaviour or  financial habit.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The increase in good credit score indicates better repayment descipline, reducing risk for lender. Consistent Poor category indicates no spike is emerging from that segment.The decrease in standard category highlights the default risk associated with such shift.

#### Chart - 11

In [None]:
# Chart - 11 visualization code - How do annual income vary with credit score ranges ?

# Set up figure size
plt.figure(figsize=(10,8))

# Create a boxplot
sns.boxplot(data = data, y = "Annual_Income", x = "Credit_Score", hue ="Credit_Score",legend = True)

# Adding title and labels for better clarity
plt.title("Distribution of Annual income based on different Credit Scores",pad =15,fontsize=14)
plt.xlabel("Credit Score",fontsize =12)
plt.ylabel("Annual Income",labelpad=14,fontsize = 12)

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

Box plot helps us to highlight the extreme values as weel as IQR.

##### 2. What is/are the insight(s) found from the chart?

Credit score and annual income show a positive correlation. As annual income increases, the credit score tends to improve, indicating lower financial risk among higher-income consumers. The median annual income for the ‘Good’ credit score category is approximately ₹45,000, followed by the ‘Standard’ category at around ₹35,000, and the ‘Poor’ category at about ₹30,000. These median values highlight a clear positive relationship between credit score and annual income.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, customers with lower annual incomes are more likely to have a higher default risk. This may be due to their spending behaviour, limited disposable income, or overall financial constraints. Therefore, businesses should focus more on low-income consumers in order to reduce bad debt and ensure better credit risk management.

#### Chart - 12

In [None]:
# Chart - 12 visualization code -  What is the distribution of credit score across different age groups?

# Set up figure size
plt.figure(figsize=(8,6))

# Create a boxplot
sns.boxplot(data = data, x= "Score_Code", y = "Age", hue = "Credit_Score",legend =True)

# Adding title and labels for better
plt.title("Boxplot of Age by Credit Score Category")
plt.xlabel("Credit Score")
plt.ylabel("Age")

# Display the plot
plt.show()


##### 1. Why did you pick the specific chart?

A box plot enables the user to visualize the distribution of numerical values and highlights the extreme values present in the dataset.

##### 2. What is/are the insight(s) found from the chart?

There is a positive relation between age and Credit score. As age increases the median of each credit score category increases which can be due to improvement in spending behaviour, financial literacy and so on.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes but businesses must focus on consumers who belongs to lower age bucket as there is high possibility of default risk.

#### Chart - 13

In [None]:
# Chart - 13 visualization code

# Set up figure size
plt.figure(figsize=(10,8))

# Create a boxplot
ax =sns.boxplot(data = data, x= "Score_Code", y = "Num_of_Delayed_Payment",hue = "Score_Code",legend =False)

# Adding title and labels for better clarity
plt.title("Distribution of Delayed payment across Credit Score Categories ")
plt.xlabel("Credit Score")
plt.ylabel("Number of Delayed Payment")
ax.set_xticklabels(['Poor', 'Standard', 'Good'])  # Parameters for x axis's labels

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

A box plot enables the user to visualize the distribution of numerical values and highlights the extreme values present in the dataset.

##### 2. What is/are the insight(s) found from the chart?

The chart clearly indicates the inverse relation of credit score with number of delayed payment, as the number of delayed payment increase the credit score decreases. Inshort, Higher number of delayed payment results in lower credit score.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Businesses must closely monitor customers with higher number of delayed payment. It also requires proper backrgound verification, pre-screening rountine check with a pinch of financial literacy.

#### Chart - 14

In [None]:
# Chart - 12 visualization code

# Setup figure size
plt.figure(figsize=(10,10))

# Create a boxplot
sns.boxplot(data= good_util,x = "Credit_Score",y = "Num_Credit_Card",hue = "Credit_Score",legend = True)

# Adding title and labels for better clarity
plt.title("Credit Score vs Num of Credit Card",pad= 14)
plt.xlabel("Credit Score")
plt.ylabel("Num of Credit Card")

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

A box plot enables the user to visualize the distribution of numerical values and highlights the extreme values present in the dataset.

##### 2. What is/are the insight(s) found from the chart?

The chart clearly indicates, number of credit card and credit score have a negative relation when credit utilization score is managed well. The median of good category is 4 followed by standard (5) then poor(7). This represents that people with good credit score have low number of credit card.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes , businesses must focus on consumers who holds numerous of credit card as their is high probability of default risk. Businesses must closely monitor such individuals inorder to maintain healthy portfolio. Also businesses must lower credit limt, create customized repayment plans and implement strict checks.

#### Chart - 15

In [None]:
# Chart - 12 visualization code

# Set up figure size
plt.figure(figsize=(10,6))

# Create a heatmap
sns.heatmap(df, annot=True, cmap="Blues", fmt="d")

# Adding title and labels for bettter clarity
plt.title("Occupation-wise Distribution of Credit Score Categories",pad =15,fontsize=14)
plt.ylabel("Occupation",fontsize=12)
plt.xlabel("Credit Score Category",fontsize=12)
plt.tight_layout()

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

A heatmap is chosen because it clearly visualizes the correlation between multiple numerical variables at once. It helps identify strong or weak relationships, detect multicollinearity, and understand which features may be important for analysis or modeling. The color gradients make it easy to quickly interpret patterns across the dataset, making the heatmap an efficient tool for exploratory data analysis.



##### 2. What is/are the insight(s) found from the chart?

The heatmap shows that across all occupations, most people fall into the Standard credit score category, which means average credit behaviour is common regardless of profession. Occupations like Architects, Lawyers, Developers, and Writers have a noticeably higher number of people in the Standard category, indicating a stable repayment pattern. The Poor credit score category remains moderate across all jobs, while the Good credit score category is comparatively smaller for every occupation. This suggests that while many customers manage credit adequately, only a few consistently maintain very strong credit behaviour.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

In terms of negative growth, the insight to note is that the Good credit score segment is very small across all occupations, meaning fewer customers fall in the low-risk group. If not addressed, this could limit premium lending opportunities and may increase overall portfolio risk.

#### Chart - 16

In [None]:
# Chart - 12 visualization code

# Set up figure size
plt.figure(figsize=(8,5))

# Create a heatmap
sns.heatmap(pivot_df, annot=True, fmt="d", cmap="Blues")
# Adding title and labels for better clarity
plt.title("Credit Score Patterns Across Different Age Groups",pad =20,fontsize=13)
plt.xlabel("Credit Score")
plt.ylabel("Age group")

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

A heatmap is chosen because it clearly visualizes the correlation between multiple numerical variables at once. It helps identify strong or weak relationships, detect multicollinearity, and understand which features may be important for analysis or modeling. The color gradients make it easy to quickly interpret patterns across the dataset, making the heatmap an efficient tool for exploratory data analysis.

##### 2. What is/are the insight(s) found from the chart?

The heatmap shows that Standard credit scores are the most common across all age groups, especially among Adults, who form the largest segment with stable credit behaviour. Teens have fewer Good scores and higher Poor scores, suggesting they may be less experienced in managing credit. Senior Citizens show a balanced pattern but still have more Standard scores than Good ones. Overall, Adults appear to be the most financially disciplined group, while Teens may need more guidance in credit handling.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

A negative insight is that Poor credit scores are relatively high among Teens and Adults, which could increase default risk if not addressed. Also, the number of Good credit scorers is low in every age group, which may limit growth in low-risk lending.

#### Chart - 17 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

# Selecting only numerical attributes
numeric = data.select_dtypes(include=['int64', 'float64'])

# Set up the figure size
plt.figure(figsize=(15,12))

# Create a heat map
sns.heatmap(numeric.corr(), annot=True, cmap="coolwarm")
plt.title("Correlation Heatmap of Numeric Features")

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

A heatmap is chosen because it clearly visualizes the correlation between multiple numerical variables at once. It helps identify strong or weak relationships, detect multicollinearity, and understand which features may be important for analysis or modeling. The color gradients make it easy to quickly interpret patterns across the dataset, making the heatmap an efficient tool for exploratory data analysis.

##### 2. What is/are the insight(s) found from the chart?

The heatmap shows clear links between income variables, debt, EMI burden, and credit utilization, helping identify both stable and high-risk customer segments. Customers with high debt, high EMI, and high utilization form a stressed group prone to delayed payments and possible defaults, while those with longer credit histories and higher balances appear low risk. These insights support better credit decisions and targeted strategies, though ignoring the high-risk patterns may lead to increased defaults and negative business impact.

#### Chart - 18 - Pair Plot

In [None]:
# Pair Plot visualization code

# Select only numerical attributes
numeric = data.select_dtypes(include=['int64','float64'])

# Create a pairplot
sns.pairplot(numeric)

# Dislay the plot
plt.show()

##### 1. Why did you pick the specific chart?

I chose a pair plot because it allows me to visualize multiple numerical features together in one place and understand how they relate to each other. Each pair of variables gets its own scatterplot, which helps me quickly spot patterns, correlations, trends, and outliers. It also shows distributions of individual features on the diagonal. Instead of creating many separate charts, a pair plot gives a complete overview of relationships between variables in a single, easy-to-read visualization, making it very useful during EDA.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?


1. Target high-risk customers early

Identify customers with frequent delays, high utilization, and low incomes for strict monitoring, restructuring, or financial-literacy interventions.

2. Strengthen portfolio quality

Focus on improving the Standard segment and expanding the Good segment through customized repayment plans, credit-limit management, and reward-based repayment programs.

3. Optimize marketing & product strategy

Adults and Economy customers should be the primary target group, while Teens and low-income customers require cautious lending to minimize bad debt.

4. Boost premium product penetration

Low Good-score and high-income representation indicates untapped opportunities for premium lending and wealth-based products.

5. Reduce default risk

Monitoring occupations, segments, and behaviour patterns helps identify stressed pockets where early action can prevent non-recoveries.

# **Conclusion**

The analysis reveals that most customers fall into the Standard credit score category, with only a small portion maintaining Good scores, indicating limited low-risk lending opportunities. Adults form the strongest and most financially active segment—they hold the highest number of accounts, credit cards, and loans, and show relatively better repayment behaviour than Teens, who display higher poor-score tendencies and weaker credit discipline. Income also plays a major role: a majority of customers belong to low-income brackets, and this group shows higher delayed payments, increased loan dependency, and greater default risk. High EMI burden, high debt, and elevated credit utilization strongly correlate with lower credit scores, forming a stressed customer segment that requires urgent monitoring.

Across occupations and segments, repayment behaviour remains mostly average, with the Good credit score segment consistently small, indicating that strong credit discipline is uncommon. Delayed payments typically fall within 10–25 days, but extreme cases (50+ days) signal potential chronic defaulters. Product penetration shows that Economy customers dominate in accounts, credit cards, and loans, making them the primary revenue source—but also the highest risk if not managed well. Trends over time show stable Poor scores, declining Standard scores, and improving Good scores, indicating a gradual shift in customer behaviour that needs strategic attention.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***