# **Project Name**    -  **Paisabazaar Banking Fraud Analysis**



##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Project Summary -**

Data Preprocessing :

- Importing libraries
- Loading dataset
- Checking dataset shape, info & duplicates
- Handling missing values
- Removing irrelevant columns
- Removing outliers (Annual_Income – IQR method)
- Data type conversion
- Feature engineering: Debt-to-Income Ratio, Income Bracket


Exploratory Data Analysis (EDA) :

- Univariate analysis (credit score, age, income, debt, utilization, etc.)
- Bivariate analysis (credit score vs. income, debt, loan types, etc.)
- Multivariate analysis (combined influence of multiple variables)
- Correlation analysis

Visualisation :

- Histograms, boxplots, countplots, scatterplots, and heatmaps for pattern detection and insights

# **GitHub Link -**

abcsdsgshjsksksks

# **Problem Statement**


**BUSINESS PROBLEM OVERVIEW**


Understanding and predicting an individual's credit score is critical for financial service providers like Paisabazaar, as it directly impacts loan approvals, product recommendations, and risk management.

In today’s competitive financial ecosystem, institutions are increasingly relying on creditworthiness metrics to offer personalized and responsible lending products. A misjudged credit profile can either lead to risky lending or missed business opportunities. Therefore, building accurate, data-driven insights into customer credit behavior is essential.

For Paisabazaar, the goal is not just to offer financial products — but to match the right product to the right person at the right time. This can only happen if credit scoring is robust, explainable, and rooted in behavioral patterns.

In this project, we will analyze customer-level data to:

- Perform exploratory data analysis (EDA) on key financial features
- Understand patterns in income, debt, credit usage, and payment behavior
- Identify the key drivers of low, average, and good credit scores
- Help Paisabazaar enhance its credit assessment pipeline and improve risk stratification for better lending decisions.

#### **Define Your Business Objective?**

The main business objective is to help Paisabazaar improve how it checks customer credit scores by using data like income, expenses, loans, and payment habits. This will make financial decisions smarter, quicker, and more personalized. The key goals are:

- **Better Credit Risk Management:**

  Predict credit scores more accurately to spot risky customers early. This helps avoid giving loans to people who may not pay them back, reducing financial losses.

- **Personalized Financial Suggestions:**
  
  Use the predicted credit scores to recommend the right products—like loans, credit cards, or insurance—based on each customer’s financial situation.

- **Faster Loan Approvals:**
  
  Make the loan process quicker by using models that automatically predict credit scores. This reduces waiting time and speeds up decisions.

- **Customers satisfaction:**
  
  Give customers useful advice and suggestions based on their data. This helps build trust and keeps them engaged with Paisabazaar for the long run.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


5. You have to create at least 20 logical & meaningful charts having important insights.

[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]







# ***Let's Begin !***

## ***1. Know Your Data***

### **Import Libraries**

In [None]:
# Install and Import Required Libraries

!pip install pandas numpy matplotlib seaborn pymysql sqlalchemy --quiet

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

sns.set_style("whitegrid")
sns.set_palette("Set2")

# Suppresses warning
import warnings
warnings.filterwarnings('ignore')

### **Dataset Loading**

In [None]:
# Loading Dataset
data = pd.read_csv('/content/paisabazaar_dataset .csv')

### **Dataset First View**

In [None]:
# Display the first 5 records
data.head()

### **Dataset Rows & Columns count**

In [None]:
# Dataset Rows & Columns
# data.shape - gives no. of rows and columns

print(f'Number of Rows: {data.shape[0]}')
print(f'Number of Columns: {data.shape[1]}')

### **Dataset Information**

In [None]:
# Dataset Info
data.info()

#### **Duplicate Values**

In [None]:
# Count of duplicated rows

duplicate_count = data.duplicated().sum()
print("Number of duplicate rows:", duplicate_count)

- There are **no duplicate rows** in the dataset. Each record is unique.

- So we don’t need to drop any entries.

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count


# Check for missing (NaN) values in each column of the dataset
# .isnull() returns a boolean DataFrame (True for null values)
# .sum() then counts how many nulls are in each column
print("Missing values in each column:\n")
data.isnull().sum()

- No missing values found in any column.
- We don’t need to fill or drop any data due to nulls.

In [None]:
# Visualizing the missing values
# Checking Null Value by plotting Heatmap
import matplotlib.pyplot as plt

plt.figure(figsize=(14, 6))
sns.heatmap(data.isnull(), cbar=False, cmap="YlOrRd", yticklabels=False)
plt.title("Heatmap of Missing Values", fontsize=14)
plt.show()

**Yellow cells → Non-Missing Values**

**Red cells → Missing values**

- Most columns have no missing values.

- Heatmap shows a clean dataset with nearly complete entries.



### **What did you know about your dataset?**

The dataset belongs to the financial services industry, provided by Paisabazaar, and is used to analyze and classify individuals based on their credit scores.

The goal is to explore key behavioral and financial indicators like income, debt, payment history, and credit card usage, which influence creditworthiness. This classification supports better loan decisions and reduces financial risk.

- The dataset consists of 100,000 rows and 28 columns.

- There are no missing values or duplicates.

---
##### **Customer Information**
Categorical:
- `Name`: Not useful for modeling
- `Occupation`: Important for profiling

Numerical:
- `Age`: Important feature
- `SSN`: Sensitive ID, to be dropped
---
##### **Financial Attributes**
Numerical:

- `Annual_Income`: Key for affordability
- `Monthly_Inhand_Salary`: Derived from income
- `Num_Bank_Accounts`: Indicator of financial behavior
- `Num_Credit_Card`: Reflects credit usage
- `Num_of_Loan`: Loan load
- `Interest_Rate`: Reflects cost of credit
---
##### **Credit-Related Attributes**
Categorical:

- `Credit_Mix`: Good / Standard / Bad

Numerical:

- `Outstanding_Debt`: Key debt signal
- `Credit_Utilization_Ratio`: Core credit behavior metric
- `Credit_History_Age`: More = better
- `Changed_Credit_Limit`: Reflects financial changes
- `Total_EMI_per_month`: Ongoing liabilities
- `Amount_invested_monthly`: Saving habit proxy
- `Monthly_Balance`: Post-expense leftover
---
##### **Behavioral Metrics**
Categorical:

- `Payment_Behaviour`: Spending + payment pattern
- `Payment_of_Min_Amount`: Yes / No — late payment flag
- `Type_of_Loan`: Multi-type; will need to split

Numerical:

- `Num_of_Delayed_Payment`: Bad habit indicator
- `Delay_from_due_date`: Tracks lateness
- `Num_Credit_Inquiries`: Higher = risky behavior
---
##### **Target Variable**

Categorical:

- `Credit_Score`: Final label — Good, Standard, Poor
---

##### **Non-Useful Columns (To Drop or Review):**

- `ID: Row index` — drop
- `Customer_ID`: Redundant — drop
- `Month`: Time signal (could be used for trends)

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
data.columns

In [None]:
# Dataset Describe
data.describe(include='all')

In [None]:
# Target variable
data['Credit_Score'].value_counts()

### Variables Description

- **ID**: Unique row identifier (not needed for analysis)

- **Customer_ID**: Unique identifier for each customer

- **Month**: Month number of the transaction/data record

- **Name**: Customer name (personal information – should be removed)

- **Age**: Age of the customer in years

- **SSN**: Social Security Number (personal information – should be removed)

- **Occupation**: Customer’s profession or job role

- **Annual_Income**: Yearly income of the customer

- **Monthly_Inhand_Salary**: Take-home salary per month

- **Num_Bank_Accounts**: Number of bank accounts owned

- **Num_Credit_Card**: Number of credit cards owned

- **Interest_Rate**: Average interest rate across all loans/credit products

- **Num_of_Loan**: Total number of loans taken

- **Type_of_Loan**: Types of loans (stored as text — could be multiple types per entry)

- **Delay_from_due_date**: Average number of days a payment is delayed

- **Num_of_Delayed_Payment**: Total number of delayed payments

- **Changed_Credit_Limit**: Change in the customer’s credit limit

- **Num_Credit_Inquiries**: Number of times credit was checked (hard inquiries)

- **Credit_Mix**: Type of credit profile (e.g., Good, Standard, Bad)

- **Outstanding_Debt**: Total amount of unpaid debt

- **Credit_Utilization_Ratio**: Percentage of available credit being used

- **Credit_History_Age**: Duration of customer’s credit history (in months)

- **Payment_of_Min_Amount**: Whether the minimum payment was made (Yes/No)

- **Total_EMI_per_month**: Total monthly EMI payments

- **Amount_invested_monthly**: Amount invested by the customer every month

- **Payment_Behaviour**: Describes customer’s spending/payment pattern

- **Monthly_Balance**: Remaining monthly balance after expenses

- **Credit_Score**: Target variable – customer’s credit rating (Good, Standard, or Poor)

### Check Unique Values for each variable.

In [None]:
# Checking Unique Values for each variable in the credit dataset

for col in data.columns:
    print(f"Unique values in '{col}' : {data[col].nunique()}.")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Creating a copy of the current dataset and assigning to df
df=data.copy()

In [None]:
# Check the data types of each column
print(df.dtypes)

In [None]:
# Dropping irrelevant columns

df.drop(['ID', 'Customer_ID', 'Name', 'SSN', 'Month'], axis=1, inplace=True)

In [None]:
# Converting data type of Monthly_Balance to numeric

data['Monthly_Balance'] = pd.to_numeric(data['Monthly_Balance'], errors='coerce')

In [None]:
# List of important categorical columns

categorical_cols = [
    'Credit_Mix',
    'Payment_of_Min_Amount',
    'Payment_Behaviour',
    'Credit_Score'
]

for col in categorical_cols:
    print("-------------------------------------")
    print(df[col].value_counts(dropna=False))
    print("-------------------------------------")

In [None]:
# Replacing 'NM' with 'No' in 'Payment_of_Min_Amount'

df['Payment_of_Min_Amount'] = df['Payment_of_Min_Amount'].replace('NM', 'No')

- Payment_of_Min_Amount

  Values: ['No', 'NM', 'Yes']
  * Treating "NM" as "No"

In [None]:
# after replacing 'NM' with 'No', let's check the unique values again
# List of important categorical columns

print("Cleaned unique values in 'Payment_of_Min_Amount':")
print(df['Payment_of_Min_Amount'].unique())
df['Payment_of_Min_Amount'].value_counts(dropna=False)

In [None]:
# Stripping extra spaces in categorical columns
# to avoid errors while analysis

cat_cols = df.select_dtypes(include='object').columns
for col in cat_cols:
    df[col] = df[col].str.strip()

In [None]:
# To check different types of loan taken by customers

df['Type_of_Loan'].value_counts().head(10)

In [None]:
# Creating a new feature that counts how many loan types each customer has
# Checking how many customers fall into each count of loan types

df['Num_Loan_Types'] = df['Type_of_Loan'].apply(lambda x: len(str(x).replace(' and', ',').split(',')))
df['Num_Loan_Types'].value_counts().sort_index()

- Most people have 1 loan — simple portfolios.

- But a large chunk has 3–5 loans

- Some have 7–10 loans — more credit dependent.

In [None]:
# Check for outliers in key numerical columns:

print(df[['Annual_Income', 'Credit_Utilization_Ratio', 'Monthly_Inhand_Salary', 'Outstanding_Debt']].describe())

In [None]:
# To check outliers in Annual Income

sns.boxplot(x=data['Annual_Income'])
plt.title("Boxplot - Annual Income")

In [None]:
# Use IQR method to remove outliers
Q1 = data['Annual_Income'].quantile(0.25)
Q3 = data['Annual_Income'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

data = data[(data['Annual_Income'] >= lower_bound) & (data['Annual_Income'] <= upper_bound)]

In [None]:
# After removing outliers
sns.boxplot(x=data['Annual_Income'])
plt.title("Boxplot - Annual Income")

 Visual Differences:
- Right-side "whisker" shortened → The extreme values (high-income outliers > upper bound) are gone.

- No outlier dots → Indicates all values are now within the acceptable range.

- Box appears tighter → Better focus on the core 50% of data (Q1 to Q3).

In [None]:
# Creating Debt-to-Income Ratio

data['Debt_to_Income_Ratio'] = data['Outstanding_Debt'] / data['Annual_Income']

In [None]:
# Categorize Income into Brackets

df['Income_Bracket'] = pd.cut(df['Annual_Income'],
                                   bins=[0, 20000, 50000, 100000],
                                   labels=['Low', 'Medium', 'High'])

In [None]:
print("Final Data Shape after improvements:\n", df.shape)
print("\nData Types:\n")
print(df.dtypes)

### **What all manipulations have you done and insights you found?**


### Data Manipulations:

1. **Removed Irrelevant Columns**: Dropped unnecessary columns such as customer identifiers and unrelated attributes to focus on relevant financial and behavioral features.

2. **Outlier Removal**: Filtered out extreme values in the **Annual_Income** column using the Interquartile Range (IQR) method. This step ensures that outliers do not skew the analysis.

3. **Data Type Conversion**: Converted the **Monthly_Balance** column to a numeric type, ensuring that calculations involving this column can be performed accurately.

4. **Calculated Debt-to-Income Ratio**: Introduced a new metric, **Debt_to_Income_Ratio**, which shows the proportion of a customer’s income that goes towards paying debt. This is an important measure of creditworthiness.

5. **Created Income Bracket**: Categorized customers' **Annual_Income** into three brackets (Low, Medium, High) to facilitate segmentation and targeted analysis.

### Insights Found:

- **Outlier Management**: Removing outliers allows for a more accurate representation of customer income, leading to more reliable insights.

- **Debt-to-Income Ratio**: This new metric provides a clearer understanding of customers' financial health, indicating potential credit risks.

- **Income Segmentation**: The creation of income brackets enables targeted analysis, making it easier to tailor financial product recommendations based on income levels.

- **Improved Data Quality**: Addressing missing values and ensuring appropriate data types enhances the reliability of the dataset for predictivd on financial attributes.

## 4. **Data Vizualization Storytelling & Experimenting with charts:** **Understand the relationships between variables**

## **Univariate Analysis**

#### Chart 1: **Credit Score Distribution**

In [None]:
# Chart - 1: Credit Score Distribution
fig, ax = plt.subplots(figsize=(10, 6))

sns.countplot(x='Credit_Score', data=df, palette='Set2', ax=ax)

ax.set_title('Credit Score Distribution', fontsize=16, fontweight='bold')
ax.set_ylabel('Count')
ax.set_xlabel('Credit Score')

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A countplot was chosen to visualize the frequency of each credit score category in the dataset (to see how many people fall into each credit score group (Good, Standard, Poor)). This helps us understand which group is the biggest and where the company should focus more attention. This is essential for designing segment-specific strategies.

##### 2. What is/are the insight(s) found from the chart?

- Most customers have a Standard credit score — they’re neither great nor risky, but in the middle.
- A good number of users have Poor scores — this is a red flag for repayment risk.
- Only a small group has a Good credit score — the ideal customers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:

- Standard group is a big opportunity — with the right products (like credit-building loans), they could move to the Good category.
- Poor score customers can be offered risk-managed products or guided through financial coaching.

Negative Insight:

- A large portion in the Standard and Poor categories suggests potential risk exposure, if left unaddressed, it may lead to higher defaults and impact profitability.

#### Chart 2: **Age distribution**

In [None]:
# Chart - 2: Age Distribution
fig, ax = plt.subplots(figsize=(10, 6))

sns.histplot(x='Age', data=df, kde=True, bins=30, color='#0066FF', ax=ax)

ax.set_title('Age Distribution', fontsize=16, fontweight='bold')
ax.set_xlabel('Age')
ax.set_ylabel('Frequency')

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

We used a histogram with a trend line to see which age groups are most common among customers. It helps us identify who our core audience is, and where we may need better engagement.

##### 2. What is/are the insight(s) found from the chart?

- Most customers are between 30 and 40 years old — this is the most active segment.
- Fewer users are under 25 or over 50 — engagement is low in younger and older age groups.
- A small rise around age 50 suggests some renewed interest from older individuals.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:

- Marketing and product design can focus on the 30–40 age group, as they’re the primary audience.
- Retirement-focused products or wealth-building tools may work well for those around 50.

Negative Insight:

- The low engagement from younger users (<25) might indicate a missed opportunity , they could be future long-term customers if onboarded early with beginner-friendly financial tools.

#### Chart - 3 **Income Distrubution**

In [None]:
# Chart - 3: Annual Income and Monthly In-hand Salary Distribution
fig, ax = plt.subplots(1, 2, figsize=(18, 6))

variables = ['Annual_Income', 'Monthly_Inhand_Salary']
titles = ['Annual Income Distribution', 'Monthly In-hand Salary Distribution']

for i, var in enumerate(variables):
    sns.histplot(data=df, x=var, bins=30, kde=True, color='#0066FF', ax=ax[i])
    ax[i].set_title(titles[i], fontsize=16, fontweight='bold')
    ax[i].set_xlabel(var.replace('_', ' '))
    ax[i].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

We used histograms to explore how income is distributed across the customer base—both annually and monthly. Understanding income levels helps us design appropriate financial products and assess overall affordability.

##### 2. What is/are the insight(s) found from the chart?

- Most people earn ₹20,000–₹40,000 annually, and ₹2,000–₹4,000 monthly, showing a clear clustering in the lower-income range.
- Both charts are right-skewed — fewer people earn high salaries.
- The similar shape of both curves shows monthly and annual incomes are consistent, meaning no major reporting or data inconsistencies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:

- Big opportunity to offer budgeting, micro-loans, or savings products tailored to low/mid-income groups.
- Stability between income and salary distributions indicates trustworthy income patterns, useful for credit underwriting.

Negative Insight:

- High concentration of low-income customers may limit eligibility for premium or high-risk financial products. This may require stricter credit policies or tiered offerings to manage risk effectively.

#### Chart - 4 **Credit Utilization**

In [None]:
# Chart - 4: Credit Utilization
fig, ax = plt.subplots(1, 1, figsize=(18, 6))

sns.boxplot(x='Credit_Utilization_Ratio', data=df, palette='Set2', ax=ax)

# Title and axis labels
ax.set_title('Credit Utilization Distribution', fontsize=16, fontweight='bold')
ax.set_xlabel('Credit Utilization Ratio', fontsize=12)
ax.set_ylabel('Value', fontsize=12)

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A boxplot was used to understand how customers are using their available credit — whether they are staying within safe limits or overspending. It also helps detect outliers who may be at risk of financial stress.

##### 2. What is/are the insight(s) found from the chart?

- Most users have a credit utilization between ~28% and 35%, which is relatively moderate and stable.
- Very few outliers exceed 45–50%, which can signal credit overuse and increased repayment risk.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:

- The majority of customers show healthy credit behavior, which lowers default risk and improves underwriting confidence.
- Credit monitoring tools or nudges could encourage customers to stay within this safe range.

Negative Insight:

- The presence of outliers suggests a small but high-risk segment. These users may require close monitoring, financial counseling, or capped product limits to avoid future defaults.

#### Chart - 5 **Distribution Of Interest** Rate

In [None]:
# Chart - 5: Distribution of Interest Rate
fig, ax = plt.subplots(figsize=(12, 6))

sns.histplot(data=df, x='Interest_Rate', bins=30, kde=True, color='#0066FF', ax=ax)

# Title and axis labels
ax.set_title('Distribution of Interest Rate', fontsize=16, fontweight='bold')
ax.set_xlabel('Interest Rate', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A histogram with a density curve helps us see how loan interest rates are spread across customers. It shows where most loans are priced and whether certain rate bands are more common than others — useful for pricing strategies.

##### 2. What is/are the insight(s) found from the chart?

- Interest rates are widely spread from 0% to 35%, but not evenly.
- There are clear peaks around 5%, 10%, and 20%, showing clustering — likely due to product tiers or risk-based pricing.
- A significant chunk of loans still fall in the higher rate range (>20%), which can indicate riskier borrowers or subprime segments.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:

- The clustered peaks can help standardize product pricing — knowing what rates dominate can guide competitive positioning.
- Data can be used to analyze customer behavior at different interest tiers (e.g., default rate at 5% vs 20%).

Negative Insight:

- High number of loans at 20%+ rates could signal exposure to high-risk segments. This may increase churn or defaults if not balanced with proper credit checks and risk controls.

## **Bivariate Analysis**

#### Chart - 6 **Age impact on it's Credit Score**

In [None]:
# Chart - 6: Age and Credit Score
fig, ax = plt.subplots(figsize=(12, 6))

sns.boxplot(x='Credit_Score', y='Age', data=df, ax=ax, palette='Set2')

ax.set_title('Age and Its Impact on Credit Score', fontsize=16, fontweight='bold')
ax.set_xlabel('Credit Score', fontsize=12)
ax.set_ylabel('Age', fontsize=12)

plt.tight_layout()
plt.show()

##### 1. Why this chart?

A boxplot was used to understand if a person’s age affects their credit score. Age is often seen as a proxy for financial maturity or history, so we wanted to test that assumption.

##### 2. Insight from the chart:

- Most customers, regardless of credit score, fall in the age range of 30 to 50 years.
- Good credit scorers tend to be slightly older than Standard or Poor groups.
- But the overall pattern suggests that age alone doesn't have a strong influence on credit score in this dataset.



3. Business Impact:

Positive impact:

- Age can support segmentation, but shouldn’t be a standalone factor in credit decisions.
- Avoids age bias — ensures models stay compliant and fair while targeting the right customer base.

No negative impact observed

- Age distribution is fairly balanced, meaning other features likely drive credit score more significantly (e.g., income, repayment behavior).

#### Chart - 7 **Income impact on it's Credit Score**


In [None]:
# Chart - 7: Income and Credit Score
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 6))

sns.violinplot(x='Credit_Score', y='Annual_Income', data=df, ax=ax1, palette='Set2')
ax1.set_title('Annual Income vs Credit Score', fontsize=16, fontweight='bold')
ax1.set_xlabel('Credit Score')
ax1.set_ylabel('Annual Income')

sns.violinplot(x='Credit_Score', y='Monthly_Inhand_Salary', data=df, ax=ax2, palette='Set2')
ax2.set_title('Monthly In-hand Salary vs Credit Score', fontsize=16, fontweight='bold')
ax2.set_xlabel('Credit Score')
ax2.set_ylabel('Monthly In-hand Salary')

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

We used these violin plots to see how income levels, both annual and monthly in-hand, differ across credit score categories. Income is a strong predictor of financial stability, so we wanted to see how it aligns with creditworthiness.

##### 2. What is/are the insight(s) found from the chart?

- People with a Good credit score generally have higher annual income and higher monthly salary.

- Those in Standard and especially Poor credit groups tend to earn less annually and monthly.

- The spread of income also narrows for lower credit groups, meaning most of them earn below a certain threshold.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positives:

- Income level is a key driver of credit health. Customers with higher income are more likely to be low-risk borrowers.
- Use this to prioritize high-income segments for pre-approved offers, loans, or credit limit enhancements.

Risk Warning:

- Low-income customers in the poor credit group may need tailored financial support, like budgeting tools, credit counseling, or secured cards.

#### Chart - 8 **The Relationship Between Income and Payment Delays**

In [None]:
# Chart 8 - Scatter Plot: Annual Income vs Payment Delays
fig, ax = plt.subplots(figsize=(12, 6))

sns.scatterplot(x='Annual_Income', y='Delay_from_due_date', data=df, ax=ax, palette='Set2', alpha=0.6)

ax.set_title('The Relationship Between Income and Payment Delays', fontsize=16, fontweight='bold')
ax.set_xlabel('Annual Income', fontsize=12)
ax.set_ylabel('Delay from Due Date (Days)', fontsize=12)

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

We used this scatter plot to explore whether income levels impact how quickly people pay their dues. Late payments are often red flags in credit behavior, so this chart helps us assess that risk.

##### 2. What is/are the insight(s) found from the chart?

- People with lower annual income tend to have more delays in making payments.
- However, many customers across all income levels manage to pay within 30 days, which is a good sign.
- There are some high-income individuals who also delay payments, showing income isn't the only factor influencing repayment behavior.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Takeaway:

- A large chunk of customers, even in the lower income brackets, pay within 30 days. This shows a potential for credit discipline even in financially tighter groups.

possible Risk:

- Timely payment within 30 days ≠ always creditworthy. Some may still default later or carry revolving debt.

- Banks should not rely on income alone. Combine this with behavior tracking (like repeated delays or increasing debt) to assess true credit risk.


#### Chart - 9 **Relationship Between Credit Scores and Bank Account Holdings**

In [None]:
# Chart 9 - Boxplot: Credit Scores vs Number of Bank Accounts
fig, ax = plt.subplots(figsize=(12, 6))

sns.boxplot(x='Credit_Score', y='Num_Bank_Accounts', data=df, ax=ax, palette='Set2')

ax.set_title('Understanding the Relationship Between Credit Scores and Bank Account Holdings', fontsize=16, fontweight='bold')
ax.set_xlabel('Credit Score', fontsize=12)
ax.set_ylabel('Number of Bank Accounts', fontsize=12)

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

We wanted to check if the number of bank accounts held by an individual has any pattern or association with their credit score. Multiple accounts could reflect either financial maturity—or mismanagement.

##### 2. What is/are the insight(s) found from the chart?

- Customers with Good credit scores typically maintain 2 to 6 accounts, with a median of 4.

- Those with Standard scores are slightly more concentrated around 3 to 5 accounts, suggesting stability but less diversification.

- Interestingly, Poor credit holders show the widest spread—some have only 2 accounts, others go beyond 6, with the median nudging closer to 6.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Interpretation:

- A higher number of accounts doesn’t always mean better credit. In fact, those with poor scores seem to hold more accounts, which could hint at financial juggling or over-leveraging.

- The “sweet spot” of 3–5 accounts appears in both good and standard credit groups, suggesting a healthy balance between access and manageability.

Recommendation:

- Use account count as one of the auxiliary indicators in credit profiling—especially when flagged with high utilization or frequent fund transfers.

- Watch for individuals with high account counts but poor scores, as this might reflect account hopping, missed obligations, or credit dependency.

#### Chart - 10 **The Impact of Delayed Payments on Your Credit Score**

In [None]:
# Chart 10 - Boxplot: Delayed Payments vs Credit Score
fig, ax = plt.subplots(figsize=(12, 6))

sns.boxplot(
    x='Credit_Score',
    y='Num_of_Delayed_Payment',
    data=df,
    ax=ax,
    palette='Set2'
)

ax.set_title('The Impact of Delayed Payments on Your Credit Score', fontsize=16, fontweight='bold')
ax.set_xlabel('Credit Score', fontsize=12)
ax.set_ylabel('Number of Delayed Payments', fontsize=12)

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

We used this boxplot to quantify the impact of late payments on credit score. Payment behavior is a direct reflection of a customer’s credit reliability.

##### 2. What is/are the insight(s) found from the chart?

- People with Good credit usually have 5–12 delayed payments—relatively fewer and controlled.
- Those with Standard scores show moderate delays, typically around 10–17.
- Individuals with Poor scores have 15–20+ delayed payments, with some extreme cases beyond that.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Clear Link:

- There’s a strong, negative correlation between number of delayed payments and credit score quality.
- The more someone delays, the more their credit trust drops—a red flag for financial institutions.


Positive Use:

- This insight can power a rule-based risk alert system: flag accounts with >12 delays for early intervention.

- Can help prioritize collections strategies and automated reminders to reduce payment slippage.


Negative Implication:

- If such delay patterns are ignored, businesses risk default buildup, affecting loan books and capital recovery.

#### Chart - 11 **Occupation Distribution Across Credit Scores**

In [None]:
# Pre-aggregate counts by Occupation and Credit_Score
credit_score_counts = df.groupby(['Occupation', 'Credit_Score']).size().reset_index(name='Count')

# Chart - 11
fig, ax = plt.subplots(figsize=(14, 8))

sns.barplot(
    x='Count',
    y='Occupation',
    hue='Credit_Score',
    data=credit_score_counts,
    ax=ax,
    palette='Set2')

ax.set_title('Occupation Distribution Across Credit Scores', fontsize=16, fontweight='bold')
ax.set_xlabel('Number of Individuals', fontsize=12)
ax.set_ylabel('Occupation', fontsize=12)

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

We used a grouped horizontal bar chart to show how different occupational categories are distributed across credit score levels (Good, Standard, Poor).

- A bar chart was chosen because it’s ideal for comparing counts across multiple categories — here, occupation types.

- The horizontal layout improves readability for longer occupation names, while the grouping by hue (Credit Score) helps spot creditworthiness patterns within each job sector.

##### 2. What is/are the insight(s) found from the chart?

- Healthcare Practitioners and Support staff make up a large portion of the dataset, mostly in the Standard and Good categories.

- Legal and Management roles also show strong presence, but with a mix of credit scores.

- Engineering, Education, and Sales show moderate representation with more balanced score distribution.

- Lower counts and a slight skew toward Poor credit are visible in Arts, Agriculture, and Food Services, hinting at financial challenges or limited access.


##### 3. Business Impact


Positive Impact:

- Occupation-specific insights can power risk-tiered credit products (e.g., flexible loans for seasonal sectors like agriculture).

- Supports targeted marketing—such as offering premium credit cards to stable sectors like healthcare or law.


Risk Flag:

- Customers in volatile or underrepresented sectors (like arts, food service) may need custom credit checks, low-limit products, or educational nudges before onboarding.


This chart helps build occupation-based segmentation models, improving both credit decisioning and product personalization.

#### Chart - 12 **The Relationship Between Credit Inquiries and Credit Score**

In [None]:
# Chart - 12
fig, ax = plt.subplots(figsize=(14, 8))

sns.countplot(
    data=df,
    x='Num_Credit_Inquiries',
    hue='Credit_Score',
    ax=ax,
    palette='Set2'
)

# Custom styling directly here
ax.set_title('The Relationship Between Credit Inquiries and Credit Score', fontsize=18, fontweight='bold')
ax.set_xlabel('Number of Credit Inquiries', fontsize=14, fontweight='bold')
ax.set_ylabel('Count', fontsize=14, fontweight='bold')
ax.legend(title='Credit Score', title_fontsize=12, fontsize=11)

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

We used a grouped count plot (vertical bar chart) to analyze how the number of credit inquiries correlates with credit score categories.

- This chart type is ideal for showing discrete frequency distributions, where we can see how many individuals made 0, 1, 2, ... credit inquiries.

- The hue-based grouping by credit score lets us detect if frequent inquiries are linked to poor credit, which is often assumed but needs validation.

##### 2. What is/are the insight(s) found from the chart?

- Fewer credit inquiries (0–2) are most common and more likely among Good and Standard credit groups.

- As inquiries increase (especially beyond 5), the proportion of Poor credit scores rises, indicating risk.

- Interestingly, a small number of Good credit holders still have many inquiries — likely due to responsible rate-shopping or proactive financial planning.

## **Multivariate Analysis**

In [None]:
# Get top 10 loan types
top_loan_types = df['Type_of_Loan'].value_counts().nlargest(10).index

# Filter data to only include those
filtered_df = df[df['Type_of_Loan'].isin(top_loan_types)]

# Create a pivot table (heatmap source)
loan_credit_heatmap = filtered_df.groupby(['Type_of_Loan', 'Credit_Score']).size().unstack(fill_value=0)

# Plot
fig, ax = plt.subplots(1, 1, figsize=(18, 10))

sns.heatmap(
    loan_credit_heatmap,
    annot=True,
    fmt='d',
    cmap='Blues',
    ax=ax,
    cbar_kws={'label': 'Count'},
    linewidths=0.5
)

# Basic styling
ax.set_title('Credit Score Distribution by Loan Type', fontsize=18, fontweight='bold')
ax.set_xlabel('Credit Score', fontsize=14, fontweight='bold')
ax.set_ylabel('Type of Loan', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

Why Use a Heatmap?
- It gives a quick visual snapshot of which loan types attract which credit score groups.
- Darker shades = more people in that category.
- Makes it easy to spot where risk (poor credit) or opportunity (good credit) lies.

Key Insights:
- Credit-Builder and Debt Consolidation Loans have many poor credit holders — no surprise, they’re meant to help rebuild credit.
- Auto, Mortgage, and Home Equity Loans have a balanced mix — used by both high and low credit scorers.
- Personal Loans and Credit Cards are popular among those with Standard and Good credit, likely due to lifestyle or urgent needs.
- Student and Payday Loans skew toward lower credit users — reflecting financial strain or early career stages.

Business Impact:
- Product Strategy: Customize products — e.g., promote Credit-Builder Loans to at-risk customers.
- Risk Control: Apply stricter checks or pricing on loans with lots of poor credit users.
- Marketing: Use insights to run targeted campaigns, matching the right loan to the right customer.

In [None]:
# Select relevant numeric columns
corr_data = df[['Age', 'Annual_Income', 'Monthly_Inhand_Salary', 'Num_Bank_Accounts',
                     'Num_Credit_Card', 'Interest_Rate', 'Outstanding_Debt',
                     'Credit_Utilization_Ratio', 'Credit_History_Age',
                     'Total_EMI_per_month', 'Amount_invested_monthly',
                     'Monthly_Balance']]

# Compute correlation matrix
corr_matrix = corr_data.corr()

# Plot heatmap
fig, ax = plt.subplots(1, 1, figsize=(12, 8))
sns.heatmap(
    corr_matrix,
    annot=True,
    cmap='Blues',
    vmin=-1,
    vmax=1,
    linewidths=0.5,
    ax=ax,
    annot_kws={"size": 10}
)

# Basic styling
ax.set_title('Correlation Heatmap of Financial Features', fontsize=16, fontweight='bold')
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)

plt.tight_layout()
plt.show()

Chart Type: Heatmap

Purpose: To explore pairwise linear relationships among key financial variables.

Insight:
- The strongest positive correlation exists between Monthly Inhand Salary and Annual Income, as expected.
- Outstanding Debt shows a moderate correlation with Interest Rate and Total EMI, highlighting debt burden patterns.
- Meanwhile, Credit Utilization has a negative correlation with Monthly Balance, suggesting overspending reduces end-of-month liquidity.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?

I recommend that Paisabazaar implement a predictive credit scoring model. This model will use customer financial data to estimate credit scores, helping achieve key business goals:

1. Strengthen Credit Risk Management

    By predicting how likely a customer is to repay, the model helps reduce loan defaults and manage financial risk more effectively.

2. Offer Smarter Product Recommendations

    Knowing a customer’s predicted credit score allows you to suggest financial products that best match their profile — improving engagement and conversions.

3. Speed Up Loan Approvals

    With automated credit assessments, decisions can be made faster, reducing manual work and improving turnaround times.

4. Boost Customer Trust and Retention

    By giving users personalized, data-driven advice and product offers, Paisabazaar can enhance user satisfaction and build long-term loyalty.

The model can be built using machine learning and trained on features like income, debt, payment history, and credit card behavior to deliver reliable and scalable predictions.

# **Conclusion**

Our analysis of customer data has revealed important insights that can help Paisabazaar improve how it assesses creditworthiness and manages financial risk.

- **Customer Demographics**: Most customers fall in the 25–45 age range — typically the working population — which is a prime segment for credit products.

- **Earning Capacity**: A large portion of the customer base earns a moderate to high income, indicating potential for loan offerings and investment products.

- **Credit Score Distribution**: While many customers have good or standard credit scores, there’s a significant number with poor scores, highlighting a need for better risk profiling.

- **Financial Behavior**: Most customers show responsible financial habits like balanced debt, regular payments, and steady credit card usage — though some outliers exist.

- **Key Influencers**: Factors like age, annual income, credit utilization ratio, and delayed payments play a major role in determining credit scores.

- **Interest Rates**: There is considerable variation in interest rates across customers, suggesting that lending terms may not always align with creditworthiness.


These findings can help Paisabazaar:

- Refine its credit scoring and risk assessment models.

- Offer more personalized and suitable financial products.

- Focus on specific customer segments for marketing and product development.

- Monitor credit trends regularly to adapt to changing market and customer behavior.

By acting on these insights, Paisabazaar can strengthen its market position, reduce default risks, and improve overall customer satisfaction.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***