<a href="https://colab.research.google.com/github/joshisakshi374/eda-/blob/main/capstone_project_module_2_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Paisabazaar Banking Fraud Analysis**    -



##### **Project Type**    - Exploratory Data Analysis
##### **Contribution**    - Sakshi joshi


# **Project Summary -**

This report presents an exploratory data analysis (EDA) project conducted for Paisabazaar to understand the factors influencing customer credit scores. The aim is to uncover meaningful insights from customer financial behavior to improve creditworthiness evaluation, optimize product recommendations, and minimize default risks.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


"To analyze and build an exploratory data analysis (EDA) framework for understanding the key factors that influence a customer's credit score using customer data provided by Paisabazaar. The goal is to extract insights that can help predict creditworthiness and improve decision-making for financial services."

#### **Define Your Business Objective?**

To enable Paisabazaar to make data-driven, accurate, and efficient decisions in evaluating customer creditworthiness by uncovering key patterns and insights from customer financial data.

**Key Goals:**

Enhance credit risk assessment

Improve matching of customers to suitable financial products

Support personalized credit-related recommendations

Lay foundation for predictive credit score modeling

Inform internal decision-making processes

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from google.colab import files

### Dataset Loading

In [None]:
uploaded = files.upload()

In [None]:
# Load Dataset
df = pd.read_csv('dataset.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
rows, columns = df.shape
print(f"Number of rows: {rows}")
print(f"Number of columns: {columns}")

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
import missingno as msno
msno.bar(df)
plt.show()

### What did you know about your dataset?

The dataset provided by Paisabazaar contains a comprehensive set of customer features including:

Demographics (Age, Occupation)

Financial indicators (Annual Income, Inhand Salary, Outstanding Debt)

Credit behavior (Credit Utilization Ratio, Credit History Age, Number of Credit Cards, Loans)

Payment behavior (Delayed Payments, Payment of Minimum Amount)

Monthly EMI and Investment habits

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

The dataset includes a diverse set of variables grouped into Identification, Demographic, Financial, Credit Behavior, and Credit History categories, which together provide a comprehensive view of a customer’s financial standing. Below is a structured explanation:

🔹 1. Identification Variables
ID: A unique identifier for each record.

Customer_ID: A unique identifier assigned to each customer.

Name: The full name of the customer.

SSN: The Social Security Number, used for uniquely identifying individuals.

Month: Denotes the specific month of the record.

🔹 2. Demographic Variables
Age: The age of the customer, useful for age-group analysis.

Occupation: The profession or job category of the customer.

🔹 3. Financial Variables
Annual_Income: Total annual income of the individual.

Monthly_Inhand_Salary: Actual salary received by the individual each month after deductions.

Outstanding_Debt: The amount of unpaid loans or balances held by the individual.

Total_EMI_per_month: Total Equated Monthly Installments paid towards loans each month.

Amount_invested_monthly: The monthly contribution made by the individual toward investments.

🔹 4. Banking and Credit Product Variables
Num_Bank_Accounts: Total number of bank accounts held by the customer.

Num_Credit_Card: Number of credit cards the individual possesses.

Interest_Rate: The rate of interest applicable to the customer's credit account(s).

Num_of_Loan: Total number of loans taken.

Type_of_Loan: Categories or types of loans held by the customer (e.g., home, auto, education).

🔹 5. Credit Behavior Variables
Delay_from_due_date: The average number of days payments were delayed beyond their due dates.

Num_of_Delayed_Payment: Count of how many times payments were delayed.

Payment_of_Min_Amount: Indicates whether the customer pays the minimum required amount on credit dues.

Changed_Credit_Card: Percentage change in credit card limit over time, reflecting financial flexibility.

Num_Credit_Inquiries: Number of recent credit inquiries, signaling demand for new credit.

🔹 6. Credit History and Profile Quality Variables
Credit_Mix: A qualitative description (e.g., Good, Standard, Bad) indicating the variety of credit types held.

Credit_Utilization_Ratio: Ratio of the credit amount used to the total credit limit, a key indicator of responsible usage.

Credit_History_Age: Duration (in years and months) for which the customer has had credit accounts.



### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# --- Data Cleaning: Handle Missing Values and Outliers ---
# Convert relevant columns to numeric
numeric_cols = ['Annual_Income', 'Credit_Utilization_Ratio', 'Num_of_Loan',
                'Num_of_Delayed_Payment', 'Credit_History_Age', 'Age', 'Month']
for col in numeric_cols:
    df[col] = pd.to_numeric(df[col], errors='coerce')

In [None]:
# Handle missing values
df = df.dropna(subset=['Credit_Score'])  # Drop rows with missing Credit_Score
for col in numeric_cols:
    df[col] = df[col].fillna(df[col].median())  # Fill numeric columns with median
df['Month'] = df['Month'].fillna(1)  # Default Month to 1
df['Credit_Score'] = df['Credit_Score'].str.strip()  # Clean Credit_Score values

In [None]:
# Handle outliers using IQR method
def cap_outliers(series):
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return series.clip(lower=lower_bound, upper=upper_bound)

for col in numeric_cols:
    df[col] = cap_outliers(df[col])

# Filter out invalid ages (e.g., Age < 18 or Age > 100)
df = df[(df['Age'] >= 18) & (df['Age'] <= 100)]

### What all manipulations have you done and insights you found?

**1. Data Type Correction**

**Objective:** Ensure columns are in the correct format for analysis.


**Steps:**
Converted numeric columns **(Annual_Income, Credit_Utilization_Ratio, Num_of_Loan, Num_of_Delayed_Payment, Credit_History_Age, Age, Month**) to numeric using** pd.to_numeric(errors='coerce')**, which handles non-numeric values by converting them to NaN.

Stripped whitespace from **Credit_Score** using **str.strip()** to ensure consistency in categorical values.

**2. Handling Missing Values**

**Objective:** Address missing data to prevent bias in analysis.


**Steps:**

Dropped rows where Credit_Score was missing, as it’s the target variable and critical for analysis.

Imputed missing values in numeric columns with their respective medians (e.g., Annual_Income, Credit_Utilization_Ratio) to minimize the impact of outliers on the imputation process.

Assigned a default value of 1 to missing Month entries, assuming the first month of the 8-month period.

Verified that no critical missing values remained in key columns after cleaning.

**3. Handling Outliers**

**Objective:** Mitigate the impact of extreme values on analysis.

**Steps:**

Used the Interquartile Range (IQR) method to cap outliers in numeric columns:
Calculated Q1 (25th percentile), Q3 (75th percentile), and IQR = Q3 - Q1.
Capped values below Q1 - 1.5IQR or above Q3 + 1.5IQR to their respective bounds.
Applied to columns: Annual_Income, Credit_Utilization_Ratio, Num_of_Loan, Num_of_Delayed_Payment, Age.

Filtered out invalid ages by retaining only rows where Age was between 18 and 100, removing unrealistic entries.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1  Credit Score Distribution by Age Group


In [None]:
# Chart - 1 visualization code
# Bin ages into groups
age_bins = [18, 25, 35, 45, 55, 100]
age_labels = ['18-25', '26-35', '36-45', '46-55', '56+']
df['Age_Group'] = pd.cut(df['Age'], bins=age_bins, labels=age_labels, include_lowest=True)

plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='Age_Group', hue='Credit_Score',
              palette={'Good': '#10B981', 'Standard': '#3B82F6', 'Poor': '#EF4444'})
plt.title('Credit Score Distribution by Age Group')
plt.xlabel('Age Group')
plt.ylabel('Count')
plt.legend(title='Credit Score')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Shows credit score distribution across age groups.

##### 2. What is/are the insight(s) found from the chart?

**Finding:**

 Younger individuals (18-25) have a higher proportion of Poor credit scores compared to older groups. Older age groups (46-55, 56+) show a higher proportion of Good scores.

**Interpretation:**

Younger customers likely have limited credit history or higher credit utilization, leading to poorer scores. Older customers benefit from longer credit histories and financial stability, making them more likely to have Good scores.

 Chart - 2
 Annual Income vs Credit Score

In [None]:
# Chart - 2 visualization code
# Sample 200 points for scatter plot
sample_df = df.sample(200, random_state=42)
plt.figure(figsize=(10, 6))
sns.scatterplot(data=sample_df, x='Annual_Income', y='Credit_Score', hue='Credit_Score',
                palette={'Good': '#10B981', 'Standard': '#3B82F6', 'Poor': '#EF4444'}, alpha=0.6)
plt.title('Annual Income vs Credit Score')
plt.xlabel('Annual Income')
plt.ylabel('Credit Score')
plt.legend(title='Credit Score')
plt.tight_layout()
plt.show()

##### 1. What is/are the insight(s) found from the chart?

**Finding:**

 Higher annual incomes (above 100,000) are strongly associated with Good credit scores, indicating a positive correlation.

**Interpretation:**

 Customers with higher incomes likely have better debt management capabilities, reducing their risk of poor credit scores. This suggests that income is a reliable predictor of creditworthiness.

#### Chart - 3  Credit Utilization and Number of Loans by Credit Score (Line Chart)

In [None]:
# Chart - 3 visualization code
avg_data = df.groupby('Credit_Score')[['Credit_Utilization_Ratio', 'Num_of_Loan']].mean().reset_index()

plt.figure(figsize=(10, 6))
plt.plot(avg_data['Credit_Score'], avg_data['Credit_Utilization_Ratio'], label='Avg Credit Utilization (%)',
         color='#10B981', marker='o')
plt.plot(avg_data['Credit_Score'], avg_data['Num_of_Loan'], label='Avg Number of Loans',
         color='#EF4444', marker='o')
plt.title('Credit Utilization and Number of Loans by Credit Score')
plt.xlabel('Credit Score')
plt.ylabel('Average Value')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

##### 1. What is/are the insight(s) found from the chart?

**Finding:**

 Credit score distributions remain stable across the 8-month period, with no significant seasonal patterns.

**Interpretation:**

 Credit scores are likely influenced more by individual financial behavior than temporal factors, allowing Paisabazaar to implement consistent risk management strategies throughout the year.

#### Chart - 4 Delayed Payments vs. Credit Score (Scatter Plot)

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(10, 6))
sns.scatterplot(data=sample_df, x='Num_of_Delayed_Payment', y='Credit_Score', hue='Credit_Score',
                palette={'Good': '#10B981', 'Standard': '#3B82F6', 'Poor': '#EF4444'}, alpha=0.6)
plt.title('Delayed Payments vs Credit Score')
plt.xlabel('Number of Delayed Payments')
plt.ylabel('Credit Score')
plt.legend(title='Credit Score')
plt.tight_layout()
plt.show()

##### 1. What is/are the insight(s) found from the chart?

**Finding:**

 A higher number of delayed payments (e.g., >10) strongly correlates with Poor credit scores.

**Interpretation:**

 Timely payments are critical for maintaining good credit, and frequent delays significantly increase the likelihood of a Poor score, making payment behavior a key risk factor.

#### Chart - 5 Credit Score Trends by Month

In [None]:
# Chart - 5 visualization code
monthly_data = df.groupby(['Month', 'Credit_Score']).size().unstack(fill_value=0)
monthly_data = monthly_data.div(monthly_data.sum(axis=1), axis=0) * 100  # Convert to percentage

plt.figure(figsize=(10, 6))
for score in ['Good', 'Standard', 'Poor']:
    plt.plot(monthly_data.index, monthly_data[score], label=score, marker='o',
             color={'Good': '#10B981', 'Standard': '#3B82F6', 'Poor': '#EF4444'}[score])
plt.title('Credit Score Trends by Month')
plt.xlabel('Month')
plt.ylabel('Percentage (%)')
plt.legend(title='Credit Score')
plt.grid(True)
plt.tight_layout()
plt.show()

##### 1. What is/are the insight(s) found from the chart?

**Finding:**
 Credit score distributions remain stable across the 8-month period, with no significant seasonal patterns.

**Interpretation:**

 Credit scores are likely influenced more by individual financial behavior than temporal factors, allowing Paisabazaar to implement consistent risk management strategies throughout the year.

#### Chart - 6 Intresting fact

In [None]:
# Chart - 6 visualization code
young_high_util = df[(df['Age_Group'] == '18-25') & (df['Credit_Utilization_Ratio'] > 35) & (df['Num_of_Delayed_Payment'] == 0)]
good_count = len(young_high_util[young_high_util['Credit_Score'] == 'Good'])
total_count = len(young_high_util)
print(f"\nInteresting Fact: {good_count} out of {total_count} individuals aged 18-25 with high credit utilization (>35%) have Good scores if they have no delayed payments. This suggests timely payments can offset high utilization for younger demographics.")

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# Encode Credit_Score for correlation (Good=2, Standard=1, Poor=0)
df['Credit_Score_Encoded'] = df['Credit_Score'].map({'Good': 2, 'Standard': 1, 'Poor': 0})
correlation_matrix = df[['Credit_Score_Encoded', 'Annual_Income', 'Credit_Utilization_Ratio',
                         'Num_of_Loan', 'Num_of_Delayed_Payment', 'Age']].corr()
print("\nCorrelation Matrix:\n", correlation_matrix)

plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Heatmap')
plt.tight_layout()
plt.show()

##### 1. What is/are the insight(s) found from the chart?

**Finding:**

 Credit_Utilization_Ratio and Num_of_Delayed_Payment have stronger negative correlations with Credit_Score compared to Num_of_Loan or Age.

**Interpretation:**

 These two factors are the most predictive of poor credit outcomes and should be prioritized in risk assessment models.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

**1. Strengthen Credit Risk Segmentation**

**🔹 Use a scoring model that incorporates:**

Credit Utilization Ratio (strongly linked to poor credit scores)

Number of Delayed Payments

Outstanding Debt

Annual Income vs. EMI Burden

**🔹 Segment customers into:**

Low Risk: Low utilization, high income, timely payments

Moderate Risk: Occasional delays, balanced utilization

High Risk: Frequent delays, high credit usage, low income

➡ **Impact:** Allows for better loan decisions and customized risk-based pricing.


**2. Incentivize Better Credit Behavior**


**🔹 Promote financial literacy programs or credit improvement plans for users with:**

High delays from due date

High EMI burden

Low payment of minimum amount

**➡ Impact:** Helps improve user credit scores over time and builds customer trust.

**3. Personalize Financial Product Offers**

**🔹 Use clusters or segments from the data to personalize:**

Loan offers (e.g., pre-approved for low-risk)

Credit card upgrades (e.g., based on utilization and inquiries)

Investment recommendations (based on surplus income)

**➡ Impact:** Higher product relevance and conversion rate.

**4. Use Predictive Modeling for Early Warning**

**🔹 Build a credit score classification model using the cleaned dataset with:**
Features like Num_of_Delayed_Payment, Credit_Utilization_Ratio, Outstanding_Debt, etc.

🔹 Use this model to flag customers likely to shift from “Standard” to “Poor” credit categories in advance.

**➡ Impact:** Reduces NPAs (non-performing assets) and improves credit portfolio health.

**5. Monitor Monthly Trends and Anomalies**

**🔹 Leverage the Month field to identify seasonal dips in credit score:**

Possibly linked to festive overspending or year-end loan accumulation.

**➡ Impact:** Enables time-sensitive interventions like reminders or repayment restructuring.

**6. Encourage Responsible Credit Card Usage**

🔹 Insights show that multiple cards don’t always mean poor scores, if utilization is low.

🔹 Offer credit line adjustments or coaching on usage optimization.

**➡ Impact:** Reduces default probability while retaining card customers.



# **Conclusion**

This Exploratory Data Analysis (EDA) project provided critical insights into the financial behavior of customers and the key drivers influencing their credit scores. By cleaning the data, handling missing values and outliers, and analyzing various financial indicators, we identified important patterns such as the impact of delayed payments, credit utilization ratio, and loan burden on creditworthiness.

**The analysis revealed that:**

**High credit utilization and frequent delayed payments** significantly correlate with lower credit scores.

**Annual income, responsible EMI payments, and age** show a positive influence on creditworthiness.

**Multiple credit cards**, when used responsibly, do not negatively impact credit scores, challenging common assumptions.

**These insights enable Paisabazaar to:**

Develop a **data-driven credit risk assessment model**

**Personalize financial product offerings** based on customer profiles

**Proactively identify and manage risk-prone customers**

Improve **operational efficiency and decision-making**

Overall, this project lays a solid foundation for implementing predictive analytics models and enhancing customer-centric credit strategies. With further model development and continuous data integration, Paisabazaar can strengthen its market position while minimizing financial risk.



### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***