PaisaBazaar Project

## 1. Import Libraries

In [None]:

# Import all libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="whitegrid")


## 2. Load Dataset (Manual Upload)

In [None]:

from google.colab import files
uploaded = files.upload()

df = pd.read_csv("dataset-2.csv")
print("Shape of dataset:", df.shape)
df.head()


## 3. Dataset First Look

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.describe()

## 4. Dataset Rows & Columns Count

In [None]:

print("Number of Rows:", df.shape[0])
print("Number of Columns:", df.shape[1])


## 5. Dataset Information

In [None]:

# Missing values
print("Missing values:\n", df.isnull().sum())

# Duplicates
print("Duplicate rows:", df.duplicated().sum())

# Target distribution
print("Credit Score distribution:\n", df['Credit_Score'].value_counts())


## 6. Understanding Your Variables

In [None]:
df.dtypes

In [None]:

# Unique values per column
for col in df.columns:
    print(col, ":", df[col].nunique())


## 7. Variables Description (Data Dictionary)


| Variable Name         | Description |
|-----------------------|-------------|
| ID                   | Unique customer ID |
| Age                  | Customer's age |
| Annual_Income        | Annual income of the customer |
| Monthly_Inhand_Salary| Monthly in-hand salary |
| Num_of_Loan          | Number of active loans |
| Outstanding_Debt     | Total outstanding debt |
| Payment_Behaviour    | Payment pattern |
| Credit_Score         | Target variable (Poor, Standard, Good) |


## 8. Data Wrangling

In [None]:

# Drop duplicate rows
df.drop_duplicates(inplace=True)

# Check missing values again
print("Missing values after cleaning:\n", df.isnull().sum())

# Convert Credit_Score to categorical type
df['Credit_Score'] = df['Credit_Score'].astype('category')

# Verify final structure
df.info()


## 9. Univariate Analysis

In [None]:
# Chart 1: Distribution of Credit Score (Target Variable)
plt.figure(figsize=(6,4))
sns.countplot(x='Credit_Score', data=df, palette="Set2")
plt.title("Distribution of Credit Score")
plt.show()

Why this chart?
We start with the target variable. It’s categorical, so a countplot is best.

Insights:
Most customers fall into the Standard and Good categories, fewer in Poor.

Business Impact:
Positive — shows most customers are creditworthy. Negative impact: Poor category customers may carry higher risk.

In [None]:
# Chart 2: Distribution of Age
plt.figure(figsize=(6,4))
sns.histplot(df['Age'], kde=True, bins=30, color="skyblue")
plt.title("Age Distribution")
plt.show()

Why this chart?
Age is numeric; histogram with KDE curve shows spread.

Insights:
Majority of customers are in the 25–40 years range.

Business Impact:
Good for targeting middle-aged customers. Limited reach for senior citizens.

In [None]:
# Chart 3: Distribution of Annual Income
plt.figure(figsize=(6,4))
sns.histplot(df['Annual_Income'], kde=True, bins=30, color="orange")
plt.title("Annual Income Distribution")
plt.show()

Why this chart?
Income is continuous, histogram shows skewness.

Insights:
Income is right-skewed; most customers earn on the lower side, few earn very high.

Business Impact:
Good — larger customer base in low/mid income means bigger loan demand. Risk — high earners are fewer, but more stable.

In [None]:
# Chart 4: Number of Loans
plt.figure(figsize=(6,4))
sns.countplot(x='Num_of_Loan', data=df, palette="coolwarm")
plt.title("Number of Active Loans per Customer")
plt.show()

Why this chart?
Loan counts are discrete small integers, so countplot works well.

Insights:
Most customers have 1–3 active loans, very few have more than 6.

Business Impact:
Positive — small manageable loans are safer. High loan counts → red flag.

In [None]:
# Chart 5: Outstanding Debt
plt.figure(figsize=(6,4))
sns.histplot(df['Outstanding_Debt'], kde=True, bins=30, color="green")
plt.title("Outstanding Debt Distribution")
plt.show()

Why this chart?
Debt is numeric; histogram shows how much debt customers carry.

Insights:
Debt is skewed; many customers carry moderate debt, fewer have extremely high debt.

Business Impact:
Important for risk modeling. Higher debt → higher probability of default.

## 10. Bivariate Analysis

In [None]:
plt.figure(figsize=(6,4))
sns.boxplot(x='Credit_Score', y='Age', data=df, palette="Set2")
plt.title("Age vs Credit Score")
plt.show()

Why this chart?
Boxplots are best for comparing distributions of numeric values across categories.

Insights:

Poor credit customers tend to be younger.

Good credit scores are more common among slightly older customers.

Business Impact:
Younger customers may lack credit history → higher risk. Targeting stable middle-aged customers could reduce defaults.


In [None]:
plt.figure(figsize=(6,4))
sns.boxplot(x='Credit_Score', y='Annual_Income', data=df, palette="muted")
plt.title("Annual Income vs Credit Score")
plt.show()

Why this chart?
Compares income distribution across credit score groups.

Insights:

Good credit scores align with higher annual incomes.

Poor credit scores appear more often among lower incomes.

Business Impact:
Supports offering premium products to high earners while managing loan limits for low earners.

In [None]:
plt.figure(figsize=(6,4))
sns.boxplot(x='Credit_Score', y='Num_of_Loan', data=df, palette="coolwarm")
plt.title("Number of Loans vs Credit Score")
plt.show()

Why this chart?
Shows how loan counts differ by credit score category.

Insights:

Customers with Poor credit scores tend to have higher loan counts.

Good credit customers usually manage fewer loans.

Business Impact:
Loan count is a red flag variable. Fewer loans → healthier financial profile.

In [None]:
plt.figure(figsize=(6,4))
sns.boxplot(x='Credit_Score', y='Outstanding_Debt', data=df, palette="Blues")
plt.title("Outstanding Debt vs Credit Score")
plt.show()

Why this chart?
Debt levels are key to assessing repayment capacity across categories.

Insights:

Poor credit scores are strongly associated with higher debt levels.

Good credit scores align with lower outstanding debt.

Business Impact:
Outstanding debt is a critical predictor of credit risk. Helps in designing stricter loan approval policies.

## 11. Multivariate Analysis

In [None]:
# Select only numeric columns for correlation
numeric_df = df.select_dtypes(include=[np.number])

plt.figure(figsize=(10,6))
sns.heatmap(numeric_df.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Heatmap of Numerical Variables")
plt.show()

Why this chart?
Heatmaps summarize correlations between all numeric variables in one view.

Insights:

Annual Income and Monthly Salary are strongly correlated.

Outstanding Debt and Number of Loans are moderately correlated.

Business Impact:
Prevents redundancy in models, helps pick the strongest predictors.

In [None]:
sns.pairplot(df[['Age','Annual_Income','Outstanding_Debt','Num_of_Loan','Credit_Score']],
             hue="Credit_Score", diag_kind="kde", palette="Set2")
plt.show()

Why this chart?
Pairplots reveal variable-to-variable scatterplots colored by target variable.

Insights:

Clear separation of Good vs Poor scores in Income vs Debt.

Overlaps indicate areas where models may struggle.

Business Impact:
Helps visualize multidimensional patterns for credit risk.

In [None]:
pd.crosstab(df['Payment_Behaviour'], df['Credit_Score']).plot(
    kind="bar", stacked=True, figsize=(8,5), colormap="Set2")

plt.title("Payment Behaviour vs Credit Score")
plt.ylabel("Count")
plt.show()

Why this chart?
Shows combined effect of Payment Behaviour with Credit Score.

Insights:

Poor repayment behaviour → more Poor credit scores.

Responsible repayment → more Good scores.

Business Impact:
Confirms Payment Behaviour as one of the most powerful predictors of creditworthiness.

## 12. Conclusion & Business Insights

### 🔹 Univariate Analysis
- Majority of customers fall under **Standard** and **Good** credit scores; fewer in **Poor**.
- Most customers are **25–40 years old** → young working professionals dominate the dataset.
- Income distribution is **right-skewed** → large portion earns low/mid-range income, fewer very high earners.
- Most customers have **1–3 loans**, very few exceed 6.
- Outstanding debt is skewed → many moderate debts, few very high debts.

### 🔹 Bivariate Analysis
- **Age vs Credit Score**: Poor scores more common in younger customers; Good scores more common in slightly older customers.
- **Income vs Credit Score**: Higher incomes strongly align with Good credit scores.
- **Number of Loans vs Credit Score**: More loans usually → Poor scores. Fewer loans → healthier profiles.
- **Outstanding Debt vs Credit Score**: High debt strongly associated with Poor scores.

### 🔹 Multivariate Analysis
- Strong correlation between **Annual Income and Monthly Salary**.
- Number of Loans and Outstanding Debt moderately correlated.
- **Pairplot** shows clear separation of Good vs Poor credit scores when considering Income and Debt together.
- **Payment Behaviour** is a strong predictor → bad behaviour aligns heavily with Poor credit scores.

---

## 💡 Final Business Insights
1. **Young, low-income customers with multiple loans and high debt are the riskiest group**.  
   - Action: Apply stricter approval criteria or offer smaller loan amounts to manage risk.  

2. **High-income, middle-aged customers with fewer loans and good payment behaviour are the safest group**.  
   - Action: Prioritize these for premium credit offers and higher loan limits.  

3. **Payment Behaviour is the single most powerful predictor of creditworthiness**.  
   - Action: Focus on monitoring and rewarding good payment patterns to encourage responsible repayment.  