# Exploratory Data Analysis (EDA) - Credit Risk Model

## 1. Introduction
This notebook performs the initial exploratory data analysis for the Bati Bank Credit Risk Model. 
**Goal**: Understand the dataset structure, detect quality issues (missing values, outliers), and identify potential features for proxy default definition.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set plot style
sns.set(style="whitegrid")
pd.set_option('display.max_columns', None)

## 2. Load Data
Loading `data.csv` and `Xente_Variable_Definitions.csv` from the `data/raw/` directory.

In [None]:
# Define file paths (Adjust if data is in a different location)
DATA_PATH = "../data/raw/data.csv"
DEFS_PATH = "../data/raw/Xente_Variable_Definitions.csv"

try:
    df = pd.read_csv(DATA_PATH)
    defs = pd.read_csv(DEFS_PATH)
    print("Data loaded successfully.")
except FileNotFoundError:
    print(f"ERROR: Data files not found at {DATA_PATH} or {DEFS_PATH}. Please ensure data is placed in data/raw/.")

## 3. Dataset Overview
Checking the shape, data types, and first few rows.

In [None]:
if 'df' in locals():
    print(f"Dataset Shape: {df.shape}")
    print("\nColumn Data Types:")
    print(df.dtypes)
    display(df.head())

## 4. Summary Statistics
Analyze numerical and categorical summaries.

In [None]:
if 'df' in locals():
    print("Numerical Summary:")
    display(df.describe())
    
    print("\nCategorical Summary:")
    display(df.describe(include=['object']))

## 5. Missing Value Analysis
Identify columns with missing data to determine imputation strategies.

In [None]:
if 'df' in locals():
    missing = df.isnull().sum()
    missing = missing[missing > 0]
    missing_percentage = (missing / len(df)) * 100
    
    missing_df = pd.DataFrame({'Missing Count': missing, 'Percentage': missing_percentage})
    missing_df = missing_df.sort_values(by='Percentage', ascending=False)
    
    if not missing_df.empty:
        plt.figure(figsize=(10, 6))
        sns.barplot(x=missing_df.index, y=missing_df['Percentage'])
        plt.title("Percentage of Missing Values by Column")
        plt.ylabel("%")
        plt.xticks(rotation=45)
        plt.show()
        display(missing_df)
    else:
        print("No missing values found.")

## 6. Numerical Feature Distribution
Visualizing the distribution of key numerical columns: `Amount`, `Value`.

In [None]:
numerical_cols = ['Amount', 'Value'] # Add others if applicable based on dtypes

if 'df' in locals():
    for col in numerical_cols:
        if col in df.columns:
            plt.figure(figsize=(12, 5))
            
            plt.subplot(1, 2, 1)
            sns.histplot(df[col], kde=True, bins=50)
            plt.title(f'Distribution of {col}')
            
            plt.subplot(1, 2, 2)
            sns.boxplot(x=df[col])
            plt.title(f'Boxplot of {col}')
            
            plt.show()

## 7. Categorical Feature Distribution
Top categories for features like `ProductCategory`, `ChannelId`, `ProviderId`.

In [None]:
categorical_cols = ['ProductCategory', 'ChannelId', 'ProviderId', 'PricingStrategy']

if 'df' in locals():
    for col in categorical_cols:
        if col in df.columns:
            plt.figure(figsize=(10, 5))
            counts = df[col].value_counts().nlargest(10)
            sns.barplot(x=counts.index, y=counts.values)
            plt.title(f'Top 10 Categories in {col}')
            plt.xticks(rotation=45)
            plt.show()

## 8. Correlation Analysis
Checking relationships between numerical variables.

In [None]:
if 'df' in locals():
    corr_matrix = df.select_dtypes(include=[np.number]).corr()
    plt.figure(figsize=(10, 8))
    sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
    plt.title("Correlation Matrix")
    plt.show()

## 9. Top 3-5 Key Insights
*Note: This section will be populated after data analysis is executed.*

### Conclusion Placeholder
1. **Insight 1**: [Pending Data]
2. **Insight 2**: [Pending Data]
3. **Insight 3**: [Pending Data]