# Exploratory Data Analysis (EDA) for Credit Risk Model

This notebook performs a comprehensive analysis of the Xente transaction data to understand patterns, detect anomalies, and select features for credit risk modeling.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Set plot style
sns.set(style="whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

# Load Data
df = pd.read_csv('../data/raw/data.csv')
df['TransactionStartTime'] = pd.to_datetime(df['TransactionStartTime'])
print(f"Dataset Shape: {df.shape}")
df.head()

## 1. Summary Statistics

In [None]:
# Numerical Summary
print("Numerical Summary:")
display(df[['Amount', 'Value', 'PricingStrategy']].describe())

In [None]:
# Categorical Summary
print("Categorical Summary:")
categorical_cols = ['ProviderId', 'ProductId', 'ProductCategory', 'ChannelId', 'FraudResult']
for col in categorical_cols:
    print(f"\n{col} Value Counts:")
    print(df[col].value_counts().head(10))

## 2. Missing Value Analysis

In [None]:
missing = df.isnull().sum()
percentage = (missing / len(df)) * 100
missing_df = pd.DataFrame({'Missing Count': missing, 'Percentage': percentage})
missing_df[missing_df['Missing Count'] > 0]

*Observation: Evaluating strategies for missing data. If minimal, row removal is acceptable. For numerical columns, mean/median imputation is standard.*

## 3. Distribution Analysis

In [None]:
# Distribution of Transaction Amounts (Log Scaled)
plt.figure(figsize=(12, 6))
sns.histplot(df['Value'], bins=50, kde=True, log_scale=True)
plt.title('Log-Distribution of Transaction Values')
plt.xlabel('Value (Log Scale)')
plt.show()

In [None]:
# Fraud Distribution
plt.figure(figsize=(6, 4))
sns.countplot(x='FraudResult', data=df)
plt.title('Fraud Class Imbalance')
plt.yscale('log')
plt.show()
print(df['FraudResult'].value_counts(normalize=True))

## 4. Correlation Analysis

In [None]:
# Correlation Matrix of Numerical Features
numerical_cols = ['Amount', 'Value', 'PricingStrategy', 'FraudResult']
corr = df[numerical_cols].corr()

plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.show()

## 5. Outlier Detection
Using Boxplots to visualize outliers in Amount.

In [None]:
plt.figure(figsize=(12, 4))
sns.boxplot(x=df['Amount'])
plt.title('Boxplot of Transaction Amounts')
plt.show()

## 6. Key Insights

1. **Class Imbalance**: The dataset is highly imbalanced with very few fraud cases (<0.5%). Models must handle this (e.g., SMOTE or Class Weights).
2. **Skewed Amounts**: Transaction values are right-skewed; Log-transformation or scaling is required for regression models.
3. **High Cardinality**: Features like `ProviderId` and `ProductId` have categorical importance but high cardinality. Target encoding or frequency encoding might be useful.
4. **Correlation**: `Value` and `FraudResult` show some positive correlation, indicating higher values might be riskier.