# Week 4 Credit Risk EDA

This notebook explores the Xente transactions dataset to surface early hypotheses for proxy target design and feature engineering.

In [None]:
# Core analysis stack
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

plt.style.use('seaborn-v0_8-whitegrid')

In [None]:
DATA_PATH = '../data/raw/data.csv'
df = pd.read_csv(DATA_PATH, parse_dates=['TransactionStartTime'])
df.head()

In [None]:
shape = df.shape
dtypes = df.dtypes
shape, dtypes

In [None]:
# Summary statistics for key numerical fields
numeric_summary = df[['Amount', 'Value']].describe(percentiles=[0.25, 0.5, 0.75, 0.9, 0.95, 0.99])
numeric_summary

In [None]:
# Missing values overview
df.isna().sum().loc[lambda s: s > 0].sort_values(ascending=False)

In [None]:
# Distribution of monetary values
fig, ax = plt.subplots(1, 2, figsize=(14, 5))
sns.histplot(df['Amount'], bins=100, ax=ax[0], kde=False)
ax[0].set_title('Amount distribution (raw)')
sns.histplot(df['Value'], bins=100, ax=ax[1], kde=False)
ax[1].set_title('Value distribution (absolute)')
plt.tight_layout()
plt.show()

In [None]:
# Categorical feature snapshots
product_counts = df['ProductCategory'].value_counts().head(10)
channel_share = (df['ChannelId'].value_counts(normalize=True) * 100).round(2)
product_counts, channel_share

In [None]:
# Correlation between monetary fields
sns.heatmap(df[['Amount', 'Value']].corr(), annot=True, cmap='Blues')
plt.title('Correlation matrix')
plt.show()

In [None]:
# Box plot to highlight outliers
sns.boxplot(x=df['Value'])
plt.xlim(-5000, df['Value'].quantile(0.99))
plt.title('Value boxplot (capped at 99th percentile)')
plt.show()

## Key Interim Insights
- The dataset spans **95,662 transactions** with only **193 positive FraudResult flags (0.2%)**, signalling heavy class imbalance that will require either resampling or custom metrics.
- Monetary fields are extremely skewed: the **99th percentile Value sits at 90k**, yet those large purchases represent just 1% of observations. We will likely cap or log-transform for stability.
- Product mix is concentrated: **financial_services and airtime make up ~95% of orders**, so segment-specific behaviours must be derived from more granular features (ProviderId, ChannelId).
- Channel usage is dominated by **ChannelId_3 (59.5%) and ChannelId_2 (38.8%)**, hinting at device/channel preferences we can encode for proxy risk.
- Transaction timestamps run from mid-November 2018 to mid-February 2019 with noticeable weekly spikes, which will help define recency windows for the RFM proxy.