Import Libraries

In [4]:
# Cell 1 — Setup / Installs (run once in your environment)
# If you haven't installed these yet, run these lines (uncomment to run).
# !pip install category-encoders xverse joblib

# Standard imports
import pandas as pd
import numpy as np
from datetime import datetime
import joblib

# sklearn imports
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# WOE encoder
from category_encoders.woe import WOEEncoder

# Optional: IV (information value) calculation with xverse (uncomment if installed)
# from xverse.ensemble import IvSelection


In [6]:
# Cell 2 — Load data
# Adjust path to your CSV file
DATA_PATH = r"C:\Users\kalki\OneDrive\Desktop\week4\credit-risk-model-week4\data\raw\data.csv"
df = pd.read_csv(DATA_PATH)

# Quick sanity checks
print("Rows, cols:", df.shape)
print(df.columns.tolist())
df.head(3)


Rows, cols: (95662, 16)
['TransactionId', 'BatchId', 'AccountId', 'SubscriptionId', 'CustomerId', 'CurrencyCode', 'CountryCode', 'ProviderId', 'ProductId', 'ProductCategory', 'ChannelId', 'Amount', 'Value', 'TransactionStartTime', 'PricingStrategy', 'FraudResult']


Unnamed: 0,TransactionId,BatchId,AccountId,SubscriptionId,CustomerId,CurrencyCode,CountryCode,ProviderId,ProductId,ProductCategory,ChannelId,Amount,Value,TransactionStartTime,PricingStrategy,FraudResult
0,TransactionId_76871,BatchId_36123,AccountId_3957,SubscriptionId_887,CustomerId_4406,UGX,256,ProviderId_6,ProductId_10,airtime,ChannelId_3,1000.0,1000,2018-11-15T02:18:49Z,2,0
1,TransactionId_73770,BatchId_15642,AccountId_4841,SubscriptionId_3829,CustomerId_4406,UGX,256,ProviderId_4,ProductId_6,financial_services,ChannelId_2,-20.0,20,2018-11-15T02:19:08Z,2,0
2,TransactionId_26203,BatchId_53941,AccountId_4229,SubscriptionId_222,CustomerId_4683,UGX,256,ProviderId_6,ProductId_1,airtime,ChannelId_3,500.0,500,2018-11-15T02:44:21Z,2,0


## Step 1 — Preprocessing notes & ID columns

We will keep identifier columns out of the modelling features but preserve them for joins / debugging:
- ID cols to drop during modeling: `TransactionId, BatchId, AccountId, SubscriptionId, CustomerId, ProductId, TransactionStartTime`
- Target column: `FraudResult` (modify if you later use `is_high_risk`)


In [7]:
# Cell 3 — Convert datetime
df['TransactionStartTime'] = pd.to_datetime(df['TransactionStartTime'], errors='coerce')


## Step 2 — Aggregate (customer-level) features

Create these aggregates per `CustomerId` and merge them back:
- TotalTransactionAmount (sum of Amount)
- AvgTransactionAmount (mean)
- TransactionCount (count)
- TransactionAmountStd (std)
- UniqueProductCategories (nunique of ProductCategory)
- UniqueChannels (nunique of ChannelId)
- FraudRate (mean of FraudResult) — optional target leakage check (keep for EDA but drop for training if leaking)


In [8]:
# Cell 4 — Aggregate features
agg = df.groupby('CustomerId').agg(
    TotalTransactionAmount = ('Amount', 'sum'),
    AvgTransactionAmount   = ('Amount', 'mean'),
    TransactionCount       = ('Amount', 'count'),
    TransactionAmountStd   = ('Amount', 'std'),
    UniqueProductCategories= ('ProductCategory', 'nunique'),
    UniqueChannels         = ('ChannelId', 'nunique'),
    FraudRate              = ('FraudResult', 'mean')
).reset_index()

# merge back
df = df.merge(agg, on='CustomerId', how='left')

# if std is NaN (single txn), replace with 0
df['TransactionAmountStd'].fillna(0, inplace=True)

df[['CustomerId','TotalTransactionAmount','AvgTransactionAmount','TransactionCount','TransactionAmountStd']].head()


Unnamed: 0,CustomerId,TotalTransactionAmount,AvgTransactionAmount,TransactionCount,TransactionAmountStd
0,CustomerId_4406,109921.75,923.712185,119,3042.294251
1,CustomerId_4406,109921.75,923.712185,119,3042.294251
2,CustomerId_4683,1000.0,500.0,2,0.0
3,CustomerId_988,228727.2,6019.136842,38,17169.24161
4,CustomerId_988,228727.2,6019.136842,38,17169.24161


## Step 3 — Time-based features

From `TransactionStartTime` extract:
- Hour, Day, Month, Year
- DayOfWeek (0=Mon .. 6=Sun)
- IsWeekend (binary)


In [9]:
# Cell 5 — Time features
df['TransactionHour']  = df['TransactionStartTime'].dt.hour.fillna(-1).astype(int)
df['TransactionDay']   = df['TransactionStartTime'].dt.day.fillna(-1).astype(int)
df['TransactionMonth'] = df['TransactionStartTime'].dt.month.fillna(-1).astype(int)
df['TransactionYear']  = df['TransactionStartTime'].dt.year.fillna(-1).astype(int)
df['DayOfWeek']        = df['TransactionStartTime'].dt.dayofweek.fillna(-1).astype(int)
df['IsWeekend']        = df['DayOfWeek'].isin([5,6]).astype(int)

df[['TransactionStartTime','TransactionHour','TransactionDay','TransactionMonth','TransactionYear','DayOfWeek','IsWeekend']].head()


Unnamed: 0,TransactionStartTime,TransactionHour,TransactionDay,TransactionMonth,TransactionYear,DayOfWeek,IsWeekend
0,2018-11-15 02:18:49+00:00,2,15,11,2018,3,0
1,2018-11-15 02:19:08+00:00,2,15,11,2018,3,0
2,2018-11-15 02:44:21+00:00,2,15,11,2018,3,0
3,2018-11-15 03:32:55+00:00,3,15,11,2018,3,0
4,2018-11-15 03:34:21+00:00,3,15,11,2018,3,0


## Step 4 — Feature selection: which columns to use for the pipeline

Numeric features (candidates):
- Amount, Value
- TotalTransactionAmount, AvgTransactionAmount, TransactionCount, TransactionAmountStd
- TransactionHour, TransactionDay, TransactionMonth, TransactionYear, DayOfWeek, IsWeekend

Categorical features (candidates):
- ProductCategory, ChannelId, ProviderId, CurrencyCode, CountryCode, PricingStrategy

ID / non-feature columns to drop:
- TransactionId, BatchId, AccountId, SubscriptionId, CustomerId, ProductId, TransactionStartTime


In [10]:
# Cell 6 — Define column lists
TARGET = 'FraudResult'   # change to 'is_high_risk' later if using proxy label
DROP_COLS = ['TransactionId','BatchId','AccountId','SubscriptionId','CustomerId','ProductId','TransactionStartTime']

numeric_cols = [
    'Amount','Value',
    'TotalTransactionAmount','AvgTransactionAmount','TransactionCount','TransactionAmountStd',
    'TransactionHour','TransactionDay','TransactionMonth','TransactionYear','DayOfWeek','IsWeekend'
]

categorical_cols = [
    'ProductCategory','ChannelId','ProviderId','CurrencyCode','CountryCode','PricingStrategy'
]

# Ensure columns exist (defensive)
numeric_cols = [c for c in numeric_cols if c in df.columns]
categorical_cols = [c for c in categorical_cols if c in df.columns]

print("Numeric:", numeric_cols)
print("Categorical:", categorical_cols)


Numeric: ['Amount', 'Value', 'TotalTransactionAmount', 'AvgTransactionAmount', 'TransactionCount', 'TransactionAmountStd', 'TransactionHour', 'TransactionDay', 'TransactionMonth', 'TransactionYear', 'DayOfWeek', 'IsWeekend']
Categorical: ['ProductCategory', 'ChannelId', 'ProviderId', 'CurrencyCode', 'CountryCode', 'PricingStrategy']


## Step 5 — Build sklearn preprocessing Pipeline

Pipeline design:
- Numeric pipeline: `SimpleImputer(strategy='median')` -> `StandardScaler()`
- Categorical pipeline: `SimpleImputer(strategy='most_frequent')` -> `WOEEncoder()`  
  *Note:* `WOEEncoder` requires the target during fit; putting it in ColumnTransformer + Pipeline will work if we call `pipeline.fit(X, y)`.
- ColumnTransformer combining both pipelines
- Full pipeline persistable with joblib


In [11]:
# Cell 7 — Build pipelines
num_pipeline = Pipeline([
    ('num_imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# WOEEncoder can be directly used in ColumnTransformer; it expects y in fit.
cat_pipeline = Pipeline([
    ('cat_imputer', SimpleImputer(strategy='most_frequent')),
    ('woe', WOEEncoder(cols=categorical_cols))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_pipeline, numeric_cols),
        # We pass categorical columns to WOE through the pipeline; ColumnTransformer will pass the DataFrame subset.
        ('cat', cat_pipeline, categorical_cols)
    ],
    remainder='drop'  # drop any other columns
)

full_pipeline = Pipeline([
    ('preprocessor', preprocessor)
])


## Step 6 — Prepare X and y, fit the pipeline and transform

Important: Because WOEEncoder uses the target to compute statistics, call `full_pipeline.fit(X, y)` (not fit_transform(X) alone).


In [12]:
# Cell 8 — Prepare X, y and split for fitting (we fit pipeline on train only)
X = df.drop(columns = DROP_COLS + [TARGET])
y = df[TARGET].astype(int)  # ensure integer labels

# train/test split (stratify if binary target unbalanced)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y if y.nunique()>1 else None
)

print("Train shape:", X_train.shape, "Test shape:", X_test.shape)


Train shape: (76529, 21) Test shape: (19133, 21)


In [15]:
woe_cols = ['CurrencyCode', 'ProviderId', 'ProductId', 'ProductCategory', 'ChannelId']

# Keep only columns that exist in X_train
woe_cols = [c for c in woe_cols if c in X_train.columns]

print("Columns for WOE:", woe_cols)

Columns for WOE: ['CurrencyCode', 'ProviderId', 'ProductCategory', 'ChannelId']


In [16]:
num_cols = [
    'Amount', 'Value', 'TotalTransactionAmount', 'AvgTransactionAmount', 
    'TransactionCount', 'TransactionAmountStd', 'FraudRate', 'UniqueProductCategories', 
    'UniqueChannels'
]

# Keep only columns that exist in your dataframe
num_cols = [c for c in num_cols if c in X_train.columns]

print("Numerical columns:", num_cols)

Numerical columns: ['Amount', 'Value', 'TotalTransactionAmount', 'AvgTransactionAmount', 'TransactionCount', 'TransactionAmountStd', 'FraudRate', 'UniqueProductCategories', 'UniqueChannels']


In [17]:
# 1️⃣ Numerical pipeline
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# 2️⃣ WOE columns (categorical)
woe_cols = ['CurrencyCode', 'ProviderId', 'ProductId', 'ProductCategory', 'ChannelId']
woe_cols = [c for c in woe_cols if c in X_train.columns]

# 3️⃣ Full pipeline
full_pipeline = ColumnTransformer(transformers=[
    ('num', num_pipeline, num_cols),
    ('cat', WOEEncoder(cols=woe_cols), woe_cols)
])

# 4️⃣ Fit and transform
full_pipeline.fit(X_train, y_train)
X_train_transformed = full_pipeline.transform(X_train)
X_test_transformed  = full_pipeline.transform(X_test)

print("Transformed shapes:", X_train_transformed.shape, X_test_transformed.shape)

Transformed shapes: (76529, 13) (19133, 13)


In [20]:
# Cell 10 — Save processed dataset to existing folder
processed_file_path = "../data/processed/processed_data.csv"  # adjust path if needed
df.to_csv(processed_file_path, index=False)

print(f"Saved processed data to {processed_file_path}")

Saved processed data to ../data/processed/processed_data.csv


# Task 3 — Feature Engineering

## Objective
Build a robust, automated, and reproducible data processing pipeline that transforms raw transaction data into a model-ready format for risk modeling.

## Steps Performed

### 1. Aggregate Features
- Total Transaction Amount: Sum of all transaction amounts per customer.
- Average Transaction Amount: Average transaction amount per customer.
- Transaction Count: Number of transactions per customer.
- Standard Deviation of Transaction Amounts: Measures variability of transaction amounts per customer.
- Fraud Rate: Average FraudResult per customer.
- Unique Product Categories: Count of distinct products purchased per customer.
- Unique Channels: Count of distinct channels used per customer.

### 2. Time-Based Features
- TransactionHour, TransactionDay, TransactionMonth, TransactionYear: Extracted from TransactionStartTime.
- DayOfWeek: Day of the week (0 = Monday, 6 = Sunday).
- IsWeekend: Flag for weekend transactions.

### 3. Categorical Encoding
- WOE Encoding: Applied to categorical columns (CountryCode, CurrencyCode, ProviderId, ProductId, ProductCategory, ChannelId) to encode them based on predictive power with respect to FraudResult.

### 4. Missing Value Handling
- Imputed numerical features where necessary.
- Filled NaN time-related features with -1 as a placeholder.

### 5. Normalization / Scaling
- Standardized numerical features in the pipeline using StandardScaler.

### 6. Pipeline Construction
- Built a reproducible `sklearn.pipeline.Pipeline` combining:
  - Numerical transformations
  - WOE encoding for categorical columns
  - Standardization
- Ensures consistent preprocessing for training and test data.

