## Objective

This notebook demonstrates the feature engineering pipeline used to transform raw transaction-level data into a model-ready format using reproducible sklearn Pipelines.

Load data

In [1]:
import pandas as pd

df = pd.read_csv("../data/raw/data.csv")
df.head()

Unnamed: 0,TransactionId,BatchId,AccountId,SubscriptionId,CustomerId,CurrencyCode,CountryCode,ProviderId,ProductId,ProductCategory,ChannelId,Amount,Value,TransactionStartTime,PricingStrategy,FraudResult
0,TransactionId_76871,BatchId_36123,AccountId_3957,SubscriptionId_887,CustomerId_4406,UGX,256,ProviderId_6,ProductId_10,airtime,ChannelId_3,1000.0,1000,2018-11-15T02:18:49Z,2,0
1,TransactionId_73770,BatchId_15642,AccountId_4841,SubscriptionId_3829,CustomerId_4406,UGX,256,ProviderId_4,ProductId_6,financial_services,ChannelId_2,-20.0,20,2018-11-15T02:19:08Z,2,0
2,TransactionId_26203,BatchId_53941,AccountId_4229,SubscriptionId_222,CustomerId_4683,UGX,256,ProviderId_6,ProductId_1,airtime,ChannelId_3,500.0,500,2018-11-15T02:44:21Z,2,0
3,TransactionId_380,BatchId_102363,AccountId_648,SubscriptionId_2185,CustomerId_988,UGX,256,ProviderId_1,ProductId_21,utility_bill,ChannelId_3,20000.0,21800,2018-11-15T03:32:55Z,2,0
4,TransactionId_28195,BatchId_38780,AccountId_4841,SubscriptionId_3829,CustomerId_988,UGX,256,ProviderId_4,ProductId_6,financial_services,ChannelId_2,-644.0,644,2018-11-15T03:34:21Z,2,0


Temporal feature extraction (evidence)

In [6]:
from src.feature_engineering import prepare_model_dataset

processed_df, preprocessor = prepare_model_dataset(df)

processed_df[
    [
        "TransactionStartTime",
        "transaction_hour",
        "transaction_day",
        "transaction_month",
    ]
].head()


Unnamed: 0,TransactionStartTime,transaction_hour,transaction_day,transaction_month
0,2018-11-15 02:18:49+00:00,2,15,11
1,2018-11-15 02:19:08+00:00,2,15,11
2,2018-11-15 02:44:21+00:00,2,15,11
3,2018-11-15 03:32:55+00:00,3,15,11
4,2018-11-15 03:34:21+00:00,3,15,11


Aggregate features (tabular evidence)

In [7]:
processed_df[
    [
        "CustomerId",
        "total_transaction_amount",
        "avg_transaction_amount",
        "transaction_count",
        "std_transaction_amount",
    ]
].drop_duplicates().head()


Unnamed: 0,CustomerId,total_transaction_amount,avg_transaction_amount,transaction_count,std_transaction_amount
0,CustomerId_4406,109921.75,923.712185,119,3042.294251
2,CustomerId_4683,1000.0,500.0,2,0.0
3,CustomerId_988,228727.2,6019.136842,38,17169.24161
5,CustomerId_1432,2000.0,2000.0,1,
6,CustomerId_2858,93400.0,3220.689655,29,5493.966126


Pipeline transformation

In [8]:
X = preprocessor.fit_transform(processed_df)

X.shape


(95662, 26)

Save processed output

In [9]:
import numpy as np

np.save("../data/processed/feature_engineering.npy", X)
