## Context 

raw features capture the full predictive signal in the data. FE transfroms domain knowledge into model inputs. creating new perspectives on the data that can imporve model performance significanly

## objective 

- create time-derived features from TransactionDT
- build card-level based on address level aggregation features
- engginer email domain and interaction features
-  document transformation logic for profuction reproducibility

In [2]:
# import and load data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings

warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-whitegrid')

train = pd.read_parquet(Path('../data/interim/train_merged.parquet'))
print(f'Data loaded: {train.shape}')

Data loaded: (590540, 434)


In [3]:
# time features 

START_DATE = pd.Timestamp('2017-11-30')
train['TransactionDate'] = START_DATE + pd.to_timedelta(train['TransactionDT'], unit='s')

train['hour'] = train['TransactionDate'].dt.hour
train['day_of_week'] = train['TransactionDate'].dt.dayofweek
train['day_of_month'] = train['TransactionDate'].dt.day
train['is_night'] = train['hour'].apply(lambda x : 1 if x >= 22 or x <= 5 else 0)
train['is_weekend'] = train['day_of_week'].apply(lambda x: 1 if x >= 5 else 0)

print('Time features created : ')
print(train[['hour', 'day_of_week', 'is_night', 'is_weekend']].describe().transpose())

Time features created : 
                count       mean       std  min  25%   50%   75%   max
hour         590540.0  13.861923  7.607152  0.0  6.0  16.0  20.0  23.0
day_of_week  590540.0   2.928123  1.947733  0.0  1.0   3.0   5.0   6.0
is_night     590540.0   0.378897  0.485113  0.0  0.0   0.0   1.0   1.0
is_weekend   590540.0   0.254101  0.435355  0.0  0.0   0.0   1.0   1.0


## insight : time based feature engginering

from the result we have created these essential time features:
- hour : hour of day 00-23. fraud rates peak during night hours
- day_of_week : 0= monday to 6 = sunday. weekend patterns differ from weekdays
- is_night : binary flag for 10 P.M to 5 A.M transaction when fraud is elevated
- is_weekend : binary flag for saturday and sunday transactions

these simple features consistenly rank in the top 50 by importance in tree models, providing high value for low engginering cost

In [4]:
# AMOUNT FEATURES 

train['TransactionAmt_log'] = np.log1p(train['TransactionAmt'])
train['TransactionAmt_decimal'] = train['TransactionAmt'] - train['TransactionAmt'].astype(int)
train['is_round_amount'] = (train['TransactionAmt'] % 1 == 0).astype(int)

print('Amount Features :')
print(train[['TransactionAmt', 'TransactionAmt_log', 'TransactionAmt_decimal', 'is_round_amount']].describe())

Amount Features :
       TransactionAmt  TransactionAmt_log  TransactionAmt_decimal  \
count   590540.000000       590540.000000           590540.000000   
mean       135.027161            4.382960                0.379452   
std        239.162521            0.937183                0.434118   
min          0.251000            0.223943                0.000000   
25%         43.320999            3.791459                0.000000   
50%         68.769001            4.245190                0.000000   
75%        125.000000            4.836282                0.949997   
max      31937.390625           10.371564                0.999001   

       is_round_amount  
count    590540.000000  
mean          0.516498  
std           0.499728  
min           0.000000  
25%           0.000000  
50%           1.000000  
75%           1.000000  
max           1.000000  


## insight amount based feature engginering

from the result we created amount based signals :

- TransactionAmt_log : handles the extreme right skew in amount distribution
- TransactionAmt_decimal : extract the decimal portion to detect .00 abd .99 pattern
- is_round_amount : binary flah for whole dollar amounts which may indicate bot behavior

Fraud patterns : very small amounts ( card testing) and round amounts ( automated transaction) correlate with higher fraud rates.

In [5]:
# card aggregations 

card_agg = train.groupby('card1')['TransactionAmt'].agg(['mean', 'std', 'count'])
card_agg.columns = ['card1_amt_mean', 'card1_amt_std', 'card1_amt_count']

# Drop existing columns if they exist to avoid merge conflicts
cols_to_drop = [col for col in card_agg.columns if col in train.columns]
train = train.drop(columns=cols_to_drop, errors='ignore')

train = train.merge(card_agg.reset_index(), on='card1', how='left')
train['amt_vs_card_mean'] = train['TransactionAmt'] - train['card1_amt_mean']

print('card aggregation features : ')
print(train[['card1_amt_mean', 'card1_amt_std', 'card1_amt_count', 'amt_vs_card_mean']].describe())

card aggregation features : 
       card1_amt_mean  card1_amt_std  card1_amt_count  amt_vs_card_mean
count   590540.000000  587096.000000    590540.000000     590540.000000
mean       135.027161     186.717651      2528.815464         -0.000001
std         79.283325     130.683121      3702.655513        225.638794
min          0.615000       0.000000         1.000000      -2297.212402
25%         97.863770     100.616867       132.000000        -76.698120
50%        120.209267     173.632751       919.000000        -33.817558
75%        157.049408     235.915588      3152.000000          9.572350
max       3454.949951    2648.284668     14932.000000      31657.542969


In [6]:
free_emails = ['gmail.com', 'yahoo.com', 'hotmail.com', 'outlook.com', 'aol.com', 'icloud.com']
train['P_email_is_free'] = train['P_emaildomain'].isin(free_emails).astype(int)
train['R_email_is_free'] = train['R_emaildomain'].isin(free_emails).astype(int)

print('email features : ')
print(f'email match rate : {train['email_match']}')

email features : 


KeyError: 'email_match'