# Feature Engineering Pipeline – E-Commerce Fraud Data (Fraud_Data.csv)

**Objective**:  
Create a complete, reproducible preprocessing and feature engineering pipeline for the e-commerce dataset, including:
- Data cleaning (duplicates, missing values)
- Geolocation integration (IP → country)
- Required feature engineering:
  - `hour_of_day`, `day_of_week`
  - `time_since_signup` (duration between signup and purchase)
  - Transaction frequency/velocity (per user/device)
- Final save to `data/processed/`

This notebook uses modular functions from `src/` for clean, reusable code.
___

In [2]:
import sys
import os
# Add project root (one directory above "notebooks")
sys.path.append(os.path.abspath(".."))

In [3]:
# Imports
import pandas as pd
# Custom modules
from scripts.preprocess import load_and_clean_fraud_data
# Optional: nice display
pd.set_option('display.float_format', '{:.2f}'.format)


In [5]:
fraud_df = load_and_clean_fraud_data(    fraud_path='../data/raw/Fraud_Data.csv',
    ip_path='../data/raw/IpAddress_to_Country.csv')

Starting Fraud_Data preprocessing...
Initial Row Count: 151,112
Duplicate Rows Found: 0
Rows After Cleaning: 151,112
✅ No duplicates found.
✅ No missing values detected. Dataset is clean.
Converting IP addresses to int64...
Merging IP data (this may take a moment)...
✅ Mapping complete. Found 21,966 IPs with unknown countries.
✅ Feature engineering completed.
Sample of new features:
   hour_of_day  day_of_week  time_since_signup  user_txn_count  \
0           10            6             489.73               1   
1           17            4             301.34               1   
2            8            1             208.14               1   
3           21            3            2065.18               1   
4            7            6             391.01               1   

   user_total_spent  user_avg_purchase  
0                46              46.00  
1                33              33.00  
2                33              33.00  
3                33              33.00  
4           

In [6]:
fraud_df

Unnamed: 0,purchase_value,source,browser,sex,age,class,country,hour_of_day,day_of_week,time_since_signup,user_txn_count,user_total_spent,user_avg_purchase
0,46,Direct,Safari,M,36,0,Unknown,10,6,489.73,1,46,46.00
1,33,Ads,IE,F,30,0,Unknown,17,4,301.34,1,33,33.00
2,33,Direct,FireFox,F,32,0,Unknown,8,1,208.14,1,33,33.00
3,33,Ads,IE,M,40,0,Unknown,21,3,2065.18,1,33,33.00
4,55,Ads,Safari,M,38,0,Unknown,7,6,391.01,1,55,55.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...
151107,39,Direct,FireFox,F,36,0,Unknown,21,4,2560.36,1,39,39.00
151108,62,SEO,IE,M,22,0,Unknown,4,4,477.59,1,62,62.00
151109,17,SEO,FireFox,M,19,0,Unknown,6,5,1569.22,1,17,17.00
151110,9,Ads,IE,F,35,0,Unknown,9,4,1818.11,1,9,9.00
