# Feature Engineering & Data Preprocessing

## Task 1: Data Analysis and Preprocessing (Continuation)

**Objective:**
- Engineer meaningful fraud-related features
- Transform data for machine learning
- Handle severe class imbalance
- Produce a final, model-ready dataset


In [80]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE

import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use("default")
sns.set_theme()


In [91]:
fraud = pd.read_csv("../data/processed/fraud_cleaned.csv")

# Ensure datetime columns are properly typed
fraud['signup_time'] = pd.to_datetime(fraud['signup_time'])
fraud['purchase_time'] = pd.to_datetime(fraud['purchase_time'])


# Inspect
fraud.head()
fraud.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129146 entries, 0 to 129145
Data columns (total 15 columns):
 #   Column                  Non-Null Count   Dtype         
---  ------                  --------------   -----         
 0   user_id                 129146 non-null  int64         
 1   signup_time             129146 non-null  datetime64[ns]
 2   purchase_time           129146 non-null  datetime64[ns]
 3   purchase_value          129146 non-null  int64         
 4   device_id               129146 non-null  object        
 5   source                  129146 non-null  object        
 6   browser                 129146 non-null  object        
 7   sex                     129146 non-null  object        
 8   age                     129146 non-null  int64         
 9   ip_address              129146 non-null  float64       
 10  class                   129146 non-null  int64         
 11  ip_int                  129146 non-null  int64         
 12  lower_bound_ip_address  129146

In [92]:
# Time since signup (in hours)
fraud['time_since_signup'] = (fraud['purchase_time'] - fraud['signup_time']).dt.total_seconds() / 3600

# Hour of day and day of week
fraud['hour_of_day'] = fraud['purchase_time'].dt.hour
fraud['day_of_week'] = fraud['purchase_time'].dt.dayofweek

# Short account flag (<24h old)
fraud['short_account'] = (fraud['time_since_signup'] < 24).astype(int)


In [93]:
# Total transactions per user
fraud['user_txn_count'] = fraud.groupby('user_id')['purchase_time'].transform('count')

# Sort for rolling calculations
fraud = fraud.sort_values(['user_id', 'purchase_time'])

# Transactions in last 24 hours
def txn_last_24h(group):
    return group.set_index('purchase_time').rolling('24H').count()['user_id']

fraud['txn_in_24h'] = fraud.groupby('user_id', group_keys=False).apply(txn_last_24h).values

# Preview
fraud[['user_id', 'purchase_time', 'user_txn_count', 'txn_in_24h']].head(10)


Unnamed: 0,user_id,purchase_time,user_txn_count,txn_in_24h
30049,2,2015-02-21 10:03:37,1,1.0
95244,4,2015-09-26 21:32:16,1,1.0
11606,8,2015-08-13 11:53:07,1,1.0
101959,12,2015-03-04 20:56:37,1,1.0
19600,16,2015-03-12 12:46:23,1,1.0
125118,18,2015-10-23 00:18:57,1,1.0
40586,33,2015-10-28 18:12:41,1,1.0
107086,39,2015-01-08 18:13:26,1,1.0
57028,41,2015-03-23 10:10:08,1,1.0
121547,47,2015-04-04 09:08:26,1,1.0


In [94]:
categorical_cols = ['source', 'browser', 'sex', 'country']

fraud_encoded = pd.get_dummies(fraud, columns=categorical_cols, drop_first=True)

# Verify
print(fraud_encoded.shape)
fraud_encoded.head()


(129146, 204)


Unnamed: 0,user_id,signup_time,purchase_time,purchase_value,device_id,age,ip_address,class,ip_int,lower_bound_ip_address,...,country_United States,country_Uruguay,country_Uzbekistan,country_Vanuatu,country_Venezuela,country_Viet Nam,country_Virgin Islands (U.S.),country_Yemen,country_Zambia,country_Zimbabwe
30049,2,2015-01-11 03:47:13,2015-02-21 10:03:37,54,FGBQNDNBETFJJ,25,880217500.0,0,880217484,872415200.0,...,True,False,False,False,False,False,False,False,False,False
95244,4,2015-06-02 16:40:57,2015-09-26 21:32:16,41,MKFUIVOHLJBYN,38,2785906000.0,0,2785906106,2785542000.0,...,False,False,False,False,False,False,False,False,False,False
11606,8,2015-05-28 07:53:06,2015-08-13 11:53:07,47,SCQGQALXBUQZJ,25,356056700.0,0,356056736,352321500.0,...,True,False,False,False,False,False,False,False,False,False
101959,12,2015-01-10 06:25:12,2015-03-04 20:56:37,35,MSNWCFEHKTIOY,19,2985180000.0,0,2985180352,2985034000.0,...,False,False,False,False,False,False,False,False,False,False
19600,16,2015-02-03 13:48:23,2015-03-12 12:46:23,9,FROZWSSWOHZBE,32,578312500.0,0,578312545,570425300.0,...,True,False,False,False,False,False,False,False,False,False


In [95]:
numeric_cols = ['purchase_value', 'age', 'time_since_signup', 'user_txn_count', 'txn_in_24h']

scaler = StandardScaler()
fraud_encoded[numeric_cols] = scaler.fit_transform(fraud_encoded[numeric_cols])

fraud_encoded[numeric_cols].head()


Unnamed: 0,purchase_value,age,time_since_signup,user_txn_count,txn_in_24h
30049,0.93175,-0.94349,-0.435282,0.0,0.0
95244,0.222055,0.56546,1.633628,0.0,0.0
11606,0.549607,-0.94349,0.555963,0.0,0.0
101959,-0.105497,-1.639928,-0.094505,0.0,0.0
19600,-1.524887,-0.130978,-0.554116,0.0,0.0


In [96]:
# Drop irrelevant columns
X = fraud_encoded.drop(columns=['class', 'user_id', 'device_id', 'signup_time', 'purchase_time', 'ip_address', 'ip_int', 'lower_bound_ip_address', 'upper_bound_ip_address'])
y = fraud_encoded['class']

# Check
print("Feature shape:", X.shape)
print("Target distribution:\n", y.value_counts())


Feature shape: (129146, 195)
Target distribution:
 class
0    116878
1     12268
Name: count, dtype: int64


In [97]:
# Original distribution
print("Original class distribution:")
print(y.value_counts())

# SMOTE preview (apply only on training set during modeling)
smote = SMOTE(random_state=42)
X_res, y_res = smote.fit_resample(X, y)

print("Resampled class distribution:")
print(y_res.value_counts())


Original class distribution:
class
0    116878
1     12268
Name: count, dtype: int64


  File "c:\Users\kalki\anaconda3\Lib\site-packages\joblib\externals\loky\backend\context.py", line 257, in _count_physical_cores
    cpu_info = subprocess.run(
        "wmic CPU Get NumberOfCores /Format:csv".split(),
        capture_output=True,
        text=True,
    )
  File "c:\Users\kalki\anaconda3\Lib\subprocess.py", line 554, in run
    with Popen(*popenargs, **kwargs) as process:
         ~~~~~^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\kalki\anaconda3\Lib\subprocess.py", line 1039, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
    ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                        pass_fds, cwd, env,
                        ^^^^^^^^^^^^^^^^^^^
    ...<5 lines>...
                        gid, gids, uid, umask,
                        ^^^^^^^^^^^^^^^^^^^^^^
                        start_new_session, process_group)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\kalki\anaconda3\Lib\subprocess.

Resampled class distribution:
class
0    116878
1    116878
Name: count, dtype: int64


In [98]:
fraud_encoded.to_csv("../data/processed/fraud_features.csv", index=False)


## Feature Engineering Summary

**Features Created & Transformed:**

1. **Time-Based Features**
   - hour_of_day, day_of_week, time_since_signup, short_account

2. **Transaction Frequency / Velocity**
   - user_txn_count (total transactions per user)
   - txn_in_24h (transactions in last 24 hours)

3. **Categorical Features**
   - One-hot encoded: source, browser, sex, country

4. **Numerical Transformation**
   - StandardScaler applied to: purchase_value, age, time_since_signup, user_txn_count, txn_in_24h

5. **Class Imbalance**
   - SMOTE previewed for handling imbalance in training set

**Next Step:**  
Proceed to **Task 2 — Model Building and Training** using the prepared `fraud_features.csv`.
