# Feature Engineering & Data Preprocessing

## Task 1: Data Analysis and Preprocessing (Continuation)

**Objective:**
- Engineer meaningful fraud-related features
- Transform data for machine learning
- Handle severe class imbalance
- Produce a final, model-ready dataset


In [100]:
# Cell 1 — Imports and Setup
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use("default")
sns.set_theme()


In [101]:
# Cell 2 — Helper Functions
def scale_numeric(df, numeric_cols):
    scaler = StandardScaler()
    df[numeric_cols] = scaler.fit_transform(df[numeric_cols])
    return df

def encode_categorical(df, categorical_cols):
    return pd.get_dummies(df, columns=categorical_cols, drop_first=True)

def apply_smote(X, y):
    smote = SMOTE(random_state=42)
    X_res, y_res = smote.fit_resample(X, y)
    return X_res, y_res


In [102]:
# Cell 3 — Load Fraud Data
fraud = pd.read_csv("../data/processed/fraud_cleaned.csv")
fraud['signup_time'] = pd.to_datetime(fraud['signup_time'])
fraud['purchase_time'] = pd.to_datetime(fraud['purchase_time'])


In [103]:
# Cell 4 — Time-based Features
fraud['time_since_signup'] = (fraud['purchase_time'] - fraud['signup_time']).dt.total_seconds() / 3600
fraud['hour_of_day'] = fraud['purchase_time'].dt.hour
fraud['day_of_week'] = fraud['purchase_time'].dt.dayofweek
fraud['short_account'] = (fraud['time_since_signup'] < 24).astype(int)


In [104]:
# Cell 5 — Transaction Frequency
fraud['user_txn_count'] = fraud.groupby('user_id')['purchase_time'].transform('count')

# Rolling transactions in last 24h
fraud = fraud.sort_values(['user_id', 'purchase_time'])
def txn_last_24h(group):
    return group.set_index('purchase_time').rolling('24H').count()['user_id']
fraud['txn_in_24h'] = fraud.groupby('user_id', group_keys=False).apply(txn_last_24h).values


In [105]:
# Cell 6 — Encode categorical & Scale numeric
categorical_cols = ['source', 'browser', 'sex', 'country']
fraud_encoded = encode_categorical(fraud, categorical_cols)

numeric_cols = ['purchase_value', 'age', 'time_since_signup', 'user_txn_count', 'txn_in_24h']
fraud_encoded = scale_numeric(fraud_encoded, numeric_cols)


In [106]:
# Cell 7 — Split features and target
X_fraud = fraud_encoded.drop(columns=['class', 'user_id', 'device_id', 'signup_time', 'purchase_time', 'ip_address', 'ip_int', 'lower_bound_ip_address', 'upper_bound_ip_address'])
y_fraud = fraud_encoded['class']

# Original distribution
print("Original class distribution:\n", y_fraud.value_counts())


Original class distribution:
 class
0    116878
1     12268
Name: count, dtype: int64


In [107]:
# Cell 8 — Apply SMOTE (for training later)
X_fraud_res, y_fraud_res = apply_smote(X_fraud, y_fraud)
print("Resampled class distribution:\n", y_fraud_res.value_counts())


Resampled class distribution:
 class
0    116878
1    116878
Name: count, dtype: int64


In [108]:
# Cell 9 — Save Fraud Features
fraud_encoded.to_csv("../data/processed/fraud_features.csv", index=False)


In [109]:
# Cell 10 — Load CreditCard Data
credit = pd.read_csv("../data/processed/creditcard_cleaned.csv")


In [110]:
# Cell 11 — Numeric & Time-based Features
numeric_cols_cc = ['Amount']
credit = scale_numeric(credit, numeric_cols_cc)
credit['hour_of_day'] = (credit['Time'] // 3600) % 24


In [111]:
# Cell 12 — Encode categorical (if any)
categorical_cols_cc = []
credit_encoded = encode_categorical(credit, categorical_cols_cc)


In [112]:
# Cell 13 — Split features and target
X_cc = credit_encoded.drop(columns=['Class', 'Time'])
y_cc = credit_encoded['Class']

# Original distribution
print("CreditCard original class distribution:\n", y_cc.value_counts())


CreditCard original class distribution:
 Class
0    283253
1       473
Name: count, dtype: int64


In [113]:
# Cell 14 — Apply SMOTE
X_cc_res, y_cc_res = apply_smote(X_cc, y_cc)
print("CreditCard resampled class distribution:\n", y_cc_res.value_counts())


CreditCard resampled class distribution:
 Class
0    283253
1    283253
Name: count, dtype: int64


In [114]:
# Cell 15 — Save CreditCard Features
credit_encoded.to_csv("../data/processed/creditcard_features.csv", index=False)


## ✅ Summary
- Both datasets are cleaned, transformed, and feature-engineered.
- Class imbalance is documented and handled with SMOTE.
- Ready for modeling in Task 2.
