# **Problem Statement**  
## **5. Design a Feature Engineering Pipeline for E-commerce Data.**

### Problem Statement

Design a Feature Engineering Pipeline for E-commerce Data

The goal is to design a robust feature engineering pipeline for e-commerce transaction data that transforms raw customer, product, and transaction logs into model-ready numerical features suitable for machine learning tasks such as churn prediction, fraud detection, or recommendation systems.

### Constraints & Example Inputs/Outputs

### Constraints
- Mixed data types (numerical, categorical, datetime)
- Missing values present
- High cardinality categorical features
- Pipeline must be reusable and scalable
- Should support both training & inference

### Example Raw Input:
```python
| user_id | product | category    | price | quantity | purchase_time |
| ------- | ------- | ----------- | ----- | -------- | ------------- |
| U1      | iPhone  | Electronics | 80000 | 1        | 2024-01-10    |
| U2      | Shoes   | Fashion     | 3000  | 2        | 2024-01-11    |
```

### Expected Output (Model-Ready):
```python
| total_spent | avg_order_value | recency_days | category_encoded |
| ----------- | --------------- | ------------ | ---------------- |
| 80000       | 80000           | 5            | 2                |
```

### Solution Approach

### Step1: Understand Business Signals
From e-commerce data, useful signals include:
- Customer value (total spend, frequency)
- Purchase behavior (recency, basket size)
- Product preference (category affinity)

### Step2: Feature Types
```python
| Feature Type | Examples               |
| ------------ | ---------------------- |
| Numerical    | price, quantity        |
| Categorical  | category, product      |
| Temporal     | recency, frequency     |
| Aggregated   | total_spent, avg_order |
```

### Step3: Why a Pipeline?
- Prevents data leakage
- Same transformations in training & inference
- Easy deployment

### Solution Code

In [1]:
# Approach1: Brute Force Feature Engineering (Manual)
import pandas as pd
from datetime import datetime

data = pd.DataFrame({
    "user_id": ["U1", "U1", "U2"],
    "category": ["Electronics", "Electronics", "Fashion"],
    "price": [80000, 2000, 3000],
    "quantity": [1, 1, 2],
    "purchase_time": pd.to_datetime(
        ["2024-01-10", "2024-01-15", "2024-01-11"]
    )
})

# Total spent per user
data["total_amount"] = data["price"] * data["quantity"]
user_features = data.groupby("user_id").agg(
    total_spent=("total_amount", "sum"),
    avg_order_value=("total_amount", "mean"),
    last_purchase=("purchase_time", "max")
)

user_features["recency_days"] = (
    datetime(2024, 1, 20) - user_features["last_purchase"]
).dt.days

user_features


Unnamed: 0_level_0,total_spent,avg_order_value,last_purchase,recency_days
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
U1,82000,41000.0,2024-01-15,5
U2,6000,6000.0,2024-01-11,9


#### Limitation of Brute Force Approach 
- Hard to reuse
- No standardization
- Breaks in production

### Alternative Solution

In [2]:
# Approach2: Optimized (Production-Ready Feature Pipeline)

# Step1: Create Raw Dataset

data = pd.DataFrame({
    "price": [80000, 2000, 3000, 1500],
    "quantity": [1, 1, 2, 3],
    "category": ["Electronics", "Electronics", "Fashion", "Fashion"]
})


In [3]:
# Step 2: Build Feature Pipeline

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

num_features = ["price", "quantity"]
cat_features = ["category"]

num_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

cat_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

feature_pipeline = ColumnTransformer([
    ("num", num_pipeline, num_features),
    ("cat", cat_pipeline, cat_features)
])


In [4]:
# Step 3: Transform Data

X_transformed = feature_pipeline.fit_transform(data)
X_transformed


array([[ 1.73182848, -0.90453403,  1.        ,  0.        ],
       [-0.58222071, -0.90453403,  1.        ,  0.        ],
       [-0.55255341,  0.30151134,  0.        ,  1.        ],
       [-0.59705436,  1.50755672,  0.        ,  1.        ]])

### Alternative Approaches

**Brute Force**
- Pandas aggregation
- Manual encoding

**Optimized**
- sklearn Pipelines
- Feature Store (Feast)
- Spark Feature Pipelines

**Why Pipeline Wins?**
- Reproducible
- Deployable
- Scales well

### Test Case

In [5]:
# Test Case 1: Normal Input
test_data = pd.DataFrame({
    "price": [5000],
    "quantity": [2],
    "category": ["Fashion"]
})

feature_pipeline.transform(test_data)


array([[-0.49321882,  0.30151134,  0.        ,  1.        ]])

**Expected**
- Scaled numeric values
- One-hot encoded category

In [6]:
# Test Case 2: Unseen Category (Production Case)
test_data = pd.DataFrame({
    "price": [10000],
    "quantity": [1],
    "category": ["Groceries"]
})

feature_pipeline.transform(test_data)


array([[-0.34488233, -0.90453403,  0.        ,  0.        ]])

**Expected**
- No error
- Category safely ignored

In [7]:
# Test Case 3: Missing Values
test_data = pd.DataFrame({
    "price": [None],
    "quantity": [2],
    "category": [None]
})

feature_pipeline.transform(test_data)


array([[-0.56738706,  0.30151134,  0.        ,  0.        ]])

**Expected**
- Median & mode imputation
- Pipeline still works

### Business Use Case
**Applications**
- Churn prediction
- Fraud detection
- Recommendation systems
- Customer lifetime value modeling

**Business Value**
- Consistent features
- Faster model iteration
- Lower production bugs

## Complexity Analysis

```python
| Step        | Complexity        |
| ----------- | ----------------- |
| Aggregation | O(n)              |
| Scaling     | O(n × f)          |
| Encoding    | O(n × categories) |
```

- Space: O(n × features) 

#### Thank You!!