# 02 - Feature Engineering

**Day 2, Part 1: Feature Engineering**

## Objectives
1. Load processed data from Day 1
2. Create RFM features (customer-level)
3. Create temporal features (order-level)
4. Create geographic features (order-level)
5. Create product features (order-level)
6. Create NLP features (order-level)
7. Create payment features (order-level)
8. Build sklearn preprocessing pipeline
9. Save all artifacts

## Key Principles
- **Fit on training data ONLY** - Apply same transformations to val/test
- **No data leakage** - Customer aggregations from training set only
- **Reproducibility** - Save all artifacts for production use

## Setup

In [None]:
import sys
sys.path.insert(0, '..')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Import our feature engineering module
from src.feature_engineering import (
    load_geolocation_data,
    engineer_features,
    save_feature_artifacts,
    create_preprocessing_pipeline,
    NUMERICAL_FEATURES,
    CATEGORICAL_FEATURES,
    BINARY_FEATURES,
    RFM_FEATURES,
)

# Settings
pd.set_option('display.max_columns', 50)
plt.style.use('seaborn-v0_8-whitegrid')

print("Setup complete!")

## Step 1: Load Processed Data from Day 1

In [None]:
# Load processed data from Day 1
data_dir = Path("../data/processed")

train_df = pd.read_parquet(data_dir / "train_processed.parquet")
val_df = pd.read_parquet(data_dir / "val_processed.parquet")
test_df = pd.read_parquet(data_dir / "test_processed.parquet")

print(f"Train: {train_df.shape[0]:,} rows × {train_df.shape[1]} columns")
print(f"Val:   {val_df.shape[0]:,} rows × {val_df.shape[1]} columns")
print(f"Test:  {test_df.shape[0]:,} rows × {test_df.shape[1]} columns")

# Load geolocation data
geo_df = load_geolocation_data("../data/raw/olist_geolocation_dataset.csv")

In [None]:
# Check date range (reference date for RFM)
print("Date ranges:")
print(f"Train: {train_df['order_purchase_timestamp'].min().date()} to {train_df['order_purchase_timestamp'].max().date()}")
print(f"Val:   {val_df['order_purchase_timestamp'].min().date()} to {val_df['order_purchase_timestamp'].max().date()}")
print(f"Test:  {test_df['order_purchase_timestamp'].min().date()} to {test_df['order_purchase_timestamp'].max().date()}")

reference_date = train_df['order_purchase_timestamp'].max()
print(f"\nReference date for RFM: {reference_date}")

In [None]:
# Original columns
print(f"Original columns ({len(train_df.columns)}):")
print(train_df.columns.tolist())

## Step 2: Run Feature Engineering Pipeline

This will create all features:
- **RFM**: recency, frequency, monetary, avg_order_value, monetary_per_day
- **Temporal**: hour, dayofweek, month + cyclical encoding + binary flags
- **Geographic**: distance_km, is_same_state, customer/seller regions
- **Product**: volume, density, price ratios
- **NLP**: text stats, sentiment
- **Payment**: installment features

In [None]:
# Run the complete feature engineering pipeline
result = engineer_features(
    train_df=train_df,
    val_df=val_df,
    test_df=test_df,
    geo_df=geo_df,
    reference_date=reference_date,
)

In [None]:
# Extract results
train_featured = result['train']
val_featured = result['val']
test_featured = result['test']
customer_features = result['customer_features']
category_stats = result['category_stats']

print(f"\nFeatured data shapes:")
print(f"Train: {train_featured.shape}")
print(f"Val:   {val_featured.shape}")
print(f"Test:  {test_featured.shape}")
print(f"Customer features: {customer_features.shape}")

## Step 3: Analyze New Features

In [None]:
# New columns created
original_cols = set(train_df.columns)
new_cols = set(train_featured.columns) - original_cols

print(f"New features created: {len(new_cols)}")
print("\nNew columns:")
for col in sorted(new_cols):
    print(f"  - {col}")

In [None]:
# RFM Features Summary
print("=" * 50)
print("RFM FEATURES (Customer-Level)")
print("=" * 50)
print(customer_features[RFM_FEATURES].describe())

In [None]:
# Visualize RFM distributions
fig, axes = plt.subplots(2, 3, figsize=(15, 8))

for idx, col in enumerate(RFM_FEATURES):
    ax = axes[idx // 3, idx % 3]
    customer_features[col].hist(bins=50, ax=ax, edgecolor='black', alpha=0.7)
    ax.set_title(f'{col}', fontsize=12)
    ax.set_xlabel(col)
    ax.set_ylabel('Frequency')

# Remove empty subplot
axes[1, 2].axis('off')

plt.suptitle('RFM Feature Distributions', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Temporal Features
print("=" * 50)
print("TEMPORAL FEATURES")
print("=" * 50)

temporal_cols = ['order_hour', 'order_dayofweek', 'order_month', 'is_weekend', 
                 'is_month_start', 'is_month_end', 'order_hour_sin', 'order_hour_cos']
print(train_featured[temporal_cols].describe())

In [None]:
# Visualize temporal patterns
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Hour distribution
train_featured['order_hour'].value_counts().sort_index().plot(kind='bar', ax=axes[0], color='steelblue')
axes[0].set_title('Orders by Hour', fontsize=12)
axes[0].set_xlabel('Hour')
axes[0].set_ylabel('Count')

# Day of week
days = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
day_counts = train_featured['order_dayofweek'].value_counts().sort_index()
axes[1].bar(days, day_counts.values, color='coral')
axes[1].set_title('Orders by Day of Week', fontsize=12)
axes[1].set_xlabel('Day')
axes[1].set_ylabel('Count')

# Weekend vs Weekday
weekend_counts = train_featured['is_weekend'].value_counts()
axes[2].pie(weekend_counts.values, labels=['Weekday', 'Weekend'], autopct='%1.1f%%',
            colors=['#66b3ff', '#ff9999'])
axes[2].set_title('Weekend vs Weekday', fontsize=12)

plt.tight_layout()
plt.show()

In [None]:
# Geographic Features
print("=" * 50)
print("GEOGRAPHIC FEATURES")
print("=" * 50)

geo_cols = ['seller_customer_distance_km', 'is_same_state', 'customer_region', 'seller_region']
print(f"Average distance: {train_featured['seller_customer_distance_km'].mean():.1f} km")
print(f"Same state orders: {train_featured['is_same_state'].mean():.1%}")
print(f"\nCustomer regions:\n{train_featured['customer_region'].value_counts()}")

In [None]:
# Distance distribution
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Distance histogram
train_featured['seller_customer_distance_km'].hist(bins=50, ax=axes[0], color='green', edgecolor='black', alpha=0.7)
axes[0].set_title('Seller-Customer Distance Distribution', fontsize=12)
axes[0].set_xlabel('Distance (km)')
axes[0].set_ylabel('Frequency')
axes[0].axvline(train_featured['seller_customer_distance_km'].median(), color='red', linestyle='--', label='Median')
axes[0].legend()

# Region distribution
train_featured['customer_region'].value_counts().plot(kind='bar', ax=axes[1], color='purple', edgecolor='black')
axes[1].set_title('Orders by Customer Region', fontsize=12)
axes[1].set_xlabel('Region')
axes[1].set_ylabel('Count')
axes[1].tick_params(axis='x', rotation=0)

plt.tight_layout()
plt.show()

In [None]:
# Product Features
print("=" * 50)
print("PRODUCT FEATURES")
print("=" * 50)

product_cols = ['product_volume_cm3', 'product_density', 'price_per_kg', 'freight_ratio', 
                'price_vs_category_mean', 'price_vs_category_zscore']
print(train_featured[product_cols].describe())

In [None]:
# NLP Features
print("=" * 50)
print("NLP FEATURES")
print("=" * 50)

nlp_cols = ['has_review_comment', 'review_text_length', 'review_word_count', 
            'review_exclamation_count', 'review_caps_ratio',
            'review_sentiment_polarity', 'review_sentiment_subjectivity']
print(train_featured[nlp_cols].describe())

In [None]:
# Sentiment vs Review Score correlation
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Sentiment by review score
sentiment_by_score = train_featured.groupby('review_score')['review_sentiment_polarity'].mean()
sentiment_by_score.plot(kind='bar', ax=axes[0], color='orange', edgecolor='black')
axes[0].set_title('Average Sentiment Polarity by Review Score', fontsize=12)
axes[0].set_xlabel('Review Score')
axes[0].set_ylabel('Avg Sentiment Polarity')
axes[0].tick_params(axis='x', rotation=0)

# Word count by review score
wordcount_by_score = train_featured.groupby('review_score')['review_word_count'].mean()
wordcount_by_score.plot(kind='bar', ax=axes[1], color='teal', edgecolor='black')
axes[1].set_title('Average Word Count by Review Score', fontsize=12)
axes[1].set_xlabel('Review Score')
axes[1].set_ylabel('Avg Word Count')
axes[1].tick_params(axis='x', rotation=0)

plt.tight_layout()
plt.show()

In [None]:
# Payment Features
print("=" * 50)
print("PAYMENT FEATURES")
print("=" * 50)

payment_cols = ['payment_per_installment', 'is_full_payment', 'is_high_installment']
print(train_featured[payment_cols].describe())

print(f"\nFull payment rate: {train_featured['is_full_payment'].mean():.1%}")
print(f"High installment rate (>=6): {train_featured['is_high_installment'].mean():.1%}")

## Step 4: Feature Correlation Analysis

In [None]:
# Select numerical features for correlation
numerical_for_corr = [
    'price', 'freight_value', 'payment_value',
    'product_weight_g', 'product_volume_cm3', 'seller_customer_distance_km',
    'freight_ratio', 'review_sentiment_polarity', 'review_word_count',
    'recency', 'frequency', 'monetary', 'delivery_days', 'is_satisfied'
]

# Filter to columns that exist
numerical_for_corr = [c for c in numerical_for_corr if c in train_featured.columns]

# Compute correlation matrix
corr_matrix = train_featured[numerical_for_corr].corr()

# Plot
plt.figure(figsize=(12, 10))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
sns.heatmap(corr_matrix, mask=mask, annot=True, cmap='RdBu_r', center=0,
            fmt='.2f', square=True, linewidths=0.5)
plt.title('Feature Correlation Matrix', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Correlations with target variables
print("=" * 50)
print("CORRELATIONS WITH TARGETS")
print("=" * 50)

# Classification target: is_satisfied
print("\nCorrelations with is_satisfied (classification target):")
corr_with_satisfied = train_featured[numerical_for_corr].corrwith(train_featured['is_satisfied']).sort_values(ascending=False)
print(corr_with_satisfied.head(10))

# Regression target: delivery_days
print("\nCorrelations with delivery_days (regression target):")
corr_with_delivery = train_featured[numerical_for_corr].corrwith(train_featured['delivery_days']).sort_values(ascending=False)
print(corr_with_delivery.head(10))

## Step 5: Customer-Level Features Analysis (for Clustering)

In [None]:
# Customer features summary
print("=" * 50)
print("CUSTOMER-LEVEL FEATURES (for Clustering)")
print("=" * 50)
print(f"\nTotal customers: {len(customer_features):,}")
print("\nFeature statistics:")
print(customer_features.describe())

In [None]:
# Customer feature correlations
cluster_features = ['recency', 'frequency', 'monetary', 'avg_order_value', 
                   'avg_review_score', 'avg_delivery_days', 'late_delivery_rate']

plt.figure(figsize=(8, 6))
corr = customer_features[cluster_features].corr()
sns.heatmap(corr, annot=True, cmap='RdBu_r', center=0, fmt='.2f', square=True)
plt.title('Customer Features Correlation (for Clustering)', fontsize=12, fontweight='bold')
plt.tight_layout()
plt.show()

## Step 6: Build and Test Preprocessing Pipeline

In [None]:
# Create preprocessing pipeline
preprocessor = create_preprocessing_pipeline(
    numerical_features=NUMERICAL_FEATURES,
    categorical_features=CATEGORICAL_FEATURES,
    binary_features=BINARY_FEATURES
)

print(f"\nNumerical features: {len(NUMERICAL_FEATURES)}")
print(f"Categorical features: {len(CATEGORICAL_FEATURES)}")
print(f"Binary features: {len(BINARY_FEATURES)}")

In [None]:
# Fit on training data
X_train = train_featured[NUMERICAL_FEATURES + CATEGORICAL_FEATURES + BINARY_FEATURES].copy()

# Handle any remaining NaN values
print(f"NaN values before fit: {X_train.isnull().sum().sum()}")

# Fit and transform
X_train_transformed = preprocessor.fit_transform(X_train)

print(f"\nTransformed shape: {X_train_transformed.shape}")
print(f"Original features: {len(NUMERICAL_FEATURES) + len(CATEGORICAL_FEATURES) + len(BINARY_FEATURES)}")
print(f"After one-hot encoding: {X_train_transformed.shape[1]}")

In [None]:
# Get feature names after transformation
from src.feature_engineering import get_feature_names_from_pipeline

feature_names = get_feature_names_from_pipeline(
    preprocessor, NUMERICAL_FEATURES, CATEGORICAL_FEATURES, BINARY_FEATURES
)

print(f"Total feature names: {len(feature_names)}")
print(f"\nSample feature names:")
print(feature_names[:10])

## Step 7: Save All Artifacts

In [None]:
# Save all artifacts
save_feature_artifacts(
    result=result,
    output_dir="../data/processed",
    models_dir="../models"
)

In [None]:
# Save preprocessing pipeline
import joblib

joblib.dump(preprocessor, '../models/feature_pipeline.joblib')
print("✓ Saved preprocessing pipeline to models/feature_pipeline.joblib")

In [None]:
# Verify saved files
from pathlib import Path

print("\n" + "=" * 50)
print("SAVED FILES")
print("=" * 50)

# Data files
data_dir = Path("../data/processed")
for f in sorted(data_dir.glob("*.parquet")):
    size_mb = f.stat().st_size / 1024 / 1024
    print(f"  {f.name}: {size_mb:.2f} MB")

# Model files
models_dir = Path("../models")
for f in sorted(models_dir.glob("*")):
    if f.is_file():
        size_kb = f.stat().st_size / 1024
        print(f"  {f.name}: {size_kb:.2f} KB")

## Summary

### Features Created

| Category | Count | Examples |
|----------|-------|----------|
| RFM | 5 | recency, frequency, monetary, avg_order_value, monetary_per_day |
| Temporal | 12 | order_hour, is_weekend, hour_sin, hour_cos |
| Geographic | 6 | seller_customer_distance_km, is_same_state, customer_region |
| Product | 6 | product_volume_cm3, freight_ratio, price_vs_category_zscore |
| NLP | 8 | review_sentiment_polarity, review_word_count, review_caps_ratio |
| Payment | 3 | payment_per_installment, is_full_payment, is_high_installment |

### Artifacts Saved

- `data/processed/train_featured.parquet` - Training data with all features
- `data/processed/val_featured.parquet` - Validation data with all features  
- `data/processed/test_featured.parquet` - Test data with all features
- `data/processed/customer_segments.parquet` - Customer-level features for clustering
- `models/feature_pipeline.joblib` - sklearn preprocessing pipeline
- `models/category_stats.csv` - Category price statistics
- `models/feature_names.json` - Feature column names

### Next Steps

Continue to **Day 2, Part 2: Unsupervised Learning (Clustering)** in notebook `03_unsupervised_learning.ipynb`

In [None]:
print("\n" + "=" * 60)
print("DAY 2, PART 1: FEATURE ENGINEERING COMPLETE!")
print("=" * 60)
print(f"""
✓ Created {len(new_cols)} new features
✓ Saved train/val/test featured data
✓ Saved customer-level features for clustering
✓ Saved preprocessing pipeline

Ready for Part 2: Clustering!
""")