# py-recipes Demo: Feature Engineering Pipelines

This notebook demonstrates the **py-recipes** package for creating reproducible feature engineering pipelines.

## What is py-recipes?

**py-recipes** provides a consistent interface for:
- **Preprocessing**: Normalization, imputation, encoding
- **Feature engineering**: Custom transformations
- **Pipeline composition**: Chain multiple steps
- **Train/test consistency**: No data leakage

## Key Concepts

1. **Recipe**: Specification of preprocessing steps (immutable)
2. **prep()**: Fit recipe to training data
3. **bake()**: Apply fitted recipe to new data
4. **PreparedRecipe**: Fitted recipe ready to transform

---

## 1. Setup

First, let's import the necessary packages and create sample data.

In [None]:
import pandas as pd
import numpy as np
from py_recipes import recipe
from py_workflows import workflow
from py_parsnip import linear_reg

# Set random seed for reproducibility
np.random.seed(42)

In [None]:
# Create sample data with various issues
n = 200

data = pd.DataFrame({
    "sales": np.random.randn(n) * 1000 + 5000,
    "price": np.random.randn(n) * 10 + 50,
    "advertising": np.random.randn(n) * 500 + 1000,
    "temperature": np.random.randn(n) * 15 + 20,
    "store_type": np.random.choice(["Mall", "Street", "Online"], n),
    "region": np.random.choice(["North", "South", "East", "West"], n)
})

# Add some missing values
missing_idx = np.random.choice(n, size=20, replace=False)
data.loc[missing_idx, "advertising"] = np.nan

print(f"Data shape: {data.shape}")
print(f"\nMissing values:\n{data.isna().sum()}")
print(f"\nFirst few rows:")
data.head()

In [None]:
# Split into train/test
train_idx = int(0.8 * n)
train = data[:train_idx].copy()
test = data[train_idx:].copy()

print(f"Train: {len(train)} | Test: {len(test)}")

---

## 2. Basic Recipe Usage

A recipe follows the **prep/bake** pattern:
1. **Create** a recipe specification
2. **prep()** fits the recipe to training data
3. **bake()** applies the fitted recipe to new data

In [None]:
# Create a simple recipe
rec = recipe()
print(f"Recipe: {rec}")
print(f"Steps: {rec.steps}")

In [None]:
# Add a normalization step
rec = recipe().step_normalize()
print(f"Recipe with 1 step: {len(rec.steps)} steps")

In [None]:
# Fit (prep) the recipe to training data
rec_fit = rec.prep(train)
print(f"Prepared recipe: {rec_fit}")
print(f"Prepared steps: {len(rec_fit.prepared_steps)}")

In [None]:
# Apply (bake) to new data
train_transformed = rec_fit.bake(train)

print("\nOriginal train data (first 3 rows):")
print(train[["sales", "price", "advertising"]].head(3))

print("\nTransformed train data (first 3 rows):")
print(train_transformed[["sales", "price", "advertising"]].head(3))

print("\nMeans after normalization:")
print(train_transformed[["sales", "price", "advertising", "temperature"]].mean())

print("\nStd devs after normalization:")
print(train_transformed[["sales", "price", "advertising", "temperature"]].std())

**Key Point**: The recipe learns from **training data** and applies the same transformation to **test data**.

This prevents **data leakage**!

In [None]:
# Apply the SAME fitted recipe to test data
test_transformed = rec_fit.bake(test)

print("Test data means (NOT zero because fitted on train):")
print(test_transformed[["sales", "price", "advertising", "temperature"]].mean())

---

## 3. Normalization (step_normalize)

Normalize numeric columns using:
- **zscore** (default): standardization (mean=0, std=1)
- **minmax**: scaling to [0, 1] range

### Z-score Normalization

In [None]:
# Normalize specific columns
rec_zscore = (
    recipe()
    .step_normalize(columns=["price", "advertising"], method="zscore")
)

rec_zscore_fit = rec_zscore.prep(train)
train_zscore = rec_zscore_fit.bake(train)

print("Z-score normalized columns:")
print(train_zscore[["price", "advertising"]].describe())

### MinMax Normalization

In [None]:
# MinMax scaling [0, 1]
rec_minmax = (
    recipe()
    .step_normalize(columns=["price", "advertising"], method="minmax")
)

rec_minmax_fit = rec_minmax.prep(train)
train_minmax = rec_minmax_fit.bake(train)

print("MinMax normalized columns:")
print(train_minmax[["price", "advertising"]].describe())

---

## 4. Dummy Variables (step_dummy)

Convert categorical variables to one-hot encoded dummy variables.

In [None]:
# Original categorical columns
print("Original data:")
print(train[["store_type", "region"]].head())
print(f"\nUnique store types: {train['store_type'].unique()}")
print(f"Unique regions: {train['region'].unique()}")

In [None]:
# One-hot encode categorical columns
rec_dummy = (
    recipe()
    .step_dummy(["store_type", "region"])
)

rec_dummy_fit = rec_dummy.prep(train)
train_dummy = rec_dummy_fit.bake(train)

print("After dummy encoding:")
print("\nNew columns:")
dummy_cols = [col for col in train_dummy.columns if "store_type" in col or "region" in col]
print(dummy_cols)

print("\nFirst few rows of dummy variables:")
print(train_dummy[dummy_cols].head())

**Note**: Original categorical columns are removed after encoding.

In [None]:
# Check that original columns are gone
print("'store_type' in columns:", "store_type" in train_dummy.columns)
print("'region' in columns:", "region" in train_dummy.columns)

---

## 5. Missing Value Imputation

Handle missing values with:
- **step_impute_mean()**: Replace NA with column mean
- **step_impute_median()**: Replace NA with column median

In [None]:
# Check missing values
print("Missing values in training data:")
print(train.isna().sum())

print(f"\nMissing in 'advertising': {train['advertising'].isna().sum()} out of {len(train)}")

### Mean Imputation

In [None]:
# Impute with mean
rec_impute_mean = (
    recipe()
    .step_impute_mean(columns=["advertising"])
)

rec_impute_mean_fit = rec_impute_mean.prep(train)
train_imputed = rec_impute_mean_fit.bake(train)

print("After mean imputation:")
print(f"Missing values: {train_imputed['advertising'].isna().sum()}")

print(f"\nOriginal mean: {train['advertising'].mean():.2f}")
print(f"After imputation mean: {train_imputed['advertising'].mean():.2f}")

### Median Imputation

In [None]:
# Impute with median
rec_impute_median = (
    recipe()
    .step_impute_median(columns=["advertising"])
)

rec_impute_median_fit = rec_impute_median.prep(train)
train_imputed_median = rec_impute_median_fit.bake(train)

print("After median imputation:")
print(f"Missing values: {train_imputed_median['advertising'].isna().sum()}")

print(f"\nOriginal median: {train['advertising'].median():.2f}")
print(f"After imputation median: {train_imputed_median['advertising'].median():.2f}")

---

## 6. Custom Transformations (step_mutate)

Create new features using custom functions.

In [None]:
# Create engineered features
rec_mutate = (
    recipe()
    .step_mutate({
        "price_squared": lambda df: df["price"] ** 2,
        "price_x_advertising": lambda df: df["price"] * df["advertising"],
        "log_sales": lambda df: np.log(df["sales"])
    })
)

rec_mutate_fit = rec_mutate.prep(train)
train_mutated = rec_mutate_fit.bake(train)

print("New columns created:")
new_cols = ["price_squared", "price_x_advertising", "log_sales"]
print(train_mutated[new_cols].head())

print("\nOriginal columns preserved:")
print(train_mutated[["price", "advertising", "sales"]].head())

---

## 7. Multi-Step Pipelines

Chain multiple preprocessing steps together.

**Order matters!** Steps are applied sequentially.

In [None]:
# Create a comprehensive preprocessing pipeline
rec_pipeline = (
    recipe()
    # 1. Handle missing values first
    .step_impute_mean()
    # 2. Create new features
    .step_mutate({
        "price_squared": lambda df: df["price"] ** 2,
        "log_advertising": lambda df: np.log(df["advertising"] + 1)
    })
    # 3. Normalize numeric columns
    .step_normalize()
    # 4. Encode categorical variables
    .step_dummy(["store_type", "region"])
)

print(f"Pipeline has {len(rec_pipeline.steps)} steps:")
for i, step in enumerate(rec_pipeline.steps, 1):
    print(f"  {i}. {type(step).__name__}")

In [None]:
# Fit the entire pipeline
rec_pipeline_fit = rec_pipeline.prep(train)

# Apply to train
train_final = rec_pipeline_fit.bake(train)

# Apply to test
test_final = rec_pipeline_fit.bake(test)

print(f"Original train shape: {train.shape}")
print(f"Final train shape: {train_final.shape}")
print(f"\nFinal columns: {list(train_final.columns)}")

In [None]:
# Verify no missing values and standardization
print("Missing values after pipeline:")
print(train_final.isna().sum().sum())

print("\nNumeric column statistics:")
numeric_cols = ["sales", "price", "advertising", "temperature", "price_squared", "log_advertising"]
print(train_final[numeric_cols].describe().loc[["mean", "std"]])

---

## 8. Integration with Workflows

Recipes integrate seamlessly with **py-workflows** for complete modeling pipelines.

In [None]:
# Create recipe
rec_for_model = (
    recipe()
    .step_impute_mean()
    .step_normalize(columns=["price", "advertising", "temperature"])
    .step_dummy(["store_type", "region"])
)

# Create workflow
wf = (
    workflow()
    .add_recipe(rec_for_model)
    .add_model(linear_reg().set_engine("sklearn"))
)

print("Workflow created with recipe preprocessing")
print(f"Preprocessor: {type(wf.preprocessor).__name__}")
print(f"Model: {wf.spec.model_type}")

In [None]:
# Fit workflow (recipe is automatically prepped)
wf_fit = wf.fit(train)

print("Workflow fitted successfully")
print(f"Preprocessor type: {type(wf_fit.pre).__name__}")

In [None]:
# Predict on test data (recipe is automatically applied)
predictions = wf_fit.predict(test)

print("Predictions:")
print(predictions.head())

print(f"\nPrediction shape: {predictions.shape}")
print(f"Test data shape: {test.shape}")

In [None]:
# Extract comprehensive outputs
outputs, coefficients, stats = wf_fit.extract_outputs()

print("Outputs DataFrame (observation-level):")
print(outputs.head())

print("\nCoefficients DataFrame (variable-level):")
print(coefficients[["variable", "coefficient"]].head())

print("\nStats DataFrame (model-level):")
print(stats[stats["metric"].isin(["rmse", "mae", "r_squared"])])

---

## 9. Complete Example: Train/Test Evaluation

A full example showing recipe → workflow → evaluation.

In [None]:
# Create comprehensive recipe
final_recipe = (
    recipe()
    # 1. Impute missing values
    .step_impute_mean()
    # 2. Engineer features
    .step_mutate({
        "price_squared": lambda df: df["price"] ** 2,
        "temp_x_price": lambda df: df["temperature"] * df["price"]
    })
    # 3. Normalize all numeric columns
    .step_normalize()
    # 4. Encode categories
    .step_dummy(["store_type", "region"])
)

# Create workflow
final_wf = (
    workflow()
    .add_recipe(final_recipe)
    .add_model(linear_reg().set_engine("sklearn"))
)

# Fit on train
final_wf_fit = final_wf.fit(train)

# Evaluate on test
final_wf_fit = final_wf_fit.evaluate(test)

print("Complete workflow fitted and evaluated!")

In [None]:
# Extract all outputs
outputs, coefficients, stats = final_wf_fit.extract_outputs()

print("="*60)
print("OUTPUTS DATAFRAME (observation-level)")
print("="*60)
print(f"Total observations: {len(outputs)}")
print(f"Train: {len(outputs[outputs['split']=='train'])}")
print(f"Test: {len(outputs[outputs['split']=='test'])}")
print("\nFirst few rows:")
print(outputs[["actuals", "forecast", "residuals", "split"]].head())

In [None]:
print("="*60)
print("COEFFICIENTS DATAFRAME (variable-level)")
print("="*60)
print(coefficients[["variable", "coefficient"]])

In [None]:
print("="*60)
print("STATS DATAFRAME (model-level metrics)")
print("="*60)

# Performance metrics
perf_metrics = stats[stats["metric"].isin(["rmse", "mae", "r_squared"])]
print("\nPerformance Metrics:")
for split in ["train", "test"]:
    print(f"\n{split.upper()}:")
    split_metrics = perf_metrics[perf_metrics["split"] == split]
    for _, row in split_metrics.iterrows():
        print(f"  {row['metric']}: {row['value']:.4f}")

In [None]:
# Extract the fitted recipe for inspection
fitted_recipe = final_wf_fit.extract_preprocessor()

print("Fitted Recipe:")
print(f"Type: {type(fitted_recipe).__name__}")
print(f"Number of prepared steps: {len(fitted_recipe.prepared_steps)}")

# Can use the fitted recipe independently
new_data_transformed = fitted_recipe.bake(test)
print(f"\nTransformed shape: {new_data_transformed.shape}")

---

## Summary

**py-recipes** provides a powerful, composable system for feature engineering:

### Key Benefits

1. **No Data Leakage**: Preprocessing fitted on train, applied to test
2. **Reproducible**: Same transformations applied consistently
3. **Composable**: Chain multiple steps together
4. **Integrated**: Works seamlessly with py-workflows
5. **Method Chaining**: Clean, readable syntax

### Available Steps

- `step_normalize()`: zscore or minmax normalization
- `step_dummy()`: one-hot encoding
- `step_impute_mean()`: mean imputation
- `step_impute_median()`: median imputation
- `step_mutate()`: custom transformations

### Pattern

```python
# 1. Create recipe
rec = recipe().step_normalize().step_dummy(["category"])

# 2. Fit to training data
rec_fit = rec.prep(train)

# 3. Apply to any data
train_transformed = rec_fit.bake(train)
test_transformed = rec_fit.bake(test)
```

### Integration with Workflows

```python
wf = (
    workflow()
    .add_recipe(rec)
    .add_model(linear_reg().set_engine("sklearn"))
)
wf_fit = wf.fit(train).evaluate(test)
outputs, coefficients, stats = wf_fit.extract_outputs()
```

---

## Next Steps

- Explore `08_workflows_demo.ipynb` for more workflow examples
- Check `02_parsnip_demo.ipynb` for model specifications
- See `07_rsample_demo.ipynb` for time series cross-validation

---