# py-recipes Demo: Feature Engineering Pipelines

This notebook demonstrates the **py-recipes** package for creating reproducible feature engineering pipelines.

## What is py-recipes?

**py-recipes** provides a consistent interface for:
- **Preprocessing**: Normalization, imputation, encoding
- **Feature engineering**: Custom transformations
- **Pipeline composition**: Chain multiple steps
- **Train/test consistency**: No data leakage

## Key Concepts

1. **Recipe**: Specification of preprocessing steps (immutable)
2. **prep()**: Fit recipe to training data
3. **bake()**: Apply fitted recipe to new data
4. **PreparedRecipe**: Fitted recipe ready to transform

## Contents

1. [Setup](#setup)
2. [Basic Recipe Usage](#basic-recipe)
3. [Normalization](#normalization)
4. [Dummy Variables (One-Hot Encoding)](#dummy)
5. [Missing Value Imputation](#imputation)
6. [Custom Transformations](#mutate)
7. [Multi-Step Pipelines](#pipelines)
8. [Integration with Workflows](#workflows)
9. [Complete Example](#complete)

---

## 1. Setup <a id="setup"></a>

First, let's import the necessary packages and create sample data.

In [30]:
import pandas as pd
import numpy as np
from py_recipes import recipe
from py_workflows import workflow
from py_parsnip import linear_reg

# Set random seed for reproducibility
np.random.seed(42)

In [31]:
# Create sample data with various issues
n = 200

data = pd.DataFrame({
    "sales": np.random.randn(n) * 1000 + 5000,
    "price": np.random.randn(n) * 10 + 50,
    "advertising": np.random.randn(n) * 500 + 1000,
    "temperature": np.random.randn(n) * 15 + 20,
    "store_type": np.random.choice(["Mall", "Street", "Online"], n),
    "region": np.random.choice(["North", "South", "East", "West"], n)
})

# Add some missing values
missing_idx = np.random.choice(n, size=20, replace=False)
data.loc[missing_idx, "advertising"] = np.nan

print(f"Data shape: {data.shape}")
print(f"\nMissing values:\n{data.isna().sum()}")
print(f"\nFirst few rows:")
data.head()

Data shape: (200, 6)

Missing values:
sales           0
price           0
advertising    20
temperature     0
store_type      0
region          0
dtype: int64

First few rows:


Unnamed: 0,sales,price,advertising,temperature,store_type,region
0,5496.714153,53.577874,,31.354829,Street,North
1,4861.735699,55.607845,700.312489,6.16752,Online,South
2,5647.688538,60.830512,1002.62185,33.044089,Street,North
3,6523.029856,60.538021,1023.490297,40.334568,Online,East
4,4765.846625,36.223306,774.967264,26.201524,Mall,East


In [32]:
# Split into train/test
train_idx = int(0.8 * n)
train = data[:train_idx].copy()
test = data[train_idx:].copy()

print(f"Train: {len(train)} | Test: {len(test)}")

Train: 160 | Test: 40


---

## 2. Basic Recipe Usage <a id="basic-recipe"></a>

A recipe follows the **prep/bake** pattern:
1. **Create** a recipe specification
2. **prep()** fits the recipe to training data
3. **bake()** applies the fitted recipe to new data

In [33]:
# Create a simple recipe
rec = recipe()
print(f"Recipe: {rec}")
print(f"Steps: {rec.steps}")

Recipe: Recipe(steps=[], template=None, roles={})
Steps: []


In [34]:
# Add a normalization step
rec = recipe().step_normalize()
print(f"Recipe with 1 step: {len(rec.steps)} steps")

Recipe with 1 step: 1 steps


In [35]:
# Fit (prep) the recipe to training data
rec_fit = rec.prep(train)
print(f"Prepared recipe: {rec_fit}")
print(f"Prepared steps: {len(rec_fit.prepared_steps)}")

Prepared recipe: PreparedRecipe(recipe=Recipe(steps=[StepNormalize(columns=None, method='zscore')], template=None, roles={}), prepared_steps=[PreparedStepNormalize(columns=['sales', 'price', 'advertising', 'temperature'], scaler=StandardScaler(), method='zscore')], template=           sales      price  advertising  temperature store_type region
0    5496.714153  53.577874          NaN    31.354829     Street  North
1    4861.735699  55.607845   700.312489     6.167520     Online  South
2    5647.688538  60.830512  1002.621850    33.044089     Street  North
3    6523.029856  60.538021  1023.490297    40.334568     Online   East
4    4765.846625  36.223306   774.967264    26.201524       Mall   East
..           ...        ...          ...          ...        ...    ...
155  4285.648582  39.974706   794.061517    59.485731     Online   West
156  6865.774511  49.814869          NaN    27.399769     Online  South
157  5473.832921  47.113414   783.720906    22.772542     Street   East
158  

In [36]:
# Apply (bake) to new data
train_transformed = rec_fit.bake(train)

print("\nOriginal train data (first 3 rows):")
print(train[["sales", "price", "advertising"]].head(3))

print("\nTransformed train data (first 3 rows):")
print(train_transformed[["sales", "price", "advertising"]].head(3))

print("\nMeans after normalization:")
print(train_transformed[["sales", "price", "advertising", "temperature"]].mean())

print("\nStd devs after normalization:")
print(train_transformed[["sales", "price", "advertising", "temperature"]].std())


Original train data (first 3 rows):
         sales      price  advertising
0  5496.714153  53.577874          NaN
1  4861.735699  55.607845   700.312489
2  5647.688538  60.830512  1002.621850

Transformed train data (first 3 rows):
      sales     price  advertising
0  0.604468  0.312385          NaN
1 -0.075690  0.521598    -0.487344
2  0.766184  1.059858     0.114993

Means after normalization:
sales          1.665335e-16
price         -6.938894e-17
advertising   -1.973730e-16
temperature   -3.330669e-17
dtype: float64

Std devs after normalization:
sales          1.00314
price          1.00314
advertising    1.00349
temperature    1.00314
dtype: float64


**Key Point**: The recipe learns from **training data** and applies the same transformation to **test data**.

This prevents **data leakage**!

In [37]:
# Apply the SAME fitted recipe to test data
test_transformed = rec_fit.bake(test)

print("Test data means (NOT zero because fitted on train):")
print(test_transformed[["sales", "price", "advertising", "temperature"]].mean())

Test data means (NOT zero because fitted on train):
sales          0.143699
price          0.160695
advertising    0.165710
temperature    0.202324
dtype: float64


---

## 3. Normalization (step_normalize) <a id="normalization"></a>

Normalize numeric columns using:
- **zscore** (default): standardization (mean=0, std=1)
- **minmax**: scaling to [0, 1] range

### Z-score Normalization

In [38]:
# Normalize specific columns
rec_zscore = (
    recipe()
    .step_normalize(columns=["price", "advertising"], method="zscore")
)

rec_zscore_fit = rec_zscore.prep(train)
train_zscore = rec_zscore_fit.bake(train)

print("Z-score normalized columns:")
print(train_zscore[["price", "advertising"]].describe())

Z-score normalized columns:
              price   advertising
count  1.600000e+02  1.440000e+02
mean  -6.938894e-17 -1.973730e-16
std    1.003140e+00  1.003490e+00
min   -3.396881e+00 -2.352548e+00
25%   -7.055730e-01 -6.926009e-01
50%    1.273539e-02  3.246156e-02
75%    6.305484e-01  6.796979e-01
max    3.914352e+00  3.177030e+00


### MinMax Normalization

In [39]:
# MinMax scaling [0, 1]
rec_minmax = (
    recipe()
    .step_normalize(columns=["price", "advertising"], method="minmax")
)

rec_minmax_fit = rec_minmax.prep(train)
train_minmax = rec_minmax_fit.bake(train)

print("MinMax normalized columns:")
print(train_minmax[["price", "advertising"]].describe())

MinMax normalized columns:
            price  advertising
count  160.000000   144.000000
mean     0.464611     0.425448
std      0.137205     0.181477
min      0.000000     0.000000
25%      0.368106     0.300194
50%      0.466353     0.431319
75%      0.550855     0.548368
max      1.000000     1.000000


---

## 4. Dummy Variables (step_dummy) <a id="dummy"></a>

Convert categorical variables to one-hot encoded dummy variables.

In [40]:
# Original categorical columns
print("Original data:")
print(train[["store_type", "region"]].head())
print(f"\nUnique store types: {train['store_type'].unique()}")
print(f"Unique regions: {train['region'].unique()}")

Original data:
  store_type region
0     Street  North
1     Online  South
2     Street  North
3     Online   East
4       Mall   East

Unique store types: ['Street' 'Online' 'Mall']
Unique regions: ['North' 'South' 'East' 'West']


In [41]:
# One-hot encode categorical columns
rec_dummy = (
    recipe()
    .step_dummy(["store_type", "region"])
)

rec_dummy_fit = rec_dummy.prep(train)
train_dummy = rec_dummy_fit.bake(train)

print("After dummy encoding:")
print("\nNew columns:")
dummy_cols = [col for col in train_dummy.columns if "store_type" in col or "region" in col]
print(dummy_cols)

print("\nFirst few rows of dummy variables:")
print(train_dummy[dummy_cols].head())

After dummy encoding:

New columns:
['store_type_Mall', 'store_type_Online', 'store_type_Street', 'region_East', 'region_North', 'region_South', 'region_West']

First few rows of dummy variables:
   store_type_Mall  store_type_Online  store_type_Street  region_East  \
0              0.0                0.0                1.0          0.0   
1              0.0                1.0                0.0          0.0   
2              0.0                0.0                1.0          0.0   
3              0.0                1.0                0.0          1.0   
4              1.0                0.0                0.0          1.0   

   region_North  region_South  region_West  
0           1.0           0.0          0.0  
1           0.0           1.0          0.0  
2           1.0           0.0          0.0  
3           0.0           0.0          0.0  
4           0.0           0.0          0.0  


**Note**: Original categorical columns are removed after encoding.

In [42]:
# Check that original columns are gone
print("'store_type' in columns:", "store_type" in train_dummy.columns)
print("'region' in columns:", "region" in train_dummy.columns)

'store_type' in columns: False
'region' in columns: False


---

## 5. Missing Value Imputation <a id="imputation"></a>

Handle missing values with:
- **step_impute_mean()**: Replace NA with column mean
- **step_impute_median()**: Replace NA with column median

In [43]:
# Check missing values
print("Missing values in training data:")
print(train.isna().sum())

print(f"\nMissing in 'advertising': {train['advertising'].isna().sum()} out of {len(train)}")

Missing values in training data:
sales           0
price           0
advertising    16
temperature     0
store_type      0
region          0
dtype: int64

Missing in 'advertising': 16 out of 160


### Mean Imputation

In [44]:
# Impute with mean
rec_impute_mean = (
    recipe()
    .step_impute_mean(columns=["advertising"])
)

rec_impute_mean_fit = rec_impute_mean.prep(train)
train_imputed = rec_impute_mean_fit.bake(train)

print("After mean imputation:")
print(f"Missing values: {train_imputed['advertising'].isna().sum()}")

print(f"\nOriginal mean: {train['advertising'].mean():.2f}")
print(f"After imputation mean: {train_imputed['advertising'].mean():.2f}")

After mean imputation:
Missing values: 0

Original mean: 944.91
After imputation mean: 944.91


### Median Imputation

In [45]:
# Impute with median
rec_impute_median = (
    recipe()
    .step_impute_median(columns=["advertising"])
)

rec_impute_median_fit = rec_impute_median.prep(train)
train_imputed_median = rec_impute_median_fit.bake(train)

print("After median imputation:")
print(f"Missing values: {train_imputed_median['advertising'].isna().sum()}")

print(f"\nOriginal median: {train['advertising'].median():.2f}")
print(f"After imputation median: {train_imputed_median['advertising'].median():.2f}")

After median imputation:
Missing values: 0

Original median: 961.20
After imputation median: 961.20


---

## 6. Custom Transformations (step_mutate) <a id="mutate"></a>

Create new features using custom functions.

In [46]:
# Create engineered features
rec_mutate = (
    recipe()
    .step_mutate({
        "price_squared": lambda df: df["price"] ** 2,
        "price_x_advertising": lambda df: df["price"] * df["advertising"],
        "log_sales": lambda df: np.log(df["sales"])
    })
)

rec_mutate_fit = rec_mutate.prep(train)
train_mutated = rec_mutate_fit.bake(train)

print("New columns created:")
new_cols = ["price_squared", "price_x_advertising", "log_sales"]
print(train_mutated[new_cols].head())

print("\nOriginal columns preserved:")
print(train_mutated[["price", "advertising", "sales"]].head())

New columns created:
   price_squared  price_x_advertising  log_sales
0    2870.588540                  NaN   8.611906
1    3092.232455         38942.868498   8.489151
2    3700.351243         60990.000902   8.639002
3    3664.851929         61960.076595   8.783094
4    1312.127921         28071.876602   8.469230

Original columns preserved:
       price  advertising        sales
0  53.577874          NaN  5496.714153
1  55.607845   700.312489  4861.735699
2  60.830512  1002.621850  5647.688538
3  60.538021  1023.490297  6523.029856
4  36.223306   774.967264  4765.846625


---

## 7. Multi-Step Pipelines <a id="pipelines"></a>

Chain multiple preprocessing steps together.

**Order matters!** Steps are applied sequentially.

In [47]:
# Create a comprehensive preprocessing pipeline
rec_pipeline = (
    recipe()
    # 1. Handle missing values first
    .step_impute_mean()
    # 2. Create new features
    .step_mutate({
        "price_squared": lambda df: df["price"] ** 2,
        "log_advertising": lambda df: np.log(df["advertising"] + 1)
    })
    # 3. Normalize numeric columns
    .step_normalize()
    # 4. Encode categorical variables
    .step_dummy(["store_type", "region"])
)

print(f"Pipeline has {len(rec_pipeline.steps)} steps:")
for i, step in enumerate(rec_pipeline.steps, 1):
    print(f"  {i}. {type(step).__name__}")

Pipeline has 4 steps:
  1. StepImputeMean
  2. StepMutate
  3. StepNormalize
  4. StepDummy


In [48]:
# Fit the entire pipeline
rec_pipeline_fit = rec_pipeline.prep(train)

# Apply to train
train_final = rec_pipeline_fit.bake(train)

# Apply to test
test_final = rec_pipeline_fit.bake(test)

print(f"Original train shape: {train.shape}")
print(f"Final train shape: {train_final.shape}")
print(f"\nFinal columns: {list(train_final.columns)}")

Original train shape: (160, 6)
Final train shape: (160, 13)

Final columns: ['sales', 'price', 'advertising', 'temperature', 'price_squared', 'log_advertising', 'store_type_Mall', 'store_type_Online', 'store_type_Street', 'region_East', 'region_North', 'region_South', 'region_West']


  result = getattr(ufunc, method)(*inputs, **kwargs)
  result = getattr(ufunc, method)(*inputs, **kwargs)
  result = getattr(ufunc, method)(*inputs, **kwargs)


In [49]:
# Verify no missing values and standardization
print("Missing values after pipeline:")
print(train_final.isna().sum().sum())

print("\nNumeric column statistics:")
numeric_cols = ["sales", "price", "advertising", "temperature", "price_squared", "log_advertising"]
print(train_final[numeric_cols].describe().loc[["mean", "std"]])

Missing values after pipeline:
4

Numeric column statistics:
             sales         price   advertising   temperature  price_squared  \
mean  1.665335e-16 -6.938894e-17 -4.218847e-16 -3.330669e-17  -3.719247e-16   
std   1.003140e+00  1.003140e+00  1.003140e+00  1.003140e+00   1.003140e+00   

      log_advertising  
mean     9.792736e-16  
std      1.003221e+00  


---

## 8. Integration with Workflows <a id="workflows"></a>

Recipes integrate seamlessly with **py-workflows** for complete modeling pipelines.

In [50]:
# Create recipe
rec_for_model = (
    recipe()
    .step_impute_mean()
    .step_normalize(columns=["price", "advertising", "temperature"])
    .step_dummy(["store_type", "region"])
)

# Create workflow
wf = (
    workflow()
    .add_recipe(rec_for_model)
    .add_model(linear_reg().set_engine("sklearn"))
)

print("Workflow created with recipe preprocessing")
print(f"Preprocessor: {type(wf.preprocessor).__name__}")
print(f"Model: {wf.spec.model_type}")

Workflow created with recipe preprocessing
Preprocessor: Recipe
Model: linear_reg


In [51]:
# Fit workflow (recipe is automatically prepped)
wf_fit = wf.fit(train)

print("Workflow fitted successfully")
print(f"Preprocessor type: {type(wf_fit.pre).__name__}")

Workflow fitted successfully
Preprocessor type: PreparedRecipe


In [52]:
# Predict on test data (recipe is automatically applied)
predictions = wf_fit.predict(test)

print("Predictions:")
print(predictions.head())

print(f"\nPrediction shape: {predictions.shape}")
print(f"Test data shape: {test.shape}")

Predictions:
         .pred
0  5097.223413
1  4473.059352
2  4710.019172
3  5189.366138
4  4753.120466

Prediction shape: (40, 1)
Test data shape: (40, 6)


In [53]:
# Extract comprehensive outputs
outputs, coefficients, stats = wf_fit.extract_outputs()

print("Outputs DataFrame (observation-level):")
print(outputs.head())

print("\nCoefficients DataFrame (variable-level):")
print(coefficients[["variable", "coefficient"]].head())

print("\nStats DataFrame (model-level):")
print(stats[stats["metric"].isin(["rmse", "mae", "r_squared"])])

Outputs DataFrame (observation-level):
       actuals       fitted     forecast    residuals  split       model  \
0  5496.714153  4925.901109  5496.714153  5496.714153  train  linear_reg   
1  4861.735699  5250.933675  4861.735699  4861.735699  train  linear_reg   
2  5647.688538  4977.788992  5647.688538  5647.688538  train  linear_reg   
3  6523.029856  5191.706343  6523.029856  6523.029856  train  linear_reg   
4  4765.846625  4828.566105  4765.846625  4765.846625  train  linear_reg   

  model_group_name   group  
0                   global  
1                   global  
2                   global  
3                   global  
4                   global  

Coefficients DataFrame (variable-level):
          variable  coefficient
0        Intercept     0.000000
1            price    93.253177
2      advertising  -149.401651
3      temperature     2.705815
4  store_type_Mall  -134.808164

Stats DataFrame (model-level):
      metric       value  split       model model_group_name   g

---

## 9. Complete Example: Train/Test Evaluation <a id="complete"></a>

A full example showing recipe → workflow → evaluation.

In [54]:
# Create comprehensive recipe
final_recipe = (
    recipe()
    # 1. Impute missing values
    .step_impute_mean()
    # 2. Engineer features
    .step_mutate({
        "price_squared": lambda df: df["price"] ** 2,
        "temp_x_price": lambda df: df["temperature"] * df["price"]
    })
    # 3. Normalize all numeric columns
    .step_normalize()
    # 4. Encode categories
    .step_dummy(["store_type", "region"])
)

# Create workflow
final_wf = (
    workflow()
    .add_recipe(final_recipe)
    .add_model(linear_reg().set_engine("sklearn"))
)

# Fit on train
final_wf_fit = final_wf.fit(train)

# Evaluate on test
final_wf_fit = final_wf_fit.evaluate(test)

print("Complete workflow fitted and evaluated!")

Complete workflow fitted and evaluated!


In [55]:
# Extract all outputs
outputs, coefficients, stats = final_wf_fit.extract_outputs()

print("="*60)
print("OUTPUTS DATAFRAME (observation-level)")
print("="*60)
print(f"Total observations: {len(outputs)}")
print(f"Train: {len(outputs[outputs['split']=='train'])}")
print(f"Test: {len(outputs[outputs['split']=='test'])}")
print("\nFirst few rows:")
print(outputs[["actuals", "forecast", "residuals", "split"]].head())

OUTPUTS DATAFRAME (observation-level)
Total observations: 200
Train: 160
Test: 40

First few rows:
    actuals  forecast  residuals  split
0  0.604468  0.604468   0.604468  train
1 -0.075690 -0.075690  -0.075690  train
2  0.766184  0.766184   0.766184  train
3  1.703807  1.703807   1.703807  train
4 -0.178402 -0.178402  -0.178402  train


  std_errors = np.sqrt(var_coef)


In [56]:
print("="*60)
print("COEFFICIENTS DATAFRAME (variable-level)")
print("="*60)
print(coefficients[["variable", "coefficient"]])

COEFFICIENTS DATAFRAME (variable-level)
             variable  coefficient
0           Intercept     0.000000
1               price     0.074952
2         advertising    -0.167890
3         temperature     0.296053
4       price_squared     0.107954
5        temp_x_price    -0.302534
6     store_type_Mall    -0.139166
7   store_type_Online     0.076976
8   store_type_Street     0.062190
9         region_East     0.130784
10       region_North    -0.108576
11       region_South     0.132182
12        region_West    -0.154390


In [57]:
print("="*60)
print("STATS DATAFRAME (model-level metrics)")
print("="*60)

# Performance metrics
perf_metrics = stats[stats["metric"].isin(["rmse", "mae", "r_squared"])]
print("\nPerformance Metrics:")
for split in ["train", "test"]:
    print(f"\n{split.upper()}:")
    split_metrics = perf_metrics[perf_metrics["split"] == split]
    for _, row in split_metrics.iterrows():
        print(f"  {row['metric']}: {row['value']:.4f}")

STATS DATAFRAME (model-level metrics)

Performance Metrics:

TRAIN:
  rmse: 0.9635
  mae: 0.7816
  r_squared: 0.0716

TEST:
  rmse: 1.1116
  mae: 0.8833
  r_squared: -0.3272


In [58]:
# Extract the fitted recipe for inspection
fitted_recipe = final_wf_fit.extract_preprocessor()

print("Fitted Recipe:")
print(f"Type: {type(fitted_recipe).__name__}")
print(f"Number of prepared steps: {len(fitted_recipe.prepared_steps)}")

# Can use the fitted recipe independently
new_data_transformed = fitted_recipe.bake(test)
print(f"\nTransformed shape: {new_data_transformed.shape}")

Fitted Recipe:
Type: PreparedRecipe
Number of prepared steps: 4

Transformed shape: (40, 13)


---

## Summary

**py-recipes** provides a powerful, composable system for feature engineering:

### Key Benefits

1. **No Data Leakage**: Preprocessing fitted on train, applied to test
2. **Reproducible**: Same transformations applied consistently
3. **Composable**: Chain multiple steps together
4. **Integrated**: Works seamlessly with py-workflows
5. **Method Chaining**: Clean, readable syntax

### Available Steps

- `step_normalize()`: zscore or minmax normalization
- `step_dummy()`: one-hot encoding
- `step_impute_mean()`: mean imputation
- `step_impute_median()`: median imputation
- `step_mutate()`: custom transformations

### Pattern

```python
# 1. Create recipe
rec = recipe().step_normalize().step_dummy(["category"])

# 2. Fit to training data
rec_fit = rec.prep(train)

# 3. Apply to any data
train_transformed = rec_fit.bake(train)
test_transformed = rec_fit.bake(test)
```

### Integration with Workflows

```python
wf = (
    workflow()
    .add_recipe(rec)
    .add_model(linear_reg().set_engine("sklearn"))
)
wf_fit = wf.fit(train).evaluate(test)
outputs, coefficients, stats = wf_fit.extract_outputs()
```

---

## Next Steps

- Explore `08_workflows_demo.ipynb` for more workflow examples
- Check `02_parsnip_demo.ipynb` for model specifications
- See `07_rsample_demo.ipynb` for time series cross-validation

---