# py-hardhat Demo: Formula-Based Data Preprocessing

This notebook demonstrates the core functionality of py-hardhat:
- `mold()`: Convert formula + data → model-ready format
- `forge()`: Apply Blueprint to new data for prediction
- Consistent handling of categorical variables
- Prevention of common train/test data issues

In [1]:
import pandas as pd
import numpy as np
import sys

# Add parent directory to path
sys.path.insert(0, '..')

from py_hardhat import mold, forge

## Example 1: Simple Numeric Regression

Let's start with a simple example using only numeric predictors.

In [2]:
# Create training data
train = pd.DataFrame({
    'sales': [100, 200, 150, 300, 250],
    'price': [10, 20, 15, 30, 25],
    'advertising': [5, 10, 7, 15, 12]
})

print("Training data:")
print(train)

Training data:
   sales  price  advertising
0    100     10            5
1    200     20           10
2    150     15            7
3    300     30           15
4    250     25           12


In [3]:
# Mold the training data
molded = mold("sales ~ price + advertising", train)

print("\nOutcomes (y):")
print(molded.outcomes)

print("\nPredictors (X):")
print(molded.predictors)

print("\nBlueprint formula:", molded.blueprint.formula)
print("Blueprint roles:", molded.blueprint.roles)


Outcomes (y):
   sales
0  100.0
1  200.0
2  150.0
3  300.0
4  250.0

Predictors (X):
   Intercept  price  advertising
0        1.0   10.0          5.0
1        1.0   20.0         10.0
2        1.0   15.0          7.0
3        1.0   30.0         15.0
4        1.0   25.0         12.0

Blueprint formula: sales ~ price + advertising
Blueprint roles: {'outcome': ['sales'], 'predictor': ['Intercept', 'price', 'advertising']}


In [4]:
# Create test data (no outcome variable)
test = pd.DataFrame({
    'price': [12, 18, 28],
    'advertising': [6, 9, 14]
})

print("Test data:")
print(test)

# Forge the test data using the blueprint
forged = forge(test, molded.blueprint)

print("\nForged predictors (same structure as training):")
print(forged.predictors)

Test data:
   price  advertising
0     12            6
1     18            9
2     28           14

Forged predictors (same structure as training):
   Intercept  price  advertising
0        1.0   12.0          6.0
1        1.0   18.0          9.0
2        1.0   28.0         14.0


## Example 2: Categorical Variables

py-hardhat handles categorical variables with one-hot encoding and ensures consistent factor levels between train and test.

In [5]:
# Create training data with categorical variable
train_cat = pd.DataFrame({
    'sales': [100, 200, 150, 300, 250, 180],
    'price': [10, 20, 15, 30, 25, 18],
    'store': ['A', 'B', 'A', 'C', 'B', 'C']
})

print("Training data with categorical:")
print(train_cat)

Training data with categorical:
   sales  price store
0    100     10     A
1    200     20     B
2    150     15     A
3    300     30     C
4    250     25     B
5    180     18     C


In [6]:
# Mold with categorical variable
molded_cat = mold("sales ~ price + store", train_cat)

print("\nPredictors with one-hot encoded 'store':")
print(molded_cat.predictors)

print("\nFactor levels captured:")
print(molded_cat.blueprint.factor_levels)


Predictors with one-hot encoded 'store':
   Intercept  store[T.B]  store[T.C]  price
0        1.0         0.0         0.0   10.0
1        1.0         1.0         0.0   20.0
2        1.0         0.0         0.0   15.0
3        1.0         0.0         1.0   30.0
4        1.0         1.0         0.0   25.0
5        1.0         0.0         1.0   18.0

Factor levels captured:
{'store': ['A', 'B', 'C']}


In [7]:
# Test data with valid categories (A and B)
test_valid = pd.DataFrame({
    'price': [12, 22],
    'store': ['A', 'B']  # Both seen in training
})

forged_valid = forge(test_valid, molded_cat.blueprint)

print("\nTest data (valid categories):")
print(test_valid)
print("\nForged predictors (note: store[T.C] column exists but is 0):")
print(forged_valid.predictors)


Test data (valid categories):
   price store
0     12     A
1     22     B

Forged predictors (note: store[T.C] column exists but is 0):
   Intercept  store[T.B]  store[T.C]  price
0        1.0         0.0         0.0   12.0
1        1.0         1.0         0.0   22.0


### Handling New Factor Levels

One of the key benefits of py-hardhat is that it prevents silent errors from new categorical levels in test data.

In [8]:
# Test data with NEW category 'D' not seen in training
test_invalid = pd.DataFrame({
    'price': [25],
    'store': ['D']  # NEW level - should error
})

try:
    forged_invalid = forge(test_invalid, molded_cat.blueprint)
    print("ERROR: Should have raised ValueError!")
except ValueError as e:
    print(f"✓ Correctly caught new factor level: {e}")

✓ Correctly caught new factor level: New factor levels found in 'store': {'D'}. Expected levels: {'A', 'B', 'C'}


## Example 3: Integration with sklearn

Let's show a complete workflow: mold → fit model → forge → predict

In [9]:
from sklearn.linear_model import LinearRegression

# Training
train_full = pd.DataFrame({
    'sales': [100, 200, 150, 300, 250, 180, 220],
    'price': [10, 20, 15, 30, 25, 18, 22],
    'advertising': [5, 10, 7, 15, 12, 9, 11],
    'region': ['North', 'South', 'North', 'West', 'South', 'West', 'North']
})

# Mold training data
molded_full = mold("sales ~ price + advertising + region", train_full)

# Fit linear regression
model = LinearRegression()
model.fit(molded_full.predictors, molded_full.outcomes)

print("Model trained!")
print(f"Coefficients: {model.coef_}")
print(f"Intercept: {model.intercept_}")

Model trained!
Coefficients: [[ 0.00000000e+00 -3.86342202e-14  1.65343510e-14  1.00000000e+01
  -1.08970282e-13]]
Intercept: [5.68434189e-14]


In [10]:
# Prediction on new data
test_full = pd.DataFrame({
    'price': [12, 28, 16],
    'advertising': [6, 14, 8],
    'region': ['North', 'South', 'West']
})

# Forge test data with same blueprint
forged_full = forge(test_full, molded_full.blueprint)

# Predict
predictions = model.predict(forged_full.predictors)

print("\nTest data:")
print(test_full)
print("\nPredictions:")
print(predictions)


Test data:
   price  advertising region
0     12            6  North
1     28           14  South
2     16            8   West

Predictions:
[[120.]
 [280.]
 [160.]]


## Summary

**py-hardhat provides:**

1. **Consistent data structure** - Train and test data are guaranteed to have identical column structure
2. **Categorical safety** - New factor levels in test data are caught early with clear error messages
3. **Formula interface** - R-style formulas for intuitive model specification
4. **Blueprint pattern** - Capture preprocessing decisions once, apply them everywhere

**This is the foundation layer** that will be used by:
- `py-parsnip` for model fitting
- `py-workflows` for pipeline composition
- `py-recipes` for feature engineering (coming next!)