# py-parsnip Demo: Unified Model Specification

This notebook demonstrates py-parsnip, which provides a unified interface for model specification across different computational engines.

**Key concepts:**
- `ModelSpec`: Immutable model specification (type + engine + args)
- `ModelFit`: Fitted model container
- Engine registry: Pluggable backends (sklearn, statsmodels, etc.)
- Seamless integration with py-hardhat for preprocessing

In [1]:
import pandas as pd
import numpy as np
import sys

sys.path.insert(0, '..')

from py_parsnip import linear_reg
from py_hardhat import mold, forge

## Example 1: Basic Linear Regression

The simplest case: OLS regression with no regularization.

In [2]:
# Create training data
train = pd.DataFrame({
    'sales': [100, 200, 150, 300, 250, 180, 220, 280],
    'price': [10, 20, 15, 30, 25, 18, 22, 28],
    'advertising': [5, 10, 7, 15, 12, 9, 11, 14]
})

print("Training data:")
print(train.head())

Training data:
   sales  price  advertising
0    100     10            5
1    200     20           10
2    150     15            7
3    300     30           15
4    250     25           12


In [3]:
# Create model specification
spec = linear_reg()

print("Model spec:")
print(f"  Type: {spec.model_type}")
print(f"  Engine: {spec.engine}")
print(f"  Mode: {spec.mode}")
print(f"  Args: {spec.args}")

Model spec:
  Type: linear_reg
  Engine: sklearn
  Mode: regression
  Args: {}


In [4]:
# Fit model with formula
# This automatically calls mold() internally
fit = spec.fit(train, "sales ~ price + advertising")

print("\nModel fitted!")
print(f"Model class: {fit.fit_data['model_class']}")
print(f"N features: {fit.fit_data['n_features']}")
print(f"N obs: {fit.fit_data['n_obs']}")


Model fitted!
Model class: LinearRegression
N features: 3
N obs: 8


In [5]:
# Make predictions
test = pd.DataFrame({
    'price': [12, 18, 24],
    'advertising': [6, 9, 12]
})

predictions = fit.predict(test)

print("\nPredictions:")
print(predictions)


Predictions:
   .pred
0  120.0
1  180.0
2  240.0


## Example 2: Regularized Regression

py-parsnip makes it easy to switch between different types of regularization.

In [6]:
# Ridge regression (L2 penalty)
ridge_spec = linear_reg(penalty=0.1, mixture=0.0)
ridge_fit = ridge_spec.fit(train, "sales ~ price + advertising")

print("Ridge model:")
print(f"  Model class: {ridge_fit.fit_data['model_class']}")

# Lasso regression (L1 penalty)
lasso_spec = linear_reg(penalty=0.1, mixture=1.0)
lasso_fit = lasso_spec.fit(train, "sales ~ price + advertising")

print("\nLasso model:")
print(f"  Model class: {lasso_fit.fit_data['model_class']}")

# ElasticNet (mix of L1 and L2)
elastic_spec = linear_reg(penalty=0.1, mixture=0.5)
elastic_fit = elastic_spec.fit(train, "sales ~ price + advertising")

print("\nElasticNet model:")
print(f"  Model class: {elastic_fit.fit_data['model_class']}")

Ridge model:
  Model class: Ridge

Lasso model:
  Model class: Lasso

ElasticNet model:
  Model class: ElasticNet


## Example 3: Extracting Model Outputs

py-parsnip provides standardized output extraction across all engines.

In [7]:
# Extract the underlying sklearn model
sklearn_model = fit.extract_fit_engine()

print("Underlying sklearn model:")
print(f"  Type: {type(sklearn_model).__name__}")
print(f"  Coefficients: {sklearn_model.coef_}")
print(f"  Intercept: {sklearn_model.intercept_}")

Underlying sklearn model:
  Type: LinearRegression
  Coefficients: [ 0.00000000e+00  1.00000000e+01 -3.04011234e-14]
  Intercept: -1.4210854715202004e-13


In [8]:
# Extract standardized three-DataFrame output
# NOTE: With the new comprehensive output structure, extract_outputs() now returns:
# 1. Outputs: Observation-level results (actuals, fitted, residuals, split)
# 2. Coefficients: Enhanced with p-values, confidence intervals, VIF
# 3. Stats: Comprehensive metrics by split (train/test) + residual diagnostics

outputs, coefficients, stats = fit.extract_outputs()

print("\n1. OUTPUTS DataFrame (observation-level results):")
print(outputs.head())
print(f"\nShape: {outputs.shape}")
print(f"Columns: {list(outputs.columns)}")

print("\n2. COEFFICIENTS DataFrame (enhanced with statistical inference):")
print(coefficients[['variable', 'coefficient', 'std_error', 'p_value', 'vif']])

print("\n3. STATS DataFrame (comprehensive metrics):")
# Show key metrics
key_metrics = stats[stats['metric'].isin(['rmse', 'mae', 'r_squared', 'adj_r_squared'])]
print(key_metrics[['metric', 'value', 'split']])


1. OUTPUTS DataFrame (observation-level results):
   actuals  fitted  forecast  residuals  split       model model_group_name  \
0    100.0   100.0     100.0      100.0  train  linear_reg                    
1    200.0   200.0     200.0      200.0  train  linear_reg                    
2    150.0   150.0     150.0      150.0  train  linear_reg                    
3    300.0   300.0     300.0      300.0  train  linear_reg                    
4    250.0   250.0     250.0      250.0  train  linear_reg                    

    group  
0  global  
1  global  
2  global  
3  global  
4  global  

Shape: (8, 8)
Columns: ['actuals', 'fitted', 'forecast', 'residuals', 'split', 'model', 'model_group_name', 'group']

2. COEFFICIENTS DataFrame (enhanced with statistical inference):
      variable   coefficient   std_error  p_value         vif
0    Intercept  0.000000e+00  353.740550   1.0000         NaN
1        price  1.000000e+01  229.282374   0.9669  214.824411
2  advertising -3.040112e-14  45

## Example 4: Categorical Variables

py-parsnip (via hardhat) handles categorical variables seamlessly.

In [9]:
# Data with categorical variable
train_cat = pd.DataFrame({
    'sales': [100, 200, 150, 300, 250, 180, 220, 280, 190, 260],
    'price': [10, 20, 15, 30, 25, 18, 22, 28, 19, 26],
    'region': ['North', 'South', 'North', 'West', 'South', 'West', 'North', 'South', 'West', 'North']
})

print("Training data with categorical:")
print(train_cat.head())

Training data with categorical:
   sales  price region
0    100     10  North
1    200     20  South
2    150     15  North
3    300     30   West
4    250     25  South


In [10]:
# Fit model with categorical variable
spec_cat = linear_reg()
fit_cat = spec_cat.fit(train_cat, "sales ~ price + region")

# Extract coefficients to see one-hot encoding
_, coefs, _ = fit_cat.extract_outputs()

print("\nCoefficients (note one-hot encoded regions):")
print(coefs)


Coefficients (note one-hot encoded regions):
          variable   coefficient   std_error        t_stat   p_value  \
0        Intercept  0.000000e+00  345.172597  0.000000e+00  1.000000   
1  region[T.South] -3.369550e-14  241.690218 -1.394161e-16  1.000000   
2   region[T.West]  5.311586e-16  228.870155  2.320786e-18  1.000000   
3            price  1.000000e+01   17.224573  5.805659e-01  0.582665   

     ci_0.025    ci_0.975       vif       model model_group_name   group  
0 -844.606919  844.606919       NaN  linear_reg                   global  
1 -591.394660  591.394660  1.508544  linear_reg                   global  
2 -560.025095  560.025095  1.352752  linear_reg                   global  
3  -32.147012   52.147012  1.248161  linear_reg                   global  


In [11]:
# Predict on new data
test_cat = pd.DataFrame({
    'price': [15, 25],
    'region': ['North', 'South']
})

predictions_cat = fit_cat.predict(test_cat)

print("\nPredictions:")
print(predictions_cat)


Predictions:
   .pred
0  150.0
1  250.0


## Example 5: Model Spec Immutability

ModelSpec is immutable, which allows safe reuse and modification.

In [12]:
# Create base spec
base_spec = linear_reg()

# Create variations without modifying original
ridge_spec = base_spec.set_args(penalty=0.1, mixture=0.0)
lasso_spec = base_spec.set_args(penalty=0.1, mixture=1.0)

print("Base spec args:", base_spec.args)
print("Ridge spec args:", ridge_spec.args)
print("Lasso spec args:", lasso_spec.args)

# Original spec unchanged
print("\nBase spec still unchanged:", base_spec.args)

Base spec args: {}
Ridge spec args: {'penalty': 0.1, 'mixture': 0.0}
Lasso spec args: {'penalty': 0.1, 'mixture': 1.0}

Base spec still unchanged: {}


## Example 6: Integration with hardhat

You can also use hardhat's mold/forge explicitly for more control.

In [13]:
# Mold training data explicitly
molded = mold("sales ~ price + advertising", train)

print("Molded data:")
print(f"  Outcomes shape: {molded.outcomes.shape}")
print(f"  Predictors shape: {molded.predictors.shape}")
print(f"  Blueprint formula: {molded.blueprint.formula}")

Molded data:
  Outcomes shape: (8, 1)
  Predictors shape: (8, 3)
  Blueprint formula: sales ~ price + advertising


In [14]:
# Fit model using molded data
spec = linear_reg()
fit = spec.fit(molded)  # Pass MoldedData instead of formula

print("\nModel fitted from molded data!")


Model fitted from molded data!


In [15]:
# Forge test data
test = pd.DataFrame({
    'price': [12, 18],
    'advertising': [6, 9]
})

forged = forge(test, molded.blueprint)

print("\nForged test data:")
print(forged.predictors)


Forged test data:
   Intercept  price  advertising
0        1.0   12.0          6.0
1        1.0   18.0          9.0


## Summary

**py-parsnip provides:**

1. **Unified interface** - Same API across different backends (sklearn, statsmodels, etc.)
2. **Immutable specs** - Safe to reuse and modify without side effects
3. **Easy regularization** - Simple parameter changes switch between OLS, Ridge, Lasso, ElasticNet
4. **Standardized outputs** - Three-DataFrame output format across all engines
5. **Seamless integration** - Works perfectly with py-hardhat for preprocessing

**What's next:**
- Add more models (rand_forest, boost_tree, prophet, etc.)
- Add more engines (statsmodels, prophet, skforecast)
- Implement py-workflows for pipeline composition

## Example 7: Train/Test Evaluation with evaluate()

The new `evaluate()` method enables comprehensive train/test evaluation with a single workflow.

## Key Takeaways from New Output Structure

**Comprehensive Train/Test Evaluation:**
1. **Outputs DataFrame**: Observation-level results with actuals, fitted, forecast, residuals
   - Includes both training and test observations
   - Identified by `split` column ('train' or 'test')
   - Model metadata for grouping (model, model_group_name, group)

2. **Coefficients DataFrame**: Enhanced statistical inference
   - Full OLS statistics: std_error, t_stat, p_value, confidence intervals
   - VIF for multicollinearity detection
   - Works for OLS models (regularized models show NaN for inference)

3. **Stats DataFrame**: Comprehensive metrics by split
   - Performance metrics calculated separately for train and test
   - Residual diagnostics (Durbin-Watson, Shapiro-Wilk) on training data
   - Model information (formula, AIC, BIC, etc.)

**Workflow:**
```python
# 1. Fit on training data
fit = spec.fit(train, formula)

# 2. Evaluate on test data
fit = fit.evaluate(test)

# 3. Extract comprehensive results
outputs, coefficients, stats = fit.extract_outputs()
```

In [16]:
# Create larger dataset with train/test split
np.random.seed(42)
n = 100
data = pd.DataFrame({
    'sales': 100 + 10 * np.arange(n) + 50 * np.random.randn(n),
    'price': 10 + 0.5 * np.arange(n) + 5 * np.random.randn(n),
    'advertising': 5 + 0.3 * np.arange(n) + 3 * np.random.randn(n)
})

# Split into train/test
train = data.iloc[:80].copy()
test = data.iloc[80:].copy()

print(f"Training data: {len(train)} observations")
print(f"Test data: {len(test)} observations")

Training data: 80 observations
Test data: 20 observations


In [17]:
# Fit model on training data
spec = linear_reg()
fit = spec.fit(train, "sales ~ price + advertising")

# Evaluate on test data - this stores test predictions for comprehensive metrics
fit = fit.evaluate(test)

print("Model fitted and evaluated!")
print(f"Training observations: {fit.fit_data['n_obs']}")
print(f"Test predictions stored: {'test_predictions' in fit.evaluation_data}")

Model fitted and evaluated!
Training observations: 80
Test predictions stored: True


In [18]:
# Extract comprehensive outputs with train/test splits
outputs, coefficients, stats = fit.extract_outputs()

print("=" * 60)
print("1. OUTPUTS DataFrame - Observation-level results")
print("=" * 60)
print(f"Total observations: {len(outputs)}")
print(f"Training observations: {len(outputs[outputs['split'] == 'train'])}")
print(f"Test observations: {len(outputs[outputs['split'] == 'test'])}")
print(f"\nColumns: {list(outputs.columns)}")

print("\nSample training observations:")
print(outputs[outputs['split'] == 'train'][['actuals', 'fitted', 'residuals', 'split']].head())

print("\nSample test observations:")
print(outputs[outputs['split'] == 'test'][['actuals', 'forecast', 'residuals', 'split']].head())

1. OUTPUTS DataFrame - Observation-level results
Total observations: 100
Training observations: 80
Test observations: 20

Columns: ['actuals', 'fitted', 'forecast', 'residuals', 'split', 'model', 'model_group_name', 'group']

Sample training observations:
      actuals      fitted   residuals  split
0  124.835708   68.487902  124.835708  train
1  103.086785  133.851081  103.086785  train
2  152.384427  170.140700  152.384427  train
3  206.151493  156.298651  206.151493  train
4  128.292331   86.927308  128.292331  train

Sample test observations:
       actuals    forecast  residuals split
80  889.016406  890.915620  -1.899214  test
81  927.855629  854.545520  73.310109  test
82  993.894702  895.025101  98.869601  test
83  904.086489  851.244346  52.842143  test
84  899.575320  978.038576 -78.463256  test


In [19]:
# Extract comprehensive outputs with train/test splits
outputs, coefficients, stats = fit.extract_outputs()

print("=" * 60)
print("1. OUTPUTS DataFrame - Observation-level results")
print("=" * 60)
print(f"Total observations: {len(outputs)}")
print(f"Training observations: {len(outputs[outputs['split'] == 'train'])}")
print(f"Test observations: {len(outputs[outputs['split'] == 'test'])}")
print(f"\nColumns: {list(outputs.columns)}")

print("\nSample training observations:")
print(outputs[outputs['split'] == 'train'][['actuals', 'fitted', 'residuals', 'split']].head())

print("\nSample test observations:")
print(outputs[outputs['split'] == 'test'][['actuals', 'forecast', 'residuals', 'split']].head())

1. OUTPUTS DataFrame - Observation-level results
Total observations: 100
Training observations: 80
Test observations: 20

Columns: ['actuals', 'fitted', 'forecast', 'residuals', 'split', 'model', 'model_group_name', 'group']

Sample training observations:
      actuals      fitted   residuals  split
0  124.835708   68.487902  124.835708  train
1  103.086785  133.851081  103.086785  train
2  152.384427  170.140700  152.384427  train
3  206.151493  156.298651  206.151493  train
4  128.292331   86.927308  128.292331  train

Sample test observations:
       actuals    forecast  residuals split
80  889.016406  890.915620  -1.899214  test
81  927.855629  854.545520  73.310109  test
82  993.894702  895.025101  98.869601  test
83  904.086489  851.244346  52.842143  test
84  899.575320  978.038576 -78.463256  test
