# Statsmodels OLS Engine Demo

This notebook demonstrates the statsmodels engine for `linear_reg()`, which provides full statistical inference for classical linear regression.

## Key Differences: sklearn vs statsmodels

| Feature | sklearn | statsmodels |
|---------|---------|-------------|
| **Use Case** | Production ML, regularization | Classical statistics, hypothesis testing |
| **Regularization** | Ridge, Lasso, ElasticNet | Not supported (OLS only) |
| **P-values** | Manually calculated | Native from model |
| **Diagnostics** | Basic (Durbin-Watson, Shapiro-Wilk) | Enhanced (Ljung-Box, Breusch-Pagan) |
| **Model Stats** | RMSE, MAE, R² | + AIC, BIC, F-stat, log-likelihood |
| **Prediction Intervals** | Not available | Native support |

**When to use statsmodels:**
- Research and academic work requiring p-values
- Hypothesis testing for coefficients
- Need for prediction intervals
- Classical econometric analysis

**When to use sklearn:**
- Regularized regression (Ridge/Lasso/ElasticNet)
- Production ML pipelines
- Prediction-focused (vs inference-focused)

In [1]:
import pandas as pd
import numpy as np
from py_parsnip import linear_reg

## 1. Create Sample Data

Sales data with multiple predictors for linear regression.

In [2]:
np.random.seed(42)

# Create synthetic sales data
n = 100
data = pd.DataFrame({
    'sales': np.random.normal(200, 50, n) + np.random.normal(0, 20, n),
    'advertising': np.random.normal(50, 15, n),
    'price': np.random.normal(100, 20, n),
    'competitors': np.random.randint(1, 6, n),
})

# Add some relationship
data['sales'] = (
    100 + 
    1.5 * data['advertising'] + 
    -0.5 * data['price'] + 
    -10 * data['competitors'] + 
    np.random.normal(0, 20, n)
)

# Train/test split
train = data.iloc[:70]
test = data.iloc[70:]

print(f"Training set: {len(train)} observations")
print(f"Test set: {len(test)} observations")
print("\nFirst few rows:")
train.head()

Training set: 70 observations
Test set: 30 observations

First few rows:


Unnamed: 0,sales,advertising,price,competitors
0,68.423031,55.36681,83.4201,3
1,143.478287,58.411768,88.796379,1
2,67.525924,66.245769,114.945872,5
3,79.400343,65.807031,112.207405,4
4,83.378934,29.334959,99.581968,1


## 2. Fit Model with Statsmodels Engine

Use `.set_engine("statsmodels")` to get OLS with full statistical inference.

In [3]:
# Create specification with statsmodels engine
spec_sm = linear_reg().set_engine("statsmodels")

# Fit model
fit_sm = spec_sm.fit(train, "sales ~ advertising + price + competitors")

# Evaluate on test set
fit_sm = fit_sm.evaluate(test)

print("Model fitted successfully!")

Model fitted successfully!


## 3. Compare with sklearn Engine

Fit the same model with sklearn for comparison.

In [4]:
# sklearn engine (default)
spec_sk = linear_reg()
fit_sk = spec_sk.fit(train, "sales ~ advertising + price + competitors")
fit_sk = fit_sk.evaluate(test)

print("sklearn model fitted successfully!")

sklearn model fitted successfully!


## 4. Extract Outputs and Compare

The key difference is in the coefficients DataFrame - statsmodels provides all statistical inference natively.

In [5]:
# Extract outputs from both models
outputs_sm, coefs_sm, stats_sm = fit_sm.extract_outputs()



In [6]:
outputs_sm

Unnamed: 0,actuals,fitted,forecast,residuals,split,model,model_group_name,group
0,68.423031,111.609183,68.423031,-43.186152,train,linear_reg,,global
1,143.478287,131.554192,143.478287,11.924095,train,linear_reg,,global
2,67.525924,89.396111,67.525924,-21.870187,train,linear_reg,,global
3,79.400343,99.801810,79.400343,-20.401467,train,linear_reg,,global
4,83.378934,92.635688,83.378934,-9.256754,train,linear_reg,,global
...,...,...,...,...,...,...,...,...
95,83.611163,80.292029,83.611163,3.319134,test,linear_reg,,global
96,130.897266,139.327623,130.897266,-8.430357,test,linear_reg,,global
97,54.760630,69.937532,54.760630,-15.176902,test,linear_reg,,global
98,109.282094,93.284207,109.282094,15.997887,test,linear_reg,,global


In [7]:
coefs_sm

Unnamed: 0,variable,coefficient,std_error,t_stat,p_value,ci_0.025,ci_0.975,vif,model,model_group_name,group
0,Intercept,117.816781,17.887268,6.586628,8.742086e-09,82.103698,153.529864,,linear_reg,,global
1,advertising,1.154139,0.138422,8.337809,6.584731e-12,0.87777,1.430508,1.00171,linear_reg,,global
2,price,-0.496943,0.162316,-3.061577,0.00318343,-0.821017,-0.172868,1.016152,linear_reg,,global
3,competitors,-9.551203,1.630606,-5.857456,1.63142e-07,-12.806813,-6.295593,1.015138,linear_reg,,global


In [8]:
stats_sm

Unnamed: 0,metric,value,split,model,model_group_name,group
0,rmse,19.30247,train,linear_reg,,global
1,mae,15.3534,train,linear_reg,,global
2,mape,17.71033,train,linear_reg,,global
3,smape,16.64836,train,linear_reg,,global
4,r_squared,0.6341171,train,linear_reg,,global
5,mda,88.4058,train,linear_reg,,global
6,adj_r_squared,0.6116012,train,linear_reg,,global
7,rmse,20.9696,test,linear_reg,,global
8,mae,16.14813,test,linear_reg,,global
9,mape,30.4395,test,linear_reg,,global


In [9]:

outputs_sk, coefs_sk, stats_sk = fit_sk.extract_outputs()

print("="*80)
print("STATSMODELS COEFFICIENTS (with p-values from model)")
print("="*80)
print(coefs_sm[["variable", "coefficient", "std_error", "t_stat", "p_value"]].to_string(index=False))

print("\n" + "="*80)
print("SKLEARN COEFFICIENTS (manually calculated inference for OLS only)")
print("="*80)
print(coefs_sk[["variable", "coefficient", "std_error", "t_stat", "p_value"]].to_string(index=False))

STATSMODELS COEFFICIENTS (with p-values from model)
   variable  coefficient  std_error    t_stat      p_value
  Intercept   117.816781  17.887268  6.586628 8.742086e-09
advertising     1.154139   0.138422  8.337809 6.584731e-12
      price    -0.496943   0.162316 -3.061577 3.183430e-03
competitors    -9.551203   1.630606 -5.857456 1.631420e-07

SKLEARN COEFFICIENTS (manually calculated inference for OLS only)
   variable  coefficient  std_error    t_stat  p_value
  Intercept     0.000000  95.299013  0.000000 1.000000
advertising     1.154139   0.737481  1.564975 0.122371
      price    -0.496943   0.864780 -0.574646 0.567484
competitors    -9.551203   8.687472 -1.099423 0.275578


### Interpretation:

Looking at p-values (typically α = 0.05):
- **p < 0.05**: Statistically significant (reject null hypothesis that coefficient = 0)
- **p > 0.05**: Not statistically significant

Both engines should give identical coefficients and p-values for OLS (no regularization).

## 5. Enhanced Model Statistics (statsmodels only)

Statsmodels provides additional model-level statistics like AIC, BIC, F-statistic.

In [10]:
# Filter for model-level stats (split = '')
model_stats_sm = stats_sm[stats_sm['split'] == ''][["metric", "value"]]
model_stats_sk = stats_sk[stats_sk['split'] == ''][["metric", "value"]]

print("Statsmodels Model Statistics:")
print(model_stats_sm.to_string(index=False))

print("\nsklearn Model Statistics:")
print(model_stats_sk.to_string(index=False))

Statsmodels Model Statistics:
          metric         value
             aic  6.210840e+02
             bic  6.300780e+02
  log_likelihood -3.065420e+02
     f_statistic  3.812852e+01
        f_pvalue  2.049340e-14
condition_number  8.610427e+02
      n_features  4.000000e+00

sklearn Model Statistics:
     metric                                     value
    formula sales ~ advertising + price + competitors
 model_type                                linear_reg
model_class                          LinearRegression


### What these mean:

- **AIC/BIC**: Model selection criteria (lower is better)
- **F-statistic**: Tests if at least one predictor is significant
- **F p-value**: p-value for F-test (< 0.05 means model is significant)
- **Log-likelihood**: Maximized log-likelihood from estimation
- **Condition number**: Multicollinearity check (> 30 may indicate issues)

## 6. Enhanced Residual Diagnostics (statsmodels)

Statsmodels provides additional tests using `statsmodels.stats.diagnostic`.

In [11]:
# Filter for diagnostic tests on training data
diag_stats = stats_sm[
    (stats_sm['split'] == 'train') & 
    (stats_sm['metric'].str.contains('_stat|_p'))
][["metric", "value"]]

print("Residual Diagnostic Tests:")
print(diag_stats.to_string(index=False))

Residual Diagnostic Tests:
            metric    value
 shapiro_wilk_stat 0.988978
    shapiro_wilk_p 0.800572
    ljung_box_stat      NaN
       ljung_box_p      NaN
breusch_pagan_stat 3.256444
   breusch_pagan_p 0.353751


### Diagnostic Test Interpretation:

1. **Durbin-Watson** (1.5-2.5 ideal)
   - Tests for autocorrelation in residuals
   - ~2.0 = no autocorrelation, <1.5 or >2.5 may indicate problems

2. **Shapiro-Wilk** (p > 0.05 means normal)
   - Tests if residuals are normally distributed
   - p > 0.05: Fail to reject normality (good!)

3. **Ljung-Box** (p > 0.05 means no autocorrelation)
   - Tests for autocorrelation at multiple lags
   - Available only from statsmodels engine

4. **Breusch-Pagan** (p > 0.05 means homoskedastic)
   - Tests for heteroskedasticity (non-constant variance)
   - Available only from statsmodels engine

## 7. Prediction Intervals (statsmodels only)

Statsmodels can provide confidence intervals around predictions using `type="conf_int"`.

In [12]:
# Create new data for prediction
new_data = pd.DataFrame({
    'advertising': [45, 60, 55],
    'price': [95, 110, 100],
    'competitors': [2, 3, 2]
})

# Point predictions
preds_point = fit_sm.predict(new_data, type="numeric")

# Predictions with confidence intervals
preds_ci = fit_sm.predict(new_data, type="conf_int")

print("Point Predictions:")
print(preds_point)

print("\nPredictions with 95% Confidence Intervals:")
print(preds_ci)

Point Predictions:
        .pred
0  103.441098
1  103.747845
2  112.497778

Predictions with 95% Confidence Intervals:
        .pred  .pred_lower  .pred_upper
0  103.441098    97.365380   109.516816
1  103.747845    97.706255   109.789435
2  112.497778   106.697576   118.297979


### Interpreting Prediction Intervals:

- `.pred`: Point estimate of sales
- `.pred_lower`: Lower bound of 95% confidence interval
- `.pred_upper`: Upper bound of 95% confidence interval

We're 95% confident the true sales value falls within [.pred_lower, .pred_upper].

## 8. Train/Test Performance Comparison

Compare prediction accuracy on training and test sets.

In [13]:
# Filter for key metrics by split
key_metrics = ['rmse', 'mae', 'r_squared']
perf_stats = stats_sm[
    stats_sm['metric'].isin(key_metrics)
][["metric", "value", "split"]].pivot(index='metric', columns='split', values='value')

print("Performance Metrics by Split:")
print(perf_stats)

print("\nInterpretation:")
print("- RMSE: Root Mean Squared Error (lower is better)")
print("- MAE: Mean Absolute Error (lower is better)")
print("- R²: Coefficient of determination (higher is better, max 1.0)")
print("\nIf test performance is much worse than train, the model may be overfitting.")

Performance Metrics by Split:
split           test      train
metric                         
mae        16.148125  15.353397
r_squared   0.711300   0.634117
rmse       20.969600  19.302469

Interpretation:
- RMSE: Root Mean Squared Error (lower is better)
- MAE: Mean Absolute Error (lower is better)
- R²: Coefficient of determination (higher is better, max 1.0)

If test performance is much worse than train, the model may be overfitting.


## 9. Regularization Not Supported

The statsmodels engine only supports OLS (no regularization). For Ridge/Lasso/ElasticNet, use sklearn.

In [14]:
# This will raise an error
try:
    spec_ridge_sm = linear_reg(penalty=0.1).set_engine("statsmodels")
    fit_ridge_sm = spec_ridge_sm.fit(train, "sales ~ advertising + price + competitors")
except ValueError as e:
    print(f"Error: {e}")
    
print("\n✓ For regularized regression, use sklearn engine:")
spec_ridge_sk = linear_reg(penalty=0.1)  # Uses sklearn by default
fit_ridge_sk = spec_ridge_sk.fit(train, "sales ~ advertising + price + competitors")
print("Ridge regression fitted successfully with sklearn!")

Error: statsmodels engine does not support regularization (penalty parameter). Use sklearn engine for Ridge/Lasso/ElasticNet.

✓ For regularized regression, use sklearn engine:
Ridge regression fitted successfully with sklearn!


## Summary

The statsmodels engine provides:

✅ **Full statistical inference** (p-values, confidence intervals)  
✅ **Enhanced diagnostics** (Ljung-Box, Breusch-Pagan tests)  
✅ **Model selection criteria** (AIC, BIC)  
✅ **Prediction intervals** (`type="conf_int"`)  
✅ **F-statistics** for overall model significance  

**Best for:**
- Academic research
- Hypothesis testing
- Classical econometric analysis
- When p-values are required

**Use sklearn engine when:**
- You need regularization (Ridge/Lasso/ElasticNet)
- Building production ML pipelines
- Prediction accuracy is more important than inference