# py-yardstick Demo: Performance Metrics

This notebook demonstrates the py-yardstick package for model evaluation metrics.

py-yardstick provides:
- Time series metrics (RMSE, MAE, MAPE, SMAPE, MASE, R²)
- Residual diagnostic tests (Durbin-Watson, Ljung-Box, Shapiro-Wilk, ADF)
- Classification metrics (Accuracy, Precision, Recall, F-measure, ROC AUC)
- Metric set composition for batch evaluation

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from py_yardstick import (
    # Time Series Metrics
    rmse, mae, mape, smape, mase, r_squared, rsq_trad, mda,
    # Residual Diagnostics
    durbin_watson, ljung_box, shapiro_wilk, adf_test,
    # Classification Metrics
    accuracy, precision, recall, f_meas, roc_auc,
    # Metric Set
    metric_set
)

# Set random seed for reproducibility
np.random.seed(42)

## 1. Time Series Regression Metrics

Let's create some example predictions and calculate various metrics.

In [None]:
# Create example data
n = 100
truth = pd.Series(np.linspace(10, 100, n) + np.random.normal(0, 5, n))
estimate = truth + np.random.normal(0, 3, n)

# Visualize
plt.figure(figsize=(10, 5))
plt.plot(truth.values, label='Truth', marker='o', markersize=3)
plt.plot(estimate.values, label='Estimate', marker='x', markersize=3, alpha=0.7)
plt.legend()
plt.title('Truth vs Estimate')
plt.xlabel('Index')
plt.ylabel('Value')
plt.grid(True, alpha=0.3)
plt.show()

### Individual Metrics

In [None]:
# RMSE - Root Mean Squared Error
print("RMSE:")
print(rmse(truth, estimate))
print()

In [None]:
# MAE - Mean Absolute Error
print("MAE:")
print(mae(truth, estimate))
print()

In [None]:
# MAPE - Mean Absolute Percentage Error
print("MAPE (percentage):")
print(mape(truth, estimate))
print()

In [None]:
# SMAPE - Symmetric Mean Absolute Percentage Error
print("SMAPE (percentage, bounded 0-200):")
print(smape(truth, estimate))
print()

In [None]:
# R-squared
print("R-squared:")
print(r_squared(truth, estimate))
print()

print("R-squared Traditional (squared correlation):")
print(rsq_trad(truth, estimate))
print()

### MASE - Mean Absolute Scaled Error

MASE requires training data to compute the scaling factor.

In [None]:
# Split data into train and test
train_truth = truth[:70]
test_truth = truth[70:]
test_estimate = estimate[70:]

print("MASE (scaled by naive forecast on training data):")
print(mase(test_truth, test_estimate, train=train_truth, m=1))
print()

### MDA - Mean Directional Accuracy

MDA measures how often the predicted direction matches the actual direction (useful for time series).

In [None]:
print("MDA (proportion of correct directional predictions):")
print(mda(truth, estimate))
print()

## 2. Metric Set - Compute Multiple Metrics at Once

Use `metric_set()` to create a custom collection of metrics that can be computed together.

In [None]:
# Create a metric set
my_metrics = metric_set(rmse, mae, mape, r_squared, mda)

# Compute all metrics at once
results = my_metrics(truth, estimate)
print("All metrics computed together:")
print(results)
print()

## 3. Residual Diagnostic Tests

These tests help evaluate model assumptions and residual properties.

In [None]:
# Compute residuals
residuals = truth - estimate

# Visualize residuals
fig, axes = plt.subplots(1, 2, figsize=(14, 4))

# Residuals plot
axes[0].scatter(estimate, residuals, alpha=0.5)
axes[0].axhline(y=0, color='r', linestyle='--')
axes[0].set_xlabel('Fitted Values')
axes[0].set_ylabel('Residuals')
axes[0].set_title('Residuals vs Fitted')
axes[0].grid(True, alpha=0.3)

# Histogram of residuals
axes[1].hist(residuals, bins=20, edgecolor='black', alpha=0.7)
axes[1].set_xlabel('Residuals')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Residuals')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### Durbin-Watson Test

Tests for autocorrelation in residuals. Values range from 0 to 4:
- 2 = no autocorrelation
- < 2 = positive autocorrelation
- > 2 = negative autocorrelation

In [None]:
print("Durbin-Watson Test:")
dw_result = durbin_watson(residuals)
print(dw_result)
print(f"\nInterpretation: DW = {dw_result['value'].iloc[0]:.3f}")
if dw_result['value'].iloc[0] < 1.5:
    print("→ Positive autocorrelation detected")
elif dw_result['value'].iloc[0] > 2.5:
    print("→ Negative autocorrelation detected")
else:
    print("→ Little or no autocorrelation")
print()

### Ljung-Box Test

Tests for autocorrelation up to a specified lag. Returns both test statistic and p-value.

In [None]:
print("Ljung-Box Test (10 lags):")
lb_result = ljung_box(residuals, lags=10)
print(lb_result)
p_value = lb_result[lb_result['metric'] == 'ljung_box_p']['value'].iloc[0]
print(f"\nInterpretation: p-value = {p_value:.4f}")
if p_value < 0.05:
    print("→ Reject null hypothesis: autocorrelation present")
else:
    print("→ Fail to reject null hypothesis: no significant autocorrelation")
print()

### Shapiro-Wilk Test

Tests for normality of residuals.

In [None]:
print("Shapiro-Wilk Normality Test:")
sw_result = shapiro_wilk(residuals)
print(sw_result)
p_value = sw_result[sw_result['metric'] == 'shapiro_wilk_p']['value'].iloc[0]
print(f"\nInterpretation: p-value = {p_value:.4f}")
if p_value < 0.05:
    print("→ Reject null hypothesis: residuals are not normally distributed")
else:
    print("→ Fail to reject null hypothesis: residuals appear normally distributed")
print()

### Augmented Dickey-Fuller Test

Tests for stationarity in a time series.

In [None]:
print("Augmented Dickey-Fuller Test (testing stationarity of truth series):")
adf_result = adf_test(truth)
print(adf_result)
p_value = adf_result[adf_result['metric'] == 'adf_p']['value'].iloc[0]
print(f"\nInterpretation: p-value = {p_value:.4f}")
if p_value < 0.05:
    print("→ Reject null hypothesis: series is stationary")
else:
    print("→ Fail to reject null hypothesis: unit root present (non-stationary)")
print()

## 4. Classification Metrics

Let's create binary classification data and evaluate performance.

In [None]:
# Create classification data
n_samples = 200
truth_class = pd.Series(np.random.binomial(1, 0.5, n_samples))

# Generate predictions (probabilities)
# Good classifier: higher prob for class 1, lower for class 0
probs = truth_class * np.random.beta(8, 2, n_samples) + (1 - truth_class) * np.random.beta(2, 8, n_samples)
estimate_probs = pd.Series(probs)

# Convert to class labels
estimate_class = pd.Series((estimate_probs > 0.5).astype(int))

print(f"True class distribution: {truth_class.value_counts().to_dict()}")
print(f"Predicted class distribution: {estimate_class.value_counts().to_dict()}")

### Confusion Matrix Visualization

In [None]:
from sklearn.metrics import confusion_matrix
import seaborn as sns

cm = confusion_matrix(truth_class, estimate_class)
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

tn, fp, fn, tp = cm.ravel()
print(f"True Negatives: {tn}")
print(f"False Positives: {fp}")
print(f"False Negatives: {fn}")
print(f"True Positives: {tp}")

### Classification Metrics

In [None]:
# Accuracy
print("Accuracy:")
print(accuracy(truth_class, estimate_class))
print()

# Precision
print("Precision:")
print(precision(truth_class, estimate_class))
print()

# Recall
print("Recall:")
print(recall(truth_class, estimate_class))
print()

# F1-Score
print("F1-Score:")
print(f_meas(truth_class, estimate_class))
print()

### ROC AUC (requires probabilities)

In [None]:
print("ROC AUC:")
print(roc_auc(truth_class, estimate_probs))
print()

### ROC Curve Visualization

In [None]:
from sklearn.metrics import roc_curve, auc

fpr, tpr, thresholds = roc_curve(truth_class, estimate_probs)
roc_auc_val = auc(fpr, tpr)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc_val:.3f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random classifier')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.grid(True, alpha=0.3)
plt.show()

### Classification Metric Set

In [None]:
# Create classification metric set
clf_metrics = metric_set(accuracy, precision, recall, f_meas)

# Compute all at once
clf_results = clf_metrics(truth_class, estimate_class)
print("Classification Metrics Summary:")
print(clf_results)

## 5. Integration with py-parsnip Models

Let's use yardstick metrics with actual model predictions from py-parsnip.

In [None]:
from py_parsnip import linear_reg
from py_rsample import initial_time_split, training, testing

# Create sample time series data
np.random.seed(42)
n = 200
dates = pd.date_range('2020-01-01', periods=n, freq='D')
sales_data = pd.DataFrame({
    'date': dates,
    'price': np.random.uniform(10, 50, n),
    'promotion': np.random.binomial(1, 0.3, n),
    'sales': 100 + np.random.uniform(0, 20, n) * np.sin(np.linspace(0, 4*np.pi, n)) + np.random.normal(0, 5, n)
})

# Add effects
sales_data['sales'] = sales_data['sales'] - 0.5 * sales_data['price'] + 10 * sales_data['promotion']

print("Sample data:")
print(sales_data.head())
print(f"\nShape: {sales_data.shape}")

In [None]:
# Train/test split
split = initial_time_split(sales_data, prop=0.75)
train_data = training(split)
test_data = testing(split)

print(f"Training samples: {len(train_data)}")
print(f"Testing samples: {len(test_data)}")

In [None]:
# Fit model
model = linear_reg().set_engine('sklearn')
fit = model.fit(train_data, 'sales ~ price + promotion')

# Get predictions
train_preds = fit.predict(train_data)
test_preds = fit.predict(test_data)

print("Training predictions:")
print(train_preds.head())
print("\nTest predictions:")
print(test_preds.head())

In [None]:
# Evaluate with yardstick metrics
train_truth = train_data['sales']
train_estimate = train_preds['.pred']

test_truth = test_data['sales']
test_estimate = test_preds['.pred']

# Define metric set
regression_metrics = metric_set(rmse, mae, mape, r_squared)

# Compute metrics for both train and test
train_metrics = regression_metrics(train_truth, train_estimate)
train_metrics['split'] = 'train'

test_metrics = regression_metrics(test_truth, test_estimate)
test_metrics['split'] = 'test'

# Combine
all_metrics = pd.concat([train_metrics, test_metrics], ignore_index=True)

print("\nModel Performance Metrics:")
print(all_metrics.pivot(index='metric', columns='split', values='value'))

In [None]:
# Visualize predictions
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Training
axes[0].scatter(train_truth, train_estimate, alpha=0.5)
axes[0].plot([train_truth.min(), train_truth.max()], 
             [train_truth.min(), train_truth.max()], 
             'r--', lw=2, label='Perfect Prediction')
axes[0].set_xlabel('Actual Sales')
axes[0].set_ylabel('Predicted Sales')
axes[0].set_title('Training Set Predictions')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Testing
axes[1].scatter(test_truth, test_estimate, alpha=0.5, color='orange')
axes[1].plot([test_truth.min(), test_truth.max()], 
             [test_truth.min(), test_truth.max()], 
             'r--', lw=2, label='Perfect Prediction')
axes[1].set_xlabel('Actual Sales')
axes[1].set_ylabel('Predicted Sales')
axes[1].set_title('Test Set Predictions')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 6. Comparing Multiple Models

Use yardstick to compare different model configurations.

In [None]:
# Fit multiple models
model1 = linear_reg().set_engine('sklearn')  # OLS
model2 = linear_reg(penalty=0.1).set_engine('sklearn')  # Ridge
model3 = linear_reg(penalty=1.0).set_engine('sklearn')  # Ridge with more regularization

fit1 = model1.fit(train_data, 'sales ~ price + promotion')
fit2 = model2.fit(train_data, 'sales ~ price + promotion')
fit3 = model3.fit(train_data, 'sales ~ price + promotion')

# Get test predictions
pred1 = fit1.predict(test_data)['.pred']
pred2 = fit2.predict(test_data)['.pred']
pred3 = fit3.predict(test_data)['.pred']

# Compute metrics
metrics1 = regression_metrics(test_truth, pred1)
metrics1['model'] = 'OLS'

metrics2 = regression_metrics(test_truth, pred2)
metrics2['model'] = 'Ridge (0.1)'

metrics3 = regression_metrics(test_truth, pred3)
metrics3['model'] = 'Ridge (1.0)'

# Combine and pivot
comparison = pd.concat([metrics1, metrics2, metrics3], ignore_index=True)
comparison_wide = comparison.pivot(index='metric', columns='model', values='value')

print("\nModel Comparison:")
print(comparison_wide)

In [None]:
# Visualize comparison
comparison_plot = comparison[comparison['metric'].isin(['rmse', 'mae', 'r_squared'])]

fig, ax = plt.subplots(figsize=(10, 5))
x = np.arange(len(comparison_plot['metric'].unique()))
width = 0.25
models = comparison_plot['model'].unique()

for i, model in enumerate(models):
    model_data = comparison_plot[comparison_plot['model'] == model]
    ax.bar(x + i*width, model_data['value'], width, label=model)

ax.set_xlabel('Metric')
ax.set_ylabel('Value')
ax.set_title('Model Performance Comparison')
ax.set_xticks(x + width)
ax.set_xticklabels(comparison_plot['metric'].unique())
ax.legend()
ax.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

## Summary

This notebook demonstrated:

1. **Time Series Metrics**: RMSE, MAE, MAPE, SMAPE, MASE, R², MDA
2. **Metric Sets**: Composing multiple metrics for batch evaluation
3. **Residual Diagnostics**: Durbin-Watson, Ljung-Box, Shapiro-Wilk, ADF tests
4. **Classification Metrics**: Accuracy, Precision, Recall, F-measure, ROC AUC
5. **Integration**: Using yardstick with py-parsnip models
6. **Model Comparison**: Evaluating multiple models systematically

All metrics follow a consistent API:
- Accept truth and estimate as primary arguments
- Return standardized DataFrames with `metric` and `value` columns
- Can be composed using `metric_set()` for batch evaluation
- Handle edge cases gracefully (NaN values, empty data, etc.)