ML Prediction Tutorial

Complete guide to forecasting epidemic cases using machine learning.

📋 Table of Contents

What is ML Prediction?
How It Works
Feature Engineering
Supported Models
Code Examples
Evaluating Predictions
Real-World Applications
Advanced Usage
Common Pitfalls

🧠 What is ML Prediction?

Unlike mechanistic models (SIR/SEIR) that use differential equations, Machine Learning learns patterns from historical data to predict future cases.

Comparison with Mechanistic Models

Feature	SIR/SEIR Models	ML Models
Knowledge needed	Parameters (β, γ)	Historical data
Interpretability	High (mathematical)	Low (black box)
Data required	Small	Large
Adaptability	Rigid	Flexible
Best for	Understanding dynamics	Short-term forecasting

When to Use ML

Scenario	Use ML?
Have historical case data	✅ Yes
Short-term forecasting (30 days)	✅ Yes
Need to understand mechanisms	❌ Use SIR/SEIR
Very little data (< 50 points)	❌ Use SIR/SEIR
Detecting anomalies	✅ Yes

🔧 How It Works

Pipeline Overview

Historical Data → Feature Engineering → Train Model → Evaluate → Predict Future
      ↓                  ↓                  ↓           ↓           ↓
   (cases/day)    (lag features,      (Random Forest   (R², RMSE)  (next 30 days)
                   rolling means)       or XGBoost)

Step-by-Step Process

Data Collection: Historical case counts per day
Feature Engineering: Create lag features, rolling averages
Training: Model learns patterns from past
Evaluation: Check accuracy on test data
Prediction: Forecast future cases

📊 Feature Engineering

What are Features?

Features are the inputs the model uses to make predictions.

Automatic Features Created

Feature	Description	Example
`lag_1`	Cases from yesterday	If today is day 10, lag_1 = day 9 cases
`lag_2`	Cases from 2 days ago	day 8 cases
`lag_3...lag_7`	Cases from 3-7 days ago	day 7-3 cases
`rolling_mean_3`	Average of last 3 days	(day 9+8+7)/3
`rolling_mean_7`	Average of last 7 days	(day 9+...+3)/7
`day_of_week`	Day of week (0-6)	Monday=0, Tuesday=1
`week`	Week number	Week 1, 2, 3...

Why These Features?

Pattern	Feature
Yesterday's cases predict today	`lag_1`
Weekly pattern (weekend effect)	`day_of_week`
Smoothing out noise	`rolling_mean`
Long-term trends	`lag_7`, `week`

🧠 Supported Models

Random Forest

How it works: Ensemble of decision trees.

predictor = EpidemicPredictor(model_type='random_forest')

Pros	Cons
Handles non-linear patterns	Can overfit
No feature scaling needed	Slower on large data
Good with small datasets	Less interpretable
Handles missing values	Memory intensive

Best for: Smaller datasets (50-500 days)

XGBoost (Extreme Gradient Boosting)

How it works: Boosted decision trees.

predictor = EpidemicPredictor(model_type='xgboost')

Pros	Cons
Very high accuracy	Requires more tuning
Handles missing values	Slower to train
Feature importance	More complex
Often wins competitions	Requires more data

Best for: Larger datasets (500+ days), competitions

💻 Code Examples

Basic ML Prediction

import pandas as pd
import numpy as np
from sir_simulator.advanced_features.ml_prediction import EpidemicPredictor

# Load or create historical data
historical = pd.DataFrame({
    'day': range(1, 101),
    'cases': 10 + np.cumsum(np.random.poisson(0.5, 100))
})

# Create predictor
predictor = EpidemicPredictor(model_type='random_forest')

# Train model
metrics, predictions, model = predictor.train(historical)

print(f"R²: {metrics['r2']:.3f}")
print(f"RMSE: {metrics['rmse']:.3f}")

# Predict future
future = predictor.predict_future(historical, days=30)
print(f"Future predictions: {future.values[:10]}")

Visualize Predictions

import matplotlib.pyplot as plt

historical = pd.DataFrame({
    'day': range(1, 101),
    'cases': 10 + np.cumsum(np.random.poisson(0.5, 100))
})

predictor = EpidemicPredictor('random_forest')
metrics, predictions, model = predictor.train(historical, test_size=0.2)
future = predictor.predict_future(historical, days=30)

# Split historical into train/test
train_size = int(len(historical) * 0.8)
train = historical[:train_size]
test = historical[train_size:]

# Actual predictions on test set
X_test = predictor.create_features(test)
y_pred = model.predict(X_test[[col for col in X_test.columns if col != 'cases']])

# Plot
plt.figure(figsize=(12, 6))

# Historical data
plt.plot(historical['day'], historical['cases'], 'b-', label='Historical Data', linewidth=2)

# Training data
plt.plot(train['day'], train['cases'], 'g-', alpha=0.5, label='Training Data')

# Test predictions
test_days = test['day'].values
plt.scatter(test_days[:len(y_pred)], y_pred, color='red', s=50, zorder=5, label='Model Predictions')
plt.plot(test_days[:len(y_pred)], y_pred, 'r--', alpha=0.7)

# Future predictions
future_days = range(historical['day'].max() + 1, historical['day'].max() + 31)
plt.plot(future_days, future.values, 'orange', linestyle='-.', linewidth=2, label='Future Forecast')

plt.xlabel('Day', fontsize=12)
plt.ylabel('Cases', fontsize=12)
plt.title('ML Prediction - Historical vs Forecast', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Compare Models

historical = pd.DataFrame({
    'day': range(1, 101),
    'cases': 10 + np.cumsum(np.random.poisson(0.5, 100))
})

models = ['random_forest', 'xgboost']
colors = ['blue', 'red']

plt.figure(figsize=(12, 6))

for model_type, color in zip(models, colors):
    predictor = EpidemicPredictor(model_type)
    metrics, pred, model = predictor.train(historical)
    future = predictor.predict_future(historical, days=30)
    
    plt.plot(range(101, 131), future.values, 
             label=f'{model_type} (R²={metrics["r2"]:.3f})', 
             color=color, linewidth=2)

plt.xlabel('Day', fontsize=12)
plt.ylabel('Predicted Cases', fontsize=12)
plt.title('ML Model Comparison - 30-day Forecast', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

📊 Evaluating Predictions

Key Metrics

Metric	Formula	Interpretation	Good Value
R²	1 - (SS_res/SS_tot)	Proportion of variance explained	> 0.8
RMSE	√(MSE)	Average prediction error	越小越好

R² Interpretation

R² Value	Meaning
0.9 - 1.0	Excellent fit
0.7 - 0.9	Good fit
0.5 - 0.7	Moderate fit
0.0 - 0.5	Poor fit
< 0	Worse than random guessing

Cross-Validation

from sklearn.model_selection import cross_val_score
from sir_simulator.advanced_features.ml_prediction import EpidemicPredictor

predictor = EpidemicPredictor('random_forest')
X, y, _ = predictor.prepare_data(historical)

# Perform cross-validation
scores = cross_val_score(predictor.model, X, y, cv=5, scoring='r2')
print(f"Cross-validation R² scores: {scores}")
print(f"Mean R²: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")

🌍 Real-World Applications

COVID-19 Forecasting

ML models were widely used during COVID-19:

Short-term hospital admissions - Plan bed capacity
Regional outbreaks - Identify emerging hotspots
Vaccination impact - Predict effect of vaccine campaigns

Seasonal Flu Prediction

Application	Description
Peak timing	When will flu season peak?
Severity	How many cases expected?
Resource allocation	Where to send vaccines?

Public Health Planning

Use Case	Benefit
Staff scheduling	Prepare for peak demand
Supply chain	Order enough tests/vaccines
Lockdown timing	Optimal intervention timing

🎯 Advanced Usage

Feature Importance

import matplotlib.pyplot as plt

predictor = EpidemicPredictor('random_forest')
X, y, _ = predictor.prepare_data(historical)
predictor.train(historical)

# Get feature importance
importances = predictor.model.feature_importances_
feature_names = [col for col in X.columns]

# Sort by importance
indices = np.argsort(importances)[::-1]

plt.figure(figsize=(10, 6))
plt.bar(range(len(importances)), importances[indices])
plt.xticks(range(len(importances)), [feature_names[i] for i in indices], rotation=45)
plt.xlabel('Feature', fontsize=12)
plt.ylabel('Importance', fontsize=12)
plt.title('ML Model - Feature Importance', fontsize=14)
plt.tight_layout()
plt.show()

Rolling Forecast

def rolling_forecast(historical, days_to_forecast=30, window_size=60):
    """Rolling forecast - retrain model as new data comes in"""
    forecasts = []
    
    for i in range(days_to_forecast):
        # Use last 'window_size' days for training
        train_data = historical.iloc[-window_size:]
        
        predictor = EpidemicPredictor('random_forest')
        metrics, _, _ = predictor.train(train_data)
        
        # Predict next day
        next_day = predictor.predict_future(train_data, days=1)
        forecasts.append(next_day.iloc[0])
        
        # Append prediction to historical (for next iteration)
        new_row = pd.DataFrame({
            'day': [historical['day'].max() + 1],
            'cases': [next_day.iloc[0]]
        })
        historical = pd.concat([historical, new_row], ignore_index=True)
    
    return forecasts

# Example usage
historical = pd.DataFrame({
    'day': range(1, 101),
    'cases': 10 + np.cumsum(np.random.poisson(0.5, 100))
})

forecasts = rolling_forecast(historical, days_to_forecast=30)
print(f"Rolling forecasts: {forecasts[:10]}")

Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV

predictor = EpidemicPredictor('random_forest')
X, y, _ = predictor.prepare_data(historical)

# Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

# Grid search
grid_search = GridSearchCV(predictor.model, param_grid, cv=5, scoring='r2')
grid_search.fit(X, y)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best R²: {grid_search.best_score_:.3f}")

⚠️ Common Pitfalls

1. Not Enough Data

Data Points	Viability
< 30	❌ Not enough for ML
30-100	⚠️ Risky, use simple model
100-500	✅ Acceptable
500+	✅ Good

2. Overfitting

Signs of overfitting:

Training R² high (>0.95), testing R² low (<0.7)
Model performs poorly on new data

Solutions:

Reduce model complexity
Add more data
Use cross-validation

3. Ignoring Seasonality

If your data has weekly patterns, ensure day_of_week feature is being used.

4. Predictions Going Negative

ML models don't know cases can't be negative:

future = predictor.predict_future(historical, days=30)
future = future.clip(lower=0)  # Ensure non-negative

📚 Key Takeaways

Concept	Summary
ML Purpose	Short-term forecasting from historical data
Key Features	Lags, rolling means, day of week
Supported Models	Random Forest, XGBoost
Evaluation	R² (goodness), RMSE (error magnitude)
Best for	100+ days of historical data
Warning	Not mechanistic, black box

ML Prediction Tutorial

📋 Table of Contents

🧠 What is ML Prediction?

Comparison with Mechanistic Models

When to Use ML

🔧 How It Works

Pipeline Overview

Step-by-Step Process

📊 Feature Engineering

What are Features?

Automatic Features Created

Why These Features?

🧠 Supported Models

Random Forest

XGBoost (Extreme Gradient Boosting)

💻 Code Examples

Basic ML Prediction

Visualize Predictions

Compare Models

📊 Evaluating Predictions

Key Metrics

R² Interpretation

Cross-Validation

🌍 Real-World Applications

COVID-19 Forecasting

Seasonal Flu Prediction

Public Health Planning

🎯 Advanced Usage

Feature Importance

Rolling Forecast

Hyperparameter Tuning

⚠️ Common Pitfalls

1. Not Enough Data

2. Overfitting

3. Ignoring Seasonality

4. Predictions Going Negative

📚 Key Takeaways

🔗 Next Steps

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

🦠 SIR Epidemic Simulator

📚 Main Pages

📖 Tutorials

📚 Reference

🔗 External Links

📊 Badges

Clone this wiki locally