Skip to content

ML Prediction Tutorial

milad edited this page May 10, 2026 · 1 revision

Complete guide to forecasting epidemic cases using machine learning.


πŸ“‹ Table of Contents


🧠 What is ML Prediction?

Unlike mechanistic models (SIR/SEIR) that use differential equations, Machine Learning learns patterns from historical data to predict future cases.

Comparison with Mechanistic Models

Feature SIR/SEIR Models ML Models
Knowledge needed Parameters (Ξ², Ξ³) Historical data
Interpretability High (mathematical) Low (black box)
Data required Small Large
Adaptability Rigid Flexible
Best for Understanding dynamics Short-term forecasting

When to Use ML

Scenario Use ML?
Have historical case data βœ… Yes
Short-term forecasting (30 days) βœ… Yes
Need to understand mechanisms ❌ Use SIR/SEIR
Very little data (< 50 points) ❌ Use SIR/SEIR
Detecting anomalies βœ… Yes

πŸ”§ How It Works

Pipeline Overview

Historical Data β†’ Feature Engineering β†’ Train Model β†’ Evaluate β†’ Predict Future
      ↓                  ↓                  ↓           ↓           ↓
   (cases/day)    (lag features,      (Random Forest   (RΒ², RMSE)  (next 30 days)
                   rolling means)       or XGBoost)

Step-by-Step Process

  1. Data Collection: Historical case counts per day
  2. Feature Engineering: Create lag features, rolling averages
  3. Training: Model learns patterns from past
  4. Evaluation: Check accuracy on test data
  5. Prediction: Forecast future cases

πŸ“Š Feature Engineering

What are Features?

Features are the inputs the model uses to make predictions.

Automatic Features Created

Feature Description Example
lag_1 Cases from yesterday If today is day 10, lag_1 = day 9 cases
lag_2 Cases from 2 days ago day 8 cases
lag_3...lag_7 Cases from 3-7 days ago day 7-3 cases
rolling_mean_3 Average of last 3 days (day 9+8+7)/3
rolling_mean_7 Average of last 7 days (day 9+...+3)/7
day_of_week Day of week (0-6) Monday=0, Tuesday=1
week Week number Week 1, 2, 3...

Why These Features?

Pattern Feature
Yesterday's cases predict today lag_1
Weekly pattern (weekend effect) day_of_week
Smoothing out noise rolling_mean
Long-term trends lag_7, week

🧠 Supported Models

Random Forest

How it works: Ensemble of decision trees.

predictor = EpidemicPredictor(model_type='random_forest')
Pros Cons
Handles non-linear patterns Can overfit
No feature scaling needed Slower on large data
Good with small datasets Less interpretable
Handles missing values Memory intensive

Best for: Smaller datasets (50-500 days)

XGBoost (Extreme Gradient Boosting)

How it works: Boosted decision trees.

predictor = EpidemicPredictor(model_type='xgboost')
Pros Cons
Very high accuracy Requires more tuning
Handles missing values Slower to train
Feature importance More complex
Often wins competitions Requires more data

Best for: Larger datasets (500+ days), competitions


πŸ’» Code Examples

Basic ML Prediction

import pandas as pd
import numpy as np
from sir_simulator.advanced_features.ml_prediction import EpidemicPredictor

# Load or create historical data
historical = pd.DataFrame({
    'day': range(1, 101),
    'cases': 10 + np.cumsum(np.random.poisson(0.5, 100))
})

# Create predictor
predictor = EpidemicPredictor(model_type='random_forest')

# Train model
metrics, predictions, model = predictor.train(historical)

print(f"RΒ²: {metrics['r2']:.3f}")
print(f"RMSE: {metrics['rmse']:.3f}")

# Predict future
future = predictor.predict_future(historical, days=30)
print(f"Future predictions: {future.values[:10]}")

Visualize Predictions

import matplotlib.pyplot as plt

historical = pd.DataFrame({
    'day': range(1, 101),
    'cases': 10 + np.cumsum(np.random.poisson(0.5, 100))
})

predictor = EpidemicPredictor('random_forest')
metrics, predictions, model = predictor.train(historical, test_size=0.2)
future = predictor.predict_future(historical, days=30)

# Split historical into train/test
train_size = int(len(historical) * 0.8)
train = historical[:train_size]
test = historical[train_size:]

# Actual predictions on test set
X_test = predictor.create_features(test)
y_pred = model.predict(X_test[[col for col in X_test.columns if col != 'cases']])

# Plot
plt.figure(figsize=(12, 6))

# Historical data
plt.plot(historical['day'], historical['cases'], 'b-', label='Historical Data', linewidth=2)

# Training data
plt.plot(train['day'], train['cases'], 'g-', alpha=0.5, label='Training Data')

# Test predictions
test_days = test['day'].values
plt.scatter(test_days[:len(y_pred)], y_pred, color='red', s=50, zorder=5, label='Model Predictions')
plt.plot(test_days[:len(y_pred)], y_pred, 'r--', alpha=0.7)

# Future predictions
future_days = range(historical['day'].max() + 1, historical['day'].max() + 31)
plt.plot(future_days, future.values, 'orange', linestyle='-.', linewidth=2, label='Future Forecast')

plt.xlabel('Day', fontsize=12)
plt.ylabel('Cases', fontsize=12)
plt.title('ML Prediction - Historical vs Forecast', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Compare Models

historical = pd.DataFrame({
    'day': range(1, 101),
    'cases': 10 + np.cumsum(np.random.poisson(0.5, 100))
})

models = ['random_forest', 'xgboost']
colors = ['blue', 'red']

plt.figure(figsize=(12, 6))

for model_type, color in zip(models, colors):
    predictor = EpidemicPredictor(model_type)
    metrics, pred, model = predictor.train(historical)
    future = predictor.predict_future(historical, days=30)
    
    plt.plot(range(101, 131), future.values, 
             label=f'{model_type} (RΒ²={metrics["r2"]:.3f})', 
             color=color, linewidth=2)

plt.xlabel('Day', fontsize=12)
plt.ylabel('Predicted Cases', fontsize=12)
plt.title('ML Model Comparison - 30-day Forecast', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

πŸ“Š Evaluating Predictions

Key Metrics

Metric Formula Interpretation Good Value
RΒ² 1 - (SS_res/SS_tot) Proportion of variance explained > 0.8
RMSE √(MSE) Average prediction error 袊小袊ε₯½

RΒ² Interpretation

RΒ² Value Meaning
0.9 - 1.0 Excellent fit
0.7 - 0.9 Good fit
0.5 - 0.7 Moderate fit
0.0 - 0.5 Poor fit
< 0 Worse than random guessing

Cross-Validation

from sklearn.model_selection import cross_val_score
from sir_simulator.advanced_features.ml_prediction import EpidemicPredictor

predictor = EpidemicPredictor('random_forest')
X, y, _ = predictor.prepare_data(historical)

# Perform cross-validation
scores = cross_val_score(predictor.model, X, y, cv=5, scoring='r2')
print(f"Cross-validation RΒ² scores: {scores}")
print(f"Mean RΒ²: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")

🌍 Real-World Applications

COVID-19 Forecasting

ML models were widely used during COVID-19:

  • Short-term hospital admissions - Plan bed capacity
  • Regional outbreaks - Identify emerging hotspots
  • Vaccination impact - Predict effect of vaccine campaigns

Seasonal Flu Prediction

Application Description
Peak timing When will flu season peak?
Severity How many cases expected?
Resource allocation Where to send vaccines?

Public Health Planning

Use Case Benefit
Staff scheduling Prepare for peak demand
Supply chain Order enough tests/vaccines
Lockdown timing Optimal intervention timing

🎯 Advanced Usage

Feature Importance

import matplotlib.pyplot as plt

predictor = EpidemicPredictor('random_forest')
X, y, _ = predictor.prepare_data(historical)
predictor.train(historical)

# Get feature importance
importances = predictor.model.feature_importances_
feature_names = [col for col in X.columns]

# Sort by importance
indices = np.argsort(importances)[::-1]

plt.figure(figsize=(10, 6))
plt.bar(range(len(importances)), importances[indices])
plt.xticks(range(len(importances)), [feature_names[i] for i in indices], rotation=45)
plt.xlabel('Feature', fontsize=12)
plt.ylabel('Importance', fontsize=12)
plt.title('ML Model - Feature Importance', fontsize=14)
plt.tight_layout()
plt.show()

Rolling Forecast

def rolling_forecast(historical, days_to_forecast=30, window_size=60):
    """Rolling forecast - retrain model as new data comes in"""
    forecasts = []
    
    for i in range(days_to_forecast):
        # Use last 'window_size' days for training
        train_data = historical.iloc[-window_size:]
        
        predictor = EpidemicPredictor('random_forest')
        metrics, _, _ = predictor.train(train_data)
        
        # Predict next day
        next_day = predictor.predict_future(train_data, days=1)
        forecasts.append(next_day.iloc[0])
        
        # Append prediction to historical (for next iteration)
        new_row = pd.DataFrame({
            'day': [historical['day'].max() + 1],
            'cases': [next_day.iloc[0]]
        })
        historical = pd.concat([historical, new_row], ignore_index=True)
    
    return forecasts

# Example usage
historical = pd.DataFrame({
    'day': range(1, 101),
    'cases': 10 + np.cumsum(np.random.poisson(0.5, 100))
})

forecasts = rolling_forecast(historical, days_to_forecast=30)
print(f"Rolling forecasts: {forecasts[:10]}")

Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV

predictor = EpidemicPredictor('random_forest')
X, y, _ = predictor.prepare_data(historical)

# Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

# Grid search
grid_search = GridSearchCV(predictor.model, param_grid, cv=5, scoring='r2')
grid_search.fit(X, y)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best RΒ²: {grid_search.best_score_:.3f}")

⚠️ Common Pitfalls

1. Not Enough Data

Data Points Viability
< 30 ❌ Not enough for ML
30-100 ⚠️ Risky, use simple model
100-500 βœ… Acceptable
500+ βœ… Good

2. Overfitting

Signs of overfitting:

  • Training RΒ² high (>0.95), testing RΒ² low (<0.7)
  • Model performs poorly on new data

Solutions:

  • Reduce model complexity
  • Add more data
  • Use cross-validation

3. Ignoring Seasonality

If your data has weekly patterns, ensure day_of_week feature is being used.

4. Predictions Going Negative

ML models don't know cases can't be negative:

future = predictor.predict_future(historical, days=30)
future = future.clip(lower=0)  # Ensure non-negative

πŸ“š Key Takeaways

Concept Summary
ML Purpose Short-term forecasting from historical data
Key Features Lags, rolling means, day of week
Supported Models Random Forest, XGBoost
Evaluation RΒ² (goodness), RMSE (error magnitude)
Best for 100+ days of historical data
Warning Not mechanistic, black box

πŸ”— Next Steps


⬆ Back to Home

Clone this wiki locally