-
Notifications
You must be signed in to change notification settings - Fork 0
ML Prediction Tutorial
milad edited this page May 10, 2026
·
1 revision
Complete guide to forecasting epidemic cases using machine learning.
- What is ML Prediction?
- How It Works
- Feature Engineering
- Supported Models
- Code Examples
- Evaluating Predictions
- Real-World Applications
- Advanced Usage
- Common Pitfalls
Unlike mechanistic models (SIR/SEIR) that use differential equations, Machine Learning learns patterns from historical data to predict future cases.
| Feature | SIR/SEIR Models | ML Models |
|---|---|---|
| Knowledge needed | Parameters (Ξ², Ξ³) | Historical data |
| Interpretability | High (mathematical) | Low (black box) |
| Data required | Small | Large |
| Adaptability | Rigid | Flexible |
| Best for | Understanding dynamics | Short-term forecasting |
| Scenario | Use ML? |
|---|---|
| Have historical case data | β Yes |
| Short-term forecasting (30 days) | β Yes |
| Need to understand mechanisms | β Use SIR/SEIR |
| Very little data (< 50 points) | β Use SIR/SEIR |
| Detecting anomalies | β Yes |
Historical Data β Feature Engineering β Train Model β Evaluate β Predict Future
β β β β β
(cases/day) (lag features, (Random Forest (RΒ², RMSE) (next 30 days)
rolling means) or XGBoost)
- Data Collection: Historical case counts per day
- Feature Engineering: Create lag features, rolling averages
- Training: Model learns patterns from past
- Evaluation: Check accuracy on test data
- Prediction: Forecast future cases
Features are the inputs the model uses to make predictions.
| Feature | Description | Example |
|---|---|---|
lag_1 |
Cases from yesterday | If today is day 10, lag_1 = day 9 cases |
lag_2 |
Cases from 2 days ago | day 8 cases |
lag_3...lag_7 |
Cases from 3-7 days ago | day 7-3 cases |
rolling_mean_3 |
Average of last 3 days | (day 9+8+7)/3 |
rolling_mean_7 |
Average of last 7 days | (day 9+...+3)/7 |
day_of_week |
Day of week (0-6) | Monday=0, Tuesday=1 |
week |
Week number | Week 1, 2, 3... |
| Pattern | Feature |
|---|---|
| Yesterday's cases predict today | lag_1 |
| Weekly pattern (weekend effect) | day_of_week |
| Smoothing out noise | rolling_mean |
| Long-term trends |
lag_7, week
|
How it works: Ensemble of decision trees.
predictor = EpidemicPredictor(model_type='random_forest')| Pros | Cons |
|---|---|
| Handles non-linear patterns | Can overfit |
| No feature scaling needed | Slower on large data |
| Good with small datasets | Less interpretable |
| Handles missing values | Memory intensive |
Best for: Smaller datasets (50-500 days)
How it works: Boosted decision trees.
predictor = EpidemicPredictor(model_type='xgboost')| Pros | Cons |
|---|---|
| Very high accuracy | Requires more tuning |
| Handles missing values | Slower to train |
| Feature importance | More complex |
| Often wins competitions | Requires more data |
Best for: Larger datasets (500+ days), competitions
import pandas as pd
import numpy as np
from sir_simulator.advanced_features.ml_prediction import EpidemicPredictor
# Load or create historical data
historical = pd.DataFrame({
'day': range(1, 101),
'cases': 10 + np.cumsum(np.random.poisson(0.5, 100))
})
# Create predictor
predictor = EpidemicPredictor(model_type='random_forest')
# Train model
metrics, predictions, model = predictor.train(historical)
print(f"RΒ²: {metrics['r2']:.3f}")
print(f"RMSE: {metrics['rmse']:.3f}")
# Predict future
future = predictor.predict_future(historical, days=30)
print(f"Future predictions: {future.values[:10]}")import matplotlib.pyplot as plt
historical = pd.DataFrame({
'day': range(1, 101),
'cases': 10 + np.cumsum(np.random.poisson(0.5, 100))
})
predictor = EpidemicPredictor('random_forest')
metrics, predictions, model = predictor.train(historical, test_size=0.2)
future = predictor.predict_future(historical, days=30)
# Split historical into train/test
train_size = int(len(historical) * 0.8)
train = historical[:train_size]
test = historical[train_size:]
# Actual predictions on test set
X_test = predictor.create_features(test)
y_pred = model.predict(X_test[[col for col in X_test.columns if col != 'cases']])
# Plot
plt.figure(figsize=(12, 6))
# Historical data
plt.plot(historical['day'], historical['cases'], 'b-', label='Historical Data', linewidth=2)
# Training data
plt.plot(train['day'], train['cases'], 'g-', alpha=0.5, label='Training Data')
# Test predictions
test_days = test['day'].values
plt.scatter(test_days[:len(y_pred)], y_pred, color='red', s=50, zorder=5, label='Model Predictions')
plt.plot(test_days[:len(y_pred)], y_pred, 'r--', alpha=0.7)
# Future predictions
future_days = range(historical['day'].max() + 1, historical['day'].max() + 31)
plt.plot(future_days, future.values, 'orange', linestyle='-.', linewidth=2, label='Future Forecast')
plt.xlabel('Day', fontsize=12)
plt.ylabel('Cases', fontsize=12)
plt.title('ML Prediction - Historical vs Forecast', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()historical = pd.DataFrame({
'day': range(1, 101),
'cases': 10 + np.cumsum(np.random.poisson(0.5, 100))
})
models = ['random_forest', 'xgboost']
colors = ['blue', 'red']
plt.figure(figsize=(12, 6))
for model_type, color in zip(models, colors):
predictor = EpidemicPredictor(model_type)
metrics, pred, model = predictor.train(historical)
future = predictor.predict_future(historical, days=30)
plt.plot(range(101, 131), future.values,
label=f'{model_type} (RΒ²={metrics["r2"]:.3f})',
color=color, linewidth=2)
plt.xlabel('Day', fontsize=12)
plt.ylabel('Predicted Cases', fontsize=12)
plt.title('ML Model Comparison - 30-day Forecast', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()| Metric | Formula | Interpretation | Good Value |
|---|---|---|---|
| RΒ² | 1 - (SS_res/SS_tot) | Proportion of variance explained | > 0.8 |
| RMSE | β(MSE) | Average prediction error | θΆε°θΆε₯½ |
| RΒ² Value | Meaning |
|---|---|
| 0.9 - 1.0 | Excellent fit |
| 0.7 - 0.9 | Good fit |
| 0.5 - 0.7 | Moderate fit |
| 0.0 - 0.5 | Poor fit |
| < 0 | Worse than random guessing |
from sklearn.model_selection import cross_val_score
from sir_simulator.advanced_features.ml_prediction import EpidemicPredictor
predictor = EpidemicPredictor('random_forest')
X, y, _ = predictor.prepare_data(historical)
# Perform cross-validation
scores = cross_val_score(predictor.model, X, y, cv=5, scoring='r2')
print(f"Cross-validation RΒ² scores: {scores}")
print(f"Mean RΒ²: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")ML models were widely used during COVID-19:
- Short-term hospital admissions - Plan bed capacity
- Regional outbreaks - Identify emerging hotspots
- Vaccination impact - Predict effect of vaccine campaigns
| Application | Description |
|---|---|
| Peak timing | When will flu season peak? |
| Severity | How many cases expected? |
| Resource allocation | Where to send vaccines? |
| Use Case | Benefit |
|---|---|
| Staff scheduling | Prepare for peak demand |
| Supply chain | Order enough tests/vaccines |
| Lockdown timing | Optimal intervention timing |
import matplotlib.pyplot as plt
predictor = EpidemicPredictor('random_forest')
X, y, _ = predictor.prepare_data(historical)
predictor.train(historical)
# Get feature importance
importances = predictor.model.feature_importances_
feature_names = [col for col in X.columns]
# Sort by importance
indices = np.argsort(importances)[::-1]
plt.figure(figsize=(10, 6))
plt.bar(range(len(importances)), importances[indices])
plt.xticks(range(len(importances)), [feature_names[i] for i in indices], rotation=45)
plt.xlabel('Feature', fontsize=12)
plt.ylabel('Importance', fontsize=12)
plt.title('ML Model - Feature Importance', fontsize=14)
plt.tight_layout()
plt.show()def rolling_forecast(historical, days_to_forecast=30, window_size=60):
"""Rolling forecast - retrain model as new data comes in"""
forecasts = []
for i in range(days_to_forecast):
# Use last 'window_size' days for training
train_data = historical.iloc[-window_size:]
predictor = EpidemicPredictor('random_forest')
metrics, _, _ = predictor.train(train_data)
# Predict next day
next_day = predictor.predict_future(train_data, days=1)
forecasts.append(next_day.iloc[0])
# Append prediction to historical (for next iteration)
new_row = pd.DataFrame({
'day': [historical['day'].max() + 1],
'cases': [next_day.iloc[0]]
})
historical = pd.concat([historical, new_row], ignore_index=True)
return forecasts
# Example usage
historical = pd.DataFrame({
'day': range(1, 101),
'cases': 10 + np.cumsum(np.random.poisson(0.5, 100))
})
forecasts = rolling_forecast(historical, days_to_forecast=30)
print(f"Rolling forecasts: {forecasts[:10]}")from sklearn.model_selection import GridSearchCV
predictor = EpidemicPredictor('random_forest')
X, y, _ = predictor.prepare_data(historical)
# Define parameter grid
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [None, 10, 20, 30],
'min_samples_split': [2, 5, 10]
}
# Grid search
grid_search = GridSearchCV(predictor.model, param_grid, cv=5, scoring='r2')
grid_search.fit(X, y)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best RΒ²: {grid_search.best_score_:.3f}")| Data Points | Viability |
|---|---|
| < 30 | β Not enough for ML |
| 30-100 | |
| 100-500 | β Acceptable |
| 500+ | β Good |
Signs of overfitting:
- Training RΒ² high (>0.95), testing RΒ² low (<0.7)
- Model performs poorly on new data
Solutions:
- Reduce model complexity
- Add more data
- Use cross-validation
If your data has weekly patterns, ensure day_of_week feature is being used.
ML models don't know cases can't be negative:
future = predictor.predict_future(historical, days=30)
future = future.clip(lower=0) # Ensure non-negative| Concept | Summary |
|---|---|
| ML Purpose | Short-term forecasting from historical data |
| Key Features | Lags, rolling means, day of week |
| Supported Models | Random Forest, XGBoost |
| Evaluation | RΒ² (goodness), RMSE (error magnitude) |
| Best for | 100+ days of historical data |
| Warning | Not mechanistic, black box |
- Scenario Comparison Tutorial - Compare interventions
- API Reference - Complete function documentation