# EPL Match Outcome Prediction
## Notebook 02 — Model Training & Evaluation (ML vs Bookmaker)

This notebook focuses on the **core supervised-learning step** of the project:

- Train two models on engineered features:
  - **Multinomial Logistic Regression** (interpretable linear baseline)
  - **Random Forest** (non-linear baseline)
- Evaluate both on a **time-based holdout test set**
- Compare to a **bookmaker implied-probability baseline** (no training)

### Why this notebook exists (academic motivation)
Your pipeline builds a chronological dataset of Premier League matches.  
To avoid **data leakage** (training on “future” information), we evaluate with a **temporal split**:

- **Train:** first 80% of matches (older)
- **Test:** last 20% of matches (more recent)

This reflects the real use-case: predicting upcoming matches using only past data.

---

**Expected input file**
- `../data/processed/model_data.csv`

**Outputs produced by your code (`src/models.py`)**
- `results/*_report.txt` (classification reports + log-loss)
- `results/visualisation/*.png` (confusion matrices)
- `results/logistic_regression_coefficients.txt`
- `results/random_forest_feature_importance.txt`
- `models/logistic_regression.pkl`, `models/random_forest.pkl`
- `models/feature_list.txt`

### 0) Imports and setup

In [None]:
from pathlib import Path
import pandas as pd

# Project functions (these implement your official evaluation + artifact saving)
from src.models import evaluate_bookmaker, train_models

: 

In [None]:
# Ensure output folders exist (your functions also create them, but we keep it explicit)
Path("../results").mkdir(parents=True, exist_ok=True)
Path("../results/visualisation").mkdir(parents=True, exist_ok=True)
Path("../models").mkdir(parents=True, exist_ok=True)

print("✅ Output folders ready.")

### 1) Load the final ML dataset (`model_data.csv`)

In [None]:
DATA_PATH = Path("../data/processed/model_data.csv")
if not DATA_PATH.exists():
    raise FileNotFoundError(
        f"Cannot find {DATA_PATH}. "
        "Run your pipeline (main.py / notebook 03) to generate it, "
        "or update DATA_PATH to the correct location."
    )

df_model = pd.read_csv(DATA_PATH)
df_model["match_date"] = pd.to_datetime(df_model["match_date"], errors="coerce")
df_model = df_model.sort_values("match_date").reset_index(drop=True)

print("Shape:", df_model.shape)
print("Date range:", df_model["match_date"].min(), "→", df_model["match_date"].max())
df_model.head()

### 2) Sanity checks (what we verify and why)

We verify that:
- `target` exists and follows your encoding: **Home win = 1, Draw = 0, Away win = -1**
- feature columns start with `diff_...` (home minus away rolling averages)

These checks help ensure we are training on the intended design:
> **match-level differences** that summarize recent team performance.

In [None]:
# Check target
if "target" not in df_model.columns:
    raise ValueError("Missing required column: target")

print("Target distribution (proportion):")
print(df_model["target"].value_counts(normalize=True).sort_index())

# Check diff_ features
diff_cols = [c for c in df_model.columns if c.startswith("diff_")]
print("\nNumber of engineered diff_ features:", len(diff_cols))
print("Example features:", diff_cols[:8])

if len(diff_cols) == 0:
    raise ValueError("No diff_ features found. Expected columns starting with 'diff_'.")

### 3) Define the test set (time-based split)

**Why time-based split?**  
Because in football prediction we must not learn from matches that occur after the ones we want to predict.

So we split chronologically:
- first 80% → training (past)
- last 20% → test (future relative to train)

In [None]:
# Drop rows with missing values (your main.py does this before training)
df_model = df_model.dropna().reset_index(drop=True)

split_idx = int(len(df_model) * 0.8)
df_train = df_model.iloc[:split_idx].reset_index(drop=True)
df_test  = df_model.iloc[split_idx:].reset_index(drop=True)

print(f"Train size: {len(df_train)} | Test size: {len(df_test)}")
print("Train date range:", df_train["match_date"].min(), "→", df_train["match_date"].max())
print("Test  date range:", df_test["match_date"].min(), "→", df_test["match_date"].max())

### 4) Bookmaker baseline (no model training)

This baseline converts **decimal odds** into **implied probabilities**:
- \( p_i = 1 / odds_i \)
- then probabilities are normalized to account for overround.

We then predict the class with the highest implied probability and evaluate:
- **Accuracy**
- **Log loss** (probabilistic quality)
- **Confusion matrix** (saved as PNG)

This gives a strong, realistic benchmark.

In [None]:
book_metrics = evaluate_bookmaker(df_test)
book_metrics

### 5) Train ML models and evaluate on the holdout test set

We train:
- **Logistic Regression** with standardization  
  Why? Interpretable coefficients and a strong linear baseline.
- **Random Forest**  
  Why? Captures non-linear effects and interactions without heavy preprocessing.

Your `train_models()` function:
- uses all `diff_...` features
- applies the same chronological split internally (80/20)
- saves:
  - reports (`.txt`)
  - confusion matrices (`.png`)
  - model files (`.pkl`)
  - feature list used for training

In [None]:
log_model, rf_model, metrics = train_models(df_model, book_metrics)
metrics

### 6) Where to find the outputs

After running the notebook, check:

**Models**
- `../models/logistic_regression.pkl`
- `../models/random_forest.pkl`
- `../models/feature_list.txt`

**Reports**
- `../results/bookmaker_baseline_report.txt`
- `../results/logistic_regression_report.txt`
- `../results/random_forest_report.txt`
- `../results/final_results_summary.txt`

**Interpretability**
- `../results/logistic_regression_coefficients.txt`
- `../results/random_forest_feature_importance.txt`

**Figures**
- `../results/visualisation/confusion_matrix_bookmaker.png`
- `../results/visualisation/confusion_matrix_logistic_regression.png`
- `../results/visualisation/confusion_matrix_random_forest.png`

### 7) Notes (optional improvements)

If you want more robust evaluation for the dissertation/report:
- **Walk-forward validation** (rolling-origin evaluation)
- Probability calibration (isotonic / Platt scaling)
- Confidence-based analysis: performance vs predicted probability bins