# Week 2 — Part 01: ML Training Loop Lab

**Estimated time:** 120–150 minutes

---

## Pre-study (Self-learn)

Foundations Course assumes Self-learn is complete. If you need a refresher on evaluation mindset and metrics:

- [Foundations Course Pre-study index](../PRESTUDY.md)
- [Self-learn — Evaluation metrics (accuracy/precision/recall/F1)](../self_learn/Chapters/4/02_core_concepts.md)

---

## What success looks like (end of Part 01)

- You can run a full loop:
  - split → train → evaluate
- You save artifacts under `output/`:
  - one run file under a timestamped folder (e.g. `output/run_.../result.json`)
  - one summary file (e.g. `output/training_loop_summary.json`)

### Checkpoint

After running this notebook, you should be able to point to:

- the exact `result.json` that produced one metric
- the `training_loop_summary.json` that ranks multiple configs

## Learning Objectives

- Implement a complete ML training loop (split → train → evaluate → save)
- Understand train/validation splits
- Practice model evaluation metrics
- Save and reload artifacts for reproducibility
- Compare model configurations

### What this part covers
This notebook implements the **ML training loop** — the core engineering pattern for running, evaluating, and saving machine learning experiments.

The loop has 5 steps: **Load → Split → Train → Evaluate → Save artifacts**

Each step produces something concrete you can inspect. By the end you will have a timestamped folder under `output/` containing the config, metrics, and model for every run — so you can always trace back "what produced this result?"

## Overview

This lab is a minimal end-to-end baseline:

1. Load data
2. Split train/validation (fixed seed)
3. Train a baseline model
4. Evaluate on validation
5. Save artifacts to `output/`

If you need a refresher on why we split data and how to interpret metrics, use the Self-learn links at the top of the notebook.

### What this cell does
Defines `TrainConfig` — a typed dataclass holding all hyperparameters — and `run_once()` which executes one full training loop iteration.

**Key design decisions:**
- **`TrainConfig` dataclass:** All parameters in one place. When you save `config.json`, you save the exact settings that produced the result. No guessing later.
- **`stratify=y` in `train_test_split`:** Ensures each class appears proportionally in both train and validation sets. Without this, a small dataset might have all examples of one class in train and none in validation.
- **`StandardScaler`:** Fit on train data only, then applied to validation. Fitting on the full dataset would "leak" validation statistics into training — a subtle but common mistake.
- **Artifacts saved per run:** `result.json` (single run) and `training_loop_summary.json` (all candidates ranked). This is your audit trail.

**What to check:** After running, open `output/run_.../result.json` and verify the metrics match what's printed.

In [1]:
from dataclasses import dataclass
from pathlib import Path
import json
import joblib
from datetime import datetime
import time

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, classification_report
from sklearn.preprocessing import StandardScaler


OUTPUT_DIR = Path("output")
OUTPUT_DIR.mkdir(exist_ok=True)


@dataclass
class TrainConfig:
    seed: int = 7
    test_size: float = 0.25
    max_iter: int = 250


cfg = TrainConfig()
print(cfg)

TrainConfig(seed=7, test_size=0.25, max_iter=250)


### Task 1.1: Load Data

Load a dataset and inspect basic shapes/labels. We'll use Iris for a small reproducible example.

In [2]:
# Load Iris dataset
data = load_iris(as_frame=True)
X = data.data
y = data.target

print(f"Features shape: {X.shape}")
print(f"Labels shape: {y.shape}")
print("\nFirst 3 rows of features:")
print(X.head(3))
print("\nClass distribution:")
print(y.value_counts())

Features shape: (150, 4)
Labels shape: (150,)

First 3 rows of features:
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                5.1               3.5                1.4               0.2
1                4.9               3.0                1.4               0.2
2                4.7               3.2                1.3               0.2

Class distribution:
target
0    50
1    50
2    50
Name: count, dtype: int64


### Task 1.2: Train/Validation Split & Preprocessing

We need to hold out some data to honestly evaluate the model. We also scale the features. 
**Crucial**: Fit the scaler only on the training data, then transform the validation data to prevent data leakage.

In [3]:
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=cfg.test_size, random_state=cfg.seed, stratify=y
)

print(f"Training set size: {len(X_train)}")
print(f"Validation set size: {len(X_val)}")

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_val_s = scaler.transform(X_val)

print("\nFeature means after scaling (Train):", X_train_s.mean(axis=0))
print("Feature means after scaling (Val):", X_val_s.mean(axis=0))

Training set size: 112
Validation set size: 38

Feature means after scaling (Train): [ 5.70971841e-16  1.01506105e-15 -1.26882631e-16  2.29974769e-16]
Feature means after scaling (Val): [-0.16690571 -0.10741899 -0.07138625 -0.07201851]


### Task 1.3: Train the Model

We initialize and train a `LogisticRegression` baseline model using parameters from `TrainConfig`.

In [4]:
model = LogisticRegression(max_iter=cfg.max_iter, random_state=cfg.seed)

start_time = time.time()
model.fit(X_train_s, y_train)
train_time = time.time() - start_time

print(f"Model trained in {train_time:.4f} seconds")
print(f"Model classes: {model.classes_}")

Model trained in 0.0050 seconds
Model classes: [0 1 2]


### Task 1.4: Evaluate the Model

We compute metrics on the validation set, including a full classification report.

In [5]:
pred = model.predict(X_val_s)

acc = float(accuracy_score(y_val, pred))
f1 = float(f1_score(y_val, pred, average="macro"))
report = classification_report(y_val, pred)

metrics = {
    "accuracy": acc,
    "f1_macro": f1,
    "train_time_seconds": train_time
}

print(f"Accuracy: {acc:.4f}")
print(f"F1 (macro): {f1:.4f}")
print("\nClassification Report:\n")
print(report)

Accuracy: 0.9737
F1 (macro): 0.9733

Classification Report:

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        13
           1       1.00      0.92      0.96        13
           2       0.92      1.00      0.96        12

    accuracy                           0.97        38
   macro avg       0.97      0.97      0.97        38
weighted avg       0.98      0.97      0.97        38



### Task 1.5: Save Artifacts

This is the core of ML discipline. Every run produces a traceable folder containing the configuration, metrics, reports, and the model itself.

In [6]:
# Create a unique run ID using the current timestamp
run_id = datetime.utcnow().strftime("run_%Y%m%d_%H%M%S")
run_dir = OUTPUT_DIR / run_id
run_dir.mkdir(exist_ok=True)

# 1. Save config
config_dict = {"seed": cfg.seed, "test_size": cfg.test_size, "max_iter": cfg.max_iter}
(run_dir / "config.json").write_text(json.dumps(config_dict, indent=2), encoding="utf-8")

# 2. Save metrics
(run_dir / "metrics.json").write_text(json.dumps(metrics, indent=2), encoding="utf-8")

# 3. Save validation report
(run_dir / "val_report.txt").write_text(report, encoding="utf-8")

# 4. Save model
joblib.dump(model, run_dir / "model.joblib")

print(f"Saved artifacts to {run_dir}:")
for f in run_dir.iterdir():
    print(f" - {f.name}")

Saved artifacts to output/run_20260220_010418:
 - model.joblib
 - val_report.txt
 - config.json
 - metrics.json


  run_id = datetime.utcnow().strftime("run_%Y%m%d_%H%M%S")


### Task 1.6: The Full Loop (Putting it together)

Let's wrap these steps into a function, run multiple configurations, and save a summary. This proves we can automate experimentation and compare runs.

In [7]:
def run_once(seed: int, test_size: float, max_iter: int) -> dict:
    """Executes a full training loop and saves artifacts."""
    # Split
    X_tr, X_v, y_tr, y_v = train_test_split(
        X, y, test_size=test_size, random_state=seed, stratify=y
    )
    
    # Scale
    scl = StandardScaler()
    X_tr_s = scl.fit_transform(X_tr)
    X_v_s = scl.transform(X_v)
    
    # Train
    t0 = time.time()
    clf = LogisticRegression(max_iter=max_iter, random_state=seed)
    clf.fit(X_tr_s, y_tr)
    t_diff = time.time() - t0
    
    # Evaluate
    p = clf.predict(X_v_s)
    m = {
        "accuracy": float(accuracy_score(y_v, p)),
        "f1_macro": float(f1_score(y_v, p, average="macro")),
        "train_time_seconds": t_diff
    }
    rep = classification_report(y_v, p)
    
    # Save Artifacts
    rid = datetime.utcnow().strftime("run_%Y%m%d_%H%M%S_%f") # add microsecond to avoid rapid run clashes
    rdir = OUTPUT_DIR / rid
    rdir.mkdir(exist_ok=True)
    
    cfg_dict = {"seed": seed, "test_size": test_size, "max_iter": max_iter}
    (rdir / "config.json").write_text(json.dumps(cfg_dict, indent=2), encoding="utf-8")
    (rdir / "metrics.json").write_text(json.dumps(m, indent=2), encoding="utf-8")
    (rdir / "val_report.txt").write_text(rep, encoding="utf-8")
    joblib.dump(clf, rdir / "model.joblib")
    
    return {
        "run_id": rid,
        "config": cfg_dict,
        "metrics": m
    }

# Run multiple candidates
candidates = [
    {"seed": 7, "test_size": 0.25, "max_iter": 100},
    {"seed": 7, "test_size": 0.25, "max_iter": 400},
    {"seed": 13, "test_size": 0.20, "max_iter": 250},
]

print("Running candidates...")
results = [run_once(**c) for c in candidates]
results_sorted = sorted(results, key=lambda r: r["metrics"]["accuracy"], reverse=True)

summary = {
    "best": results_sorted[0],
    "all": results_sorted,
}

summary_path = OUTPUT_DIR / "training_loop_summary.json"
summary_path.write_text(json.dumps(summary, indent=2), encoding="utf-8")

print(f"\nWrote summary to: {summary_path}")
print("Best config:", summary["best"]["config"])
print("Best metrics:", summary["best"]["metrics"])

Running candidates...

Wrote summary to: output/training_loop_summary.json
Best config: {'seed': 13, 'test_size': 0.2, 'max_iter': 250}
Best metrics: {'accuracy': 1.0, 'f1_macro': 1.0, 'train_time_seconds': 0.002576589584350586}


  rid = datetime.utcnow().strftime("run_%Y%m%d_%H%M%S_%f") # add microsecond to avoid rapid run clashes
  rid = datetime.utcnow().strftime("run_%Y%m%d_%H%M%S_%f") # add microsecond to avoid rapid run clashes
  rid = datetime.utcnow().strftime("run_%Y%m%d_%H%M%S_%f") # add microsecond to avoid rapid run clashes
