# Week 2 — Part 02: Reproducibility Package Lab

**Estimated time:** 90–120 minutes

---

## Pre-study (Self-learn)

Foundamental Course assumes Self-learn is complete. If you need a refresher on environments, dependencies, and reproducibility:

- [Foundamental Course Pre-study index](../PRESTUDY.md)
- [Self-learn — Chapter 2: Python and Environment Management](../../self_learn/Chapters/2/Chapter2.md)

---

## What success looks like (end of Part 02)

- You save a reproducibility package under `output/` that includes:
  - `config.json` — the exact settings used
  - `metrics.json` — the results
  - `requirements.txt` — the dependencies
- Another person can recreate your environment and get similar results.

### Checkpoint

After running this notebook:
- You can point to `output/reproducibility_package/config.json`
- You can point to `output/reproducibility_package/requirements.txt`

## Learning Objectives

- Create a complete reproducibility package
- Capture configuration, metrics, and dependencies
- Understand what makes an experiment reproducible

## Overview

This notebook builds a minimal, runnable **reproducibility package**.

In this lab you will:

- create a tiny package skeleton under `ml_reproducibility_package/`
- capture run inputs (seed/model) in a metadata file
- write metrics as a JSON artifact
- add a pinned `requirements.txt` (exercise)

If you need background on environments and dependency recording, use the Self-learn links at the top of the notebook.

## Exercise 1: Create a minimal package skeleton

We’ll build a tiny package structure with a trainer module and a CLI entry point.

> Note: this is a *minimal* scaffold designed for Foundamental Course reproducibility practice.

In [None]:
from pathlib import Path

project_root = Path("ml_reproducibility_package")
(project_root / "src" / "ml_trainer").mkdir(parents=True, exist_ok=True)
(project_root / "tests").mkdir(exist_ok=True)

(project_root / "src" / "ml_trainer" / "__init__.py").write_text("")

print("created:", project_root)

In [None]:
trainer_py = '''\
from __future__ import annotations

import json
from dataclasses import dataclass
from pathlib import Path
from typing import Dict

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score
from sklearn.model_selection import train_test_split


@dataclass(frozen=True)
class TrainResult:
    metrics: Dict[str, float]


def load_data(csv_path: str, label_col: str):
    df = pd.read_csv(csv_path)
    if label_col not in df.columns:
        raise ValueError(f"Missing label column: {label_col}")
    return df.drop(columns=[label_col]), df[label_col]


def train(csv_path: str, label_col: str, *, seed: int = 42, max_iter: int = 200) -> TrainResult:
    X, y = load_data(csv_path, label_col)
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=seed)

    model = LogisticRegression(max_iter=max_iter)
    model.fit(X_train, y_train)

    pred = model.predict(X_val)
    metrics = {
        "accuracy": float(accuracy_score(y_val, pred)),
        "f1_macro": float(f1_score(y_val, pred, average="macro")),
    }
    return TrainResult(metrics=metrics)


def write_metrics_json(out_path: Path, result: TrainResult) -> None:
    out_path.write_text(json.dumps({"metrics": result.metrics}, indent=2), encoding="utf-8")
'''

cli_py = '''\
from __future__ import annotations

import argparse
from pathlib import Path

from ml_trainer.trainer import train, write_metrics_json


def build_parser() -> argparse.ArgumentParser:
    p = argparse.ArgumentParser(description="Train ML model")
    p.add_argument("--input", required=True)
    p.add_argument("--label_col", required=True)
    p.add_argument("--seed", type=int, default=42)
    p.add_argument("--out", default="output")
    return p


def main() -> int:
    args = build_parser().parse_args()
    out_dir = Path(args.out)
    out_dir.mkdir(parents=True, exist_ok=True)

    result = train(args.input, args.label_col, seed=args.seed)
    write_metrics_json(out_dir / "metrics.json", result)

    print("training completed")
    print("wrote", out_dir / "metrics.json")
    return 0


if __name__ == "__main__":
    raise SystemExit(main())
'''

(project_root / "src" / "ml_trainer" / "trainer.py").write_text(trainer_py, encoding="utf-8")
(project_root / "src" / "ml_trainer" / "cli.py").write_text(cli_py, encoding="utf-8")

print("wrote trainer.py and cli.py")

In [None]:
pyproject = """[build-system]
requires = ["setuptools>=45", "wheel"]
build-backend = "setuptools.build_meta"

[project]
name = "ml-trainer"
version = "0.1.0"
description = "A reproducible ML training package"
requires-python = ">=3.8"
"""

(project_root / "pyproject.toml").write_text(pyproject)
print("wrote pyproject.toml")

In [None]:
import json


def write_run_metadata(path: Path, *, seed: int, model: str) -> None:
    payload = {"seed": int(seed), "model": str(model)}
    path.write_text(json.dumps(payload, indent=2), encoding="utf-8")


run_dir = project_root / "output"
run_dir.mkdir(exist_ok=True)
write_run_metadata(run_dir / "run_metadata.json", seed=42, model="logreg")
print("wrote run metadata")

## Exercise: reproducibility extras (TODO)

Implement the TODO function below.

Goal:

- Create a pinned `requirements.txt` inside `ml_reproducibility_package/`.
- This file is part of your reproducibility package: it proves which dependencies were used.

Constraint:

- Keep it deterministic (same inputs → same file format).

In [None]:
def write_requirements_txt_todo(project_root: Path) -> Path:
    # TODO: implement.
    # Suggested approach:
    # - Write a minimal set of dependencies you used (pandas, scikit-learn)
    # - Pin versions if you know them; otherwise leave as unpinned for the exercise
    # - Return the path you wrote
    out_path = project_root / "requirements.txt"
    out_path.write_text("TODO\n", encoding="utf-8")
    return out_path


print("Implement write_requirements_txt_todo().")

## Self-check

- Do you have a deterministic seed stored with artifacts?
- Is your package installable from scratch?
- Can you re-run with the same config and get the same metrics?

## Appendix: Solutions (peek only after trying)

Reference implementation for `write_requirements_txt_todo`.

In [None]:
def write_requirements_txt_todo(project_root: Path) -> Path:
    out_path = project_root / "requirements.txt"
    # Minimal reproducibility package: pin the major deps used by trainer.py
    out_path.write_text(
        "\n".join(
            [
                "pandas",
                "scikit-learn",
            ]
        )
        + "\n",
        encoding="utf-8",
    )
    return out_path


p = write_requirements_txt_todo(project_root)
print("wrote", p)