# Week 2 — Part 02: Reproducibility Package Lab

**Estimated time:** 90–120 minutes

## Learning Objectives

- Explain reproducibility packages and why they matter
- Create a minimal Python package structure for ML code
- Add metadata and dependency management
- Practice lightweight CLI wiring for training


## Overview

A reproducibility package bundles code, configs, and artifacts so runs can be repeated later with the same results.

Key ideas:

- deterministic inputs and seeds
- pinned dependencies
- saved artifacts and run metadata
- minimal CLI for reproducible execution

## Exercise 1: Create a minimal package skeleton

We’ll build a tiny package structure with a trainer module and a CLI entry point.

> Note: this is a *minimal* scaffold designed for Level 1 reproducibility practice.

In [None]:
from pathlib import Path

project_root = Path("ml_reproducibility_package")
(project_root / "src" / "ml_trainer").mkdir(parents=True, exist_ok=True)
(project_root / "tests").mkdir(exist_ok=True)

(project_root / "src" / "ml_trainer" / "__init__.py").write_text("")

print("created:", project_root)

In [None]:
trainer_py = """
Minimal trainer module (skeleton).
"""

from pathlib import Path
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression


def load_data(csv_path: str, label_col: str):
    df = pd.read_csv(csv_path)
    if label_col not in df.columns:
        raise ValueError(f"Missing label column: {label_col}")
    return df.drop(columns=[label_col]), df[label_col]


def train(csv_path: str, label_col: str, seed: int = 42):
    X, y = load_data(csv_path, label_col)
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=seed)
    model = LogisticRegression(max_iter=200)
    model.fit(X_train, y_train)
    return model
"""

cli_py = """
Minimal CLI stub.
"""

import argparse
from ml_trainer.trainer import train


def build_parser() -> argparse.ArgumentParser:
    p = argparse.ArgumentParser(description="Train ML model")
    p.add_argument("--input", required=True)
    p.add_argument("--label_col", required=True)
    return p


def main() -> None:
    args = build_parser().parse_args()
    _ = train(args.input, args.label_col)
    print("training completed")


if __name__ == "__main__":
    main()
"""

(project_root / "src" / "ml_trainer" / "trainer.py").write_text(trainer_py)
(project_root / "src" / "ml_trainer" / "cli.py").write_text(cli_py)

print("wrote trainer.py and cli.py")

In [None]:
pyproject = """[build-system]
requires = ["setuptools>=45", "wheel"]
build-backend = "setuptools.build_meta"

[project]
name = "ml-trainer"
version = "0.1.0"
description = "A reproducible ML training package"
requires-python = ">=3.8"
"""

(project_root / "pyproject.toml").write_text(pyproject)
print("wrote pyproject.toml")

In [None]:
def write_run_metadata(path: Path, *, seed: int, model: str) -> None:
    payload = {"seed": seed, "model": model}
    path.write_text(str(payload), encoding="utf-8")


run_dir = project_root / "output"
run_dir.mkdir(exist_ok=True)
write_run_metadata(run_dir / "run_metadata.txt", seed=42, model="logreg")
print("wrote run metadata")

## Practice exercises

- Add a `requirements.txt` writer that pins versions.
- Add a `--seed` flag to the CLI and save it to run metadata.
- Add a JSON metrics writer in `trainer.py`.

## References

- Packaging guide: https://packaging.python.org/

## Self-check

- Do you have a deterministic seed stored with artifacts?
- Is your package installable from scratch?
- Can you re-run with the same config and get the same metrics?