# Week 2 — Part 02: Reproducibility Package Lab

**Estimated time:** 90–120 minutes

---

## Pre-study (Self-learn)

Foundations Course assumes Self-learn is complete. If you need a refresher on environments, dependencies, and reproducibility:

- [Foundations Course Pre-study index](../PRESTUDY.md)
- [Self-learn — Chapter 2: Python and Environment Management](../self_learn/Chapters/2/Chapter2.md)

---

## What success looks like (end of Part 02)

- You save a reproducibility package under `output/` that includes:
  - `config.json` — the exact settings used
  - `metrics.json` — the results
  - `requirements.txt` — the dependencies
- Another person can recreate your environment and get similar results.

### Checkpoint

After running this notebook:
- You can point to `output/reproducibility_package/config.json`
- You can point to `output/reproducibility_package/requirements.txt`

## Learning Objectives

- Create a complete reproducibility package
- Capture configuration, metrics, and dependencies
- Understand what makes an experiment reproducible

### What this part covers
This notebook focuses on **reproducibility** — the ability to re-run an experiment and get the same (or explainably similar) results.

A reproducibility package bundles together:
- **`config.json`** — exact parameters used (seed, model, hyperparameters)
- **`metrics.json`** — what happened (accuracy, F1, latency)
- **`requirements.txt`** — exact library versions

**Why this matters for LLM work:** When you later compare prompt strategies or model versions, reproducibility lets you isolate what changed. Was it the data? The prompt? The model version? Without saved configs and artifacts, you're guessing.

## Overview

This notebook explores **reproducibility packages** using our unified `ml_package`.

In this lab you will:

- explore the existing `ml_package` structure 
- use the package to create reproducible runs
- capture run inputs and metadata
- generate complete reproducibility packages

> **Note**: We now have a complete, production-ready `ml_package` that handles training, comparison, reporting, and reproducibility. This notebook will reference and use this package instead of creating duplicate code.

### What this cell does
Imports the `ml_package` modules and shows the package structure. The package is split into four focused modules:

- **`trainer.py`** — the 6-stage training pipeline (load → split → scale → train → evaluate → save)
- **`reproducibility.py`** — captures dependencies, validates the environment, creates run metadata
- **`comparison.py`** — loads multiple run folders, selects the best, summarizes across runs
- **`reporting.py`** — generates markdown reports and dashboards

**Why a package instead of one big script?** Each module has a single responsibility. You can test, reuse, and update each part independently. This is the same principle as pipeline stages in Week 6.

## Exercise 1: Explore the ML Package Structure

Let's examine our unified `ml_package` that handles all ML training needs:

```python
from pathlib import Path

# Look at our package structure
package_root = Path("ml_package")
print("Package structure:")
for path in sorted(package_root.rglob("*")):
    if path.is_file():
        print(f"  {path.relative_to(package_root)}")
```

This package provides:
- **trainer.py** - Core training pipeline with 6 modular stages
- **comparison.py** - Run comparison and analysis utilities  
- **reporting.py** - Report generation and dashboards
- **reproducibility.py** - Dependency capture and environment validation

Let's import and explore the key components:

### What this cell does
Creates a sample dataset, configures a training run with explicit parameters, and runs `trainer.train()`.

**Key reproducibility habits shown here:**
- `random_state=42` — fixed seed so train/val splits are identical across runs
- `max_iter=500` — explicit hyperparameter, not a hidden default
- All parameters go into `TrainConfig` so they're automatically saved to `config.json`

**What to check:** After running, find the new timestamped folder under `reproducibility_artifacts/`. Open `config.json` — it should contain every parameter you set here.

In [None]:
# Explore the package structure
from pathlib import Path

package_root = Path("ml_package")
print("Package structure:")
for path in sorted(package_root.rglob("*")):
    if path.is_file():
        print(f"  {path.relative_to(package_root)}")

# Import key components
from ml_package import trainer, reproducibility, comparison, reporting

print("\n✅ Successfully imported all package modules!")
print("Available functions:")
print("- trainer: train(), create_sample_dataset(), TrainConfig")
print("- reproducibility: capture_dependencies(), validate_environment()")
print("- comparison: load_runs(), select_best_run(), summarize_runs()")
print("- reporting: generate_experiment_summary(), write_comparison_report()")

### What this cell does
Captures the full dependency environment (`requirements.txt`), validates that key packages are importable, and saves run metadata (config + environment info) to a JSON file.

**Why capture dependencies at run time?** Library versions change. If you re-run an experiment 3 months later with a newer `scikit-learn`, results may differ. Saving `requirements.txt` alongside your metrics means you can always recreate the exact environment that produced a result.

**What `validate_environment()` checks:** That each required package is importable and returns its version. This catches silent failures where a package is listed in `requirements.txt` but not actually installed correctly.

In [None]:
# Create sample data
sample_csv = "reproducibility_sample.csv"
trainer.create_sample_dataset(sample_csv, "iris")

# Configure training
config = trainer.TrainConfig(
    input_csv=sample_csv,
    label_col="label", 
    test_size=0.2,
    random_state=42,
    max_iter=500
)

# Run training
result = trainer.train(config, "reproducibility_artifacts")
print(f"Training completed! Accuracy: {result.metrics['accuracy']:.4f}")

In [None]:
# Create a complete reproducibility package
import time
from pathlib import Path

# Get the run directory
run_id = time.strftime("run_%Y%m%d_%H%M%S")
run_dir = Path("reproducibility_artifacts") / run_id

# Capture dependencies
reproducibility.capture_dependencies(run_dir / "requirements.txt")

# Validate environment
env_info = reproducibility.validate_environment(["pandas", "scikit-learn", "joblib", "numpy"])

# Create comprehensive metadata
metadata = reproducibility.create_run_metadata(
    config=config.__dict__,
    environment_info=env_info
)
reproducibility.save_run_metadata(metadata, run_dir / "run_metadata.json")

print(f"Reproducibility package created in: {run_dir}")
print("Files created:")
for file in run_dir.iterdir():
    print(f"  - {file.name}")

In [None]:
## Exercise 3: Validate Reproducibility

Let's check if our run can be reproduced:

```python
# Check reproducibility score
repro_score = reproducibility.check_reproducibility(run_dir)
print(f"Reproducibility score: {repro_score['overall_score']}/100")

# Create a standalone reproducibility package
package_dir = Path("reproducibility_package")
reproducibility.create_reproducibility_package(run_dir, package_dir)

print(f"\nStandalone package created: {package_dir}")
print("This package can be shared and reproduced by others!")
```

# Check reproducibility score
repro_score = reproducibility.check_reproducibility(run_dir)
print(f"Reproducibility score: {repro_score['overall_score']}/100")

# Create a standalone reproducibility package
package_dir = Path("reproducibility_package")
reproducibility.create_reproducibility_package(run_dir, package_dir)

print(f"\nStandalone package created: {package_dir}")
print("This package can be shared and reproduced by others!")

In [None]:
## Exercise 4: Compare Multiple Runs

Let's run multiple experiments and compare them:

```python
# Run multiple experiments with different configurations
configs = [
    trainer.TrainConfig("reproducibility_sample.csv", "label", 0.2, 42, 200),
    trainer.TrainConfig("reproducibility_sample.csv", "label", 0.2, 42, 500),
    trainer.TrainConfig("reproducibility_sample.csv", "label", 0.3, 42, 500),
]

results = []
for i, cfg in enumerate(configs):
    result = trainer.train(cfg, "comparison_artifacts")
    results.append(result)
    print(f"Run {i+1}: accuracy={result.metrics['accuracy']:.4f}, max_iter={cfg.max_iter}")

# Load and compare runs
runs = comparison.load_runs("comparison_artifacts")
summary = comparison.summarize_runs(runs)
best = comparison.select_best_run(runs, "accuracy")

print(f"\nBest run: {best.run_id} with accuracy {best.metrics['accuracy']:.4f}")
print(f"Average accuracy across {len(runs)} runs: {summary['avg_accuracy']:.4f}")
```

# Run multiple experiments with different configurations
configs = [
    trainer.TrainConfig("reproducibility_sample.csv", "label", 0.2, 42, 200),
    trainer.TrainConfig("reproducibility_sample.csv", "label", 0.2, 42, 500),
    trainer.TrainConfig("reproducibility_sample.csv", "label", 0.3, 42, 500),
]

results = []
for i, cfg in enumerate(configs):
    result = trainer.train(cfg, "comparison_artifacts")
    results.append(result)
    print(f"Run {i+1}: accuracy={result.metrics['accuracy']:.4f}, max_iter={cfg.max_iter}")

# Load and compare runs
runs = comparison.load_runs("comparison_artifacts")
summary = comparison.summarize_runs(runs)
best = comparison.select_best_run(runs, "accuracy")

print(f"\nBest run: {best.run_id} with accuracy {best.metrics['accuracy']:.4f}")
print(f"Average accuracy across {len(runs)} runs: {summary['avg_accuracy']:.4f}")

## Exercise 5: Generate Reports

Let's create comprehensive reports for our experiments:

```python
# Generate comparison report
report_dir = Path("experiment_reports")
reports = reporting.generate_experiment_summary(runs, report_dir)

print("Generated reports:")
for name, path in reports.items():
    print(f"  {name}: {path}")

# Create a dashboard summary
dashboard_path = report_dir / "dashboard.md"
reporting.write_quick_summary(dashboard_path, runs)
print(f"  dashboard: {dashboard_path}")

# Show a snippet of the comparison report
comparison_content = (report_dir / "comparison_report.md").read_text()
print("\nComparison report preview:")
print("=" * 50)
print(comparison_content[:500] + "...")
```

## Self-check

✅ **Reproducibility Achieved:**
- Complete package structure with modular components
- Deterministic seeds stored with artifacts  
- Dependency capture and environment validation
- Run comparison and reporting capabilities

✅ **Package Features:**
- **trainer.py**: 6-stage modular training pipeline
- **comparison.py**: Load, compare, and analyze multiple runs
- **reporting.py**: Generate markdown reports and dashboards
- **reproducibility.py**: Capture dependencies and validate environments

✅ **CLI Tools Available:**
- `python train.py` - Run training with reproducibility
- `python compare_runs.py` - Compare and analyze experiments

The package eliminates code duplication while providing enhanced functionality for reproducible ML experiments!

In [None]:
# Generate comparison report
report_dir = Path("experiment_reports")
reports = reporting.generate_experiment_summary(runs, report_dir)

print("Generated reports:")
for name, path in reports.items():
    print(f"  {name}: {path}")

# Create a dashboard summary
dashboard_path = report_dir / "dashboard.md"
reporting.write_quick_summary(dashboard_path, runs)
print(f"  dashboard: {dashboard_path}")

# Show a snippet of the comparison report
comparison_content = (report_dir / "comparison_report.md").read_text()
print("\nComparison report preview:")
print("=" * 50)
print(comparison_content[:500] + "...")