# Week 6 — Part 04: End-to-end capstone runner (one command)

**Estimated time:** 60–90 minutes

## What success looks like (end of Part 04)

- You can describe a stable CLI interface for the capstone runner.
- Running the runner produces `output/report.json` and `output/report.md` deterministically.
- When something fails, you still have intermediate artifacts in `output/` to debug.

### Checkpoint

After reading/running the skeleton, you should be able to point to:

- the CLI flags (`--input`, `--output_dir`, `--model`)
- the output contract (`report.json`, `report.md`)

## Learning Objectives

- Design a stable CLI interface for the capstone
- Define a clear output contract (report + intermediate artifacts)
- Build a runner skeleton with argparse
- Capture failure evidence for debugging

### What this part covers
This notebook defines the **end-to-end capstone runner** — a single command that orchestrates all pipeline stages from CSV input to final report.

**The goal:** `python run_capstone.py --input data.csv --output_dir output --model llama3.1`

One command. Predictable outputs. Debuggable failures.

**Why a runner matters:** Without a runner, you have to manually execute each notebook in order, passing outputs between them. A runner makes the pipeline reproducible, testable, and demo-ready — anyone can clone your repo and run it without asking you questions.

## Overview

Your capstone should run with **one command**. That means:

- clear CLI flags
- predictable outputs
- stable artifact locations

---

## Underlying theory: the runner is your system’s public interface

From Week 1, reproducibility is an interface. The runner is the concrete version of that idea:

$$
\text{outputs} = r(\text{input},\ \text{config})
$$

Practical implication:

- if the runner is stable, testing and demos become easy
- if the runner requires manual steps, failures become non-reproducible

### What this cell does
Defines `run_capstone()` — the main pipeline function — and `build_parser()` — the CLI argument parser.

**Walk through `run_capstone()`:**
1. Create `output_dir` (with `mkdir(parents=True, exist_ok=True)` — safe to call even if it exists)
2. TODO: implement the 5 pipeline stages (load → profile → compress → llm → report)
3. Write `report.json` and `report.md` — the two required output artifacts

**Walk through `build_parser()`:**
- `--input` (required) — the CSV file to analyze
- `--output_dir` (default: `"output"`) — where to write all artifacts
- `--model` (required) — which LLM model to use

**Your task:** Replace the `TODO` comment in `run_capstone()` with real stage implementations from Parts 01–03. Each stage should save an intermediate artifact before calling the next stage — so if the LLM call fails, you still have `profile.json` and `compressed_input.json` for debugging.

import argparse
import json
from pathlib import Path
from typing import Any, Dict


def run_capstone(input_path: Path, output_dir: Path, model: str) -> Dict[str, Any]:
    output_dir.mkdir(parents=True, exist_ok=True)

    # TODO: implement pipeline stages (load -> profile -> compress -> llm -> report)
    report: Dict[str, Any] = {
        "model": model,
        "input": str(input_path),
        "summary": "placeholder",
    }

    (output_dir / "report.json").write_text(json.dumps(report, indent=2), encoding="utf-8")
    (output_dir / "report.md").write_text("# Report\n\nPlaceholder report", encoding="utf-8")
    return report


def build_parser() -> argparse.ArgumentParser:
    parser = argparse.ArgumentParser()
    parser.add_argument("--input", required=True)
    parser.add_argument("--output_dir", default="output")
    parser.add_argument("--model", required=True)
    return parser


# Example CLI usage:
# python run_capstone.py --input data.csv --output_dir output --model llama3.1

### What this cell does
Defines `validate_outputs()` — a post-run check that verifies the required output files actually exist.

**Why validate outputs explicitly?** A pipeline can "succeed" (no exceptions raised) but still produce incomplete outputs if a stage silently skips writing a file. This validator catches that case and gives you a clear error: `"missing outputs: [output/report.json]"`.

**Your task:** Extend `validate_outputs()` with schema checks — not just "file exists" but "file contains valid JSON with the expected keys." For example, `report.json` should have at least `model` and `summary` fields.

## Suggested CLI



%%bash
python run_capstone.py --input data.csv --output_dir output --model <MODEL_NAME>




## Output contract

The command should write:

- `output/report.json`
- `output/report.md`

Optionally:

- `output/profile.json`
- `output/compressed_input.json`

Failure-mode design tip:



In [None]:
from pathlib import Path
def validate_outputs(output_dir: Path) -> None:
    required = [output_dir / "report.json", output_dir / "report.md"]
    missing = [p for p in required if not p.exists()]
    if missing:
        raise FileNotFoundError(f"missing outputs: {missing}")


print("Implement validate_outputs() with extra checks if needed.")

## Self-check

- Can you run from a fresh folder after following README steps?
- If the model call fails, do you still get intermediate outputs?

## References

- Python `argparse`: https://docs.python.org/3/library/argparse.html