# Week 3 — Part 03: Structured outputs (JSON) — parse + validate + retry/repair

**Estimated time:** 90–120 minutes

---

## Pre-study (Level 0)

Level 1 assumes Level 0 is complete. If you need a refresher on structured outputs, schemas, and validation mindset:

- [Level 1 Pre-study index](../PRESTUDY.md)
- [Level 0 — Structured outputs and schemas](../../level_0/Chapters/3/01_function_calling_structured_outputs.md)

---

## What success looks like (end of Part 03)

- You can take raw model text and deterministically produce either:
  - a validated dict that matches your schema, or
  - a clear error with the raw output saved for debugging.
- You can run a capped retry/repair loop.

### Checkpoint

- You can demonstrate at least one failure case (bad JSON or wrong schema) and you saved the raw output under `output/`.

## Learning Objectives

- Parse model text into JSON safely
- Validate schema and separate parse vs schema failures
- Implement a capped retry/repair loop
- Save raw outputs for debugging

## Overview

Models can produce valid JSON, or almost-JSON (extra prose, single quotes, trailing commas).

This lab builds a deterministic wrapper:

1. ask for strict JSON
2. parse it
3. validate schema
4. retry/repair on failure (capped)

Key habit: save raw output when parsing/validation fails so debugging is inspection, not guesswork.

If you need more background on schemas/validation, use the Level 0 links at the top of the notebook.

In [None]:
import json
from pathlib import Path


def parse_json_strict(text: str) -> dict:
    data = json.loads(text)
    if not isinstance(data, dict):
        raise ValueError("expected a JSON object")
    return data


def validate_shape(data: dict) -> None:
    allowed = {"person", "company"}
    extra = set(data.keys()) - allowed
    missing = allowed - set(data.keys())
    if missing:
        raise ValueError(f"missing keys: {sorted(missing)}")
    if extra:
        raise ValueError(f"extra keys: {sorted(extra)}")

    for k in ["person", "company"]:
        v = data[k]
        if v is not None and not isinstance(v, str):
            raise ValueError(f"{k} must be string or null")


print(validate_shape(parse_json_strict('{"person": "Ada", "company": null}')))

OUTPUT_DIR = Path("output")
OUTPUT_DIR.mkdir(exist_ok=True)

In [None]:
def parse_and_validate(text: str) -> dict:
    data = parse_json_strict(text)
    validate_shape(data)
    return data


bad_outputs = [
    "Here is the JSON: {\"person\": \"Ada\", \"company\": null}",
    "{'person': 'Ada', 'company': null}",
    '{"person": "Ada"}',
]

for raw in bad_outputs:
    try:
        parse_and_validate(raw)
        print("OK", raw)
    except Exception as e:
        print("FAIL", type(e).__name__, "->", str(e))

In [None]:
def call_llm_stub(prompt: str) -> str:
    # Simulate a model that sometimes returns almost-JSON.
    if "REPAIR" in prompt:
        return '{"person": "Ada Lovelace", "company": null}'
    return "Here is the JSON: {\"person\": \"Ada Lovelace\", \"company\": null}"

In [None]:
def extract_with_repair(text: str, call_llm, *, max_retries: int = 2) -> dict:
    base_prompt = (
        "Return ONLY JSON with keys person, company (null when unknown).\n"
        f"INPUT:\n{text}\n"
    )

    prompt = base_prompt
    last_err = None
    for attempt in range(max_retries + 1):
        raw = call_llm(prompt)
        try:
            return parse_and_validate(raw)
        except Exception as e:
            last_err = str(e)
            prompt = (
                "REPAIR: Your previous output was invalid.\n"
                "Return ONLY JSON with keys person, company.\n"
                f"Invalid output:\n{raw}\n\n"
                f"Error:\n{last_err}\n"
            )

    raise ValueError(f"Failed after retries. Last error: {last_err}")


print(extract_with_repair("Ada Lovelace", call_llm_stub, max_retries=2))

In [None]:
def extract_with_repair_todo(text: str, call_llm, *, max_retries: int = 2) -> dict:
    # TODO:
    # - Add raw-output persistence to output/llm_raw.txt on failure.
    # - Separate parse failures from schema failures in the error message.
    # - Keep retries capped (max_retries).
    return extract_with_repair(text, call_llm, max_retries=max_retries)


print("Implement extract_with_repair_todo().")
print(extract_with_repair_todo("Ada Lovelace", call_llm_stub, max_retries=2))

## Common pitfalls

- Asking for JSON but not banning extra text
- Not separating parse failure vs schema failure
- No retry cap
- Mixing business logic with parsing/validation

## References

- Python `json`: https://docs.python.org/3/library/json.html
- Pydantic (optional): https://docs.pydantic.dev/latest/
- JSON Schema: https://json-schema.org/
- Tenacity: https://tenacity.readthedocs.io/

In [None]:
def parse_and_validate(text: str) -> dict:
    data = parse_json_strict(text)
    validate_shape(data)
    return data


bad_outputs = [
    "Here is the JSON: {\"person\": \"Ada\", \"company\": null}",
    "{'person': 'Ada', 'company': null}",
    '{"person": "Ada"}',
]

for raw in bad_outputs:
    try:
        parse_and_validate(raw)
        print("OK", raw)
    except Exception as e:
        print("FAIL", type(e).__name__, "->", str(e))

## Appendix: Solutions (peek only after trying)

Reference implementation for `extract_with_repair_todo` that persists raw outputs and clarifies error stages.

In [None]:
def extract_with_repair_todo(text: str, call_llm, *, max_retries: int = 2) -> dict:
    base_prompt = (
        "Return ONLY JSON with keys person, company (null when unknown).\n"
        f"INPUT:\n{text}\n"
    )

    prompt = base_prompt
    last_err = None
    last_raw = None

    for attempt in range(max_retries + 1):
        raw = call_llm(prompt)
        last_raw = raw
        try:
            data = parse_json_strict(raw)
        except Exception as e:
            last_err = f"PARSE_ERROR: {e}"
            prompt = (
                "REPAIR: Your previous output was invalid JSON.\n"
                "Return ONLY JSON with keys person, company.\n"
                f"Invalid output:\n{raw}\n\n"
                f"Error:\n{last_err}\n"
            )
            continue

        try:
            validate_shape(data)
            return data
        except Exception as e:
            last_err = f"SCHEMA_ERROR: {e}"
            prompt = (
                "REPAIR: Your previous output failed schema validation.\n"
                "Return ONLY JSON with keys person, company.\n"
                f"Invalid output:\n{raw}\n\n"
                f"Error:\n{last_err}\n"
            )

    raw_path = OUTPUT_DIR / "llm_raw.txt"
    raw_path.write_text(last_raw or "", encoding="utf-8")
    raise ValueError(f"Failed after retries. {last_err}. Raw saved to {raw_path}")


try:
    extract_with_repair_todo("Ada Lovelace", call_llm_stub, max_retries=1)
except Exception as e:
    print("expected failure path exercised:", str(e))