AI Engineering for Real Apps — A YouTube series building production AI patterns from scratch.
Each episode adds one focused module to this repo. By episode 4 you have a working multi-component AI system, not four disconnected demos.
Branch: ep1-langgraph-pipeline
A PipelineGraph with four specialized agents wired together via LangGraph's StateGraph:
PlannerAgent → ResearchAgent → WriterAgent → QAAgent
↑ │
└──── retry if score < 0.7 ──┘
| Agent | Responsibility |
|---|---|
PlannerAgent |
Turns a user question into a structured ResearchPlan (Pydantic) |
ResearchAgent |
Executes search queries against a document store |
WriterAgent |
Synthesizes research into a cited Report (Pydantic) |
QAAgent |
Scores the report; triggers a retry loop if confidence is too low |
- TypedDict state — all agents share one typed
PipelineState; nothing is hidden in local variables - Conditional edges — the retry loop is a graph edge, not an
if/elseinside an agent - Structured output — every LLM call uses Claude's
tool_useto return a typed Pydantic model, not a string - Eval harness — shape, grounding, and threshold tests that work without mocking LLM content
cd ai-engineering-lab
python -m venv .venv
.venv\Scripts\activate # Windows
# source .venv/bin/activate # macOS/Linux
pip install -r requirements.txt
cp .env.example .env
# edit .env — add your ANTHROPIC_API_KEYpython -m ep1_langgraph.run "Which product categories had the highest return rates last quarter?"# Offline — mocks all LLM calls
pytest ep1_langgraph/eval/ -v
# Integration — calls real Claude API (requires ANTHROPIC_API_KEY in .env)
pytest ep1_langgraph/eval/ -v -m integrationA production-style REST service that accepts raw text and returns a validated, structured enrichment object — enforced at every layer.
Client → EnrichmentRequest (Pydantic, extra=forbid)
→ Claude tool_use (forces structured JSON)
→ EnrichmentResponse (Pydantic, field + model validators)
→ Client
| Component | Responsibility |
|---|---|
EnrichmentRequest |
Validates inbound payload; rejects unexpected fields (422) |
ClaudeClient |
Calls Claude with tool_choice: {type: tool} to guarantee typed output |
EnrichmentResponse |
Validates genres list, strips unknown LLM fields, appends warning when confidence < 0.5 |
routes.py |
Maps upstream errors to 502 (bad LLM output) vs 503 (API unreachable) |
- Strict input validation —
extra="forbid"rejects any field the caller shouldn't send - Output envelope hardening —
extra="ignore"silently drops unexpected LLM fields instead of crashing - Multi-gate validation — Pydantic fires three times before a response leaves the service
- Dependency injection —
Depends(get_claude_client)makes the LLM client swappable in tests - Error taxonomy — 422 (client mistake), 502 (LLM output invalid), 503 (upstream unreachable)
uvicorn ep2_structured_api.main:app --reload
# POST http://localhost:8000/api/enrich
# GET http://localhost:8000/healthExample request:
{
"title": "Inception",
"context": "A thief who steals corporate secrets through dream-sharing technology."
}Example response:
{
"title": "Inception",
"summary": "A skilled thief uses experimental dream-sharing technology to steal corporate secrets.",
"genres": ["sci-fi", "thriller"],
"confidence": 0.92
}# Offline — mocks all Claude calls
pytest ep2_structured_api/tests/ -v
# Integration — calls real Claude API (requires ANTHROPIC_API_KEY in .env)
pytest ep2_structured_api/tests/ -v -m integrationAn agent that navigates a real web UI and completes multi-step goals using Claude for decision-making and Playwright for execution — with no hardcoded selectors.
UserGoal → [observe page state] → [Claude decides next action] → [Playwright executes]
↑ │
└────────────────── loop until done ──────────────────────┘
| Module | Responsibility |
|---|---|
browser.py |
get_page_state() — extracts visible text + element descriptors (~1–3KB, not full HTML) |
llm.py |
decide() — Claude tool_use forced, returns structured Action |
loop.py |
act() with semantic locator fallbacks; run_loop() (injectable); run_agent() (browser lifecycle) |
actions.py |
Action Pydantic model with action-dependent validation |
- Semantic locators —
get_by_role,get_by_label,get_by_placeholderwithor_()fallback chains; survive UI refactors - Constrained vocabulary —
Literal["click", "type", "scroll", "scroll_up", "done"]means Claude can only return actions Playwright can execute - Cheap observation — visible text + element labels, not full DOM; ~50× cheaper per loop iteration
- Testable loop —
run_loop(page)is separated from browser lifecycle so tests inject a mock page; Playwright is a lazy import inrun_agent - Three test signals — outcome (did it finish?), trace structure (valid types + non-empty reasons), cap-never-hit (earliest stuck detector)
pip install -r requirements.txt
python -m playwright install chromium # one-time: download browser binarypython -m ep3_playwright_agent.run https://example.com "Click the More information link"
python -m ep3_playwright_agent.run https://example.com "Click the More information link" --verboseVerbose output shows the full action trace:
[1] click → More information
reason: The goal is to click the More information link
[2] done →
reason: Clicked the link successfully
Result: Clicked the link successfully
# Offline — no API key, no browser binary required
pytest ep3_playwright_agent/tests/ -v
# Integration — real Claude + real Playwright (requires ANTHROPIC_API_KEY in .env)
pytest ep3_playwright_agent/tests/ -v -m integrationA chat API that converts natural language questions about a movie performance catalog into SQL, executes them, charts the results, and returns a plain-English summary — all in one POST request.
ChatRequest(question)
→ Vanna.generate_sql() # NL-to-SQL via ChromaDB vector + Claude
→ run_query() # DuckDB read-only execution
→ build_chart() # Plotly JSON (bar / scatter / "")
→ summarize() # Claude tool_use → summary + follow-up
→ ChatResponse
| Module | Responsibility |
|---|---|
data/schema.py |
DDL definitions + 10 ground-truth Q&A training pairs |
data/create_db.py |
Generates catalog.duckdb — 20 titles × 3 regions × 4 quarters |
pipeline/vanna_setup.py |
CatalogVanna(ChromaDB_VectorStore, Anthropic_Chat) singleton via @lru_cache |
pipeline/query_engine.py |
run_query() — SELECT/WITH guard, FileNotFoundError, 500-row cap |
pipeline/chart_builder.py |
build_chart() — categorical+numeric → bar; two numeric → scatter |
pipeline/summarizer.py |
summarize() — async Claude tool_use; short-circuits on empty df |
api/routes.py |
POST /chat — 4-step pipeline; 503/502/error-field taxonomy per step |
ui/index.html |
Dark-theme single-page chat UI; renders table + Plotly chart inline |
- Vanna multi-inheritance pattern —
CatalogVanna(ChromaDB_VectorStore, Anthropic_Chat)is the standard Vanna composition; explicit__init__calls to both parents - SELECT guard —
WITH(CTEs) as well asSELECTaccepted; everything else rejected before the DB is touched - Singleton via
@lru_cache— one Vanna instance for the app lifetime; no per-request ChromaDB startup cost - Sync Vanna in async route —
await loop.run_in_executor(None, vn.generate_sql, question)keeps the event loop unblocked - Error taxonomy — 503 (Vanna), 502 (DuckDB / summarizer),
ChatResponse(error=...)(expected domain errors)
pip install -r requirements.txt
# 1 — create the database
python -m ep4_nlsql.data.create_db
# 2 — train Vanna (writes ChromaDB index)
python -m ep4_nlsql.pipeline.vanna_setupuvicorn ep4_nlsql.main:app --reload
# Chat UI: http://localhost:8000/ui
# Swagger: http://localhost:8000/docs
# POST /api/v1/chatExample request:
{ "question": "Which genre has the highest average return rate?" }Example response:
{
"question": "Which genre has the highest average return rate?",
"sql": "SELECT t.genre, AVG(p.return_rate) AS avg_return_rate FROM performance p ...",
"result": [{"genre": "Horror", "avg_return_rate": 0.142}, ...],
"chart_json": "{\"data\":[{\"type\":\"bar\",...}]}",
"summary": "Horror titles have by far the highest return rate at 14.2%, ...",
"follow_up": "Which Horror titles specifically have the highest return rates?"
}# Offline — no API key required
pytest ep4_nlsql/tests/ -v
# Integration — real Claude (requires ANTHROPIC_API_KEY + bootstrap steps above)
pytest ep4_nlsql/tests/ -v -m integration| Episode | Module | Topic |
|---|---|---|
| 1 | ep1_langgraph |
LangGraph multi-agent pipeline |
| 2 | ep2_structured_api |
FastAPI + Pydantic + Claude structured API |
| 3 | ep3_playwright_agent |
Playwright browser automation agent |
| 4 | ep4_nlsql |
NL-to-SQL with Vanna + DuckDB |