A multi-agent text-to-SQL analyst that plans, writes, validates, executes, and explains SQL with self-correction and safety guardrails. Built from scratch in plain Python around Anthropic's Claude — no LangChain, no LangGraph, ~1900 lines of typed code.
The repo is intentionally a portfolio piece: production-shaped agent loop, real safety layer, JSONL traces you can grep, and a test suite that runs end-to-end without an API key.
You ask a question in English. Six specialized agents take turns: one picks the relevant tables, one plans the steps, one writes SQL, one validates it against a live EXPLAIN QUERY PLAN, one executes it under a read-only connection with a timeout, and one explains the result. If validation fails (hallucinated column, broken join, unsafe statement) the writer gets the error message back and retries — up to a configurable budget — before giving up.
+--------------------+
question ----------------> | Schema Explorer | Haiku (picks tables, joins)
+---------+----------+
|
v
+--------------------+
| Planner | Sonnet (numbered plan in English)
+---------+----------+
|
v
+--------------------+
+-----------> | SQL Writer | Sonnet (emits one SELECT)
| +---------+----------+
| retry |
| with v
| error +--------------------+
+---------------+ Validator | sqlglot + EXPLAIN QUERY PLAN
+---------+----------+
| OK
v
+--------------------+
| Executor | Read-only conn, row cap, timeout
+---------+----------+
|
v
+--------------------+
| Summarizer | Haiku (plain-English answer)
+--------------------+
Naive text-to-SQL — paste schema, ask model, run output — falls over the moment you hit a real warehouse. The failure modes are well known:
- The model hallucinates a column name or invents a join.
- It writes a query that runs but answers the wrong question.
- It produces something subtly destructive that the wrapper happily executes.
- A single retry attempt with no error context can't recover from a typo.
This project is the answer to "can we let business users query the warehouse with natural language?" — at least in skeleton form. The agent loop, the validator-driven retry, and the layered safety model are the bits that actually move "demo on a slide" closer to "ship it to the business team."
Each agent does one job. Splitting them is what makes self-correction tractable: when the validator catches a bad column the writer can re-read the plan (still correct) instead of regenerating a new approach from scratch.
| Agent | Model | Job |
|---|---|---|
| Schema Explorer | Haiku 4.5 | Reads the introspected schema, names the relevant tables + joins, flags gotchas. |
| Planner | Sonnet 4.6 | Produces a 4-6 step numbered plan in plain English. Never emits SQL. |
| Writer | Sonnet 4.6 | Emits one SQLite SELECT, fenced in a code block. On retry it sees the prior SQL + validator error. |
| Validator | (no LLM) | Parses with sqlglot to enforce the safety policy, then runs EXPLAIN QUERY PLAN to catch unknown tables/columns. |
| Executor | (no LLM) | Runs the validated SQL against a read-only SQLite connection with a wall-clock timeout and a row cap. |
| Summarizer | Haiku 4.5 | Writes a short English answer, citing real numbers and calling out caveats (NULLs, truncation). |
The orchestrator (src/sql_analyst/orchestrator.py) is the demo. It's deliberately hand-written and short (~200 LOC) so you can read it top to bottom and see exactly what's happening — no framework dependency, no DSL.
Question: "What were the top 5 tracks by revenue in 2009?"
[schema_explorer]
RELEVANT_TABLES: tracks, invoice_items, invoices
JOINS:
- invoice_items.track_id = tracks.track_id
- invoice_items.invoice_id = invoices.invoice_id
NOTES:
- Revenue per line item is invoice_items.unit_price * invoice_items.quantity.
- Filter on invoices.invoice_date using LIKE '2009%'.
[planner]
1. Read tracks, invoice_items, and invoices.
2. Join invoice_items to invoices on invoice_id and to tracks on track_id.
3. Filter to invoices where invoice_date is in 2009.
4. Group by track to aggregate SUM(unit_price * quantity) as revenue.
5. Order by revenue descending and limit to 5.
[writer]
SELECT t.title, ROUND(SUM(ii.unit_price * ii.quantity), 2) AS revenue
FROM tracks t
JOIN invoice_items ii ON ii.track_id = t.track_id
JOIN invoices i ON i.invoice_id = ii.invoice_id
WHERE i.invoice_date LIKE '2009%'
GROUP BY t.track_id
ORDER BY revenue DESC
LIMIT 5
[validator]
FAILED: EXPLAIN failed: no such column: t.title
[writer (retry 1)] <- gets the error back, fixes it
SELECT t.name AS track_name, ROUND(SUM(ii.unit_price * ii.quantity), 2) AS revenue
FROM tracks t
JOIN invoice_items ii ON ii.track_id = t.track_id
JOIN invoices i ON i.invoice_id = ii.invoice_id
WHERE i.invoice_date LIKE '2009%'
GROUP BY t.track_id
ORDER BY revenue DESC
LIMIT 5
[validator]
OK
[executor]
5 rows in 3 ms
[summarizer]
In 2009 the top 5 tracks by revenue were Stairway to Heaven ($9.03), Bohemian
Rhapsody ($3.87), Comfortably Numb ($1.29), Acknowledgement ($1.29), and So What
($1.29). Stairway to Heaven dominates because its $1.29 unit price plus heavy
purchase volume from a single big spender pushed it well above the rest.
The full JSONL trace for this run is checked in at examples/sample_traces/trace-self-correction.jsonl — that's exactly what gets written to disk every run.
Defense in depth, because any single layer will eventually be defeated:
- OS-level read-only: SQLite is opened with
mode=rovia URI, so even if every Python check fails the kernel refuses the write. - AST safety check:
sqlglotparses every candidate statement. The top-level node must beSELECT/UNION/INTERSECT/EXCEPT. We walk the entire tree and reject anything that containsINSERT,UPDATE,DELETE,DROP,CREATE,ALTER,TRUNCATE,MERGE, orPRAGMAnodes — including DML wrapped in CTEs likeWITH d AS (DELETE FROM t RETURNING *) SELECT * FROM d. - No stacked statements:
sqlglot.parsereturns a list; we require exactly one statement, blockingSELECT 1; DROP TABLE xinjection patterns. EXPLAIN QUERY PLANnotSELECT: validation runsEXPLAIN QUERY PLAN <sql>, not the SQL itself. This catches unknown tables/columns at near-zero cost and zero risk.- Mandatory row cap: if the writer omits a
LIMIT, we inject one (defaults to 100). Cross-joins can't run away with memory. - Wall-clock timeout: the executor arms a
threading.Timerthat callsconnection.interrupt(). SQLite raisesOperationalErrorand we surface it cleanly.
These guardrails block: stacked-query injection, accidental DML from a confused model, fork-bomb-by-cartesian-join, and basic schema hallucination. They don't replace proper RBAC at the database level — that's the whole point of layer 1.
Requirements: Python 3.11+.
git clone <this repo>
cd sql-analyst-agent
# Build a fresh venv with uv (or just use plain pip).
uv sync # or: python -m venv .venv && .venv/bin/pip install -e .[dev]
# Build the sample database (Chinook subset, ~50 rows, no external dependencies).
python scripts/build_chinook.py
# Run the tests. No API key needed.
pytest
# To actually query: set your key.
cp .env.example .env
echo "ANTHROPIC_API_KEY=sk-ant-..." >> .env
python -m sql_analyst "Show me the top 5 customers by total spend."[orchestrator] run id: 1716745201-2c9d3f1a
[schema] loaded 8 tables
[schema_explorer] RELEVANT_TABLES: customers, invoices
JOINS:
- customers.customer_id = invoices.customer_id
[planner] 1. Read from customers and invoices.
2. Inner join on customer_id.
...
[writer] SELECT c.first_name, c.last_name, ROUND(SUM(i.total), 2) AS total_spend
FROM customers c JOIN invoices i ON c.customer_id = i.customer_id
GROUP BY c.customer_id ORDER BY total_spend DESC LIMIT 5
[validator] OK
[executor] 5 rows in 2 ms
[summarizer] The top 5 customers by total spend are Helena Holy ($25.50),
Jennifer Peterson ($6.93), ...
+----------------------------+
| Answer |
| The top 5 customers ... |
+----------------------------+
Trace saved to traces/trace-1716745201-2c9d3f1a.jsonl
| Question | Generated SQL (abridged) | Result |
|---|---|---|
| Top 5 customers by total spend | SELECT first_name, last_name, SUM(total)... ORDER BY 3 DESC LIMIT 5 |
Helena Holy ($25.50) leads by 4x |
| Top 5 tracks by revenue in 2009 | ... WHERE invoice_date LIKE '2009%' GROUP BY track_id ...(self-corrected from t.title -> t.name) |
Stairway to Heaven ($9.03) |
| Which country generated the most revenue? | SELECT billing_country, SUM(total) ... LIMIT 1 |
Czech Republic ($28.71) |
| Customers with no purchases | LEFT JOIN ... WHERE invoice_id IS NULL |
(empty result; summarizer says so) |
| Average track length per genre | SELECT g.name, AVG(milliseconds)/1000.0 FROM tracks ... |
Classical longest avg |
The self-correction case is the interesting one — the writer guessed tracks.title (an Album column name) instead of tracks.name. The validator caught it in 5 ms via EXPLAIN, fed the error back, and the second attempt was correct.
Honest engineering notes — these are deliberately not glossed over.
- Schema fits in context here. Chinook is 8 tables. For a warehouse with thousands of tables, the Schema Explorer would need to be a real retrieval step (embed table descriptions + columns, top-k retrieve given the question). I left a stub here intentionally.
sqlglotAST checks are not a substitute for RBAC. A determined attacker can probably find a parse edge case. The read-only SQLite mode is the real safety net; the AST check is there to give the LLM fast, structured error messages when it goes off the rails.- No re-ranking or multiple candidates. A more robust system would generate N candidate SQLs and pick the one whose
EXPLAINreturned the lowest cost, or whose result shape matched what the plan asked for. - No conversation memory. Each question is independent. A real product needs the writer to remember "we already aliased
invoice_itemsasii" within a session. - Summarizer has no access to historical context. It only sees the current rows. For "is this number normal?" you'd want to feed it a baseline.
- Trace files are per-run. For production you'd ship them to an OTLP sink or s3.
- No streaming to stdout for token-by-token. The labels stream, but each agent's output is rendered after
complete()returns. The infrastructure to switch to true token streaming is inllm.py::stream.
| Choice | Why |
|---|---|
| Python 3.11 + dataclasses | Plain, typed, no framework lock-in. |
| Anthropic Claude (Sonnet + Haiku) | Sonnet for the hard reasoning steps (planner, writer), Haiku for the cheap ones (schema brief, summary). Mixing models is what production agents actually do for cost. |
| SQLite | Smallest dependency that has real semantics: foreign keys, EXPLAIN QUERY PLAN, read-only URI mode. Easy to ship. |
| sqlglot | Best-in-class SQL AST for Python; dialect-aware, lets the safety policy be one walk over the tree. |
| Rich | Nice CLI rendering, single dependency, zero configuration. |
| pytest | The bar. Tests pass without an API key by injecting a MockLLM that returns canned responses in order. |
| Hand-rolled orchestration | LangGraph / LangChain would obscure what the loop actually does. ~200 LOC of orchestrator.py makes the mechanics visible. |
sql-analyst-agent/
README.md
pyproject.toml
.env.example
src/sql_analyst/
agents/
schema_explorer.py planner.py writer.py
validator.py executor.py summarizer.py
orchestrator.py # the loop
safety.py # sqlglot-based AST policy
db.py # read-only connection, introspection, EXPLAIN
trace.py # JSONL append-only logger
llm.py # provider abstraction
config.py # env-driven config
cli.py
tests/
test_safety.py test_validator.py
test_executor.py test_orchestrator.py
conftest.py # MockLLM fixture, in-memory Chinook fixture
data/
chinook.db # generated by scripts/build_chinook.py
examples/
questions.txt
sample_traces/ # three real JSONL traces, one with self-correction
scripts/
build_chinook.py demo.sh