Skip to content

olsongl/sql-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

sql-analyst-agent

A multi-agent text-to-SQL analyst that plans, writes, validates, executes, and explains SQL with self-correction and safety guardrails. Built from scratch in plain Python around Anthropic's Claude — no LangChain, no LangGraph, ~1900 lines of typed code.

The repo is intentionally a portfolio piece: production-shaped agent loop, real safety layer, JSONL traces you can grep, and a test suite that runs end-to-end without an API key.

What it does

You ask a question in English. Six specialized agents take turns: one picks the relevant tables, one plans the steps, one writes SQL, one validates it against a live EXPLAIN QUERY PLAN, one executes it under a read-only connection with a timeout, and one explains the result. If validation fails (hallucinated column, broken join, unsafe statement) the writer gets the error message back and retries — up to a configurable budget — before giving up.

                            +--------------------+
question  ----------------> |  Schema Explorer   |  Haiku   (picks tables, joins)
                            +---------+----------+
                                      |
                                      v
                            +--------------------+
                            |      Planner       |  Sonnet  (numbered plan in English)
                            +---------+----------+
                                      |
                                      v
                            +--------------------+
              +-----------> |     SQL Writer     |  Sonnet  (emits one SELECT)
              |             +---------+----------+
              | retry                 |
              | with                  v
              | error         +--------------------+
              +---------------+     Validator      |  sqlglot + EXPLAIN QUERY PLAN
                              +---------+----------+
                                        | OK
                                        v
                              +--------------------+
                              |     Executor       |  Read-only conn, row cap, timeout
                              +---------+----------+
                                        |
                                        v
                              +--------------------+
                              |    Summarizer      |  Haiku   (plain-English answer)
                              +--------------------+

Why this matters

Naive text-to-SQL — paste schema, ask model, run output — falls over the moment you hit a real warehouse. The failure modes are well known:

  • The model hallucinates a column name or invents a join.
  • It writes a query that runs but answers the wrong question.
  • It produces something subtly destructive that the wrapper happily executes.
  • A single retry attempt with no error context can't recover from a typo.

This project is the answer to "can we let business users query the warehouse with natural language?" — at least in skeleton form. The agent loop, the validator-driven retry, and the layered safety model are the bits that actually move "demo on a slide" closer to "ship it to the business team."

Architecture

Each agent does one job. Splitting them is what makes self-correction tractable: when the validator catches a bad column the writer can re-read the plan (still correct) instead of regenerating a new approach from scratch.

Agent Model Job
Schema Explorer Haiku 4.5 Reads the introspected schema, names the relevant tables + joins, flags gotchas.
Planner Sonnet 4.6 Produces a 4-6 step numbered plan in plain English. Never emits SQL.
Writer Sonnet 4.6 Emits one SQLite SELECT, fenced in a code block. On retry it sees the prior SQL + validator error.
Validator (no LLM) Parses with sqlglot to enforce the safety policy, then runs EXPLAIN QUERY PLAN to catch unknown tables/columns.
Executor (no LLM) Runs the validated SQL against a read-only SQLite connection with a wall-clock timeout and a row cap.
Summarizer Haiku 4.5 Writes a short English answer, citing real numbers and calling out caveats (NULLs, truncation).

The orchestrator (src/sql_analyst/orchestrator.py) is the demo. It's deliberately hand-written and short (~200 LOC) so you can read it top to bottom and see exactly what's happening — no framework dependency, no DSL.

A concrete trace

Question: "What were the top 5 tracks by revenue in 2009?"

[schema_explorer]
RELEVANT_TABLES: tracks, invoice_items, invoices
JOINS:
  - invoice_items.track_id = tracks.track_id
  - invoice_items.invoice_id = invoices.invoice_id
NOTES:
  - Revenue per line item is invoice_items.unit_price * invoice_items.quantity.
  - Filter on invoices.invoice_date using LIKE '2009%'.

[planner]
1. Read tracks, invoice_items, and invoices.
2. Join invoice_items to invoices on invoice_id and to tracks on track_id.
3. Filter to invoices where invoice_date is in 2009.
4. Group by track to aggregate SUM(unit_price * quantity) as revenue.
5. Order by revenue descending and limit to 5.

[writer]
SELECT t.title, ROUND(SUM(ii.unit_price * ii.quantity), 2) AS revenue
FROM tracks t
JOIN invoice_items ii ON ii.track_id = t.track_id
JOIN invoices i ON i.invoice_id = ii.invoice_id
WHERE i.invoice_date LIKE '2009%'
GROUP BY t.track_id
ORDER BY revenue DESC
LIMIT 5

[validator]
FAILED: EXPLAIN failed: no such column: t.title

[writer (retry 1)]      <- gets the error back, fixes it
SELECT t.name AS track_name, ROUND(SUM(ii.unit_price * ii.quantity), 2) AS revenue
FROM tracks t
JOIN invoice_items ii ON ii.track_id = t.track_id
JOIN invoices i ON i.invoice_id = ii.invoice_id
WHERE i.invoice_date LIKE '2009%'
GROUP BY t.track_id
ORDER BY revenue DESC
LIMIT 5

[validator]
OK

[executor]
5 rows in 3 ms

[summarizer]
In 2009 the top 5 tracks by revenue were Stairway to Heaven ($9.03), Bohemian
Rhapsody ($3.87), Comfortably Numb ($1.29), Acknowledgement ($1.29), and So What
($1.29). Stairway to Heaven dominates because its $1.29 unit price plus heavy
purchase volume from a single big spender pushed it well above the rest.

The full JSONL trace for this run is checked in at examples/sample_traces/trace-self-correction.jsonl — that's exactly what gets written to disk every run.

Safety model

Defense in depth, because any single layer will eventually be defeated:

  1. OS-level read-only: SQLite is opened with mode=ro via URI, so even if every Python check fails the kernel refuses the write.
  2. AST safety check: sqlglot parses every candidate statement. The top-level node must be SELECT / UNION / INTERSECT / EXCEPT. We walk the entire tree and reject anything that contains INSERT, UPDATE, DELETE, DROP, CREATE, ALTER, TRUNCATE, MERGE, or PRAGMA nodes — including DML wrapped in CTEs like WITH d AS (DELETE FROM t RETURNING *) SELECT * FROM d.
  3. No stacked statements: sqlglot.parse returns a list; we require exactly one statement, blocking SELECT 1; DROP TABLE x injection patterns.
  4. EXPLAIN QUERY PLAN not SELECT: validation runs EXPLAIN QUERY PLAN <sql>, not the SQL itself. This catches unknown tables/columns at near-zero cost and zero risk.
  5. Mandatory row cap: if the writer omits a LIMIT, we inject one (defaults to 100). Cross-joins can't run away with memory.
  6. Wall-clock timeout: the executor arms a threading.Timer that calls connection.interrupt(). SQLite raises OperationalError and we surface it cleanly.

These guardrails block: stacked-query injection, accidental DML from a confused model, fork-bomb-by-cartesian-join, and basic schema hallucination. They don't replace proper RBAC at the database level — that's the whole point of layer 1.

Quickstart

Requirements: Python 3.11+.

git clone <this repo>
cd sql-analyst-agent

# Build a fresh venv with uv (or just use plain pip).
uv sync                      # or:  python -m venv .venv && .venv/bin/pip install -e .[dev]

# Build the sample database (Chinook subset, ~50 rows, no external dependencies).
python scripts/build_chinook.py

# Run the tests. No API key needed.
pytest

# To actually query: set your key.
cp .env.example .env
echo "ANTHROPIC_API_KEY=sk-ant-..." >> .env

python -m sql_analyst "Show me the top 5 customers by total spend."

Sample expected output

[orchestrator] run id: 1716745201-2c9d3f1a
[schema] loaded 8 tables
[schema_explorer] RELEVANT_TABLES: customers, invoices
                  JOINS:
                    - customers.customer_id = invoices.customer_id
[planner] 1. Read from customers and invoices.
          2. Inner join on customer_id.
          ...
[writer]  SELECT c.first_name, c.last_name, ROUND(SUM(i.total), 2) AS total_spend
          FROM customers c JOIN invoices i ON c.customer_id = i.customer_id
          GROUP BY c.customer_id ORDER BY total_spend DESC LIMIT 5
[validator] OK
[executor] 5 rows in 2 ms
[summarizer] The top 5 customers by total spend are Helena Holy ($25.50),
             Jennifer Peterson ($6.93), ...

+----------------------------+
| Answer                     |
| The top 5 customers ...    |
+----------------------------+
Trace saved to traces/trace-1716745201-2c9d3f1a.jsonl

Example queries and outputs

Question Generated SQL (abridged) Result
Top 5 customers by total spend SELECT first_name, last_name, SUM(total)... ORDER BY 3 DESC LIMIT 5 Helena Holy ($25.50) leads by 4x
Top 5 tracks by revenue in 2009 ... WHERE invoice_date LIKE '2009%' GROUP BY track_id ...
(self-corrected from t.title -> t.name)
Stairway to Heaven ($9.03)
Which country generated the most revenue? SELECT billing_country, SUM(total) ... LIMIT 1 Czech Republic ($28.71)
Customers with no purchases LEFT JOIN ... WHERE invoice_id IS NULL (empty result; summarizer says so)
Average track length per genre SELECT g.name, AVG(milliseconds)/1000.0 FROM tracks ... Classical longest avg

The self-correction case is the interesting one — the writer guessed tracks.title (an Album column name) instead of tracks.name. The validator caught it in 5 ms via EXPLAIN, fed the error back, and the second attempt was correct.

Tradeoffs and what I'd extend

Honest engineering notes — these are deliberately not glossed over.

  • Schema fits in context here. Chinook is 8 tables. For a warehouse with thousands of tables, the Schema Explorer would need to be a real retrieval step (embed table descriptions + columns, top-k retrieve given the question). I left a stub here intentionally.
  • sqlglot AST checks are not a substitute for RBAC. A determined attacker can probably find a parse edge case. The read-only SQLite mode is the real safety net; the AST check is there to give the LLM fast, structured error messages when it goes off the rails.
  • No re-ranking or multiple candidates. A more robust system would generate N candidate SQLs and pick the one whose EXPLAIN returned the lowest cost, or whose result shape matched what the plan asked for.
  • No conversation memory. Each question is independent. A real product needs the writer to remember "we already aliased invoice_items as ii" within a session.
  • Summarizer has no access to historical context. It only sees the current rows. For "is this number normal?" you'd want to feed it a baseline.
  • Trace files are per-run. For production you'd ship them to an OTLP sink or s3.
  • No streaming to stdout for token-by-token. The labels stream, but each agent's output is rendered after complete() returns. The infrastructure to switch to true token streaming is in llm.py::stream.

Tech stack

Choice Why
Python 3.11 + dataclasses Plain, typed, no framework lock-in.
Anthropic Claude (Sonnet + Haiku) Sonnet for the hard reasoning steps (planner, writer), Haiku for the cheap ones (schema brief, summary). Mixing models is what production agents actually do for cost.
SQLite Smallest dependency that has real semantics: foreign keys, EXPLAIN QUERY PLAN, read-only URI mode. Easy to ship.
sqlglot Best-in-class SQL AST for Python; dialect-aware, lets the safety policy be one walk over the tree.
Rich Nice CLI rendering, single dependency, zero configuration.
pytest The bar. Tests pass without an API key by injecting a MockLLM that returns canned responses in order.
Hand-rolled orchestration LangGraph / LangChain would obscure what the loop actually does. ~200 LOC of orchestrator.py makes the mechanics visible.

Repo layout

sql-analyst-agent/
  README.md
  pyproject.toml
  .env.example
  src/sql_analyst/
    agents/
      schema_explorer.py    planner.py    writer.py
      validator.py          executor.py   summarizer.py
    orchestrator.py         # the loop
    safety.py               # sqlglot-based AST policy
    db.py                   # read-only connection, introspection, EXPLAIN
    trace.py                # JSONL append-only logger
    llm.py                  # provider abstraction
    config.py               # env-driven config
    cli.py
  tests/
    test_safety.py          test_validator.py
    test_executor.py        test_orchestrator.py
    conftest.py             # MockLLM fixture, in-memory Chinook fixture
  data/
    chinook.db              # generated by scripts/build_chinook.py
  examples/
    questions.txt
    sample_traces/          # three real JSONL traces, one with self-correction
  scripts/
    build_chinook.py        demo.sh

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors