A small open-source project for benchmarking two LLM-based SQL agent configurations on the Chinook database, with reproducible run artifacts, custom dataset support, richer evaluation reporting, and a minimal Django API backend.
- benchmark two SQL agent strategies on the same evaluation dataset
- support custom datasets and optional dataset-specific SQLite databases
- inspect per-run artifacts, summaries, error types, and failure examples
- use a Django API plus a React dashboard for dataset management and run analysis
- support lightweight asynchronous benchmark jobs with live polling in the UI
baseline- a direct text-to-SQL pipeline that generates a SQL query, executes it, and synthesizes the answer
structured- a LangGraph-based SQL agent that lists tables, retrieves relevant schema, generates a query, checks it, executes it, and answers the question
The goal is to evaluate whether a more structured agent pipeline improves reliability and answer quality over a simpler baseline.
backend/
manage.py
config/
api/
frontend/
src/
package.json
src/
config.py
db.py
datasets.py
prompts.py
schemas.py
agents.py
evaluators.py
benchmark.py
report.py
data/
chinook_eval_dataset.json
results/
src/- benchmark core, agent orchestration, dataset loading, evaluation, and summary reporting
backend/- Django + DRF API that exposes datasets, runs, and lightweight async run-job endpoints
frontend/- Vite + React + TypeScript dashboard for uploads, run execution, live job tracking, and detailed analysis
The initial benchmark dataset contains 15 question-answer pairs across several categories:
countlookupfilteraggregationjoinranking
The dataset also supports optional evaluation metadata fields such as:
difficultyexpected_sqlexpected_result
It can be provided in two JSON formats:
- a plain array of examples compatible with the original benchmark
- an object with dataset metadata and an
examplesarray, optionally including a dataset-specific SQLite database path or URL
The benchmark stores and summarizes these metrics:
- correctness
- success rate
- latency
- intermediate step count
- evaluation mode
- error type
- failure examples
- Create a virtual environment.
- Install dependencies:
pip install -r requirements.txt- Copy
.env.exampleto.envand fill in your keys.
python -m src.benchmarkYou can switch to a custom dataset by changing DATASET_PATH in .env.
This will:
- download
Chinook.dbif it is missing - run both agents on every dataset example
- evaluate predictions with an LLM judge
- save top-level results to
results/ - create a dedicated run folder in
results/runs/ - snapshot benchmark configuration for reproducibility
- generate markdown and JSON summaries
- latest detailed benchmark CSV in
results/ - latest markdown summary in
results/summary.md - per-run artifact directory in
results/runs/<run_name>/benchmark_results.csvsummary.mdsummary.jsonconfig.json
The benchmark supports two evaluation paths:
llm_judge- semantic correctness based on the expected answer
expected_result- exact comparison against an expected raw SQL result when it is available in the dataset
This makes it possible to gradually move from a pure LLM-as-a-judge setup to a more grounded benchmark.
The generated summary includes:
- overall metrics by agent
- accuracy by category
- evaluation mode counts
- error type breakdown
- example failures
- direct agent comparison on accuracy, latency, and step count
The repository now includes a small Django + Django REST Framework backend that reuses the same benchmark core.
python -m pip install -r requirements.txt
python backend/manage.py migrate
python backend/manage.py runserverThe API will be available at http://127.0.0.1:8000/api/.
GET /api/health/- simple health check
GET /api/datasets/- list builtin and uploaded datasets
POST /api/datasets/upload/- upload a custom dataset JSON file
GET /api/datasets/<dataset_id>/- retrieve dataset metadata and preview rows
GET /api/runs/- list benchmark runs discovered in
results/runs/
- list benchmark runs discovered in
POST /api/runs/start/- start a benchmark run for a selected dataset
- accepts optional
async_mode=truefor background execution
GET /api/runs/<run_name>/- retrieve config and summary for a specific run
GET /api/run-jobs/- list tracked async benchmark jobs
GET /api/run-jobs/<job_id>/- retrieve live status and artifact metadata for a specific async job
- uploaded dataset files are stored locally in
backend_data/datasets/ - benchmark runs remain stored in
results/runs/ - async jobs are tracked in
backend_data/run_jobs/using lightweight file-based state - synchronous execution is still available, but the React UI uses async job polling by default
The repository also includes a Vite + React + TypeScript UI that consumes the Django API.
cd frontend
npm install
npm run devThe UI will be available at http://127.0.0.1:5173/.
- copy
frontend/.env.exampletofrontend/.envif you want to override the API base URL - by default the frontend targets
http://127.0.0.1:8000/api
- upload dataset JSON files
- inspect builtin and uploaded datasets
- start benchmark runs from the browser
- navigate across dedicated dataset, runs, live-job, and run-detail pages
- poll asynchronous benchmark jobs with live status updates
- inspect completed run summaries and artifact paths
- compare agents on accuracy, latency, and step deltas
- review evaluation-mode and error-type breakdowns
- inspect failure examples captured in the run summary
python backend/manage.py checkcd frontend
npm run typecheck
npm run build- this repository is designed as a portfolio-quality MVP rather than a production job platform
- async execution is intentionally lightweight and uses local threads plus file-backed job status
- no authentication, distributed queue, or persistent database models are used for benchmark artifacts
OPENAI_API_KEYis required.LANGCHAIN_API_KEYis optional and only used for tracing.- the project is intentionally small so it can serve as a portfolio starter repository.
- the frontend now depends on
react-router-dom, so runnpm installagain after pulling the latest changes.
- compare multiple LLMs
- expand the dataset
- add error analysis by category
- add regression tests for benchmark consistency