SQL Agent Benchmark

A small open-source project for benchmarking two LLM-based SQL agent configurations on the Chinook database, with reproducible run artifacts, custom dataset support, richer evaluation reporting, and a minimal Django API backend.

Portfolio highlights

benchmark two SQL agent strategies on the same evaluation dataset
support custom datasets and optional dataset-specific SQLite databases
inspect per-run artifacts, summaries, error types, and failure examples
use a Django API plus a React dashboard for dataset management and run analysis
support lightweight asynchronous benchmark jobs with live polling in the UI

What this project compares

baseline
- a direct text-to-SQL pipeline that generates a SQL query, executes it, and synthesizes the answer
structured
- a LangGraph-based SQL agent that lists tables, retrieves relevant schema, generates a query, checks it, executes it, and answers the question

Project goal

The goal is to evaluate whether a more structured agent pipeline improves reliability and answer quality over a simpler baseline.

Repository structure

backend/
  manage.py
  config/
  api/

frontend/
  src/
  package.json

src/
  config.py
  db.py
  datasets.py
  prompts.py
  schemas.py
  agents.py
  evaluators.py
  benchmark.py
  report.py

data/
  chinook_eval_dataset.json

results/

Architecture overview

src/
- benchmark core, agent orchestration, dataset loading, evaluation, and summary reporting
backend/
- Django + DRF API that exposes datasets, runs, and lightweight async run-job endpoints
frontend/
- Vite + React + TypeScript dashboard for uploads, run execution, live job tracking, and detailed analysis

Dataset

The initial benchmark dataset contains 15 question-answer pairs across several categories:

count
lookup
filter
aggregation
join
ranking

The dataset also supports optional evaluation metadata fields such as:

difficulty
expected_sql
expected_result

It can be provided in two JSON formats:

a plain array of examples compatible with the original benchmark
an object with dataset metadata and an examples array, optionally including a dataset-specific SQLite database path or URL

Metrics

The benchmark stores and summarizes these metrics:

correctness
success rate
latency
intermediate step count
evaluation mode
error type
failure examples

Setup

Create a virtual environment.
Install dependencies:

pip install -r requirements.txt

Copy .env.example to .env and fill in your keys.

Run the benchmark

python -m src.benchmark

You can switch to a custom dataset by changing DATASET_PATH in .env.

This will:

download Chinook.db if it is missing
run both agents on every dataset example
evaluate predictions with an LLM judge
save top-level results to results/
create a dedicated run folder in results/runs/
snapshot benchmark configuration for reproducibility
generate markdown and JSON summaries

Output files

latest detailed benchmark CSV in results/
latest markdown summary in results/summary.md
per-run artifact directory in results/runs/<run_name>/
- benchmark_results.csv
- summary.md
- summary.json
- config.json

Evaluation modes

The benchmark supports two evaluation paths:

llm_judge
- semantic correctness based on the expected answer
expected_result
- exact comparison against an expected raw SQL result when it is available in the dataset

This makes it possible to gradually move from a pure LLM-as-a-judge setup to a more grounded benchmark.

Reporting

The generated summary includes:

overall metrics by agent
accuracy by category
evaluation mode counts
error type breakdown
example failures
direct agent comparison on accuracy, latency, and step count

Minimal Django backend

The repository now includes a small Django + Django REST Framework backend that reuses the same benchmark core.

Start the backend

python -m pip install -r requirements.txt
python backend/manage.py migrate
python backend/manage.py runserver

The API will be available at http://127.0.0.1:8000/api/.

Available endpoints

GET /api/health/
- simple health check
GET /api/datasets/
- list builtin and uploaded datasets
POST /api/datasets/upload/
- upload a custom dataset JSON file
GET /api/datasets/<dataset_id>/
- retrieve dataset metadata and preview rows
GET /api/runs/
- list benchmark runs discovered in results/runs/
POST /api/runs/start/
- start a benchmark run for a selected dataset
- accepts optional async_mode=true for background execution
GET /api/runs/<run_name>/
- retrieve config and summary for a specific run
GET /api/run-jobs/
- list tracked async benchmark jobs
GET /api/run-jobs/<job_id>/
- retrieve live status and artifact metadata for a specific async job

Backend behavior

uploaded dataset files are stored locally in backend_data/datasets/
benchmark runs remain stored in results/runs/
async jobs are tracked in backend_data/run_jobs/ using lightweight file-based state
synchronous execution is still available, but the React UI uses async job polling by default

Minimal React frontend

The repository also includes a Vite + React + TypeScript UI that consumes the Django API.

Start the frontend

cd frontend
npm install
npm run dev

The UI will be available at http://127.0.0.1:5173/.

Frontend configuration

copy frontend/.env.example to frontend/.env if you want to override the API base URL
by default the frontend targets http://127.0.0.1:8000/api

Frontend features

upload dataset JSON files
inspect builtin and uploaded datasets
start benchmark runs from the browser
navigate across dedicated dataset, runs, live-job, and run-detail pages
poll asynchronous benchmark jobs with live status updates
inspect completed run summaries and artifact paths
compare agents on accuracy, latency, and step deltas
review evaluation-mode and error-type breakdowns
inspect failure examples captured in the run summary

Quality checks

Backend

python backend/manage.py check

Frontend

cd frontend
npm run typecheck
npm run build

Notes for reviewers

this repository is designed as a portfolio-quality MVP rather than a production job platform
async execution is intentionally lightweight and uses local threads plus file-backed job status
no authentication, distributed queue, or persistent database models are used for benchmark artifacts

Notes

OPENAI_API_KEY is required.
LANGCHAIN_API_KEY is optional and only used for tracing.
the project is intentionally small so it can serve as a portfolio starter repository.
the frontend now depends on react-router-dom, so run npm install again after pulling the latest changes.

Next improvements

compare multiple LLMs
expand the dataset
add error analysis by category
add regression tests for benchmark consistency

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SQL Agent Benchmark

Portfolio highlights

What this project compares

Project goal

Repository structure

Architecture overview

Dataset

Metrics

Setup

Run the benchmark

Output files

Evaluation modes

Reporting

Minimal Django backend

Start the backend

Available endpoints

Backend behavior

Minimal React frontend

Start the frontend

Frontend configuration

Frontend features

Quality checks

Backend

Frontend

Notes for reviewers

Notes

Next improvements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
backend		backend
data		data
frontend		frontend
results		results
src		src
.env.example		.env.example
.gitignore		.gitignore
Custom_SQL_Agent_Evaluation.ipynb		Custom_SQL_Agent_Evaluation.ipynb
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

SQL Agent Benchmark

Portfolio highlights

What this project compares

Project goal

Repository structure

Architecture overview

Dataset

Metrics

Setup

Run the benchmark

Output files

Evaluation modes

Reporting

Minimal Django backend

Start the backend

Available endpoints

Backend behavior

Minimal React frontend

Start the frontend

Frontend configuration

Frontend features

Quality checks

Backend

Frontend

Notes for reviewers

Notes

Next improvements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages