Skip to content

justweb-s/sql-agent-evaluation

Repository files navigation

SQL Agent Benchmark

A small open-source project for benchmarking two LLM-based SQL agent configurations on the Chinook database, with reproducible run artifacts, custom dataset support, richer evaluation reporting, and a minimal Django API backend.

Portfolio highlights

  • benchmark two SQL agent strategies on the same evaluation dataset
  • support custom datasets and optional dataset-specific SQLite databases
  • inspect per-run artifacts, summaries, error types, and failure examples
  • use a Django API plus a React dashboard for dataset management and run analysis
  • support lightweight asynchronous benchmark jobs with live polling in the UI

What this project compares

  • baseline
    • a direct text-to-SQL pipeline that generates a SQL query, executes it, and synthesizes the answer
  • structured
    • a LangGraph-based SQL agent that lists tables, retrieves relevant schema, generates a query, checks it, executes it, and answers the question

Project goal

The goal is to evaluate whether a more structured agent pipeline improves reliability and answer quality over a simpler baseline.

Repository structure

backend/
  manage.py
  config/
  api/

frontend/
  src/
  package.json

src/
  config.py
  db.py
  datasets.py
  prompts.py
  schemas.py
  agents.py
  evaluators.py
  benchmark.py
  report.py

data/
  chinook_eval_dataset.json

results/

Architecture overview

  • src/
    • benchmark core, agent orchestration, dataset loading, evaluation, and summary reporting
  • backend/
    • Django + DRF API that exposes datasets, runs, and lightweight async run-job endpoints
  • frontend/
    • Vite + React + TypeScript dashboard for uploads, run execution, live job tracking, and detailed analysis

Dataset

The initial benchmark dataset contains 15 question-answer pairs across several categories:

  • count
  • lookup
  • filter
  • aggregation
  • join
  • ranking

The dataset also supports optional evaluation metadata fields such as:

  • difficulty
  • expected_sql
  • expected_result

It can be provided in two JSON formats:

  • a plain array of examples compatible with the original benchmark
  • an object with dataset metadata and an examples array, optionally including a dataset-specific SQLite database path or URL

Metrics

The benchmark stores and summarizes these metrics:

  • correctness
  • success rate
  • latency
  • intermediate step count
  • evaluation mode
  • error type
  • failure examples

Setup

  1. Create a virtual environment.
  2. Install dependencies:
pip install -r requirements.txt
  1. Copy .env.example to .env and fill in your keys.

Run the benchmark

python -m src.benchmark

You can switch to a custom dataset by changing DATASET_PATH in .env.

This will:

  • download Chinook.db if it is missing
  • run both agents on every dataset example
  • evaluate predictions with an LLM judge
  • save top-level results to results/
  • create a dedicated run folder in results/runs/
  • snapshot benchmark configuration for reproducibility
  • generate markdown and JSON summaries

Output files

  • latest detailed benchmark CSV in results/
  • latest markdown summary in results/summary.md
  • per-run artifact directory in results/runs/<run_name>/
    • benchmark_results.csv
    • summary.md
    • summary.json
    • config.json

Evaluation modes

The benchmark supports two evaluation paths:

  • llm_judge
    • semantic correctness based on the expected answer
  • expected_result
    • exact comparison against an expected raw SQL result when it is available in the dataset

This makes it possible to gradually move from a pure LLM-as-a-judge setup to a more grounded benchmark.

Reporting

The generated summary includes:

  • overall metrics by agent
  • accuracy by category
  • evaluation mode counts
  • error type breakdown
  • example failures
  • direct agent comparison on accuracy, latency, and step count

Minimal Django backend

The repository now includes a small Django + Django REST Framework backend that reuses the same benchmark core.

Start the backend

python -m pip install -r requirements.txt
python backend/manage.py migrate
python backend/manage.py runserver

The API will be available at http://127.0.0.1:8000/api/.

Available endpoints

  • GET /api/health/
    • simple health check
  • GET /api/datasets/
    • list builtin and uploaded datasets
  • POST /api/datasets/upload/
    • upload a custom dataset JSON file
  • GET /api/datasets/<dataset_id>/
    • retrieve dataset metadata and preview rows
  • GET /api/runs/
    • list benchmark runs discovered in results/runs/
  • POST /api/runs/start/
    • start a benchmark run for a selected dataset
    • accepts optional async_mode=true for background execution
  • GET /api/runs/<run_name>/
    • retrieve config and summary for a specific run
  • GET /api/run-jobs/
    • list tracked async benchmark jobs
  • GET /api/run-jobs/<job_id>/
    • retrieve live status and artifact metadata for a specific async job

Backend behavior

  • uploaded dataset files are stored locally in backend_data/datasets/
  • benchmark runs remain stored in results/runs/
  • async jobs are tracked in backend_data/run_jobs/ using lightweight file-based state
  • synchronous execution is still available, but the React UI uses async job polling by default

Minimal React frontend

The repository also includes a Vite + React + TypeScript UI that consumes the Django API.

Start the frontend

cd frontend
npm install
npm run dev

The UI will be available at http://127.0.0.1:5173/.

Frontend configuration

  • copy frontend/.env.example to frontend/.env if you want to override the API base URL
  • by default the frontend targets http://127.0.0.1:8000/api

Frontend features

  • upload dataset JSON files
  • inspect builtin and uploaded datasets
  • start benchmark runs from the browser
  • navigate across dedicated dataset, runs, live-job, and run-detail pages
  • poll asynchronous benchmark jobs with live status updates
  • inspect completed run summaries and artifact paths
  • compare agents on accuracy, latency, and step deltas
  • review evaluation-mode and error-type breakdowns
  • inspect failure examples captured in the run summary

Quality checks

Backend

python backend/manage.py check

Frontend

cd frontend
npm run typecheck
npm run build

Notes for reviewers

  • this repository is designed as a portfolio-quality MVP rather than a production job platform
  • async execution is intentionally lightweight and uses local threads plus file-backed job status
  • no authentication, distributed queue, or persistent database models are used for benchmark artifacts

Notes

  • OPENAI_API_KEY is required.
  • LANGCHAIN_API_KEY is optional and only used for tracing.
  • the project is intentionally small so it can serve as a portfolio starter repository.
  • the frontend now depends on react-router-dom, so run npm install again after pulling the latest changes.

Next improvements

  • compare multiple LLMs
  • expand the dataset
  • add error analysis by category
  • add regression tests for benchmark consistency

About

Benchmarking baseline vs structured SQL agents with Django + React dashboard and async run tracking

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors