SQLsynth

[IUI 2025] Text-to-SQL Domain Adaptation via Human-LLM Collaborative Data Annotation

This is the repo for the IUI 2025 paper

Note: This repo serves as the latest version and the backup of the official repo at Adobe

SQLsynth is not only an interactive data annotation but also automated data synthesis tool designed for quickly creating highly customized (e.g., schema, DB records, distribution) text-to-SQL datasets.

🌐 Live Demo: Try me 👈🏻

🌟 Features

Database Schema Customization
- Freely create, edit, annotate (use NL to label the semantics of database fields, useful for LLMs) in the canvas.
- 📦 --> A highly customized database schema, with meaningful descriptions
Database Records Population
- Given a database schema, populate it with concrete records
- Rule-based method (No LLM calling)
- Recognized for different datatype
- Distribution is configurable
- 📦 --> A complete, customized database full of records
SQL Query Sampling
- Given a database, randomly sample SQL queries.
- Based on PCFG (Probability Context-Free Grammar) and other rules to extract records from a specified database.
- The probability distribution is configurable (e.g., increase the number of queries with WHERE clauses or those involving a specific column).
- Syntax is customizable (e.g., support for user-defined SQL dialect).
- Filter by executing the SQL to ensure it is associated with meaningful execution results
- 📦 --> A large amount of SQL queries (with a customized distribution) under the provided database
SQL-to-Text Generation
- Convert SQL queries into NL questions
- Three stages:
  1. Convert the SQL query into step-by-step NL explanations by a grammar-based method.
  2. Conduct in-context on specified real-world data for style adaptation
  3. Generating the NL question by LLMs
- 📦 --> A large amount of (NL, SQL) pairs under the customized database, where NL questions may be perfect (ambiguous, lack details, etc.)
Text-SQL Alignment
- Mapping NL components (substrings) to SQL compoenents (clauses)
- Error checking for generated NL (note that the SQL is absolutely correct)
- Use to analyze (1) what information may be missing (the SQL component fails to map to NL components), and (2) what information may be redundant (the NL component doesn't map to any SQL compoenent)
- Interactively highlight by visual correspondence in the UI
- Scoring: Given a NL-SQL paier, evaluate the quality on a scale from 0 to 100, with analysis in NL
- 📦 --> A large amount of refined (NL, SQL) pairs under the customized database
Dataset Statistics & Visualization
- Upload and analyze existing SQL query datasets
- Assist users in tracking datasets from a dataset-level perspective
- Comprehensive statistics dashboard with summary metrics (total queries, unique keywords, average complexity),including:
  - SQL structure distribution
  - Keyword frequency distribution
  - Clause number distribution
  - Column and table usage patterns
  - Query complexity distribution
  - Reference value distribution
- 📦 --> Insights into dataset characteristics and qualities

📦 Installation

Backend Setup

Clone the repository:

git clone https://github.com/yourusername/SQLsynth.git
cd SQLsynth

Install Python dependencies:

cd backend
pip install flask
pip install flask_cors
pip install sql-metadata
pip install openai
pip install nltk
pip install spacy
pip install sqlparse
python -m spacy download en_core_web_sm

Configure LLM API:
- Open backend/openai_api.py
- Implement your own get_openai_response() function
- The function should take a string prompt as input and return a string response

Frontend Setup

Install Node.js dependencies:

cd frontend
npm install

If you encounter missing dependencies, please use npm install for necessary packages based on pop-up instructions.

🚀 Quick Start

Running the Application

Start the Backend Server:

cd backend
python server.py

The backend will run on http://localhost:5001 by default.

Start the Frontend:

cd frontend
npm start

The frontend will run on http://localhost:3000 by default.

Open your browser and navigate to http://localhost:3000
Enjoy! 🎉

Project Structure

SQLsynth_repo/
├── backend/
│   ├── server.py              # Main Flask server
│   ├── SQL_synthesizer.py     # Query synthesis engine
│   ├── SQL2NL_clean.py        # Rule-based SQL parser
│   ├── llm_analysis.py        # LLM prompts and analysis
│   ├── records_synthesizer.py # Record generation
│   ├── ICL_retriever.py       # Example retrieval
│   ├── db_handling.py         # Database utilities
│   ├── openai_api.py          # LLM API interface
│   ├── evaluation_steps.py    # Evaluation tools
│   ├── *_config.json          # Configuration files
│   ├── output_data/           # Generated datasets
│   └── temp_db/               # Temporary databases
├── frontend/
│   ├── src/
│   │   ├── App.jsx            # Main app component
│   │   ├── SchemaTab.jsx      # Schema designer
│   │   ├── DatabaseTab.jsx    # Record management
│   │   ├── DatasetTab.jsx     # Dataset synthesis
│   │   └── AnalysisTab.jsx    # Analysis interface
│   ├── public/                # Static assets
│   └── package.json           # Dependencies
├── user_study/
│   └── spider_schemas/        # 166 Spider schemas
└── README.md

Backend (`backend/`)

server.py: Flask server handling all API endpoints
SQL_synthesizer.py: PCFG-based SQL query generation
SQL2NL_clean.py: Rule-based SQL decomposition and explanation
llm_analysis.py: LLM prompts and analysis functions
records_synthesizer.py: Database record generation with constraint satisfaction
ICL_retriever.py: In-context learning example retrieval
db_handling.py: Database operations and utilities
openai_api.py: LLM API interface (user-implemented)
evaluation_steps.py: Evaluation utilities

Frontend (`frontend/src/`)

App.jsx: Main application component with global state management
SchemaTab.jsx: Interactive schema designer
DatabaseTab.jsx: Database record management interface
DatasetTab.jsx: Dataset synthesis and download
AnalysisTab.jsx: SQL-NL alignment analysis
SQLSubexpressionCorrespondence.jsx: Visual representation of SQL components

Configuration Files

manual_config.json: Manual probability configuration for SQL synthesis
learned_config.json: Learned probability distribution from existing datasets
spider_example_pool.json: Example pool for in-context learning

Usage

1. Schema Design

Import Schema: Drag and drop a JSON schema file
Edit Schema: Add/remove tables and columns
Define Relationships: Specify primary and foreign keys
Add Descriptions: Document tables and columns for better NL generation

Schema format example:

{
  "users": {
    "comment": "User information table",
    "columns": [
      {
        "field": "user_id",
        "type": "text",
        "isPrimary": true,
        "comment": "Unique user identifier"
      },
      {
        "field": "username",
        "type": "text",
        "comment": "User's login name"
      }
    ]
  }
}

2. Record Synthesis

Click "Generate Records" to create synthetic data
Specify the number of records to generate
Records respect foreign key constraints and data types
Export records to JSON

3. NL-SQL pair Synthesis

Configure query distribution (number of tables, columns, clauses)
Generate individual queries or batch synthesis
View step-by-step SQL decomposition
Get suggested natural language descriptions
Check alignment between SQL and NL

4. NL-SQL pair Scoring

Evaluate the quality and equivalence between natural language questions and SQL queries
Score with explanations

5. Dataset Analysis

Upload existing SQL query datasets
View comprehensive statistics:
- Keyword distribution
- Query structure patterns
- Clause complexity
- Column and table usage
- Query complexity metrics

Script-Based Synthesis

While human-in-the-loop guarantees the data quality, you can also opt for large-scale dataset generation without the UI:

from server import auto_synthetic_data

synthetic_data = auto_synthetic_data(
    schema_path="backend/saved_frontend_schema.json",
    save_path="backend/output_data/synthetic_data.jsonl",
    config_path="backend/learned_config.json",
    synthesized_DB_records_path="backend/output_data/DB_records.json",
    example_path="backend/spider_example_pool.json",
    data_num=2000
)

Parameters:

schema_path: Path to the database schema JSON file
save_path: Output file path for synthetic data
config_path: Configuration file for query distribution
synthesized_DB_records_path: Path to save generated database records
example_path: Path to example pool for in-context learning
data_num: Number of SQL-NL pairs to generate

Configuration

Click to expand configuration details

Query Distribution Configuration

Adjust probabilities in learned_config.json or manual_config.json:

{
  "sample_table_probs": [0.5, 0.3, 0.2],
  "sample_column_probs": [0.4, 0.3, 0.2, 0.1],
  "select_star_prob": 0.2,
  "where_clause_prob": 0.3,
  "group_by_clause_prob": 0.2,
  "order_by_clause_prob": 0.3,
  "having_clause_prob": 0.3,
  "limit_clause_count": 0.1
}

Network Configuration

Change Backend Port

Edit backend/server.py:

app.run(debug=True, host="0.0.0.0", port=YOUR_PORT)

Change Frontend Port

# macOS/Linux
PORT=4000 npm start

# Windows
set PORT=4000 && npm start

Deploy on Server

Replace localhost with your server IP in frontend/src/App.jsx:

const ip = 'your.server.ip';  // or domain name
const port = 5001;

API Reference

Click to expand API reference details

Key Endpoints

`POST /step_by_step_description`

Generate step-by-step explanation for a SQL query.

Request:

{
  "sql": "SELECT name FROM users WHERE age > 18",
  "schema": {...}
}

Response:

{
  "explanation_data": [...]
}

`POST /suggested_nl`

Get suggested natural language description for SQL.

Request:

{
  "sql": "...",
  "schema": {...},
  "parsed_step_by_step_data": [...]
}

Response:

{
  "nl_query": "What are the names of users older than 18?",
  "examples": [...]
}

`POST /check_alignment`

Check alignment between NL and SQL components.

Request:

{
  "sql": "...",
  "nl": "...",
  "schema": {...},
  "parsed_step_by_step_data": [...]
}

Response:

{
  "alignment_data": [...],
  "uncovered_substrings": [...]
}

`POST /synthesize_records`

Generate synthetic database records.

Request:

{
  "schema": {...},
  "num": 100
}

Response:

{
  "synthetic_records": {...}
}

`POST /synthetic_sql`

Generate a random SQL query.

Request:

{
  "schema": {...},
  "records": {...}
}

Response:

{
  "synthetic_sql": "SELECT ...",
  "config": {...}
}

`POST /analyze_dataset`

Analyze an uploaded SQL query dataset.

Request: Multipart form data with file upload

Response:

{
  "totalQueries": 1000,
  "averageComplexity": 12.5,
  "keywordDistribution": {...},
  "structureDistribution": {...},
  ...
}

Citation

If you use SQLsynth in your research, please cite:

@inproceedings{Tian_2025, series={IUI ’25},
   title={Text-to-SQL Domain Adaptation via Human-LLM Collaborative Data Annotation},
   url={http://dx.doi.org/10.1145/3708359.3712083},
   DOI={10.1145/3708359.3712083},
   booktitle={Proceedings of the 30th International Conference on Intelligent User Interfaces},
   publisher={ACM},
   author={Tian, Yuan and Lee, Daniel and Wu, Fei and Mai, Tung and Qian, Kun and Sahai, Siddhartha and Zhang, Tianyi and Li, Yunyao},
   year={2025},
   month=mar, pages={1398–1425},
   collection={IUI ’25} }

Acknowledgments

This work was completed during an internship at Adobe
Adobe Property

Contact

For questions or feedback, please open an issue on GitHub or contact me by tian211@purdue.edu.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
backend		backend
frontend		frontend
README.md		README.md

magic-YuanTian/SQLsynth

Folders and files

Latest commit

History

Repository files navigation

SQLsynth

🌟 Features

📦 Installation

Backend Setup

Frontend Setup

🚀 Quick Start

Running the Application

Project Structure

Backend (backend/)

Frontend (frontend/src/)

Configuration Files

Usage

1. Schema Design

2. Record Synthesis

3. NL-SQL pair Synthesis

4. NL-SQL pair Scoring

5. Dataset Analysis

Script-Based Synthesis

Configuration

Query Distribution Configuration

Network Configuration

Change Backend Port

Change Frontend Port

Deploy on Server

API Reference

Key Endpoints

POST /step_by_step_description

POST /suggested_nl

POST /check_alignment

POST /synthesize_records

POST /synthetic_sql

POST /analyze_dataset

Citation

Acknowledgments

Contact

About

Resources

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Backend (`backend/`)

Frontend (`frontend/src/`)

`POST /step_by_step_description`

`POST /suggested_nl`

`POST /check_alignment`

`POST /synthesize_records`

`POST /synthetic_sql`

`POST /analyze_dataset`

Packages