[IUI 2025] Text-to-SQL Domain Adaptation via Human-LLM Collaborative Data Annotation
This is the repo for the
IUI 2025 paper
- Note: This repo serves as the latest version and the backup of the official repo at Adobe
SQLsynth is not only an interactive data annotation but also automated data synthesis tool designed for quickly creating highly customized (e.g., schema, DB records, distribution) text-to-SQL datasets.
🌐 Live Demo: Try me 👈🏻
- Database Schema Customization
- Freely create, edit, annotate (use NL to label the semantics of database fields, useful for LLMs) in the canvas.
- 📦 --> A highly customized database schema, with meaningful descriptions
- Database Records Population
- Given a database schema, populate it with concrete records
- Rule-based method (No LLM calling)
- Recognized for different datatype
- Distribution is configurable
- 📦 --> A complete, customized database full of records
- SQL Query Sampling
- Given a database, randomly sample SQL queries.
- Based on PCFG (Probability Context-Free Grammar) and other rules to extract records from a specified database.
- The probability distribution is configurable (e.g., increase the number of queries with WHERE clauses or those involving a specific column).
- Syntax is customizable (e.g., support for user-defined SQL dialect).
- Filter by executing the SQL to ensure it is associated with meaningful execution results
- 📦 --> A large amount of SQL queries (with a customized distribution) under the provided database
- SQL-to-Text Generation
- Convert SQL queries into NL questions
- Three stages:
- Convert the SQL query into step-by-step NL explanations by a grammar-based method.
- Conduct in-context on specified real-world data for style adaptation
- Generating the NL question by LLMs
- 📦 --> A large amount of (NL, SQL) pairs under the customized database, where NL questions may be perfect (ambiguous, lack details, etc.)
- Text-SQL Alignment
- Mapping NL components (substrings) to SQL compoenents (clauses)
- Error checking for generated NL (note that the SQL is absolutely correct)
- Use to analyze (1) what information may be missing (the SQL component fails to map to NL components), and (2) what information may be redundant (the NL component doesn't map to any SQL compoenent)
- Interactively highlight by visual correspondence in the UI
- Scoring: Given a NL-SQL paier, evaluate the quality on a scale from 0 to 100, with analysis in NL
- 📦 --> A large amount of refined (NL, SQL) pairs under the customized database
- Dataset Statistics & Visualization
- Upload and analyze existing SQL query datasets
- Assist users in tracking datasets from a dataset-level perspective
- Comprehensive statistics dashboard with summary metrics (total queries, unique keywords, average complexity),including:
- SQL structure distribution
- Keyword frequency distribution
- Clause number distribution
- Column and table usage patterns
- Query complexity distribution
- Reference value distribution
- 📦 --> Insights into dataset characteristics and qualities
- Clone the repository:
git clone https://github.com/yourusername/SQLsynth.git
cd SQLsynth- Install Python dependencies:
cd backend
pip install flask
pip install flask_cors
pip install sql-metadata
pip install openai
pip install nltk
pip install spacy
pip install sqlparse
python -m spacy download en_core_web_sm- Configure LLM API:
- Open
backend/openai_api.py - Implement your own
get_openai_response()function - The function should take a string prompt as input and return a string response
- Open
- Install Node.js dependencies:
cd frontend
npm install- If you encounter missing dependencies, please use
npm installfor necessary packages based on pop-up instructions.
- Start the Backend Server:
cd backend
python server.pyThe backend will run on http://localhost:5001 by default.
- Start the Frontend:
cd frontend
npm startThe frontend will run on http://localhost:3000 by default.
-
Open your browser and navigate to
http://localhost:3000 -
Enjoy! 🎉
SQLsynth_repo/
├── backend/
│ ├── server.py # Main Flask server
│ ├── SQL_synthesizer.py # Query synthesis engine
│ ├── SQL2NL_clean.py # Rule-based SQL parser
│ ├── llm_analysis.py # LLM prompts and analysis
│ ├── records_synthesizer.py # Record generation
│ ├── ICL_retriever.py # Example retrieval
│ ├── db_handling.py # Database utilities
│ ├── openai_api.py # LLM API interface
│ ├── evaluation_steps.py # Evaluation tools
│ ├── *_config.json # Configuration files
│ ├── output_data/ # Generated datasets
│ └── temp_db/ # Temporary databases
├── frontend/
│ ├── src/
│ │ ├── App.jsx # Main app component
│ │ ├── SchemaTab.jsx # Schema designer
│ │ ├── DatabaseTab.jsx # Record management
│ │ ├── DatasetTab.jsx # Dataset synthesis
│ │ └── AnalysisTab.jsx # Analysis interface
│ ├── public/ # Static assets
│ └── package.json # Dependencies
├── user_study/
│ └── spider_schemas/ # 166 Spider schemas
└── README.md
server.py: Flask server handling all API endpointsSQL_synthesizer.py: PCFG-based SQL query generationSQL2NL_clean.py: Rule-based SQL decomposition and explanationllm_analysis.py: LLM prompts and analysis functionsrecords_synthesizer.py: Database record generation with constraint satisfactionICL_retriever.py: In-context learning example retrievaldb_handling.py: Database operations and utilitiesopenai_api.py: LLM API interface (user-implemented)evaluation_steps.py: Evaluation utilities
App.jsx: Main application component with global state managementSchemaTab.jsx: Interactive schema designerDatabaseTab.jsx: Database record management interfaceDatasetTab.jsx: Dataset synthesis and downloadAnalysisTab.jsx: SQL-NL alignment analysisSQLSubexpressionCorrespondence.jsx: Visual representation of SQL components
manual_config.json: Manual probability configuration for SQL synthesislearned_config.json: Learned probability distribution from existing datasetsspider_example_pool.json: Example pool for in-context learning
- Import Schema: Drag and drop a JSON schema file
- Edit Schema: Add/remove tables and columns
- Define Relationships: Specify primary and foreign keys
- Add Descriptions: Document tables and columns for better NL generation
Schema format example:
{
"users": {
"comment": "User information table",
"columns": [
{
"field": "user_id",
"type": "text",
"isPrimary": true,
"comment": "Unique user identifier"
},
{
"field": "username",
"type": "text",
"comment": "User's login name"
}
]
}
}- Click "Generate Records" to create synthetic data
- Specify the number of records to generate
- Records respect foreign key constraints and data types
- Export records to JSON
- Configure query distribution (number of tables, columns, clauses)
- Generate individual queries or batch synthesis
- View step-by-step SQL decomposition
- Get suggested natural language descriptions
- Check alignment between SQL and NL
- Evaluate the quality and equivalence between natural language questions and SQL queries
- Score with explanations
- Upload existing SQL query datasets
- View comprehensive statistics:
- Keyword distribution
- Query structure patterns
- Clause complexity
- Column and table usage
- Query complexity metrics
While human-in-the-loop guarantees the data quality, you can also opt for large-scale dataset generation without the UI:
from server import auto_synthetic_data
synthetic_data = auto_synthetic_data(
schema_path="backend/saved_frontend_schema.json",
save_path="backend/output_data/synthetic_data.jsonl",
config_path="backend/learned_config.json",
synthesized_DB_records_path="backend/output_data/DB_records.json",
example_path="backend/spider_example_pool.json",
data_num=2000
)Parameters:
schema_path: Path to the database schema JSON filesave_path: Output file path for synthetic dataconfig_path: Configuration file for query distributionsynthesized_DB_records_path: Path to save generated database recordsexample_path: Path to example pool for in-context learningdata_num: Number of SQL-NL pairs to generate
Click to expand configuration details
Adjust probabilities in learned_config.json or manual_config.json:
{
"sample_table_probs": [0.5, 0.3, 0.2],
"sample_column_probs": [0.4, 0.3, 0.2, 0.1],
"select_star_prob": 0.2,
"where_clause_prob": 0.3,
"group_by_clause_prob": 0.2,
"order_by_clause_prob": 0.3,
"having_clause_prob": 0.3,
"limit_clause_count": 0.1
}Edit backend/server.py:
app.run(debug=True, host="0.0.0.0", port=YOUR_PORT)# macOS/Linux
PORT=4000 npm start
# Windows
set PORT=4000 && npm startReplace localhost with your server IP in frontend/src/App.jsx:
const ip = 'your.server.ip'; // or domain name
const port = 5001;Click to expand API reference details
Generate step-by-step explanation for a SQL query.
Request:
{
"sql": "SELECT name FROM users WHERE age > 18",
"schema": {...}
}Response:
{
"explanation_data": [...]
}Get suggested natural language description for SQL.
Request:
{
"sql": "...",
"schema": {...},
"parsed_step_by_step_data": [...]
}Response:
{
"nl_query": "What are the names of users older than 18?",
"examples": [...]
}Check alignment between NL and SQL components.
Request:
{
"sql": "...",
"nl": "...",
"schema": {...},
"parsed_step_by_step_data": [...]
}Response:
{
"alignment_data": [...],
"uncovered_substrings": [...]
}Generate synthetic database records.
Request:
{
"schema": {...},
"num": 100
}Response:
{
"synthetic_records": {...}
}Generate a random SQL query.
Request:
{
"schema": {...},
"records": {...}
}Response:
{
"synthetic_sql": "SELECT ...",
"config": {...}
}Analyze an uploaded SQL query dataset.
Request: Multipart form data with file upload
Response:
{
"totalQueries": 1000,
"averageComplexity": 12.5,
"keywordDistribution": {...},
"structureDistribution": {...},
...
}If you use SQLsynth in your research, please cite:
@inproceedings{Tian_2025, series={IUI ’25},
title={Text-to-SQL Domain Adaptation via Human-LLM Collaborative Data Annotation},
url={http://dx.doi.org/10.1145/3708359.3712083},
DOI={10.1145/3708359.3712083},
booktitle={Proceedings of the 30th International Conference on Intelligent User Interfaces},
publisher={ACM},
author={Tian, Yuan and Lee, Daniel and Wu, Fei and Mai, Tung and Qian, Kun and Sahai, Siddhartha and Zhang, Tianyi and Li, Yunyao},
year={2025},
month=mar, pages={1398–1425},
collection={IUI ’25} }
For questions or feedback, please open an issue on GitHub or contact me by tian211@purdue.edu.






