SQL 🤝 LLMs

Check out our online documentation for a more comprehensive overview.

Results from the paper are available here

pip install blendsql

BlendSQL is a superset of SQLite for problem decomposition and hybrid question-answering with LLMs.

As a result, we can Blend together...

🥤 ...operations over heterogeneous data sources (e.g. tables, text, images)
🥤 ...the structured & interpretable reasoning of SQL with the generalizable reasoning of LLMs

It can be viewed as an inversion of the typical text-to-SQL paradigm, where a user calls a LLM, and the LLM calls a SQL program.

Now, the user is given the control to oversee all calls (LLM + SQL) within a unified query language.

For example, imagine we have the following table titled parks, containing info on national parks in the United States.

We can use BlendSQL to build a travel planning LLM chatbot to help us navigate the options below.

Name	Location	Area	Recreation Visitors (2022)	Description
Death Valley	California, Nevada	3,408,395.63 acres (13,793.3 km2)	1,128,862	Death Valley is the hottest, lowest, and driest place in the United States, with daytime temperatures that have exceeded 130 °F (54 °C).
Everglades	Alaska	7,523,897.45 acres (30,448.1 km2)	9,457	The country's northernmost park protects an expanse of pure wilderness in Alaska's Brooks Range and has no park facilities.
New River Gorge	West Virgina	7,021 acres (28.4 km2)	1,593,523	The New River Gorge is the deepest river gorge east of the Mississippi River.
Katmai	Alaska	3,674,529.33 acres (14,870.3 km2)	33,908	This park on the Alaska Peninsula protects the Valley of Ten Thousand Smokes, an ash flow formed by the 1912 eruption of Novarupta.

BlendSQL allows us to ask the following questions by injecting "ingredients", which are callable functions denoted by double curly brackets ({{, }}).

Which parks don't have park facilities?

SELECT * FROM parks
    WHERE NOT {{
        LLMValidate(
            'Does this location have park facilities?',
            context=(SELECT "Name" AS "Park", "Description" FROM parks),
        )
    }}

What does the largest park in Alaska look like?

SELECT {{VQA('Describe this image.', 'parks::Image')}} FROM parks
    WHERE "Location" = 'Alaska'
    ORDER BY {{
        LLMMap(
            'Size in km2?',
            'parks::Area'
        )
    }} LIMIT 1

Which state is the park in that protects an ash flow?

SELECT "Location" FROM parks WHERE "Name" = {{
    LLMQA(
      'Which park protects an ash flow?',
      context=(SELECT "Name", "Description" FROM parks),
      options="parks::Name"
    ) 
}}

How many parks are located in more than 1 state?

SELECT COUNT(*) FROM parks
    WHERE {{LLMMap('How many states?', 'parks::Location')}} > 1

Now, we have an intermediate representation for our LLM to use that is explainable, debuggable, and very effective at hybrid question-answering tasks.

For in-depth descriptions of the above queries, check out our documentation.

Features

Supports many DBMS 💾
- Currently, SQLite and PostgreSQL are functional - more to come!
Supports many models ✨
- Transformers, Llama.cpp, OpenAI, Ollama
Easily extendable to multi-modal usecases 🖼️
Smart parsing optimizes what is passed to external functions 🧠
- Traverses abstract syntax tree with sqlglot to minimize LLM function calls 🌳
Constrained decoding with outlines 🚀
LLM function caching, built on diskcache 🔑

Quickstart

from blendsql import blend, LLMQA
from blendsql.db import SQLite
from blendsql.models import OpenaiLLM, TransformersLLM
from blendsql.utils import fetch_from_hub

blendsql = """
SELECT * FROM w
WHERE city = {{
    LLMQA(
        'Which city is located 120 miles west of Sydney?',
        (SELECT * FROM documents WHERE documents MATCH 'sydney OR 120'),
        options='w::city'
    )
}}
"""
# Make our smoothie - the executed BlendSQL script
smoothie = blend(
    query=blendsql,
    db=SQLite(fetch_from_hub("1884_New_Zealand_rugby_union_tour_of_New_South_Wales_1.db")),
    blender=OpenaiLLM("gpt-3.5-turbo"),
    # If you don't have OpenAI setup, you can use this small Transformers model below instead
    # blender=TransformersLLM("Qwen/Qwen1.5-0.5B"),
    ingredients={LLMQA},
    verbose=True
)
print(smoothie.df)
print(smoothie.meta.prompts)

Citation

@article{glenn2024blendsql,
      title={BlendSQL: A Scalable Dialect for Unifying Hybrid Question Answering in Relational Algebra},
      author={Parker Glenn and Parag Pravin Dakle and Liang Wang and Preethi Raghavan},
      year={2024},
      eprint={2402.17882},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Acknowledgements

Special thanks to those below for inspiring this project. Definitely recommend checking out the linked work below, and citing when applicable!

The authors of Binding Language Models in Symbolic Languages
- This paper was the primary inspiration for BlendSQL.
The authors of EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images
- As far as I can tell, the first publication to propose unifying model calls within SQL
- Served as the inspiration for the vqa-ingredient.ipynb example
The authors of Grammar Prompting for Domain-Specific Language Generation with Large Language Models
The maintainers of the Outlines library for powering the constrained decoding capabilities of BlendSQL
- Paper at https://arxiv.org/abs/2307.09702

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Check out our online documentation for a more comprehensive overview.

Features

Quickstart

Citation

Acknowledgements

Files

README.md

Latest commit

History

README.md

File metadata and controls

Check out our online documentation for a more comprehensive overview.

Features

Quickstart

Citation

Acknowledgements