# Semantic Search using DuckDB SQL and OpenAI Embeddings


DuckDB is an increasingly popular analytical database, known for its speed, simplicity, and ability to handle large-scale data analysis directly from your laptop or server. Its lightweight design and SQL compatibility make it a great choice for modern data science workflows.

In this Cookbook, we will demonstrate integrating DuckDB with OpenAI APIs for performing semantic search on the Arxiv dataset, including loading data, generating embeddings, and running similarity queries using SQL.

This notebook demonstrates how to:

- Load the [arXiv](https://www.kaggle.com/datasets/spsayakpaul/arxiv-paper-abstracts) paper abstracts dataset into DuckDB
- Generate and store OpenAI embeddings into DuckDB
- Embed a search query with the OpenAI embeddings endpoint
- Perform semantic search in DuckDB using the embedded query

## Install dependencies

In [1]:
!pip install numpy kagglehub duckdb pandas openai

Collecting numpy
  Using cached numpy-2.3.2-cp313-cp313-macosx_14_0_arm64.whl.metadata (62 kB)
Collecting kagglehub
  Using cached kagglehub-0.3.13-py3-none-any.whl.metadata (38 kB)
Collecting duckdb
  Using cached duckdb-1.3.2-cp313-cp313-macosx_12_0_arm64.whl.metadata (7.0 kB)
Collecting pandas
  Using cached pandas-2.3.2-cp313-cp313-macosx_11_0_arm64.whl.metadata (91 kB)
Collecting openai
  Using cached openai-1.106.1-py3-none-any.whl.metadata (29 kB)
Collecting pyyaml (from kagglehub)
  Using cached PyYAML-6.0.2-cp313-cp313-macosx_11_0_arm64.whl.metadata (2.1 kB)
Collecting requests (from kagglehub)
  Using cached requests-2.32.5-py3-none-any.whl.metadata (4.9 kB)
Collecting tqdm (from kagglehub)
  Using cached tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Collecting pytz>=2020.1 (from pandas)
  Using cached pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Using cached tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting anyio<5

## Extract the dataset and load into DuckDB
In this example, we'll be using the arXiv paper abstracts from kaggle as an example. Its a simple CSV with titles and summaries. Let's extract it.

In [2]:
import kagglehub
import pandas as pd

path = kagglehub.dataset_download("spsayakpaul/arxiv-paper-abstracts")

path = path+"/arxiv_data.csv"
print(path)

# Load the dataset into DuckDB
import duckdb

# Create a connection to the database
conn = duckdb.connect('arxiv_data.db')

# Load the dataset into DuckDB, limiting to 400 rows for testing
duckdb.sql(f"""
    CREATE OR REPLACE TABLE papers AS 
        SELECT * FROM read_csv('{path}', header=true, parallel=false)
        LIMIT 400
""")

# Inspect the first 5 rows of the dataset
result = duckdb.sql("SELECT * FROM papers LIMIT 5").df()

result.head()



  from .autonotebook import tqdm as notebook_tqdm


/Users/ayman/.cache/kagglehub/datasets/spsayakpaul/arxiv-paper-abstracts/versions/2/arxiv_data.csv


Unnamed: 0,titles,summaries,terms
0,Survey on Semantic Stereo Matching / Semantic ...,Stereo matching is one of the widely used tech...,"['cs.CV', 'cs.LG']"
1,FUTURE-AI: Guiding Principles and Consensus Re...,The recent advancements in artificial intellig...,"['cs.CV', 'cs.AI', 'cs.LG']"
2,Enforcing Mutual Consistency of Hard Regions f...,"In this paper, we proposed a novel mutual cons...","['cs.CV', 'cs.AI']"
3,Parameter Decoupling Strategy for Semi-supervi...,Consistency training has proven to be an advan...,['cs.CV']
4,Background-Foreground Segmentation for Interio...,"To ensure safety in automated driving, the cor...","['cs.CV', 'cs.LG']"


### Add an embeddings column to the schema

In [3]:
duckdb.sql("ALTER TABLE papers ADD COLUMN IF NOT EXISTS embeddings FLOAT[1024]")

# Verify the new column has been added by inspecting the schema
duckdb.sql("PRAGMA table_info(papers)")

┌───────┬────────────┬─────────────┬─────────┬────────────┬─────────┐
│  cid  │    name    │    type     │ notnull │ dflt_value │   pk    │
│ int32 │  varchar   │   varchar   │ boolean │  varchar   │ boolean │
├───────┼────────────┼─────────────┼─────────┼────────────┼─────────┤
│     0 │ titles     │ VARCHAR     │ false   │ NULL       │ false   │
│     1 │ summaries  │ VARCHAR     │ false   │ NULL       │ false   │
│     2 │ terms      │ VARCHAR     │ false   │ NULL       │ false   │
│     3 │ embeddings │ FLOAT[1024] │ false   │ NULL       │ false   │
└───────┴────────────┴─────────────┴─────────┴────────────┴─────────┘

## Generate embeddings for the dataset

There are multiple options for creating embeddings in DuckDB. We could either

1. Loop through batches of inputs in Python, call the embedding model and store each batch in the database.

2. Create a custom DuckDB function (UDF) to call the model and write the embeddings in a single SQL statement.

In this notebook, I'll go with option 2, in order to have an "SQL first" experience, defining a re-usable SQL embedding function that I could use in different use cases.

### Defining an OpenAI embeddings UDF for DuckDB

The function below specifies the encoding format as "float" and sets the embedding dimensions to 1024 which is compatible with the embeddings field size on DuckDB.

In [None]:
import numpy as np
from duckdb.typing import VARCHAR
import openai
client = openai.OpenAI()

# Define the UDF for embedding a text input using the OpenAI API.
def embed_openai(text: str) -> np.ndarray:
    """
    DuckDB UDF for embedding a text input using the OpenAI API.
    """
    model = "text-embedding-3-small"
    response = client.embeddings.create(
        model=model,
        input=text,
        encoding_format="float",
        dimensions=1024
    )

    return response.data[0].embedding

# Register the UDF with DuckDB.
duckdb.create_function("embed_openai", embed_openai, [VARCHAR], "FLOAT[1024]")

<duckdb.duckdb.DuckDBPyConnection at 0x117d2ce30>

*Note on performance:* The above function, will run a call to OpenAI's embeddings API for every single row. Depending on your dataset size, this might be quite slow. For larger datasets, consider [upgrading this function](https://lukaszrogalski.substack.com/p/python-udfs-in-duckdb) to work with aggregated data and pass in multiple sentences (batches) to the OpenAI embeddings call.

Now that we’ve registered the function with DuckDB, we can use in like any native function as part of our SQL query:

In [6]:
duckdb.sql("SELECT embed_openai('Which papers are related to quantum computing?') AS query_embedding;")

┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

### Generating Embeddings

With the embedding function in place, we can now use it to generate and write embeddings into our table via SQL. The below query should run on every row in the table, calling the openai embedding UDF we previously defined. On 400 rows, it should take around 2 minutes to complete.

In [7]:
duckdb.query("""
UPDATE papers
SET embeddings = embed_openai(
  COALESCE(titles, '') || ' ' || COALESCE(summaries, '')
)
WHERE embeddings IS NULL
""")

Inspecting the first 5 rows of the dataset we can see that the embeddings have been created for every row.

In [None]:
result = duckdb.sql("SELECT * FROM papers LIMIT 5").df()
result.head()

Unnamed: 0,titles,summaries,terms,embeddings
0,Survey on Semantic Stereo Matching / Semantic ...,Stereo matching is one of the widely used tech...,"['cs.CV', 'cs.LG']","[-0.018463377, -0.03012074, 0.010921418, -0.04..."
1,FUTURE-AI: Guiding Principles and Consensus Re...,The recent advancements in artificial intellig...,"['cs.CV', 'cs.AI', 'cs.LG']","[-0.015125522, -0.020882344, 0.042208467, 0.04..."
2,Enforcing Mutual Consistency of Hard Regions f...,"In this paper, we proposed a novel mutual cons...","['cs.CV', 'cs.AI']","[0.00833142, -0.021476267, 0.037161183, 0.0197..."
3,Parameter Decoupling Strategy for Semi-supervi...,Consistency training has proven to be an advan...,['cs.CV'],"[0.014294317, -0.020803811, 0.03544353, 0.0138..."
4,Background-Foreground Segmentation for Interio...,"To ensure safety in automated driving, the cor...","['cs.CV', 'cs.LG']","[-0.009169946, 0.0074990084, 0.011346209, -0.0..."


## Running a Similarity Search with SQL

Now that we have embeddings for each paper, we can use them to perform a semantic similarity search. 

To do this, we can use an array distance function native to DuckDB such as array_cosine_similarity that computes the cosine similarity between two vectors.

Below we define a query that uses our embed_openai function to generate an embedding for a query, and then uses the array_cosine_similarity function to compute the similarity between the query embedding and each of the paper embeddings.



In [9]:
def search_papers(query_text: str, k: int = 5):
    return duckdb.execute("""
        WITH q AS (
            SELECT embed_openai(?) AS qe
        )
        SELECT
            titles,
            summaries,
            array_cosine_similarity(embeddings, q.qe) AS score
        FROM papers, q
        WHERE embeddings IS NOT NULL
        ORDER BY score DESC
        LIMIT ?
    """, [query_text, k]).fetchdf()

# Test the function
search_papers("What are the research papers on image segmentation for the medical field?")

Unnamed: 0,titles,summaries,score
0,Medical Matting: A New Perspective on Medical ...,"In medical image segmentation, it is difficult...",0.579598
1,Self-Supervision with Superpixels: Training Fe...,Few-shot semantic segmentation (FSS) has great...,0.570959
2,A Spatial Guided Self-supervised Clustering Ne...,The segmentation of medical images is a fundam...,0.56201
3,Superpixel-Guided Label Softening for Medical ...,Segmentation of objects of interest is one of ...,0.561668
4,Efficient and Generic Interactive Segmentation...,Semantic segmentation of medical images is an ...,0.560177


### Optimizing queries with an index

While the above search query works fine on 400 rows, it can eventually get much slower as the dataset grows into hundreds of thousands. Without an index, DuckDB will compare a query embedding with all document embeddings to find the most similar one.

In order to speed up the vector search, we can use ANN (Approximate Nearest Neighbor) with [HNSW (Hierarchical Navigable Small World)](https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world), supported via DuckDB's vector [similarity search extension](https://duckdb.org/2024/05/03/vector-similarity-search-vss.html).

Let's try that out.

In [None]:
# Install the extension
duckdb.sql("INSTALL vss;")
duckdb.sql("LOAD vss;")
duckdb.sql("SET GLOBAL hnsw_enable_experimental_persistence = true;")

# Create an index on the embeddings column

Now we can verify that the index has been created and run a quick test

In [None]:
# Verify the index has been created
duckdb.sql("SELECT * FROM duckdb_indexes();")

┌───────────────┬──────────────┬─────────────┬────────────┬────────────────┬───────────┬────────────┬───────────┬─────────┬───────────────────────┬───────────┬────────────┬──────────────┬────────────────────────────────────────────────────────────────┐
│ database_name │ database_oid │ schema_name │ schema_oid │   index_name   │ index_oid │ table_name │ table_oid │ comment │         tags          │ is_unique │ is_primary │ expressions  │                              sql                               │
│    varchar    │    int64     │   varchar   │   int64    │    varchar     │   int64   │  varchar   │   int64   │ varchar │ map(varchar, varchar) │  boolean  │  boolean   │   varchar    │                            varchar                             │
├───────────────┼──────────────┼─────────────┼────────────┼────────────────┼───────────┼────────────┼───────────┼─────────┼───────────────────────┼───────────┼────────────┼──────────────┼──────────────────────────────────────────────────────

In [20]:
# Test the function
search_papers("What are the research papers on image segmentation for the medical field?")

Unnamed: 0,titles,summaries,score
0,Medical Matting: A New Perspective on Medical ...,"In medical image segmentation, it is difficult...",0.579598
1,Self-Supervision with Superpixels: Training Fe...,Few-shot semantic segmentation (FSS) has great...,0.570959
2,A Spatial Guided Self-supervised Clustering Ne...,The segmentation of medical images is a fundam...,0.56201
3,Superpixel-Guided Label Softening for Medical ...,Segmentation of objects of interest is one of ...,0.561668
4,Efficient and Generic Interactive Segmentation...,Semantic segmentation of medical images is an ...,0.560177


## Conclusion

In this cookbook, we explored how to integrate OpenAI’s embedding calls as a reusable UDF in DuckDB. This approach proves especially powerful when you want to store and query embeddings directly alongside your data. By doing so, you unlock new opportunities for combining advanced data analysis with retrieval tasks, all through DuckDB’s simple and familiar SQL interface.