# Vector Search on HuggingFace with the Hub as Backend

Datasets on the HuggingFace Hub rely on **parquet** files. We can interact with these files using [DuckDB](https://huggingface.co/docs/hub/en/datasets-duckdb) as a fast in-memory database system. One of DuckDB's featurs is **vector similarity search** that can be used with or without an index.

## Setups

In [None]:
!pip install -qU datasets duckdb sentence-transformers model2vec

## Create embeddings for the dataset

We need to create embeddings for the dataset to search over, so we will use the `sentence-transformers` library to create embeddings for the dataset.

In [None]:
from sentence_transformers import SentenceTransformer
from sentence_transformers.models import StaticEmbedding

static_embedding = StaticEmbedding.from_model2vec('minishlab/portion-base-8M')
model = SentenceTransformer(modules=[static_embedding])

Next we will load the [`ai-blueprint/fineweb-bbc-news`](https://huggingface.co/datasets/ai-blueprint/fineweb-bbc-news) dataset.

In [None]:
from datasets import load_dataset

dataset = load_dataset('ai-blueprint/fineweb-bbc-news')

In [None]:
dataset

Normally, we want to chunk our data into smaller batches to avoid losing precision, but in this example, we will just create embeddings for the full text of the dataset.

In [None]:
def create_embeddings(batch):
    embeddings = model.encode(batch['text'], convert_to_numpy=True)
    batch['embeddings'] = embeddings.tolist()
    return batch


dataset = dataset.map(create_embeddings, batched=True)

In [None]:
# we can upload the embeddings back
dataset.push_to_hub('ai-blueprint/fineweb-bbc-news-embeddings')

## Vector Search the HuggingFace Hub

We can now perform vector search on the dataset using `duckdb`. We can decide if we want to use an index. Searching without an index is slower but more precise, whereas searching with an index is faster but less precise.

### Without an index

In [None]:
import duckdb

def similarity_search_without_duckdb_index(
        query: str,
        k: int = 5,
        dataset_name: str = 'ai-blueprint/fineweb-bbc-news-embeddings',
        embedding_column: str = 'embeddings'
):
    # use the same model as used for indexing
    query_vector = model.encode(query)
    embedding_dim = model.get_sentence_embedding_dimension()

    sql = f"""
        SELECT
            *,
            array_cosine_distance(
                {embedding_column}::float[{embedding_dim}],
                {query_vector.tolist()}::float[{embedding_dim}]
            ) as distance
        FROM 'hf://datasets/{dataset_name}/**/*.parquet'
        ORDER BY distance
        LIMIT {k}
    """

    return duckdb.sql(sql).to_df()

In [None]:
similarity_search_without_duckdb_index("What is the future of AI?")

### With an index

In [None]:
import duckdb

def _setup_vss():
    duckdb.sql(
        query="""
        INSTALL vss;
        LOAD vss;
        """
    )

def _drop_table(table_name):
    duckdb.sql(
        query=f"""
        DROP TABLE IF EXISTS {table_name};
        """
    )

def _create_table(dataset_name, table_name, embedding_column):
    duckdb.sql(
        query=f"""
        CREATE TABLE {table_name} AS
        SELECT *, {embedding_column}::float[{model.get_sentence_embedding_dimension()}] as {embedding_column}_float
        FROM 'hf://datasets/{dataset_name}/**/*.parquet';
        """
    )

def _create_index(table_name, embedding_column):
    duckdb.sql(
        query=f"""
        CREATE INDEX my_hnsw_index ON {table_name} USING HNSW ({embedding_column}_float) WITH (metric = 'cosine);
        """
    )

def create_index(dataset_name, table_name, embedding_column):
    _setup_vss()
    _drop_table(table_name)
    _create_table(dataset_name, table_name, embedding_column)
    _create_index(table_name, embedding_column)

In [None]:
create_index(
    dataset_name='ai-blueprint/fineweb-bbc-news-embeddings',
    table_name='fineweb_bbc_news_embeddings',
    embedding_column='embeddings'
)

Now we can perform a vector search with the index, which returns the results instantly.

In [None]:
def similarity_search_with_duckdb_index(
        query: str,
        k: int = 5,
        table_name: str = 'fineweb_bbc_news_embeddings',
        embedding_column: str = 'embeddings'
):
    embedding = model.encode(query).tolist()

    return duckdb.sql(
        query=f"""
        SELECT *, array_cosine_distance({embedding_column}_float, {embedding}::FLOAT[{model.get_sentence_embedding_dimension()}]) as distance
        FROM {table_name}
        ORDER BY distance
        LIMIT {k}
        """
    ).to_df()

In [None]:
similarity_search_with_duckdb_index("What is the future of AI?")