# Index and retrieve documents for vector search using Sentence Transformers and DuckDB

We will be using the [ai-blueprint/fineweb-bbc-news](https://huggingface.co/datasets/ai-blueprint/fineweb-bbc-news) dataset, which is a dataset that contains a sample of the data from fineweb that was sourced from the BBC News website. We assume, these documents function as relevant company documents. At the end we deploy a microservice that can be used to perform vector search on our dataset.

## Dependencies and imports

Let's install the necessary dependencies.

In [None]:
!pip install datasets duckdb sentence-transformers model2vec vicinity gradio gradio-client -q

Now let's import the necessary libraries.

In [26]:
import duckdb
import gradio as gr
import numpy as np
import pandas as pd

from datasets import load_dataset
from gradio_client import Client
from sentence_transformers import SentenceTransformer
from sentence_transformers.models import StaticEmbedding
from vicinity import Vicinity, Backend, Metric

## Load the dataset

In [3]:
dataset = load_dataset("ai-blueprint/fineweb-bbc-news")
dataset["train"]

Dataset({
    features: ['url', 'text'],
    num_rows: 352549
})

## Chunking the documents

To understand how to chunk the documents, we will first need to understand what our `text` column looks like. Depending on the format of our data and the intentions of our retrieval, we can use different strategies to chunk the documents. In our case, we will not be chunking the documents, but we will be using the `text` column to embed the documents. Underneath you can find recommended strategies for chunking the documents.

<details>
<summary>BeautifulSoup for HTML/Markdown</summary>
When working with HTML or Markdown, you can use a library like <a href="https://pypi.org/project/beautifulsoup4/">BeautifulSoup</a> to parse and extract elements like paragraphs, headers, images, etc. We can use this to extract the text from the HTML/Markdown and then split it into chunks.
</details>

<details>
<summary>Chonkie for basic chunking</summary>
When chunking the documents, you can consider different strategies, such as splitting based on tokens, words, sentences or semantic units. There are a lot of libraries out there but a nice lightweight options is <a href="https://github.com/chonkie-ai/chonkie">chonkie</a>, which supports a lot of different <a href="https://docs.chonkie.ai/chunkers/overview">strategies and examples</a>.
</details>

## Creating embeddings

Depending on the format of our data and the intentions of our retrieval, we can use different strategies to create embeddings for our documents. In our case, we will be working with basic text data, so we will be using the `text` column to embed the documents. Underneath you can find recommended strategies for creating embeddings for other approaches.

<details>
<summary>RAGatouille and ColBERT for improved accuracy</summary>
A more complex but also more accurate approach is using contextual late interaction with [ColBERT](https://github.com/stanford-futuredata/ColBERT). ColBERT encodes each passage and query into a matrix of token-level embeddings, which ensures better semantic preservation and matching. The [RAGatouille](https://github.com/AnswerDotAI/RAGatouille) library provides a simple interface for using ColBERT in a pipeline.
</details>

<details>
<summary>Byaldi and Colpali for multi-modal document retrieval</summary>
[Colpali](https://github.com/illuin-tech/colpali) is an approach that was inspired by ColBERT but uses documents and images as input rather than text. The [Byaldi](https://github.com/AnswerDotAI/byaldi) library provides a simple interface for using Colpali in a pipeline.
</details>

<details>
<summary>CLIP for images or image-text pairs</summary>
We can use use a similar approach to create embeddings for our images and texts. We will use the [sentence-transformers/clip-ViT-B-32](https://huggingface.co/sentence-transformers/clip-ViT-B-32) model to create the embeddings for our images and texts which will then be embedded into a single vector space. You can then use these embeddings to perform a multi-modal search.
</details>

We will use the [minishlab/potion-base-8M](https://huggingface.co/minishlab/potion-base-8M) model to create the embeddings for our text, which we chose because of the speed at which it can create embeddings. It takes mere minutes to embed more than hundreds of thousands of documents on consumer hardware. In other scenarios the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard) can help with choosing the best model for your specific task.

In [None]:
# Initialize a StaticEmbedding module
static_embedding = StaticEmbedding.from_model2vec("minishlab/potion-base-8M")
model = SentenceTransformer(modules=[static_embedding])


def create_embeddings(batch):
    """Create embeddings for a batch of text chunks."""
    batch["embedding"] = model.encode(batch["text"])
    return batch


# Create dataset with chunks and generate embeddings
embeddings_dataset = dataset.map(create_embeddings, batched=True)
embeddings_dataset.push_to_hub("ai-blueprint/fineweb-bbc-news-text-embeddings")

## Vector search Hub datasets

For the similarity search, we will can simply execute queries on top of the Hugging Face Hub using the [DuckDB integration for vector search](https://huggingface.co/docs/hub/en/datasets-duckdb). This also works with [private datasets](https://huggingface.co/docs/hub/en/datasets-duckdb-auth). When doing so, we can either use an index or not. Searching **without an index** is slower but more precise, whereas searching **with an index** is faster but less precise.

### Use the Hub directly

To search without an index, we can use the duckdb library to connect to the dataset and perform a vector search. This is a slow operation, but normally works quick enough for small datasets up to let's say 100k rows. Meaning querying our dataset will be somewhat slower.

In [24]:
def similarity_search_without_duckdb_index(
    query: str,
    k: int = 5,
    dataset_name: str = "ai-blueprint/fineweb-bbc-news-embeddings",
    embedding_column: str = "embeddings",
):
    # Use same model as used for indexing
    query_vector = model.encode(query)
    embedding_dim = model.get_sentence_embedding_dimension()

    sql = f"""
        SELECT 
            *,
            array_cosine_distance(
                {embedding_column}::float[{embedding_dim}], 
                {query_vector.tolist()}::float[{embedding_dim}]
            ) as distance
        FROM 'hf://datasets/{dataset_name}/**/*.parquet'
        ORDER BY distance
        LIMIT {k}
    """
    df = duckdb.sql(sql).to_df()
    df = df.drop(columns=[embedding_column])
    return df

similarity_search_without_duckdb_index("What is the future of AI?")

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Unnamed: 0,url,text,distance
0,https://www.bbc.com/news/technology-51064369,The last decade was a big one for artificial i...,0.2812
1,http://www.bbc.com/news/technology-25000756,Singularity: The robots are coming to steal ou...,0.365842
2,http://www.bbc.co.uk/news/technology-25000756,Singularity: The robots are coming to steal ou...,0.365842
3,https://www.bbc.co.uk/news/technology-37494863,"Google, Facebook, Amazon join forces on future...",0.38082
4,https://www.bbc.co.uk/news/technology-37494863,"Google, Facebook, Amazon join forces on future...",0.38082


Because of the dataset size, this approach is not very efficient and takes 30 seconds to run, but we can improve it by using creating an approximate nearest neighbor index.

### Using a DuckDB vector search index

This approach works for huge datasets and relies on the [DuckDB vector search extension](https://duckdb.org/docs/extensions/vss.html). We will copy the dataset from the Hub to a local DuckDB database and create a vector search index. Creating the local index has some minor overhead but it will significantly speed up the search once you've created it.

In [None]:
def _setup_vss():
    return """
        INSTALL vss;
        LOAD vss;
        """


def _drop_table(table_name):
    return f"""
        DROP TABLE IF EXISTS {table_name};
        """


def _create_table(dataset_name, table_name, embedding_column):
    return f"""
        CREATE TABLE {table_name} AS 
        SELECT *, {embedding_column}::float[{model.get_sentence_embedding_dimension()}] as {embedding_column}_float 
        FROM 'hf://datasets/{dataset_name}/**/*.parquet';
        """


def _create_index(table_name, embedding_column):
    return f"""
        CREATE INDEX my_hnsw_index ON {table_name} USING HNSW ({embedding_column}_float) WITH (metric = 'cosine');
        """


def create_index(dataset_name, table_name, embedding_column):
    duckdb.sql(_setup_vss())
    duckdb.sql(_drop_table(table_name))
    duckdb.sql(_create_table(dataset_name, table_name, embedding_column))
    duckdb.sql(_create_index(table_name, embedding_column))


create_index(
    dataset_name="ai-blueprint/fineweb-bbc-news-embeddings",
    table_name="fineweb_bbc_news_embeddings",
    embedding_column="embeddings",
)

After this, we can simply execute queries on the local DuckDB database, which is much faster than the previous approach and produces similar results.

In [23]:
def similarity_search_with_duckdb_index(
    query: str,
    k: int = 5,
    table_name: str = "fineweb_bbc_news_embeddings",
    embedding_column: str = "embeddings"
):
    embedding = model.encode(query).tolist()
    df = duckdb.sql(
        query=f"""
        SELECT *, array_cosine_distance({embedding_column}_float, {embedding}::FLOAT[{model.get_sentence_embedding_dimension()}]) as distance 
        FROM {table_name}
        ORDER BY distance 
        LIMIT {k};
    """
    ).to_df()
    df = df.drop(columns=[embedding_column, embedding_column + "_float"])
    return df

similarity_search_with_duckdb_index("What is the future of AI?")

Unnamed: 0,url,text,distance
0,https://www.bbc.com/news/technology-51064369,The last decade was a big one for artificial i...,0.2812
1,http://www.bbc.co.uk/news/technology-25000756,Singularity: The robots are coming to steal ou...,0.365842
2,http://www.bbc.com/news/technology-25000756,Singularity: The robots are coming to steal ou...,0.365842
3,https://www.bbc.co.uk/news/technology-37494863,"Google, Facebook, Amazon join forces on future...",0.38082
4,https://www.bbc.co.uk/news/technology-37494863,"Google, Facebook, Amazon join forces on future...",0.38082


The query has been reduced from 30 seconds to sub-second response times and does not require you to deploy a heavy-weight vector search engine, while data storage is being handled by the Hub.

## Using vicinity as vector search backend

Lastly, we can also take a more Pythonic approach and use the [vicinity library](https://github.com/MinishLab/vicinity) to create a vector search index. We simply load the dataset from the Hub and create a vector search index by passing our vectors and items to the `Vicinity` class.

In [20]:
vicinity = Vicinity.from_vectors_and_items(
    vectors=np.array(embeddings_dataset["embeddings"]),
    items=embeddings_dataset["text"],
    backend_type=Backend.HNSW,
    metric=Metric.COSINE,
)

After this, we can execute queries on the local vicinity vector search index.  Note that the retrieved results are similar to the ones we got from the other methods.

In [21]:
def similarity_search_with_vicinity(query: str, k: int = 10):
    return vicinity.query(vectors=model.encode(query), k=k)


similarity_search_with_vicinity(query="How should companies prepare for AI?", k=5)

[[('Artificial intelligence (AI) is one of the most exciting technologies today, and Africa doesn\'t want to be left behind.\nToday a majority of the AI industry is in North America, Europe and Asia.\nEfforts are being made to train computer scientists from African nations, as AI can be used to solve many complex challenges.\nIn a bid to improve diversity, tech giants are providing investment to develop new talent.\nIn April, Google opened its first African AI research centre in Ghana.\nThe AI laboratory, based in Accra, will be used to develop solutions to help improve healthcare, agriculture and education.\nGoogle\'s head of AI Accra Moustapha Cisse is from Senegal.\nAfter completing an undergraduate degree in maths and physics in Senegal, he taught himself AI and then went to study in Paris, before joining Facebook.\nThere are very few AI researchers from Africa, and Mr Cisse has faced great obstacles in achieving his ambitions.\n"Despite the support, many of us still have trouble m

## Creating a web app and microservice for retrieval

We will be using [Gradio](https://github.com/gradio-app/gradio) as web application tool to create a demo interface for our vector search index. We can develop this locally and then easily deploy it to Hugging Face Spaces. Lastly, we can use the Gradio client as SDK to directly interact with our vector search index.

### Creating the web app

In [21]:
def search(query, k):
    return similarity_search_with_duckdb_index(query, k)


with gr.Blocks() as demo:
    gr.Markdown("""# RAG - retrieve
                
                Part of [AI blueprint](https://github.com/huggingface/ai-blueprint) - a blueprint for AI development, focusing on practical examples of RAG, information extraction, analysis and fine-tuning in the age of LLMs. """)
    query = gr.Textbox(label="Query")
    k = gr.Slider(1, 10, value=5, label="Number of results")
    btn = gr.Button("Search")
    results = gr.Dataframe(headers=["title", "url", "content", "distance"])
    btn.click(fn=search, inputs=[query, k], outputs=[results])

demo.launch(share=False)  # set to True to share as a public website directly

* Running on local URL:  http://127.0.0.1:7861

To create a public link, set `share=True` in `launch()`.




<iframe
	src="https://ai-blueprint-rag-retrieve.hf.space"
	frameborder="0"
	width="850"
	height="450"
></iframe>

## Deploying the web app on Hugging Face 

We can now [deploy our Gradio application to Hugging Face Spaces](https://huggingface.co/new-space?sdk=gradio&name=rag-retrieve).

-  Click on the "Create Space" button.
-  Copy the code from the Gradio interface and paste it into an `app.py` file. Don't forget to copy the `similarity_search_*` function, along with the code to create the index.
-  Create a `requirements.txt` file with `duckdb`, `sentence-transformers` and `model2vec`.

We wait a couple of minutes for the application to deploy et voila, we have [a public vector search interface](https://huggingface.co/spaces/ai-blueprint/rag-retrieve)!

## Using the web app as a microservice

We can now use the [Gradio client as SDK](https://www.gradio.app/guides/getting-started-with-the-python-client) to directly interact with our vector search index. Each Gradio app has a API documentation that describes the available endpoints and their parameters, which you can access from the button at the bottom of the Gradio app's space page.

In [27]:
client = Client("https://ai-blueprint-rag-retrieve.hf.space/")
results = client.predict(
    api_name="/similarity_search", query="How should companies prepare for AI?", k=5
)
pd.DataFrame(data=results["data"], columns=results["headers"])

Loaded as API: https://ai-blueprint-rag-retrieve.hf.space/ ✔


Unnamed: 0,chunk,url,distance
0,"""We have to prepare for a different future. "".",http://news.bbc.co.uk/2/hi/europe/3602209.stm,0.444404
1,UK spies will need to use artificial intellige...,https://www.bbc.com/news/technology-52415775,0.446492
2,Google developing kill switch for AI\n- 8 June...,http://www.bbc.com/news/technology-36472140,0.471058
3,Artificial intelligence (AI) is one of the mos...,https://www.bbc.co.uk/news/business-48139212,0.471088
4,The last decade was a big one for artificial i...,https://www.bbc.com/news/technology-51064369,0.472657


## Conclusion

We have shown a basic approach on how to index and perform vector search on the Hugging Face Hub. Next, we will build a reranker that uses the output of the vector search to improve the quality of the retrieved documents by reranking them based on their relevance to the query.

## Next Steps

- Continue - with [Augmenting retrieval results by reranking using Sentence Transformers](./augment.ipynb).
- Contribute - code to show how to use chunking, ColBERT, multi-modal, RAGatouille, Byaldi, or CLIP.
- Learn - theories behind the approaches in [Hugging Face courses](https://huggingface.co/learn) or [smol-course](https://github.com/huggingface/smol-course?tab=readme-ov-file).
- Explore - notebooks with similar techniques on [the Hugging Face Cookbook](https://huggingface.co/learn/cookbook/index).