# Index and retrieve documents for vector search using Sentence Transformers and DuckDB

We will be using the [ai-blueprint/fineweb-bbc-news](https://huggingface.co/datasets/ai-blueprint/fineweb-bbc-news) dataset, which is a dataset that contains a sample of the data from fineweb that was sourced from the BBC News website. We assume, these documents function as relevant company documents. At the end we will have deployed a microservice that can be used to perform vector search on our dataset.

In [3]:
from datasets import load_dataset

dataset = load_dataset("ai-blueprint/fineweb-bbc-news")
dataset["train"]

Dataset({
    features: ['url', 'text'],
    num_rows: 352549
})

## Chunking the documents

To understand how to chunk the documents, we will first need to understand what our `text` column looks like. When working with HTML or Markdown, you can use a library like `BeautifulSoup` to parse and extract elements like paragraphs, headers, images, etc. In our example, the data is already in a structured format, so we can directly consider splitting the text into chunks. 

When chunking the documents, you can consider different strategies, such as splitting based on tokens, words, sentences or semantic units. There are a lot of libraries out there but a nice lightweight options is [chonkie](https://github.com/chonkie-ai/chonkie), which supports a lot of different [strategies and examples](https://docs.chonkie.ai/chunkers/overview).

In our case, we will not be chunking the documents, but we will be using the `text` column to embed the documents.

## Creating embeddings

In order to create a vector search index, we will need to create embeddings for each of our chunks. We will use the [sentence-transformers library](https://huggingface.co/sentence-transformers) to create these embeddings.

### Creating text embeddings

We will use the [minishlab/potion-base-8M](https://huggingface.co/minishlab/potion-base-8M) model to create the embeddings for our text, which we chose because of the speed at which it can create embeddings. It takes mere minutes to embed more than hundreds of thousands of documents on consumer hardware. In other scenarios the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard) can help with choosing the best model for your specific task. 

In [5]:
from datasets import Dataset

from sentence_transformers import SentenceTransformer
from sentence_transformers.models import StaticEmbedding

# Initialize a StaticEmbedding module
static_embedding = StaticEmbedding.from_model2vec("minishlab/potion-base-8M")
model = SentenceTransformer(modules=[static_embedding])

def create_text_embeddings(batch):
    """Create embeddings for a batch of text chunks."""
    batch["embedding"] = model.encode(batch["text"])
    return batch


# Create dataset with chunks and generate embeddings
embeddings_dataset = dataset.map(create_text_embeddings, batched=True)
embeddings_dataset.push_to_hub("ai-blueprint/fineweb-bbc-news-text-embeddings")

Map:   0%|          | 0/352549 [00:00<?, ? examples/s]

  ret, _, _, _ = torch.embedding_bag(


### Creating multi-modal embeddings

We can use use a similar approach to create embeddings for our images and texts. We will use the [sentence-transformers/clip-ViT-B-32](https://huggingface.co/sentence-transformers/clip-ViT-B-32) model to create the embeddings for our images and texts which will then be embedded into a single vector space. You can then use this embedding to perform a multi-modal search.

## Vector search Hub datasets

For the similarity search, we will can simply execute queries on top of the Hugging Face Hub using the [DuckDB integration for vector search](https://huggingface.co/docs/hub/en/datasets-duckdb). This also works with [private datasets](https://huggingface.co/docs/hub/en/datasets-duckdb-auth). Note that we need to use the same model for embedding the query as we used for indexing.

### Use the Hub directly

This approach works quick enough for datasets up to roughly 100K records.

In [36]:
from sentence_transformers import SentenceTransformer
from sentence_transformers.models import StaticEmbedding
import duckdb


# Initialize a StaticEmbedding module
static_embedding = StaticEmbedding.from_model2vec("minishlab/potion-base-8M")
model = SentenceTransformer(modules=[static_embedding])


def similarity_search_without_duckdb_index(
    query: str,
    k: int = 5,
    dataset_name: str = "smol-blueprint/fineweb-bbc-news-text-embeddings",
    embedding_column: str = "embedding",
):
    # Use same model as used for indexing
    query_vector = model.encode(query)
    embedding_dim = model.get_sentence_embedding_dimension()

    sql = f"""
        SELECT 
            url,
            chunk,
            array_cosine_distance(
                {embedding_column}::float[{embedding_dim}], 
                {query_vector.tolist()}::float[{embedding_dim}]
            ) as distance
        FROM 'hf://datasets/{dataset_name}/**/*.parquet'
        ORDER BY distance
        LIMIT {k}
    """

    return duckdb.sql(sql).to_df()


similarity_search_without_duckdb_index(
    "How should companies prepare for AI?",
)

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Unnamed: 0,url,chunk,distance
0,http://news.bbc.co.uk/2/hi/europe/3602209.stm,"""We have to prepare for a different future. "".",0.444404
1,https://www.bbc.com/news/technology-52415775,UK spies will need to use artificial intellige...,0.446492
2,http://www.bbc.com/news/technology-36472140,Google developing kill switch for AI\n- 8 June...,0.471058
3,https://www.bbc.co.uk/news/business-48139212,Artificial intelligence (AI) is one of the mos...,0.471088
4,https://www.bbc.com/news/technology-51064369,The last decade was a big one for artificial i...,0.472657


Because of the dataset size, this approach is not very efficient, but we can improve it by using creating an approximate nearest neighbor index.

### Using a DuckDB vector search index

This approach works for huge datasets and relies on the [DuckDB vector search extension](https://duckdb.org/docs/extensions/vss.html). We will copy the dataset from the Hub to a local DuckDB database and create a vector search index. Creating the index takes a while, but afterwards the queries will run much faster.

In [19]:
from sentence_transformers import SentenceTransformer
from sentence_transformers.models import StaticEmbedding
import duckdb

# Initialize a StaticEmbedding module
static_embedding = StaticEmbedding.from_model2vec("minishlab/potion-base-8M")
model = SentenceTransformer(modules=[static_embedding])

dataset_name = "smol-blueprint/fineweb-bbc-news-text-embeddings"
embedding_column = "embedding"
table_name = "fineweb"

duckdb.sql(query=f"""
    INSTALL vss;
    LOAD vss;
    CREATE TABLE {table_name} AS 
    SELECT *, {embedding_column}::float[{model.get_sentence_embedding_dimension()}] as embedding_float 
    FROM 'hf://datasets/{dataset_name}/**/*.parquet';
    CREATE INDEX my_hnsw_index ON {table_name} USING HNSW (embedding_float) WITH (metric = 'cosine');
""")

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

After this, we can simply execute blazingfast queries on the local DuckDB database, which is much faster than the previous approach and produces similar results.

In [20]:
def similarity_search_with_duckdb_index(query: str, k: int = 5):
    embedding = model.encode(query).tolist()
    return duckdb.sql(
        query=f"""
        SELECT chunk, url, array_cosine_distance({embedding_column}_float, {embedding}::FLOAT[{model.get_sentence_embedding_dimension()}]) as distance 
        FROM {table_name}
        ORDER BY distance 
        LIMIT {k};
    """
    ).to_df()

similarity_search_with_duckdb_index("How should companies prepare for AI?", k=5)

Unnamed: 0,chunk,url,distance
0,UK spies will need to use artificial intellige...,https://www.bbc.com/news/technology-52415775,0.446492
1,Google developing kill switch for AI\n- 8 June...,http://www.bbc.com/news/technology-36472140,0.471058
2,Artificial intelligence (AI) is one of the mos...,https://www.bbc.co.uk/news/business-48139212,0.471088
3,The last decade was a big one for artificial i...,https://www.bbc.com/news/technology-51064369,0.472657
4,Singularity: The robots are coming to steal ou...,http://www.bbc.co.uk/news/technology-25000756,0.501493


## Using vicinity as vector search backend

Lastly, we can also take a more Pythonic approach and use the [vicinity library](https://github.com/MinishLab/vicinity) to create a vector search index. We simply load the dataset from the Hub and create a vector search index by passing our vectors and items to the `Vicinity` class.

In [1]:
import numpy as np
from vicinity import Vicinity, Backend, Metric
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
from sentence_transformers.models import StaticEmbedding

# Initialize a StaticEmbedding module
static_embedding = StaticEmbedding.from_model2vec("minishlab/potion-base-8M")
model = SentenceTransformer(modules=[static_embedding])

embeddings_dataset = load_dataset(
    "smol-blueprint/fineweb-bbc-news-text-embeddings", split="train"
)

vicinity = Vicinity.from_vectors_and_items(
    vectors=np.array(embeddings_dataset["embedding"]),
    items=embeddings_dataset["chunk"],
    backend_type=Backend.HNSW,
    metric=Metric.COSINE,
)

After this, we can execute blazingfast queries on the local vicinity vector search index. Note that the retrieved results are similar to the ones we got from the other methods.

In [3]:
def similarity_search_with_vicinity(query: str, k: int = 10):
    return vicinity.query(vectors=model.encode(query), k=k)


similarity_search_with_vicinity(query="How should companies prepare for AI?", k=5)

[[('UK spies will need to use artificial intelligence (AI) to counter a range of threats, an intelligence report says. Adversaries are likely to use the technology for attacks in cyberspace and on the political system, and AI will be needed to detect and stop them. But AI is unlikely to predict who might be about to be involved in serious crimes, such as terrorism - and will not replace human judgement, it says. The report is based on unprecedented access to British intelligence. The Royal United Services Institute (Rusi) think tank also argues that the use of AI could give rise to new privacy and human-rights considerations, which will require new guidance. The UK\'s adversaries "will undoubtedly seek to use AI to attack the UK", Rusi says in the report - and this may include not just states, but also criminals. Fire with fire\nThe future threats could include using AI to develop deep fakes - where a computer can learn to generate convincing faked video of a real person - in order to 

## Creating a web app and microservice for retrieval

We will be using [Gradio](https://github.com/gradio-app/gradio) as web application tool to create a demo interface for our vector search index. We can develop this locally and then easily deploy it to Hugging Face Spaces. Lastly, we can use the Gradio client as SDK to directly interact with our vector search index.

### Creating the web app

In [21]:
import gradio as gr


def search(query, k):
    return similarity_search_with_duckdb_index(query, k)


with gr.Blocks() as demo:
    gr.Markdown("""# RAG - retrieve
                
                Part of [smol blueprint](https://github.com/davidberenstein1957/smol-blueprint) - a smol blueprint for AI development, focusing on practical examples of RAG, information extraction, analysis and fine-tuning in the age of LLMs. """)
    query = gr.Textbox(label="Query")
    k = gr.Slider(1, 10, value=5, label="Number of results")
    btn = gr.Button("Search")
    results = gr.Dataframe(headers=["title", "url", "content", "distance"])
    btn.click(fn=search, inputs=[query, k], outputs=[results])

demo.launch(share=False) # set to True to share as a public website directly

* Running on local URL:  http://127.0.0.1:7861

To create a public link, set `share=True` in `launch()`.




<iframe
	src="https://smol-blueprint-rag-retrieval.hf.space"
	frameborder="0"
	width="850"
	height="450"
></iframe>

## Deploying the web app on Hugging Face 

We can now [deploy our Gradio application to Hugging Face Spaces](https://huggingface.co/new-space?sdk=gradio&name=rag-retrieve).

-  Click on the "Create Space" button.
-  Copy the code from the Gradio interface and paste it into an `app.py` file. Don't forget to copy the `similarity_search_*` function, along with the code to create the index.
-  Create a `requirements.txt` file with `duckdb`, `sentence-transformers` and `model2vec`.

We wait a couple of minutes for the application to deploy et voila, we have [a public vector search interface](https://huggingface.co/spaces/smol-blueprint/rag-retrieve)!

## Using the web app as a microservice

We can now use the [Gradio client as SDK](https://www.gradio.app/guides/getting-started-with-the-python-client) to directly interact with our vector search index. Each Gradio app has a API documentation that describes the available endpoints and their parameters, which you can access from the button at the bottom of the Gradio app's space page.

In [23]:
from gradio_client import Client
import pandas as pd

client = Client("https://smol-blueprint-rag-retrieve.hf.space/")
results = client.predict(
    api_name="/similarity_search", query="How should companies prepare for AI?", k=5
)
pd.DataFrame(data=results["data"], columns=results["headers"])

Unnamed: 0,chunk,url,distance
0,UK spies will need to use artificial intellige...,https://www.bbc.com/news/technology-52415775,0.446492
1,Google developing kill switch for AI\n- 8 June...,http://www.bbc.com/news/technology-36472140,0.471058
2,Artificial intelligence (AI) is one of the mos...,https://www.bbc.co.uk/news/business-48139212,0.471088
3,The last decade was a big one for artificial i...,https://www.bbc.com/news/technology-51064369,0.472657
4,Singularity: The robots are coming to steal ou...,http://www.bbc.co.uk/news/technology-25000756,0.501493


## Next steps

We have shown a basic approach on how to index and perform vector search on the Hugging Face Hub. Next, we will build a reranker that uses the output of the vector search to improve the quality of the retrieved documents by reranking them based on their relevance to the query.

[Augmenting retrieval results by reranking using Sentence Transformers](./augment.ipynb)
