# Hugging Face Hub as a vector search backend

We will be using the [smol-blueprint/hf-blogs](https://huggingface.co/datasets/smol-blueprint/hf-blogs) dataset, which is a dataset that contains the blogs from the Hugging Face website. 

In [162]:
from datasets import load_dataset

dataset = load_dataset("smol-project-blueprint/hf-blogs")
dataset["train"][0]

Generating train split: 100%|██████████| 312/312 [00:00<00:00, 9165.69 examples/s]


{'title': 'How to train a new language model from scratch using Transformers and Tokenizers',
 'author': 'julien-c',
 'date': 'February 14, 2020',
 'local': 'how-to-train',
 'tags': 'guide, nlp',
 'URL': 'https://huggingface.co/blog/how-to-train',
 'content': ' # How to train a new language model from scratch using Transformers and Tokenizers   <a target="_blank" href="https://colab.research.google.com/github/huggingface/blog/blob/main/notebooks/01_how_to_train.ipynb">     <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"> </a>  Over the past few months, we made several improvements to our [`transformers`](https://github.com/huggingface/transformers) and [`tokenizers`](https://github.com/huggingface/tokenizers) libraries, with the goal of making it easier than ever to **train a new language model from scratch**.  In this post we’ll demo how to train a “small” model (84 M parameters = 6 layers, 768 hidden size, 12 attention heads) – that’s the same 

## Chunking the documents

To understand how to chunk the documents, we will first need to understand what our `content` column looks like. As we can see above, the content is formatted as a markdown file. Therefore, we know that our text is structured in paragraphs using header indicators like `#` or `##`, and that we can use these to split the text into chunks that are semantically meaningful. Let's write a function that does this. Similarly, we can extract things like images by using something like `BeautifulSoup.find_all(name="img")`.

In [163]:
import re
import markdown
from bs4 import BeautifulSoup

counter = 0
def structure_content(row):
    soup = BeautifulSoup(markdown.markdown(row["content"]))
    # Split on 2 or more # followed by space
    chunks = re.split(r'#{2,}\s', string=soup.text)
    # Filter empty chunks and add back the # prefix except for first chunk
    row["chunked_content"] = list(set([chunk for chunk in chunks if chunk.strip()]))
    return row

chunked_dataset = dataset.map(structure_content)
chunked_dataset["train"][0]

Map: 100%|██████████| 312/312 [00:07<00:00, 42.93 examples/s]


{'title': 'How to train a new language model from scratch using Transformers and Tokenizers',
 'author': 'julien-c',
 'date': 'February 14, 2020',
 'local': 'how-to-train',
 'tags': 'guide, nlp',
 'URL': 'https://huggingface.co/blog/how-to-train',
 'content': ' # How to train a new language model from scratch using Transformers and Tokenizers   <a target="_blank" href="https://colab.research.google.com/github/huggingface/blog/blob/main/notebooks/01_how_to_train.ipynb">     <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"> </a>  Over the past few months, we made several improvements to our [`transformers`](https://github.com/huggingface/transformers) and [`tokenizers`](https://github.com/huggingface/tokenizers) libraries, with the goal of making it easier than ever to **train a new language model from scratch**.  In this post we’ll demo how to train a “small” model (84 M parameters = 6 layers, 768 hidden size, 12 attention heads) – that’s the same 

The chunked content seems reasonable, so we can now continue to the next step, which is creating embeddings for each of our content items.

## Creating embeddings

In order to create a vector search index, we will need to create embeddings for each of our chunks. We will use the [Hugging Face `sentence-transformers` library](https://huggingface.co/sentence-transformers) to create these embeddings. 

### Creating text embeddings

You can find the best models using the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard) but don't forget to always check the performance of the model on your specific task. We will use the [Snowflake/snowflake-arctic-embed-m-v1.5](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1.5) model to create the embeddings for our text, which we chose because it performs well on assymetric search on benchmarks, i.e., query-answer pairs. 

In [164]:
chunked_dataset

DatasetDict({
    train: Dataset({
        features: ['title', 'author', 'date', 'local', 'tags', 'URL', 'content', 'chunked_content'],
        num_rows: 312
    })
})

In [166]:
from sentence_transformers import SentenceTransformer
from datasets import Dataset

model = SentenceTransformer("Snowflake/snowflake-arctic-embed-m-v1.5")

def extract_chunks(dataset):
    """Extract chunks from dataset while removing unnecessary fields."""
    data = []
    excluded_fields = {"chunked_content", "images", "content", "code", "image"}
    
    for row in dataset["train"]:
        for chunk in row["chunked_content"]:
            # Create new dict with only desired fields rather than copying
            item = {}
            for k,v in row.items():
                if k not in excluded_fields:
                    item[k] = v
            item["chunk"] = chunk
            data.append(item)
    return data

def create_text_embeddings(batch):
    """Create embeddings for a batch of text chunks."""
    batch["embedding"] = model.encode(batch["chunk"])
    return batch

# Create dataset with chunks and generate embeddings
chunks = extract_chunks(chunked_dataset)
embeddings_dataset = Dataset.from_list(chunks)
embeddings_dataset = embeddings_dataset.map(create_text_embeddings, batched=True)
embeddings_dataset.push_to_hub("smol-blueprint/hf-blogs-text-embeddings")

Map: 100%|██████████| 2927/2927 [00:41<00:00, 70.29 examples/s]
Creating parquet from Arrow format: 100%|██████████| 3/3 [00:00<00:00, 30.48ba/s]
Uploading the dataset shards: 100%|██████████| 1/1 [00:01<00:00,  1.63s/it]


CommitInfo(commit_url='https://huggingface.co/datasets/smol-blueprint-project/hf-blogs-text-embeddings/commit/b58425de991d0a84daca707495cbef7f19acaf52', commit_message='Upload dataset', commit_description='', oid='b58425de991d0a84daca707495cbef7f19acaf52', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/smol-blueprint-project/hf-blogs-text-embeddings', endpoint='https://huggingface.co', repo_type='dataset', repo_id='smol-blueprint-project/hf-blogs-text-embeddings'), pr_revision=None, pr_num=None)

### Creating multi-modal embeddings

We can use use a similar approach to create embeddings for our images and texts. We will use the [sentence-transformers/clip-ViT-B-32](https://huggingface.co/sentence-transformers/clip-ViT-B-32) model to create the embeddings for our images and texts which will then be embedded into a single vector space. This is a larger model which means it will take longer to embed the content.

## Vector search Hub datasets

For the similarity search, we will can simply execute queries on top of the Hugging Face Hub using the [DuckDB integration for vector search](https://huggingface.co/docs/hub/en/datasets-duckdb). Note that we need to use the same model for embedding the query as we used for indexing.

In [160]:
from sentence_transformers import SentenceTransformer
import duckdb
from huggingface_hub import get_token

model = SentenceTransformer("Snowflake/snowflake-arctic-embed-m-v1.5")

def similarity_search(
    query: str, 
    k: int = 5, 
    dataset_name: str = "smol-blueprint/hf-blogs-text-embeddings", 
    embedding_column: str = "embedding",
):
    # Use same model as used for indexing
    query_vector = model.encode(query)
    embedding_dim = model.get_sentence_embedding_dimension()
    
    sql = f"""
        SELECT 
            title,
            author,
            date,
            local,
            tags,
            URL,
            chunk,
            array_cosine_distance(
                {embedding_column}::float[{embedding_dim}], 
                {query_vector.tolist()}::float[{embedding_dim}]
            ) as distance
        FROM 'hf://datasets/{dataset_name}/**/*.parquet'
        ORDER BY distance
        LIMIT {k}
    """
    
    return duckdb.sql(sql).to_df()

similarity_search("What is the best way to learn Hugging Face?",)

Unnamed: 0,title,author,date,local,tags,URL,chunk,distance
0,Federated Learning using Hugging Face and Flower,charlesbvll,"March 27, 2023",fl-with-flower,"nlp, transformers, guide, flower, federated-le...",https://huggingface.co/blog/fl-with-flower,Standard Hugging Face workflow,0.10972
1,"The Hugging Face Hub for Galleries, Libraries,...",davanstrien,"June 12, 2023",hf-hub-glam-guide,"community, guide",https://huggingface.co/blog/hf-hub-glam-guide,What can you find on the Hugging Face Hub?,0.12229
2,How Hugging Face Accelerated Development of Wi...,Violette,"March 1, 2023",classification-use-cases,"nlp, case-studies",https://huggingface.co/blog/classification-use...,Solutions provided by the Hugging Face Experts...,0.132556
3,Hugging Face's TensorFlow Philosophy,rocketknight1,"August 12, 2022",tensorflow-philosophy,"nlp, cv, guide",https://huggingface.co/blog/tensorflow-philosophy,# Hugging Face's TensorFlow Philosophy,0.141068
4,Creating open machine learning datasets? Share...,davanstrien,"October 30, 2023",researcher-dataset-sharing,"community, research, datasets, guide",https://huggingface.co/blog/researcher-dataset...,# Creating open machine learning datasets? Sha...,0.155733


## Gradio as vector search interface

We will be using Gradio as web application tool to create a demo interface for our vector search index. We can develop this locally and then easily deploy it to Hugging Face Spaces. Lastly, we can use the Gradio client as SDK to directly interact with our vector search index.

### Gradio demo UI

In [161]:
import gradio as gr

def search(query, k):
    return similarity_search(query, k)

with gr.Blocks() as demo:
    query = gr.Textbox(label="Query")
    k = gr.Slider(1, 10, value=5, label="Number of results")
    btn = gr.Button("Search")
    results = gr.Dataframe(headers=["title", "url", "content", "distance"])
    btn.click(fn=search, inputs=[query, k], outputs=[results])
    

demo.launch()

* Running on local URL:  http://127.0.0.1:7865

To create a public link, set `share=True` in `launch()`.




## Deploying Gradio to Hugging Face Spaces

We can now [deploy our Gradio application to Hugging Face Spaces](https://huggingface.co/new-space?sdk=gradio&name=vector-search-hub). Follow the redirect and then click on the "Create Space" button. After that, you can copy the code from the Gradio interface and paste it into an `app.py` file. Don't forget to copy the `similarity_search` function from the notebook and paste it into the `app.py` file. Lastly, you need to create an `requirements.txt` file with and with the following content:

```bash
duckdb
sentence-transformers
```

We wait a couple of minutes for the application to deploy et voila, we have [a public vector search interface](https://huggingface.co/spaces/smol-blueprint/vector-search-hub)!

## Gradio as Rest API

We can now use the [Gradio client as SDK](https://www.gradio.app/guides/getting-started-with-the-python-client) to directly interact with our vector search index. Each Gradio app has a API documentation that describes the available endpoints and their parameters, which you can access from the button at the bottom of the Gradio app's space page.

In [1]:
from gradio_client import Client
import pandas as pd

client = Client("https://smol-blueprint-vector-search-hub.hf.space/")
results = client.predict(
    api_name="/similarity_search",
    query="Optimizing LLM inference", 
    k=5
)
pd.DataFrame(data=results["data"], columns=results["headers"])

  from .autonotebook import tqdm as notebook_tqdm


Loaded as API: https://smol-blueprint-vector-search-hub.hf.space/ ✔


Unnamed: 0,title,author,date,local,tags,URL,chunk,distance
0,Introducing the Private Hub: A New Way to Buil...,FedericoPascual,"August 3, 2022",introducing-private-hub,"announcement, enterprise, hub",https://huggingface.co/blog/introducing-privat...,Training accurate models faster,0.192108
1,Fine-tuning Llama 2 70B using PyTorch FSDP,smangrul,"September 13, 2023",ram-efficient-pytorch-fsdp,"llm, guide, nlp",https://huggingface.co/blog/ram-efficient-pyto...,Fine-Tuning,0.193254
2,Making ML-powered web games with Transformers.js,Xenova,"July 5, 2023",ml-web-games,"game-dev, guide, web, javascript, transformers.js",https://huggingface.co/blog/ml-web-games,1. Training the neural network,0.196486
3,Open-Source Text Generation & LLM Ecosystem at...,merve,"July 17, 2023",os-llms,"LLM, inference, nlp",https://huggingface.co/blog/os-llms,Tools in the Hugging Face Ecosystem for LLM Se...,0.197265
4,Comparing the Performance of LLMs: A Deep Dive...,mehdiiraqui,"November 7, 2023",Lora-for-sequence-classification-with-Roberta-...,"nlp, guide, llm, peft",https://huggingface.co/blog/Lora-for-sequence-...,Pre-trained Models,0.198704
