# Augment retrieval results by reranking using Sentence Transformers

Retrieval are quick estimates of the most relevant documents to a query which works fine for a first pass over millions of documents, but we can improve this relevance by reranking the retrieved documents. We will be using the retrieval microservice of the [Index and retrieve documents for vector search using Sentence Transformers and DuckDB](./retrieve.ipynb) notebook to build a reranker. We have seen how to index the Hugging Face Hub and perform vector search on it. Now, we will build a reranker that takes a set of documents and a query and returns a list of documents sorted by relevance to the query.

## Hugging Face as a vector search backend

A brief recap of the previous notebook, we use Hugging Face as vector search backend and can call it as a REST API through the Gradio Python Client.

In [10]:
from gradio_client import Client
import pandas as pd

gradio_client = Client("https://smol-blueprint-rag-retrieval.hf.space/")


def similarity_search(query: str, k: int = 5) -> pd.DataFrame:
    results = gradio_client.predict(api_name="/similarity_search", query=query, k=k)
    return pd.DataFrame(data=results["data"], columns=results["headers"])


similarity_search("What is the future of AI?", k=5)

Loaded as API: https://smol-blueprint-rag-retrieval.hf.space/ ✔


Unnamed: 0,chunk,url,distance
0,The last decade was a big one for artificial i...,https://www.bbc.com/news/technology-51064369,0.260197
1,"""The manifold of things which were lumped into...",https://www.bbc.com/news/technology-51064369,0.358736
2,Artificial intelligence (AI) is one of the mos...,https://www.bbc.co.uk/news/business-48139212,0.364774
3,"""\nSeveral others started to talk about AGI be...",https://www.bbc.com/news/technology-51064369,0.370624
4,Can artificial intelligence predict the future...,https://www.bbc.com/news/av/technology-46104433,0.376021


## Reranking retrieved documents

Whenever we retrieve documents from the vector search backend, we can improve the quality of the documents that we pass to the LLM. We do that by ranking the documents by relevance to the query. We will use the [sentence-transformers library](https://huggingface.co/sentence-transformers). You can find the best models to do this, using the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard). 

We will first retrieve 50 documents and then use [sentence-transformers/all-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2) to rerank the documents and return the top 5.

In [19]:
from sentence_transformers import CrossEncoder
import pandas as pd

reranker = CrossEncoder("sentence-transformers/all-MiniLM-L12-v2")


def rerank_documents(query: str, documents: pd.DataFrame) -> pd.DataFrame:
    documents = documents.copy()
    documents = documents.drop_duplicates("chunk")
    documents["rank"] = reranker.predict([[query, hit] for hit in documents["chunk"]])
    documents = documents.sort_values(by="rank", ascending=False)
    return documents


query = "What is the future of AI?"
documents = similarity_search(query, k=50)
reranked_documents = rerank_documents(query=query, documents=documents)
reranked_documents[:5]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at sentence-transformers/all-MiniLM-L12-v2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Unnamed: 0,chunk,url,distance,rank
47,"Google, Facebook, Amazon join forces on future...",https://www.bbc.co.uk/news/technology-37494863,0.494021,0.503516
6,Google has its own AI labs and has been invest...,http://www.bbc.com/news/technology-30432493,0.39155,0.502463
29,"IBM says it can ""analyze and interpret all of ...",http://www.bbc.com/news/world-asia-38521403,0.443237,0.502219
8,Singularity: The robots are coming to steal ou...,http://www.bbc.com/news/technology-25000756,0.395984,0.501567
11,Google developing kill switch for AI\n- 8 June...,http://www.bbc.com/news/technology-36472140,0.397516,0.501371


We can see the returned documents have slightly shifted in the ranking, which is good, because we see that our reranking works.

## Creating a web app and microservice for reranking

We will be using [Gradio](https://github.com/gradio-app/gradio) as web application tool to create a demo interface for our RAG pipeline. We can develop this locally and then easily deploy it to Hugging Face Spaces. Lastly, we can use the Gradio client as SDK to directly interact our RAG pipeline.

### Creating the web app

In [25]:
import gradio as gr


def rag_interface(query: str, documents: pd.DataFrame):
    documents = rerank_documents(query=query, documents=documents)
    return documents


with gr.Blocks() as demo:
    gr.Markdown("""# RAG Hub Datasets 
                
                Part of [smol blueprint](https://github.com/davidberenstein1957/smol-blueprint) - a smol blueprint for AI development, focusing on practical examples of RAG, information extraction, analysis and fine-tuning in the age of LLMs.""")

    query_input = gr.Textbox(
        label="Query", placeholder="Enter your question here...", lines=3
    )
    documents_input = gr.Dataframe(
        label="Documents", headers=["chunk"], wrap=True, interactive=True
    )

    submit_btn = gr.Button("Submit")
    documents_output = gr.Dataframe(
        label="Documents", headers=["chunk", "rank"], wrap=True
    )

    submit_btn.click(
        fn=rerank_documents,
        inputs=[query_input, documents_input],
        outputs=[documents_output],
    )

demo.launch()

* Running on local URL:  http://127.0.0.1:7871

To create a public link, set `share=True` in `launch()`.




<iframe
	src="https://smol-blueprint-rag-hub-datasets.hf.space"
	frameborder="0"
	width="850"
	height="450"
></iframe>

### Deploying the web app to Hugging Face

We can now [deploy our Gradio application to Hugging Face Spaces](https://huggingface.co/new-space?sdk=gradio&name=rag-augment).

-  Click on the "Create Space" button.
-  Copy the code from the Gradio interface and paste it into an `app.py` file. Don't forget to copy the `generate_response_*` function, along with the code to execute the RAG pipeline.
-  Create a `requirements.txt` file with `gradio-client` and `sentence-transformers`.
-  Set a Hugging Face API as `HF_TOKEN` secret variable in the space settings, if you are using the Inference API.

We wait a couple of minutes for the application to deploy et voila, we have [a public RAG interface](https://huggingface.co/spaces/smol-blueprint/rag-augment)!

### Using the web app as a microservice

We can now use the [Gradio client as SDK](https://www.gradio.app/guides/getting-started-with-the-python-client) to directly interact with our RAG pipeline. Each Gradio app has a API documentation that describes the available endpoints and their parameters, which you can access from the button at the bottom of the Gradio app's space page.

In [27]:
from gradio_client import Client

client = Client("https://smol-blueprint-rag-augment.hf.space/")
result = client.predict(
    query="What is the future of AI?",
    documents=similarity_search("What is the future of AI?", k=50),
    api_name="/rag_pipeline",
)
result

Loaded as API: https://smol-blueprint-rag-augment.hf.space/ ✔


ValueError: Could not fetch config for https://smol-blueprint-rag-augment.hf.space/

## Next steps

We have seen how to build a RAG pipeline with a SmolLM and some rerankers. Next steps would be to monitor the performance of the RAG pipeline and improve it.