# Building a RAG pipeline with a SmolLM and some rerankers

We will be using the output of the [indexing](./indexing.ipynb) notebook to build a RAG pipeline. We have seen how to index the Hugging Face Hub and perform vector search on it. Now, we will build a RAG pipeline that uses this vector search index to retrieve relevant information from a company's documents and uses a SmolLM model to answer questions. Also, we will show how to use rerankers to improve the quality of the RAG pipeline.

## Hugging Face as a vector search backend

A brief recap of the previous notebook, we use Hugging Face as vector search backend and can call it as a REST API through the Gradio Python Client.

In [18]:
from gradio_client import Client
import pandas as pd

gradio_client = Client("https://smol-blueprint-vector-search-hub.hf.space/")


def similarity_search(query: str, k: int = 5):
    results = gradio_client.predict(api_name="/similarity_search", query=query, k=k)
    return pd.DataFrame(data=results["data"], columns=results["headers"])


similarity_search("What is the future of AI?", k=5)

Loaded as API: https://smol-blueprint-vector-search-hub.hf.space/ ✔


Unnamed: 0,chunk,url,distance
0,The last decade was a big one for artificial i...,https://www.bbc.com/news/technology-51064369,0.260197
1,"""The manifold of things which were lumped into...",https://www.bbc.com/news/technology-51064369,0.358736
2,Artificial intelligence (AI) is one of the mos...,https://www.bbc.co.uk/news/business-48139212,0.364774
3,"""\nSeveral others started to talk about AGI be...",https://www.bbc.com/news/technology-51064369,0.370624
4,Can artificial intelligence predict the future...,https://www.bbc.com/news/av/technology-46104433,0.376021


## Reranking retrieved documents

Whenever we retrieve documents from the vector search backend, we can use a reranker to improve the quality of the retrieved documents before passing them to the LLM. We will use the [sentence-transformers library](https://huggingface.co/sentence-transformers). You can find the best models using the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard). 

We will first retrieve 50 documents and then use [sentence-transformers/all-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2) to rerank the documents and return the top 5.

In [31]:
from sentence_transformers import CrossEncoder

reranker = CrossEncoder("sentence-transformers/all-MiniLM-L12-v2")


def query_and_rerank_documents(query: str, k_retrieved: int = 10):
    documents = similarity_search(query, k_retrieved)
    documents = documents.drop_duplicates("chunk")
    documents["rank"] = reranker.predict([[query, hit] for hit in documents["chunk"]])
    reranked_documents = documents.sort_values(by="rank", ascending=False)
    return reranked_documents


query_and_rerank_documents("What is the future of AI?", k_retrieved=10)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at sentence-transformers/all-MiniLM-L12-v2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Unnamed: 0,chunk,url,distance,rank
7,- Video: Exactly what is AI?\n- Which jobs wil...,https://www.bbc.com/news/technology-34066941,0.392402,0.49818
6,Google has its own AI labs and has been invest...,http://www.bbc.com/news/technology-30432493,0.39155,0.497452
5,"""This is why AI is a long-term scientific rese...",https://www.bbc.com/news/technology-51064369,0.384782,0.496838
2,Artificial intelligence (AI) is one of the mos...,https://www.bbc.co.uk/news/business-48139212,0.364774,0.496125
8,Singularity: The robots are coming to steal ou...,http://www.bbc.co.uk/news/technology-25000756,0.395984,0.495883
3,"""\nSeveral others started to talk about AGI be...",https://www.bbc.com/news/technology-51064369,0.370624,0.495376
0,The last decade was a big one for artificial i...,https://www.bbc.com/news/technology-51064369,0.260197,0.494808
1,"""The manifold of things which were lumped into...",https://www.bbc.com/news/technology-51064369,0.358736,0.494693
4,Can artificial intelligence predict the future...,https://www.bbc.com/news/av/technology-46104433,0.376021,0.48929


We can see the returned documents have slightly shifted in the ranking, which is good, because we see that our reranking works.

## Generating responses with retrieved documents

We will now use the retrieved documents to generate a response based on the context, while using a SmolLM model. We will use a local SmolLM with transformers and another one hosted through Hugging Face Inference API.

### Inference with LLMs

When using a local SmolLM, we can use different providers, like `llama-cpp`, `vllm` or `ollama`, however, for the simplicty we will use a local transformers and a model hosted on the Inference API.

#### SmolLM in transformers

We will use [HuggingFaceTB/SmolLM2-360M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct) and use [the transformers integration attached to the model on the Hub](https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct?library=transformers).

In [6]:
from transformers import pipeline

pipe = pipeline("text-generation", model="HuggingFaceTB/SmolLM2-360M-Instruct")


def generate_response_transformers(query: str):
    messages = [
        {
            "role": "system",
            "content": "You will receive a query and context. Only return the answer based on the context without mentioning the context.",
        },
        {"role": "user", "content": query},
    ]
    return pipe(messages, max_new_tokens=4000)


generate_response_transformers("What is the future of AI?")

Device set to use mps:0


[{'generated_text': [{'role': 'user', 'content': 'What is the future of AI?'},
   {'role': 'assistant',
    'content': "AI is evolving rapidly, and we're on the cusp of a new era. Here are some of the key trends and advancements that will shape the future of AI:\n\n1. **Quantum Computing**: Quantum computers are expected to revolutionize AI by solving complex problems that are currently unsolvable by classical computers. They will also enable AI to learn from itself, making it more self-aware and capable of adapting to new situations.\n\n2. **Neuromorphic Computing**: Neuromorphic computing is a type of AI that mimics the human brain's structure and function. It's expected to be used in AI systems that can learn and adapt in real-time, much like the human brain.\n\n3. **Explainable AI (XAI)**: Explainable AI aims to make AI systems more transparent and understandable. This is important because AI systems are often used in critical applications, and it's crucial that they are transparen

### SmolLM in Hugging Face Inference API

We will use the [serverless Hugging Face Inference API](https://huggingface.co/inference-api). This is free and means we don't have to worry about hosting the model. We can find models available for inference [through some basic filters on the Hub](https://huggingface.co/models?inference=warm&pipeline_tag=text-generation&sort=trending). We will use the [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) model and call it with [the provided inference endpoint snippet](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct?inference_api=true).

In [28]:
from huggingface_hub import get_token, InferenceClient


inference_client = InferenceClient(api_key=get_token())


def generate_response_api(query: str):
    messages = [
        {
            "role": "system",
            "content": "You will receive a query and context. Only return the answer based on the context without mentioning the context.",
        },
        {"role": "user", "content": query},
    ]
    completion = inference_client.chat.completions.create(
        model="HuggingFaceTB/SmolLM2-360M-Instruct", messages=messages, max_tokens=2000
    )

    return completion.choices[0].message


generate_response_api("What is the future of AI?")

ChatCompletionOutputMessage(role='assistant', content='The future of AI is growing rapidly with a potential boom in various industries like healthcare, finance, transportation, and education. The integration of AI into these sectors creates new opportunities for better user experience and business performance. For instance, AI can aid hospitals to combat misinformation and provide more personalized patient care. In the financial sector, AI helps make smarter investment decisions. Furthermore, AI-powered robots can improve the efficiency of logistics, such as helping package and shipping products efficiently. Education is another sector poised for transformation with AI-driven tools enabling students to learn at a faster pace and more effectively.', tool_calls=None)

## The actual RAG pipeline

We will now build a RAG pipeline that uses the vector search backend to retrieve documents, reranks them and then uses the LLM to generate a response.

In [32]:
def rag_pipeline(query: str, k_retrieved: int = 10, k_reranked: int = 5):
    documents = query_and_rerank_documents(query, k_retrieved=k_retrieved)
    query_with_context = (
        f"Context: {documents['chunk'].to_list()[:k_reranked]}\n\nQuery: {query}"
    )
    return generate_response_api(query_with_context), documents


rag_pipeline("What is the future of AI?")

(ChatCompletionOutputMessage(role='assistant', content='Based on the provided context, I would say that the future of AI looks bright with the development and implementation of AGI (Artificial General Intelligence). AI is expected to outperform human intelligence, improve tasks and systems, and bring about transformative changes in various industries. The main drivers of this advancement include the explosion of data, increasing AI power and computational resources, and the emergence of widespread access to powerful computing. The smartphones that we use today could very well comprise AI, the streets of cities could contain smart city systems, and AI agents are already making decisions far beyond our manual control.\n\nThis suggests that AI encompasses a realm of possibilities beyond the knowledge of human intelligence and fortunately for us, we can expect to see the opportunities for human and AI collaboration in unprecedented ways. The question of a future where machines usurp jobs a

## Gradio as vector search interface

We will be using [Gradio](https://github.com/gradio-app/gradio) as web application tool to create a demo interface for our RAG pipeline. We can develop this locally and then easily deploy it to Hugging Face Spaces. Lastly, we can use the Gradio client as SDK to directly interact our RAG pipeline.

### Gradio as sharable app


In [38]:
import gradio as gr


def rag_interface(query: str, k_retrieved: int, k_reranked: int):
    response, documents = rag_pipeline(query, k_retrieved=k_retrieved, k_reranked=k_reranked)
    return response.content, documents


with gr.Blocks() as demo:
    gr.Markdown("""# RAG Hub Datasets 
                
                Part of [smol blueprint](https://github.com/davidberenstein1957/smol-blueprint) - a smol blueprint for AI development, focusing on practical examples of RAG, information extraction, analysis and fine-tuning in the age of LLMs.""")

    with gr.Row():
        query_input = gr.Textbox(
            label="Query", placeholder="Enter your question here...", lines=3
        )

    with gr.Row():
        with gr.Column():
            retrieve_slider = gr.Slider(
                minimum=1,
                maximum=20,
                value=10,
                label="Number of documents to retrieve",
            )
        with gr.Column():
            rerank_slider = gr.Slider(
                minimum=1,
                maximum=10,
                value=5,
                label="Number of documents to use after reranking",
            )

    submit_btn = gr.Button("Submit")
    response_output = gr.Textbox(label="Response", lines=10)
    documents_output = gr.Dataframe(
        label="Documents", headers=["chunk", "url", "distance", "rank"], wrap=True
    )

    submit_btn.click(
        fn=rag_interface,
        inputs=[query_input, retrieve_slider, rerank_slider],
        outputs=[response_output, documents_output],
    )

demo.launch()

* Running on local URL:  http://127.0.0.1:7865

To create a public link, set `share=True` in `launch()`.




<iframe
	src="https://smol-blueprint-rag-hub-datasets.hf.space"
	frameborder="0"
	width="850"
	height="450"
></iframe>

### Deploying Gradio to Hugging Face Spaces

We can now [deploy our Gradio application to Hugging Face Spaces](https://huggingface.co/new-space?sdk=gradio&name=rag-hub-datasets).

-  Click on the "Create Space" button.
-  Copy the code from the Gradio interface and paste it into an `app.py` file. Don't forget to copy the `generate_response_*` function, along with the code to execute the RAG pipeline.
-  Create a `requirements.txt` file with `gradio-client` and `sentence-transformers`.
-  Set a Hugging Face API as `HF_TOKEN` secret variable in the space settings, if you are using the Inference API.

We wait a couple of minutes for the application to deploy et voila, we have [a public RAG interface](https://huggingface.co/spaces/smol-blueprint/rag-hub-datasets)!

### Gradio as REST API

We can now use the [Gradio client as SDK](https://www.gradio.app/guides/getting-started-with-the-python-client) to directly interact with our RAG pipeline. Each Gradio app has a API documentation that describes the available endpoints and their parameters, which you can access from the button at the bottom of the Gradio app's space page.

In [40]:
from gradio_client import Client

client = Client("https://smol-blueprint-rag-hub-datasets.hf.space/")
result = client.predict(
    query="What is the future of AI?",
    k_retrieved=10,
    k_reranked=5,
    api_name="/rag_pipeline",
)
result

Loaded as API: https://smol-blueprint-rag-hub-datasets.hf.space/ ✔
('In the future, artificial intelligence (AI) is expected to play an increasingly significant role in various aspects of society, including object recognition, computer vision, natural language processing, robotics, and more. AI is already being used to develop products and services that enhance personal convenience, make life more efficient, and improve the quality of lives.\n\nAI work will also be more methodical and less reliant on high-bandwidth parallelism, setting it free from the "wonder years" of the 21st century. This will allow AI systems to generalize and adapt more effectively, borrowing from examples and experiences to solve problems more efficiently with fewer examples.\n\nArtificial intelligence is expected to see an industry growth of $23 billion by 2023, a $3.75 billion growth compared to the previous year, and to continue growing at $6.9 billion each year.\n\nResearchers are actively exploring various 

## Next steps

We have seen how to build a RAG pipeline with a SmolLM and some rerankers. Next steps would be to monitor the performance of the RAG pipeline and improve it.