# Building a RAG pipeline with a SmolLM and some rerankers

We will be using the output of the [indexing](./indexing.ipynb) notebook to build a RAG pipeline. We have seen how to index the Hugging Face Hub and perform vector search on it. Now, we will build a RAG pipeline that uses this vector search index to retrieve relevant information from a company's documents and uses a SmolLM model to answer questions. Also, we will show how to use rerankers to improve the quality of the RAG pipeline.

## Hugging Face as a vector search backend

A brief recap of the previous notebook, we use Hugging Face as vector search backend and can call it as a REST API through the Gradio Python Client.

In [5]:
from gradio_client import Client
import pandas as pd

client = Client("https://smol-blueprint-vector-search-hub.hf.space/")

def similarity_search(query: str, k: int = 5):
    results = client.predict(
        api_name="/similarity_search",
        query=query, 
        k=k
    )
    return pd.DataFrame(data=results["data"], columns=results["headers"])

similarity_search("Optimizing LLM inference", k=5)

Loaded as API: https://smol-blueprint-vector-search-hub.hf.space/ ✔


Unnamed: 0,title,author,date,local,tags,URL,chunk,distance
0,Introducing the Private Hub: A New Way to Buil...,FedericoPascual,"August 3, 2022",introducing-private-hub,"announcement, enterprise, hub",https://huggingface.co/blog/introducing-privat...,Training accurate models faster,0.192108
1,Fine-tuning Llama 2 70B using PyTorch FSDP,smangrul,"September 13, 2023",ram-efficient-pytorch-fsdp,"llm, guide, nlp",https://huggingface.co/blog/ram-efficient-pyto...,Fine-Tuning,0.193254
2,Making ML-powered web games with Transformers.js,Xenova,"July 5, 2023",ml-web-games,"game-dev, guide, web, javascript, transformers.js",https://huggingface.co/blog/ml-web-games,1. Training the neural network,0.196486
3,Open-Source Text Generation & LLM Ecosystem at...,merve,"July 17, 2023",os-llms,"LLM, inference, nlp",https://huggingface.co/blog/os-llms,Tools in the Hugging Face Ecosystem for LLM Se...,0.197265
4,Comparing the Performance of LLMs: A Deep Dive...,mehdiiraqui,"November 7, 2023",Lora-for-sequence-classification-with-Roberta-...,"nlp, guide, llm, peft",https://huggingface.co/blog/Lora-for-sequence-...,Pre-trained Models,0.198704


## Reranking retrieved documents

Whenever we retrieve documents from the vector search backend, we can use a reranker to improve the quality of the retrieved documents before passing them to the LLM. We will use the [sentence-transformers library](https://huggingface.co/sentence-transformers). You can find the best models using the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard). 

We will first retrieve 50 documents and then use [sentence-transformers/all-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2) to rerank the documents and return the top 5.

In [14]:
from sentence_transformers import CrossEncoder

reranker = CrossEncoder("sentence-transformers/all-MiniLM-L12-v2")

def rerank_retrieved_documents(query: str, k_retrieved: int = 10, k_reranked: int = 5):
    documents = similarity_search(query, k_retrieved)
    documents["rank"] = reranker.predict([[query, hit] for hit in documents["chunk"]])
    documents = documents.sort_values(by="rank", ascending=False)
    return documents[:k_reranked]

rerank_retrieved_documents("How can I optimize LLM inference?", k_retrieved=10, k_reranked=5)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at sentence-transformers/all-MiniLM-L12-v2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Unnamed: 0,title,author,date,local,tags,URL,chunk,distance,rank
8,StarCoder: A State-of-the-Art LLM for Code,lvwerra,"May 4, 2023",starcoder,"nlp, community, research",https://huggingface.co/blog/starcoder,# StarCoder: A State-of-the-Art LLM for Code,0.240642,0.494235
9,Deploy Hugging Face models easily with Amazon ...,philschmid,"July 8, 2021",deploy-hugging-face-models-easily-with-amazon-...,"guide, partnerships, aws",https://huggingface.co/blog/deploy-hugging-fac...,pseudo code end,0.245591,0.493449
6,Porting fairseq wmt19 translation system to tr...,stas,"November 3, 2020",porting-fsmt,"open-source-collab, nlp",https://huggingface.co/blog/porting-fsmt,Notes,0.239382,0.493295
2,Open-Source Text Generation & LLM Ecosystem at...,merve,"July 17, 2023",os-llms,"LLM, inference, nlp",https://huggingface.co/blog/os-llms,Tools in the Hugging Face Ecosystem for LLM Se...,0.233712,0.49288
3,Block Sparse Matrices for Smaller and Faster L...,madlag,"Sep 10, 2020",pytorch_block_sparse,"research, nlp",https://huggingface.co/blog/pytorch_block_sparse,# Block Sparse Matrices for Smaller and Faster...,0.234537,0.492682
