
**What are RAG Rerankers?**
- Definition of RAG Rerankers
- Introduction to the RAG (Retrieval-Augmented Generation) framework
- Overview of reranking process in natural language generation

**Why Use RAG Rerankers?**
- Improving answer relevance
- Enhancing response diversity
- Increasing model performance and user satisfaction

**How Do RAG Rerankers Work?**
- Retrieval phase: obtaining relevant passages/documents
- Augmentation phase: encoding retrieved content with the prompt
- Generation phase: reranking generated responses based on relevance scores
- Fine-tuning and optimization techniques

**Implementing RAG Rerankers**
- Choosing appropriate retrieval methods
- Integrating RAG rerankers with existing NLP pipelines
- Selecting relevant prompts and inputs
- Fine-tuning and hyperparameter tuning

**Benefits of RAG Rerankers**
- Improved answer quality and relevance
- Enhanced user experience
- Better performance on specific tasks
- Potential for domain adaptation and transfer learning

**Challenges and Considerations**
- Computational complexity and resource requirements
- Fine-tuning data availability and quality
- Overfitting and generalization issues
- Ethical considerations and biases in retrieval and reranking

**Applications of RAG Rerankers**
- Question answering systems
- Chatbots and virtual assistants
- Information retrieval and summarization
- Content recommendation systems

**Conclusion**
- Recap of the importance and functionality of RAG rerankers
- Future directions and research avenues
- Final thoughts on the potential impact of RAG rerankers in NLP tasks

## **What?**

**The problem:**

Due to information loss in optimizing semantic search with vector representations of text documents, top results frequently fail to capture relevant information. How can we ensure that potentially useful but lower-ranked information is included in the results to improve the language model's response?

**Introduction:** 

A Retrieval-Augmented Generation (RAG) system combines information retrieval capabilities with the generative prowess of LLMs. While RAG is easy to use, it is hard to master. In out-of-the-box RAG, we put documents into a vector database and add an LLM on top in the expectation that it would provide us with the relevant results. 

However, this might not always work. 
- RAG performs semantic search across many text documents, ranging from tens to billions of documents. 
- Typically, we transform our text into vectors


**Terminology:** 

- Context Window: The number of tokens the model can take as input when generating responses. Larger context windows improve LLM performance and their usefulness across various applications.

- LLM recall: The ability of an LLM to find information from the text placed within its context window

- Retrieval recall: The percentage of correct documents returned by the retriever in response to a query

- Context stuffing: 

- Reranking: The "learning-to-rank approach" or simply the re-ranker component sifts through the responses procided by the retriever and ranks them based on their relevance. 

- Bi-encoder: There are two separate encoders - one for encoding the input query and another for encoding the candidate documents. These encoders work independently, producing embeddings for the query and each document. During inference, the model computes the similarity score between the query and each document independently and the document with the highest similarity score is considered the most relevant. Commonly used in tasks where document retrieval or ranking is primary goal, such as search engines or recommendation systems. 

- Cross-encoder: The query and document are processed together in a single encoder. The model takes both the query and teh document as input and produces a joint representation. There exists a single similarity score for each query-document pair and the document with the highest score is considered the most relevant. Commonly used in scenarios where understanding the context or relationship between the query and document is essential such as duplicate detection, question answering or information retrieval. 

<img src="./rerank.jpeg" style="height: 300px;"/>

Two-stage Retrieval: 

1. First-stage model: (an embedding model/retriever) retrieves a set of relevant documents from a larger dataset. 
2. Second-stage model: the raranker is used to rerank those documents retrieved by the first-stage model.




## **Why?**

- Because retrieving a small set of documents from a large dataset is much faster than reranking a large set of documents. 
- Rerankers are slow, and retrievers are fast. However, rerankers are much more accurate than embedding models. 
- Intuition behind the problem: Bi-encoders compress all of the possible meanings of a document into a single vector theeby losing information. These embeddings are created ahead of receiving the query and therefore no context exists between the document embeddings and the user query, resulting in sometimes irrelevant results. 
- On the other hand, a reranker can receive the raw information directly into the large transformer computation, meaning less information loss. 






## **How?**