First LLM models relied entirely on their own knowledge. Problem: the knowledge of first ChatGPT was 2 years behind the actual time. To overcome this you had to continuosly retrain the whole model which is unfeasible.

Classic RAG system:
1. Retriever component fetches candidate documents (focus on recall)<br>
   - using query matching
   - using embedding matching
2. LLM uses them as a context

Challenge, not only in the domain of LLM generation but in the IR generally = original user query might be not enough for retireval of all relevant documents

# User query insufficiency (1982)
---
[[paper](https://arxiv.org/pdf/2503.00223)]<br> Users often cannot articulate their needs precisely because they don’t fully understand the problem => IR systems must be context-dependent, personalized, dialogue maintaining

# Rocchio (1971)
---
Rocchio algorithm defined the first personalized version of document Retrieval. What it does - it shifts query vector (from the document-query vector space) in the direction towards documents previously positively evaluated by this user

__Query Augmentation__ = rewriting user query to reflect all sides of user's intent

# PRF (2021)
---
[[paper]](https://arxiv.org/abs/2108.11044)<br>PRF = Pseudo-Relevance Feedback. It is the algorithm to enhance the original query with related / synonim terms. Synonymic terms = most frequent terms that appear in the list of documents returned on the first retrieval

<img src="img/prf.png" width=400>




# Dense Retrieval
---
Dense Retrieval is a family of algorithms where queries and documents are mapped into the same latent vector-space by some sort of Encoder. Query/document proximity is then calculated by a cosine / dot-product similarity




# REALM (2020)
---
[[paper]]()<br>
Retrieval‑Augmented Language Model Pre‑Training = one of the first implementations of Dense Retrieval where the Retriever model is trainable and is trainable together with Generator LLM

Retriever is an early-linkage two-tower model

Training procedure
1. pretraining: some text corpora is masked (MLM) => we get a labeled QA dataset for self-suprrevised training
2. fine-tuning: trained on QA dataset

<img src="img/realm.png" width=750>



# ColBERT (2020)
---
[[paper]](https://arxiv.org/pdf/2004.12832)<br>
In late 2010s transformer-based Encoders (BERT) narrowed the gap by learning to extract the latent semantics of the query and documents. Late linkage models like ColBERT made the process efficient




Embedding are per-token => all-vs-all comparison with Max aggregation<br>
Linkage is late<br>
Used only for ranking, not generation



<img src="img/colbert.png" width=750>

# Zero-shot retrieval
---
Out-of-the box pretrained Encoders do not know how to rank by relevance. They need to be fine-tuned on some relevance dataset. Usually contrastive losses are used. But labeling is expensive => there is a challenge of __zero-shot retrieval__ - fetching without previous training on human labeled data

# DPR-CTL (2020)
---
A neural retrieval method that is trained in self-supervised fashion - neighboring chunks of text are considered positive exmaples, random chunks - negative ones. It achieves comparable to fine-tuned alternatives performance 



# DPR (2020)
---
[[paper]](https://arxiv.org/pdf/2004.04906)<br>DPR = Dense Passage Model is an implementation of Dense Retrieval offered by Meta. It uses two BERT models to encode query and documents and dot-product as similarity measure. Encoders are trained over contrastive loss (one positive and one negative example)

# RAG (2020)
---
[[paper]](https://arxiv.org/pdf/2005.11401)<br>Retrieval Augmented Generation = the first implementation of the RAG approach when the term was coined. It uses DPR retrieval combined with some generation model (BART in the original paper)

<img src="img/rag.png" width=600>

# ExaRanker (2023)
---
[[paper]](https://arxiv.org/pdf/2402.06334)<br>
Before training the Retriever on some relevance dataset let's use a strong LLM (like Chatgpt) to generate a textual "explanation" for each example in this dataset ("this document is relevant because ...")

During training phase make the Retriever model not only predict the correct label, but also reconstruct this explanation<br>This develops model's reasoning ability about relevance, avoids prediction hacking. Distance to ground-truth explanation is measured using standard text (sequence-to-sequence) loss

<img src="img/exarank.png" width=400>

# CONQRR (2022)
---
[[paper]](https://arxiv.org/pdf/2112.08558)<br>
Retrieval for conversational (dialog) systems is more challenging since the query might be distributed along the previous conversation ("What about his birthplace?"). CONQRR summarizes all necessary information for the query to be effectively processed. It is retriever agnostic and relies only on query rewriting
<img src="img/conqrr.png" width=400>

# Contextual Clues Sampling (2022)
---
[[paper]](https://arxiv.org/pdf/2210.07093)<br>
The authors suggest using some strong LLM (ChatGPT) to enhance original query by generating a list of related terms - they call them "contextual clues"<br>Not sure what exact prompt do they use "Model, generate me related terms"?

Retriever model runs multiple fetches - one for each enhancement and extracts a list of documents which are next fused into one large list.

Diversity is achieved by multiple generations. They are followed by deduplication - identic or similar "clues" are grouped into clusters<br>Precision is achieved by first ranking the clues and then retrieved documents according to their generation probabilities. Only top-K are used.

<img src="img/contextual_clues.png" width=400>

# DeepRetrieval (2025)
---
[[paper]](https://arxiv.org/pdf/2503.00223)<br>Enhances user query by rewriting it with a (reasoning) LLM model

LLM is trainable and is updated using RL (PPO). Reward consists of two pieces: query consistency (how good new prompt is formatted) + Recall-based reward (how relevant are the documents we fetched)

Requires some kind of pre-labeled dataset to be able to evaluate the relevance reward

<img src="img/deep_retrieval.png" width=400>

# Search-R1 (2025)
---
[[paper]](https://arxiv.org/pdf/2503.09516)<br>Treats retrieval as a multi-step iterative enhancement process. Model inserts API calls during reasoning and fetches new data. 

Retrieval here is a part of generator model. The fetch itself is not trainable, just an API call. Reasoning can be adjusted. 

The model is updated using DPO/GRPO. Reward is determined by the correctness of the final answer (ExactMatch). Requires some pre-labeled dataset

# S3
---
[[paper]](https://arxiv.org/abs/2505.14146)<br>
S3 = Search, select, serve. When Generator is not trainable, focus on fine-tuning the Retriever model

In s3  we train the Retrieval model using RL. 

A reward is an uplift compared to some baseline (RAG)




# GraphRAG (2024)
---
[[paper]](https://arxiv.org/abs/2404.16130)<br>

In case of "broad" queries (that require some aggregation of knowledge) regular RAG tends to give too fragmented answers<br>Instead of doing an exhausting full-scan over all documents in a corpus, let's make the knowledge hierarchical and query it in a tree-like fasion. 

Example of a broad query<br>"What are the main research themes and their interconnections in the latest COVID-19 scientific literature?"

Graph = named entities linked by their relatshionships. Communities = clusters of similar nodes. They might have different levels of aggregation (large communities consisting of smaller communities)

__Algorithm__
- Graph building
    - split documents into managebale chunks
    - detect Named Entities and Relationships
    - build a graph
    - create an hierarchy
    - generate a summarization - first on low level, then on high level
- Aggregate in map-reduce style
- Generate answer

<img src="img/graphRAG.png" width=500>




# LightRAG (2024)
---
[[paper]](https://arxiv.org/abs/2410.05779)<br>
LightRAG = a separate parallel implementation of the similar idea, but with focus on <u>fast</u> indexing and retrieval

LightRAG is a <u>Hybrid</u> approach - it is intended to work with both specific queries (through vector retrieval) and broad queries (thriygh graph retrieval)

Examples of specific / broad queries:<br>
“Who wrote ’Pride and Prejudice’?”<br>
“How does artificial intelligence influence modern education?”

__Algorithm__
1. Graph building
    - split documents into managebale chunks
    - encode each chunk with an embedding (for example using Sentence-BERT)
    - extract Entities and Relations from each chunk using "LLM Profiling"
    - build a graph 
        - nodes 
            - chunks 
            - entities 
        - edges 
            - embedding proximity 
            - having entities 
            - relationship between entities
- Retrieval
    - find nodes using a) query proximity b) entity matching
    - expand using neighbors + 
    - rerank and filter documents
- Generate answer

<img src="img/light_rag.png" width=1000>

They compare their performance with GraphRAG and declare LightRAG winning while being way more efficient


# PathRAG (2025)
---
[[paper]](https://arxiv.org/abs/2502.14902)<br>
PathRAG = Graph based RAG but instead of retrieving all relevant communities / subgraphs detect only crucial dependency paths in these graphs and rewrite by summarizing them

__Algorithm__
1. Graph building
    - Nodes = entity or text-chunk nodes extracted from the corpus.
    - Edges represent relations (e.g., co-occurrence or semantic links).
2. Retrieval
    - select anchor nodes (by embedding proximity or entity matching)
    - select paths connecting anchor nodes to other relevant nodes (multi-hop graph traversal).
    - rank paths by total “resource” score and prune low-value or redundant paths.
    - add other path features (e.g., length, connectivity).
    - order paths by reliability
    - format them as prompt bullets for the LLM
3. Generation
    - Feed the structured prompt into the LLM to generate a logical, coherent response using the curated paths.




<img src="img/path_rag.png" width=500>