# Text Retrieval
---
Information Retrieval (of Text documents) is often associated with three tasks: 
- Search
- Recommendations
- Generation

Search and Retrieval usually do not require any substantial post-processing of retrieved documents. Generation does require it (sumamrize, find answers etc), but in most cases rely on the original document => can use standard retrieval

# Retrieval for Generation
---

First LLM models relied entirely on their own weights. Problem: the knowledge of first ChatGPT was 2 years behind the actual time. To overcome this limitation you had to continuosly retrain the whole model, which is unfeasible.

There came RAG = Retrieval Augmented Generation. Idea: relevant fresh data is attached directly to the LLM context on-the-run (real time)

Classic RAG system:
1. Retriever component fetches candidate documents (focus on recall)<br>
   - using query matching
   - using embedding matching
2. LLM uses them as a context

<img src="img/rag1.png" width=500>

Important problem, not only in the domain of LLM generation, but in the IR generally = original query might be not sufficient for accurate retireval

# User query misspecification
---
One of the main challenges in Information Retrieval (known before 1980s) is that users often cannot articulate their needs precisely because they don’t fully understand the problem. IR practitioneers of that time ([paper](https://arxiv.org/pdf/2503.00223)) formulated the following target - a good IR system must be context-dependent, personalized, able to maintain dialogue with user

Examples of poor specification:

- Ambiguous Queries / Polysemy<br>
   ```jaguar, apple history```<br><br>
- Synonymy / Different Vocabulary<br>
   ```heart attack symptoms```<br><br>
- Contextual or Temporal Mismatch<br>
   ```president bush```<br><br>
- Mismatch Due to Domain Jargon<br>
   ```How do I fix my computer’s blue screen?```<br><br>
- Overspecification (too narrow queries)<br>
   ```best restaurants for vegan ramen in San Francisco open late```<br><br>
- User Under-Specification (too wide queries)<br>
   ```python tutorial```

Main takeaway = Retrieval requires some kind of query preprocessing / adjustment. Classic sparse retrievers like TF-IDF / BM25 might not be enough

# Sparse Retrieval
---
"Sparse Retrieval" is an old approach where documents are represetnted as vocabulary-size vector and the process lookups inverted index using the terms from the query. Resulting document chains are then merged, filtered and transfered for further ranking

Matching algorithms
- TF-IDF
- BM25

# Dense Retrieval
---
Dense Retrievers is a family of algorithms where queries and documents are mapped into the same latent vector-space by some sort of Encoder. Query/document proximity is then calculated by a cosine / dot-product similarity

Dense Extraction can be implemented by either performing a "full-scan" (suitable for small indices) or "approximate fetch" (suitable for large indices but requires additional filtering of the result)

# Hybrid Retrieval
---
Hybrid Retrievers use dense encoding as an intermediary representaion, but still use sparse retrieval from inverted index. They tend to take best from two worlds - semantic richness of Dense representation and efficiency of Sparse vectors. Often refered to as neural sparse retrieval.

The output is a sparse vocabulary-sized vector of tokens that constitute the document/query, but instead of deterministic TFs weights we use more expressive, context-dependent weights

Methods include:
- DeepImpact
- DeepCT
- COIL
- uniCOIL
- TILDE
- SPLADE / SPLADE v2

# Approximate Retrieval
---
Retrieving documents from a large database, whether using sparse or dense keys, requires full-scan of this database, which is <u>unsatisfactory</u> for most production-level systems

This means we need to finr a way to accelerate retrieval. Puting hardware accelreration (sharding, caching etc) aside, the best option is to store documents in a manner optimized for fast retireval using proximity-based queries

Most dense retrievers are approximate (at least at the candidate generation phase). Approximate nearest neighbours alogorithms that are worth mentioning include:
- k-D trees<br>
- FAISS (Meta)<br>
- Annoy (Spotify)<br>we build k decision trees, navigate to analyzed observation in $log(n)$ and neighbours from the neighbouring nodes
- ScaNN (Google)<br>
- HNWS (Yandex)<br>build a proximity graph, aggregate it several times to make it hierarchical and navigate to find neighbouring obervations

See the notebook for more detailed explanation

# Two-stage Retrieval
---
Most modern systems implement two-stage retrieval process:
1. fast but unfiltered candidate generation<br>- sparse matching: tf-idf / bm25 weighted matching through inverted index<br>- approximate dense matching (for moderate size systems): approximate algorithms like FAISS, Annoy, HNSW
2. slow but precise filtering & reranking


# Retriever as a Model
---
Retrievers can be trainable

Methods:
- REALM
- colBERT
- DPR
- RAG
- DPR-CTL
- S3

# Rocchio (1971)
---
Rocchio algorithm defined the first personalized version of document Retrieval. What it does - it shifts query vector (from the document-query vector space) in the direction towards documents previously positively evaluated by this user

# Retireval Augmentation
---
Query Augmentation = rewriting OR augmentation of user query to reflect all sides of user's intent
Document Augmentation = enhancing documents stored in index

Algorithms include:
- Doc2Query
- AxaRanker
- DeepRetrieval
- Search-R1
- ConQRR

## Zero-shot retrieval
Out-of-the box pretrained Encoders do not know how to rank by relevance. They need to be fine-tuned on some relevance dataset. Usually contrastive losses are used. But labeling is expensive => there is a challenge of __zero-shot retrieval__ - fetching without previous training on human labeled data

# PRF (2021)
---
[[paper]](https://arxiv.org/abs/2108.11044)<br>PRF = Pseudo-Relevance Feedback. It is the algorithm to enhance the original query with related / synonim terms. Synonymic terms = most frequent terms that appear in the list of documents returned on the first retrieval

<img src="img/prf.png" width=400>

# REALM (2020)
---
[[paper]](https://arxiv.org/abs/2002.08909)<br>
Retrieval‑Augmented Language Model Pre‑Training = one of the first implementations of Dense Retrieval for Generation where the Retriever model is trainable and is trainable together with Generator LLM

Retriever is an early-linkage two-tower model

Model architecture:
- consists of "Retriever" and "Reader"
- Retriever is a two-tower network with the same BERT model and [CLS] output
- embedings are combined with a dot-product
- for all retrieved candidates apply Reader model to generate answer
- average answers from 

Training procedure
1. pretraining: some text corpora is masked (MLM) => we get a labeled QA dataset for self-suprrevised training
2. fine-tuning: trained on QA dataset

<img src="img/realm.png" width=750>

# ColBERT (2020)
---
[[paper]](https://arxiv.org/pdf/2004.12832)<br>
ColBERT is a tower network of two Encoders (BERT) - one for query encoding and one for document encoding. Each tower outputs token embeddings, which in turn are combined using MaxSim pooling to get a scalar "relevance" score

Model architecture:
- each tower uses the same BERT model with full output
- a projection linear layer is additionally attached to reduce the dimensionality of the outputs (748 -> 128)
- we compare sets of output embeddings in a "cross-attentional" manner
- MaxSim (Maximum Simlarity) = each embedding from query is multiplied to such embedding from document that gives maximum product
- all individual scores are summed into a global "relevance" score

Training process:
- use constrastive learning - query is compared to one positive and one negative example
- ranking loss evaluates how far these two scores are
- all weights of the model are fine-tuned (embedding layer, BERT layers, linear projection)

Inference:
- for each document $D$ from a database compute the "relevance" score $s(Q,D)$
- select top-K documents with highest score

Acceleration (optional):
- precompute outputs for all documents
- store those outputs in an index

<img src="img/colbert.png" width=750>

## MaxSim
MaxSim is a "crossattention-like" method of aggregating scores

Suppose query encoder retruned $Q = [q_1, q_2, ..., q_m]$ and document encoder returned $D = [d_1, d_2, ..., d_n]$

Then for each query token we get the most relevant pair $\text{score}(Q, D) = \sum_{i=1}^{m} \max_{j=1}^{n} \langle q_i, d_j \rangle$ 



# DPR (2020)
---
[[paper]](https://arxiv.org/pdf/2004.04906)<br>DPR = Dense Passage Model is an implementation of Dense Retrieval offered by Meta. It uses two BERT models to encode query and documents and dot-product as similarity measure. Encoders are trained over contrastive loss (one positive and one negative example)

Architecture:
- two-tower model with two <u>different</u> BERT encoders (called Question encoder and Passage encoder) with a single [CLS] output
- relevance score is a dot-product of towers output

Training process:
- use contrastive learning: we gonna compare a query $Q$ with one positive $D^+$ and one negative example $D^-$
- compute predicted "relevance" score for all documents
- normalize output scores to get a probability<br>probability of event "$D^+$ is relevant to $Q$"
- use negative loglikelihood as a loss function<br>[why not triplet loss?]

Index construction:
- apply trained Passage Encoder to all documents (passages)
- store output embeddings in some ANN index

Inference
- output embedding
- retrieve top-K documents
- optionaly send to further processing like "Reader" ot "Reranker" model



# DPR-CTL (2020)
---
A neural retrieval method that is trained in self-supervised fashion - neighboring chunks of text are considered positive exmaples, random chunks - negative ones. It achieves comparable to fine-tuned alternatives performance






# RAG (2020)
---
[[paper]](https://arxiv.org/pdf/2005.11401)<br>Retrieval Augmented Generation = the first implementation of the RAG approach when the term was coined. It uses DPR model for retrieval together with some generation model (BART in the original paper)

<img src="img/rag.png" width=600>

# ExaRanker (2023)
---
[[paper]](https://arxiv.org/pdf/2402.06334)<br>
Before training the Retriever on some relevance dataset let's use a strong LLM (like Chatgpt) to generate a textual "explanation" for each example in this dataset ("this document is relevant because ...")

During training phase make the Retriever model not only predict the correct label, but also reconstruct this explanation<br>This develops model's reasoning ability about relevance, avoids prediction hacking. Distance to ground-truth explanation is measured using standard text (sequence-to-sequence) loss

<img src="img/exarank.png" width=400>

# CONQRR (2022)
---
[[paper]](https://arxiv.org/pdf/2112.08558)<br>
Retrieval for conversational (dialog) systems is more challenging since the query might be distributed along the previous conversation ("What about his birthplace?"). CONQRR summarizes all necessary information for the query to be effectively processed. It is retriever agnostic and relies only on query rewriting
<img src="img/conqrr.png" width=400>

# Contextual Clues Sampling (2022)
---
[[paper]](https://arxiv.org/pdf/2210.07093)<br>
The authors suggest using some strong LLM (ChatGPT) to enhance original query by generating a list of related terms - they call them "contextual clues"<br>Not sure what exact prompt do they use "Model, generate me related terms"?

Retriever model runs multiple fetches - one for each enhancement and extracts a list of documents which are next fused into one large list.

Diversity is achieved by multiple generations. They are followed by deduplication - identic or similar "clues" are grouped into clusters<br>Precision is achieved by first ranking the clues and then retrieved documents according to their generation probabilities. Only top-K are used.

<img src="img/contextual_clues.png" width=400>

# DeepRetrieval (2025)
---
[[paper]](https://arxiv.org/pdf/2503.00223)<br>Enhances user query by rewriting it with a (reasoning) LLM model

LLM is trainable and is updated using RL (PPO). Reward consists of two pieces: query consistency (how good new prompt is formatted) + Recall-based reward (how relevant are the documents we fetched)

Requires some kind of pre-labeled dataset to be able to evaluate the relevance reward

<img src="img/deep_retrieval.png" width=400>

# Search-R1 (2025)
---
[[paper]](https://arxiv.org/pdf/2503.09516)<br>Treats retrieval as a multi-step iterative enhancement process. Model inserts API calls during reasoning and fetches new data. 

Retrieval here is a part of generator model. The fetch itself is not trainable, just an API call. Reasoning can be adjusted. 

The model is updated using DPO/GRPO. Reward is determined by the correctness of the final answer (ExactMatch). Requires some pre-labeled dataset

# S3
---
[[paper]](https://arxiv.org/abs/2505.14146)<br>
S3 = Search, select, serve. When Generator is not trainable, focus on fine-tuning the Retriever model

In s3  we train the Retrieval model using RL. 

A reward is an uplift compared to some baseline (RAG)




# GraphRAG (2024)
---
[[paper]](https://arxiv.org/abs/2404.16130)<br>

In case of "broad" queries (that require some aggregation of knowledge) regular RAG tends to give too fragmented answers<br>Instead of doing an exhausting full-scan over all documents in a corpus, let's make the knowledge hierarchical and query it in a tree-like fasion. 

Example of a broad query<br>"What are the main research themes and their interconnections in the latest COVID-19 scientific literature?"

Graph = named entities linked by their relatshionships. Communities = clusters of similar nodes. They might have different levels of aggregation (large communities consisting of smaller communities)

__Algorithm__
- Graph building
    - split documents into managebale chunks
    - detect Named Entities and Relationships
    - build a graph
    - create an hierarchy
    - generate a summarization - first on low level, then on high level
- Aggregate in map-reduce style
- Generate answer

<img src="img/graphRAG.png" width=500>




# LightRAG (2024)
---
[[paper]](https://arxiv.org/abs/2410.05779)<br>
LightRAG = a separate parallel implementation of the similar idea, but with focus on <u>fast</u> indexing and retrieval

LightRAG is a <u>Hybrid</u> approach - it is intended to work with both specific queries (through vector retrieval) and broad queries (thriygh graph retrieval)

Examples of specific / broad queries:<br>
“Who wrote ’Pride and Prejudice’?”<br>
“How does artificial intelligence influence modern education?”

__Algorithm__
1. Graph building
    - split documents into managebale chunks
    - encode each chunk with an embedding (for example using Sentence-BERT)
    - extract Entities and Relations from each chunk using "LLM Profiling"
    - build a graph 
        - nodes 
            - chunks 
            - entities 
        - edges 
            - embedding proximity 
            - having entities 
            - relationship between entities
- Retrieval
    - find nodes using a) query proximity b) entity matching
    - expand using neighbors + 
    - rerank and filter documents
- Generate answer

<img src="img/light_rag.png" width=1000>

They compare their performance with GraphRAG and declare LightRAG winning while being way more efficient


# PathRAG (2025)
---
[[paper]](https://arxiv.org/abs/2502.14902)<br>
PathRAG = Graph based RAG but instead of retrieving all relevant communities / subgraphs detect only crucial dependency paths in these graphs and rewrite by summarizing them

__Algorithm__
1. Graph building
    - Nodes = entity or text-chunk nodes extracted from the corpus.
    - Edges represent relations (e.g., co-occurrence or semantic links).
2. Retrieval
    - select anchor nodes (by embedding proximity or entity matching)
    - select paths connecting anchor nodes to other relevant nodes (multi-hop graph traversal).
    - rank paths by total “resource” score and prune low-value or redundant paths.
    - add other path features (e.g., length, connectivity).
    - order paths by reliability
    - format them as prompt bullets for the LLM
3. Generation
    - Feed the structured prompt into the LLM to generate a logical, coherent response using the curated paths.




<img src="img/path_rag.png" width=500>




# SPLADE (2021)
---
[[paper]](https://arxiv.org/pdf/2107.05720)<br>
SPLADE = Sparse Lexical and Expansion Model for Information Retrieval

SPLADE is a two-tower Encoder (BERT) model, used to rank documents by relevance with query augmentation 

Towers has additional projection layer at the output - it maps output embedding to token distribution.
The purpose of this distribution is to encode tokens "related" to each input token. It works as query enhancement<br>
$z_i = W \cdot h_i + b, \quad z_i \in \mathbb{R}^{|V|}$

<img src="img/splade2.png" width=750>

Document-level aggregation of embeddings is done by MaxPooling of token probability - maximal seen signal for token goes to the output<br>
$z = \max_{i=1..n}(z_i), \quad z \in \mathbb{R}^{|V|}$

Sparsification is enforced by Lasso ($L_1$) regularization + smoothed Loss function<br>
$\mathcal{L}_\text{total} = \mathcal{L} + \lambda \cdot (\|v_q\|_1 + \|v_{d^+}\|_1 + \|v_{d^-}\|_1)$

Smoothed Loss function guarantees less flucatuation around zero values<br>
$v_x = \log(1 + \text{ReLU}(z)), \quad v_x \in \mathbb{R}^{|V|}$

At inference time sparsified distribution and fetch candidate documents from inverted index

<img src="img/splade.png" width=500>

Model is trained on labeled relevance dataset (i.e. MS MARCO) using contrastive loss function with regularization<br>
$\mathcal{L} = \max(0, \text{margin} - S(q, d^+) + S(q, d^-))$

Where model output is predicted relevance score:<br>
$S(q, d) = \langle v_q, v_d \rangle$




# SPLADE++ (2021)
---
[[paper]](https://arxiv.org/abs/2109.10086)<br>
Same idea as SPLADE, but several advancements in comparison to v1:
1. they propose more aggregation strategies (max, sum, avg, weighted)
2. instead of two-tower architecture they distill from cross-encoder
3. they added weighting in regularization based no token frequency in batch




# COIL (2021)
---
[[paper]](https://arxiv.org/pdf/2104.07186)<br>
COIL = Contextualized Inverted List Retrieval<br>It is an example of Hybrid retrieval: sparse, but with neural enhancements<br>

__Idea:__ we rely on classic sparse matching, but instead of computing standard (non-contextual) TFs, we use <u>multidimensional embeddings</u> 

Inverted index still stores documents with non-zero occurences => the fetch process remains the same. But document ranking is different  - it is done by summing dot-products of token embeddings vs documents embeddings

Interpretation:<br>
Per-document token embeddings reflect their importance to this particular document. Query token embeddings reflects its importance to query being processed. Their combination models query-document relevance in terms of this token. Total relevance is modeled as a sum of all token matches

<img src="img/COIL.png" width=750>

# uniCOIL (2022)
---
[[paper]](https://arxiv.org/pdf/2106.14807)<br>
uniCOIL is a newer and simplified version of COIL - it drops complexity to make retrieval faster<br>

Two main simplifications:
1) we compute scalar value as a score, instead of computing multidimensional embedding as COIL does<br>it is done by appending a single MLP head to the Encoder
2) unlike COIL, we do not compute scores for query, only for documents - for queries we stay with one-hot encodings (constant values)
Apart from this, methodology is the same = we fetch posting lists from inverted index and compute per-token dot-products to rerank

Algorithm:
For each token $t$ in a document $D$ we predict scalar score $s(D,t)$ using BERT model with one scalar head (token importance score). All documents with non-zero scores are appended to postings list of that token. For query encoding at inference time use standard one-hot encoding. Alternatively, we can also apply BERT with scalar head. Both variants are applicable: first is faster, second is (a little bit) more accurate. Relevance score is a dot-product

Second simplification is optional - we can compute context-depending embeddings for query:
<img src="img/uniCOIL.png" width=750>


# DeepImpact
---
[[paper]](https://arxiv.org/pdf/2104.12016)<br>
DeepImpact

Model architecture:
- encode input sequence with BERT model
- additional MLP layer that outputs a scalar "impact score"<br>interpretation = how important this token is to the input

Training process:
- contrastive loss is used: query and 2 candidate documents
- we compute two scores for positive and negative documents
- aggregate scores to get 2 predicted "relevances"

Index construction:
- iterate over all documents from a database
- append document tokens to an inverted index with its score

Inference:
- get the input
- fetch the listings from the inverted index
- for each document from a listing compute the relevance by summing pre-stored scores

<img src="img/DEEPIMPACT.png" width=750>


# DeepCT
---
[[paper]](https://arxiv.org/pdf/1910.10687)<br>
DeepImpact

<img src="img/DEEPCT.png" width=500>

Model architecture:
- encode input sequence with BERT model
- additional MLP layer that outputs a scalar "impact score"<br>interpretation = how important this token is to the input
- enforce sparsity 

Training process:
- done in a self-supervised fashion: for each dociuemtn we make model predict the occurence of token in the input<br>interpretation = likelihood of the token in that context
- logistic loss is used

Index construction:
- iterate over all documents from a database
- append document tokens to an inverted index with its score

Inference:
- get the input
- fetch the listings from the inverted index
- for each document from a listing compute the relevance by summing pre-stored scores


# TILDE
---
[[paper]](https://arxiv.org/pdf/2108.08513)<br>
TILDE =  Term Independent Likelihood moDEl for Passage Re‑ranking

Model architecture:
- encode input sequence with BERT model, but use only [CLS] output
- a small MLP decoder maps this embedding back to token space, returning the probabilities vector $[P_i]$<br>here P = probability that x occures in X
- enforce sparsity either by applying threshold OR top-k selection
- output = one vector of token probabilities

<img src="img/TILDE.png" width=750>

Training process:
- done in self-supervised fashion: for each document MLP should predict whether each token occurs in the input
- logistic loss is used

Index construction:
- iterate over all documents from a database
- append selected tokens to an inverted index with their probabilities

Inference:
- get the input
- fetch the listings from the inverted index
- for each document from a listing compute the relevance by summing pre-stored probabilities

Model architecture is very similar to Autoencoders: Encoder learns to extract semantics, Decoder learns to extract tokens

Advancements in TILDEv2:
- instead of binary occurence (0,1) Encoder is being trained on TF-IDF
- model is distilled from more powerful models like SPLADE<br>SPLADE is more powerfil beacuse it is token-level, not document-level like TILDE






# Doc2Query (2019)
---
[[paper]](https://arxiv.org/pdf/1904.08375)<br>
__Idea:__ instead of enhancing user queries let's __enhance documents__ by generating potential queries and appending them to document text

Training process: Take some sequence-to-sequence model like T5 and train on a labeled dataset MS MARCO, but in reversed fashion (prompt = document, answer = question)

Such document enhancement will increase the recall of the retirve 

<img src="img/doc2query.png" width=500>











