Skip to content

Commit

Permalink
Add paper summaries for retrieval research. (#3128)
Browse files Browse the repository at this point in the history
  • Loading branch information
kpoeppel committed May 13, 2023
1 parent 24856cd commit f80f556
Showing 1 changed file with 160 additions and 1 deletion.
161 changes: 160 additions & 1 deletion docs/docs/research/retrieval.md
Expand Up @@ -42,7 +42,7 @@ for recall and reranking is optimised for precision.

The first search in general is a sparse vector search, or dense vector search
(bi-encoder). The advantage is that it is fast because document can be
pre-indexed and stored in a DB. Consine similarity is used to find the most
pre-indexed and stored in a DB. Cosine similarity is used to find the most
similar pre-indexed document embedding given the query embedding. Reranking is a
technique to boost the performance of top K documents from the first search. For
example, cross-encoder which takes both query and document into a language
Expand Down Expand Up @@ -92,10 +92,169 @@ including:
The embedding-based approach ingests retrieved information directly into the
model, e.g. via an additional encoder and cross-attention.

### 3a

Simply inject embeddings via cross-attention or a similar mechanism.

### 3b

Inject embeddings based on a more sophisticated architecture, e.g. make the
model decide to do retrieval and only then inject embeddings. Might be hard to
train.

### 3c

Train retrieval index jointly with the injection. Possibly infeasible as the
index needs to be re-updated during training.

### Relevant papers

- RETRO: [http://arxiv.org/abs/2112.04426](http://arxiv.org/abs/2112.04426)
- REALM: [https://arxiv.org/abs/2002.08909](https://arxiv.org/abs/2002.08909)
- RAG: [https://arxiv.org/abs/2005.11401](https://arxiv.org/abs/2005.11401)
- Atlas [https://arxiv.org/abs/2208.03299](https://arxiv.org/abs/2208.03299)
- ...

## Paper summaries

### Borgeaud et al 2020.: Improving Language Models by Retrieving from Trillions of Tokens - "RETRO"

Idea: Use BERT (Devlin et al. 2018) as a contextual encoder for chunks of size
64 of the training data. Then train an encoder-decoder transformer model with
inputs and similar (not too similar / same) input chunks retrieved by BERT
embedding similarity - all done in a causal way (retrieve only "from the past").
The Cross-Attention is replaced by a Chunked Cross Attention optimized for
batches of similar retrieved chunks. They pre-filter their dataset such that
data duplicates cannot easily leak information via retrieval. This was scaled to
2T tokens and a 7.5 B parameter model exceeding GPT-3 performance. RETROfitting
of a pre-trained transformer also works, with small losses in perplexity (0.3),
but a lot faster training (6 % of training sequences = 6M seq à 2048 tokens).
This is not fine-tuning but just training the cross-attention, keeping
pre-trained weights fixed. Larger models benefit from more nearest neighbors,
i.e. the 7B can utilize 40 nearest neighbor chunks, a 172M model only 10 NNs.

[http://arxiv.org/abs/2112.04426](http://arxiv.org/abs/2112.04426)

### Izacard et al. 2022: Unsupervised Dense Information Retrieval with Contrastive Learning - "Contriver"

They present Contriver, an open-source implementation of their novel approach to
information retrieval using neural networks that outperforms traditional methods
and can be applied to a wide range of retrieval settings. The main idea behind
Contriver is to use contrastive learning to train dense retrievers for
information retrieval. Their key contribution is showing that this approach
leads to strong performance in various retrieval settings, including
cross-lingual retrieval, and outperforms traditional unsupervised term-frequency
methods such as BM25.

Specifically, on the BEIR benchmark, their unsupervised model outperforms BM25
on 11 out of 15 datasets for the Recall@100. When used as pre-training before
fine-tuning, either on a few thousands in-domain examples or on the large MS
MARCO dataset, their contrastive model leads to improvements on the BEIR
benchmark.

Pre-trained model and source code are available on Huggingface and GitHub.

### Schick et al. 2023: Toolformer: Language Models Can Teach Themselves to Use Tools

They use in-context learning of GPT-3 and some handcrafted samples to annotate a
language modeling dataset with potential uses of external tools, like QA,
wikipedia search, a calculator, machine translation and a calendar - via text
tags for those tools and respective tool queries. They use this data then to
fine-tune GPT-2/GPT-J models, implement according tools and train with up to 25k
examples per API, max sequence length 1,024. They outperform other language
models with large margin when using tools and are comparable to larger ones when
only fine-tuned on the tool-based dataset.

[http://arxiv.org/abs/2302.04761](http://arxiv.org/abs/2302.04761)

### Guu et al 2020: REALM: Retrieval-Augmented Language Model Pre-Training

They use retrieved information from a KB to train a MLM self-supervised and
evaluate on QA tasks. Predecessor to RETRO.

The authors of the paper structure the retriever in REALM such that the
computation performed for each document can be cached and asynchronously
updated, and selection of the best documents can be formulated as Maximum Inner
Product Search (MIPS). This allows for efficient retrieval of potentially
relevant documents from a large corpus during pre-training.

During pre-training, REALM backpropagates through the retrieval step that
considers millions of documents, but it does not backpropagate to each
individual document. Instead, it uses a single encoder to encode the subset of
retrieved samples and then backpropagates through this encoder. This approach
allows for efficient computation during pre-training while still allowing for
effective utilization of world knowledge.

(https://arxiv.org/abs/2002.08909)[https://arxiv.org/abs/2002.08909]

### Zamani et al. 2022: Retrieval-Enhanced Machine Learning

This paper introduces a new research program called Retrieval-Enhanced Machine
Learning (REML), which combines information retrieval techniques with machine
learning to improve model accuracy and interpretability. The authors describe
the core principles of indexing, representation, retrieval, and ranking that
underlie REML models, and provide examples of how these models have been applied
in real-world scenarios.

The main contribution of this paper is to lay out a research agenda for REML
that includes several key challenges and opportunities for future work. These
include developing new optimization algorithms that can handle large-scale data
sets, exploring the use of deep learning architectures in conjunction with
retrieval-based methods, and investigating the impact of different retrieval
strategies on model performance.

Overall, the key idea behind REML is to leverage the strengths of both
information retrieval and machine learning to create more powerful and flexible
models that can handle complex data sets and produce more accurate results. By
combining these two fields, researchers hope to pave the way for new advances in
artificial intelligence and information access research.

(https://arxiv.org/abs/2205.01230)[https://arxiv.org/abs/2205.01230]

### Thakur et al. 2021: BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models

The BEIR benchmarking tool is designed to provide a comprehensive evaluation of
information retrieval models across diverse tasks and domains. It includes 18
retrieval datasets for comparison and evaluation of model generalization,
spanning nine different retrieval tasks such as fact checking, citation
prediction, duplicate question retrieval, argument retrieval, news retrieval,
question answering, tweet retrieval, bio-medical IR, and entity retrieval. The
selection methodology is motivated by the need for diverse tasks and domains to
evaluate the zero-shot capabilities of retrieval systems. The tool is
open-sourced with a standardized data format and easy-to-adapt code examples for
many different retrieval strategies.

They compare neural retrieval to legacy systems like BM25 and show that BM25 is
still a very strong baseline. The best model is a BM25 based search with
additional re-ranking based on a neural classifier.

Observations:

1. "In-domain performance is not a good indicator for out-of-domain
generalization"
2. "Term-weighting fails, document expansion captures out-of-domain keyword
vocabulary"
3. "Dense retrieval models with issues for out-of-distribution data"
4. "Re-ranking and Late-Interaction models generalize well to
out-of-distribution data"
5. "Strong training losses for dense retrieval leads to better
out-of-distribution performances"
6. "TAS-B model prefers to retrieve documents with shorter lengths"

Conclusion: Maybe not only focus on a vector-based index, use a standard index
as base + neural re-ranking

(https://arxiv.org/pdf/2104.08663.pdf)[https://arxiv.org/pdf/2104.08663.pdf]

## Other interesting papers

- Nakano et al: WebGPT (predecessor to ChatGPT) - fine-tune GPT3 to search the
web for QA tasks
(https://arxiv.org/pdf/2112.09332.pdf)[https://arxiv.org/pdf/2112.09332.pdf]

- Schick et al: PEER: A Collaborative Language Model
(https://arxiv.org/pdf/2208.11663.pdf)[https://arxiv.org/pdf/2208.11663.pdf]

- Goyal et al. 2023: Retrieval Augmented Reinforcement Learning

- Humphreys et al. 2022: Large-Scale Retrieval for Reinforcement Learning

0 comments on commit f80f556

Please sign in to comment.