# What is DPR?

We've implemented a Dense Passage Retriever (DPR) as our retriever in the last few sections, but we haven't really discussed what it actually is - other than a *retriever*.

Our DPR model is doing more than just performing a similarity search. Before even filtering through FAISS for relevant contexts, DPR is converting the question you have provided into an *approximate* context.

DPR does not save data like FAISS, and so we can think of the transformation it performs as a linguistic exercise. The model takes a question, and based on it's knowledge of question-context pairs, converts this into a best-guess answer approximation.

So, let's take a few steps back, and get to grips with how DPR is doing this.

We already know that in open-domain Q&A, we typically design a model architecture that contains a data source, retriever, and reader/generator.

The first of these components is our document store. The two most popular stores are Elasticsearch and FAISS.

Next up is our retriever - the topic of this section. The job of the retriever is to filter through our document store for relevant chunks of information (the documents) and pass them to the reader/generator model.

The reader/generator model is the final model in our Q&A stack. We can either have a reader, which extracts an answer directly from the context. Or, a generator, which uses language generation to generate an answer from the context.

![The retriever-reader and retriever-generator stacks](../../assets/images/qa_retriever_reader_and_retriever_generator_stack.png)

The job of the retriever is critical to our reader performance. Given a query, it must find the most relevant contexts.

If the reader is being given *incorrect* contexts, it will output *incorrect* answers.

If the reader is being given *correct* contexts, it *may* output *correct* answers.

So, if we want to have any chance of outputting good answers, the retriever must work well.

---

## Sparse Retrievers

Before DPR, we relied on sparse vector retrievers for the task of finding relevant information from our document stores. For this, we used TF-IDF or BM25.

### TF-IDF

The TF-IDF algorithm is a popular option for calculating the similarity of two pieces of text.

* **TF** refers to how many words in the query are found in the context.
* **IDF** is the inverse of the fraction of documents containing this word.

These two values are then multiplied to give the TF-IDF score.

Now, we may find that the word *"hippocampus"* is shared between the query and context, this would increase the TF-IDF score because:

* **TF** - the word is found in both the query and the context (high score).
* **IDF** - the word *"hippocampus"* is not found in many other documents (so the inverse of the word frequency is a high number).

Alternatively, if we took the word *"the"*, we would return a low TF-IDF score because:

* **TF** - the word is found in both the query and the context (high score).
* **IDF** - the word *"the"* is found in many other documents (so the inverse of the word frequency is a low number).

Because IDF is a low number due to how common the is, the TF-IDF score is low too.

So, the TF-IDF score is great for finding sequences that contain the *same uncommon* words.

### BM25

BM25 is a variation of TF-IDF. Here, we still calculate TF and IDF, but the TF score is dampened after returning large numbers of matches between the query and contexts.

Additionally, it also considers the document length. The TF-IDF score is normalized so that short documents will score better than long documents given they both have the same number of word matches.

When using sparse retrievers, BM25 is typically favored over TF-IDF.

---

## Dense Passage Retrieval

Dense Passage Retrieval (DPR) for ODQA was introduced in 2020 as an alternative to the traditional TF-IDF and BM25 techniques for passage retrieval.

### Pros
The [paper that introduced DPR](https://arxiv.org/pdf/2004.04906.pdf) begins by stating that this new approach outperforms current Lucene (the document store) BM25 retrievers by a 9–19% passage retrieval accuracy.

DPR is able to outperform the traditional sparse retrieval methods for two key reasons:

* Semantically similar words (*"hey", "hello", "hey"*) will not be viewed as a match by TF. DPR uses dense vectors encoded with semantic meaning (*so "hey", "hello", and "hey" will closely match*).

* Sparse retrievers are not trainable. DPR uses embedding functions that we can train and fine-tune for specific tasks.

### Cons

Despite these clear performance benefits, it's not all good news. Yes we can train our DPR model, but that's also a disadvantage - whereas TF-IDF and BM25 come ready-to-go - DPR does not.

As is usually the case in ML, DPR requires a lot of training data - which in this case is a curated dataset of question and context pairs.

### Two BERTs, And Training

DPR works by using two unique BERT encoder models. One of those models - Eᴘ - encodes passages of text into an encoded *passage* vector (we store context vectors in our document store).

The other model - EQ - maps a question into an encoded *question* vector.

During training, we feed a question-context pair into our DPR model, and the model weights will be optimized to maximize the dot product between two respective Eᴘ/EQ model outputs:

![Information flow during DPR training](../../assets/images/qa_dpr_training.png)

The dot product value between the two model outputs Eᴘ(p) and EQ(q) measures the similarity between both vectors. A higher dot product correlates to a higher similarity - because the closer two vectors are to each other, the larger the dot product.

By training the two models to output the same vector, we are training the context encoder and question encoder to output very similar vectors for related question-context pairs.

### At Runtime

Once the model (or two models) have been trained, they are ready for Q&A indexing and retrieval.

When we first build our document store, we need to encode the data we store in there using the Eᴘ encoder - so during document store initialization (or when adding new documents) - we run every passage of text through the Eᴘ encoder and store the output vectors in our document store.

For real-time Q&A, we only need the EQ encoder. When we ask a question, it will be sent to the EQ encoder which then outputs our EQ vector EQ(q).

Next, the EQ(q) vector is compared against the already indexed Eᴘ(p) vectors in our document store - where we filter for the vectors which return the highest similarity score:

**sim(q,p) = EQ(q)ᵀ Eᴘ(p)**

And that's it - the retriever will have identified the most relevant contexts for a given question.

These relevant contexts are then passed onto our reader (or generator) model, which will create an answer based on the contexts - the Q&A process is complete.

Let's move onto that final step in the process.