<a href="https://colab.research.google.com/github/kandloic/haystack/blob/master/Findings_Tutorial6_Better_Retrieval_via_DPR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Better retrieval via "Dense Passage Retrieval"


### Importance of Retrievers

The Retriever has a huge impact on the performance of our overall search pipeline.


### Different types of Retrievers
#### Sparse
Family of algorithms based on counting the occurences of words (bag-of-words) resulting in very sparse vectors with length = vocab size. 

Examples: BM25, TF-IDF  
Pros: Simple, fast, well explainable  
Cons: Relies on exact keyword matches between query and text  
 

#### Dense
These retrievers use neural network models to create "dense" embedding vectors. Within this family there are two different approaches: 

a) Single encoder: Use a **single model** to embed both query and passage.  
b) Dual-encoder: Use **two models**, one to embed the query and one to embed the passage

Recent work suggests that dual encoders work better, likely because they can deal better with the different nature of query and passage (length, style, syntax ...). 

Examples: REALM, DPR, Sentence-Transformers ...
Pros: Captures semantinc similarity instead of "word matches" (e.g. synonyms, related topics ...) 
Cons: Computationally more heavy, initial training of model  


### "Dense Passage Retrieval"

In this Tutorial, we want to highlight one "Dense Dual-Encoder" called Dense Passage Retriever. 
It was introdoced by Karpukhin et al. (2020, https://arxiv.org/abs/2004.04906. 

Original Abstract: 

_"Open-domain question answering relies on efficient passage retrieval to select candidate contexts, where traditional sparse vector space models, such as TF-IDF or BM25, are the de facto method. In this work, we show that retrieval can be practically implemented using dense representations alone, where embeddings are learned from a small number of questions and passages by a simple dual-encoder framework. When evaluated on a wide range of open-domain QA datasets, our dense retriever outperforms a strong Lucene-BM25 system largely by 9%-19% absolute in terms of top-20 passage retrieval accuracy, and helps our end-to-end QA system establish new state-of-the-art on multiple open-domain QA benchmarks."_

Paper: https://arxiv.org/abs/2004.04906  
Original Code: https://fburl.com/qa-dpr 


*Use this [link](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial6_Better_Retrieval_via_DPR.ipynb) to open the notebook in Google Colab.*


## Prepare environment

### Colab: Enable the GPU runtime 
Make sure you enable the GPU runtime to experience decent speed in this tutorial.  
**Runtime -> Change Runtime type -> Hardware accelerator -> GPU**

<img src="https://raw.githubusercontent.com/deepset-ai/haystack/master/docs/img/colab_gpu_runtime.jpg">

In [None]:
# Make sure you have a GPU running
!nvidia-smi

Thu Aug 13 12:21:09 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   50C    P8    10W /  70W |      0MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
! pip install git+https://github.com/deepset-ai/haystack.git

Collecting git+https://github.com/deepset-ai/haystack.git
  Cloning https://github.com/deepset-ai/haystack.git to /tmp/pip-req-build-t11prui6
  Running command git clone -q https://github.com/deepset-ai/haystack.git /tmp/pip-req-build-t11prui6
Collecting farm==0.4.6
[?25l  Downloading https://files.pythonhosted.org/packages/e2/93/1beb613753a9845b689eee4571ba4a7f3210b60b4bd90f024fc324c96785/farm-0.4.6-py3-none-any.whl (184kB)
[K     |████████████████████████████████| 194kB 12.7MB/s 
[?25hCollecting fastapi
[?25l  Downloading https://files.pythonhosted.org/packages/82/cb/96cb7cc6a807af493f0083e7d854fdd568ae5335f8f93b96c966fabd8d2f/fastapi-0.61.0-py3-none-any.whl (48kB)
[K     |████████████████████████████████| 51kB 7.3MB/s 
[?25hCollecting uvicorn
[?25l  Downloading https://files.pythonhosted.org/packages/32/9a/5f619c02f36e751071c2b7eaa37a7c4b767feb41e4c2de48e8fbe4e7b451/uvicorn-0.11.8-py3-none-any.whl (43kB)
[K     |████████████████████████████████| 51kB 7.3MB/s 
[?25hCollectin

In [None]:
from haystack import Finder
from haystack.indexing.cleaning import clean_wiki_text
from haystack.indexing.utils import convert_files_to_dicts, fetch_archive_from_http
from haystack.reader.farm import FARMReader
from haystack.reader.transformers import TransformersReader
from haystack.utils import print_answers

## Document Store

### Start an Elasticsearch server
You can start Elasticsearch on your local machine instance using Docker. If Docker is not readily available in your environment (e.g. in Colab notebooks), you can also manually download and execute Elasticsearch from source.

In [None]:
# Recommended: Start Elasticsearch using Docker
#! docker run -d -p 9200:9200 -e "discovery.type=single-node" elasticsearch:7.6.2
# wait until ES has started
#! sleep 30

In [None]:
# In Colab / No Docker environments: Start Elasticsearch from source
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.6.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.6.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.6.2

import os
from subprocess import Popen, PIPE, STDOUT
es_server = Popen(['elasticsearch-7.6.2/bin/elasticsearch'],
                   stdout=PIPE, stderr=STDOUT,
                   preexec_fn=lambda: os.setuid(1)  # as daemon
                  )
# wait until ES has started
! sleep 30

In [None]:
# Connect to Elasticsearch
from haystack.database.elasticsearch import ElasticsearchDocumentStore

# We need to set `embedding_field` and `embedding_dim`, when we plan to use a dense retriever
document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document", 
                                            embedding_field="embedding", embedding_dim=768)

08/13/2020 12:25:03 - INFO - elasticsearch -   PUT http://localhost:9200/document [status:200 request:0.401s]
08/13/2020 12:25:03 - INFO - elasticsearch -   PUT http://localhost:9200/label [status:200 request:0.223s]


## Cleaning & indexing documents

Similarly to the previous tutorials, we download, convert and index some Game of Thrones articles to our DocumentStore

In [None]:
# Let's first get some files that we want to use
doc_dir = "data"
#s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt.zip"
#fetch_archive_from_http(url=s3_url, output_dir=doc_dir)

# Convert files to dicts
dicts = convert_files_to_dicts(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)

# Now, let's write the dicts containing documents to our DB.
document_store.write_documents(dicts)

08/13/2020 12:25:04 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.608s]


## Initalize Retriever, Reader,  & Finder

### Retriever

**Here:** We use a `DensePassageRetriever`

**Alternatives:**

- The `ElasticsearchRetriever`with custom queries (e.g. boosting) and filters
- Use `EmbeddingRetriever` to find candidate documents based on the similarity of embeddings (e.g. created via Sentence-BERT)
- Use `TfidfRetriever` in combination with a SQL or InMemory Document store for simple prototyping and debugging

In [None]:
from haystack.retriever.dense import DensePassageRetriever
retriever = DensePassageRetriever(document_store=document_store, embedding_model="dpr-bert-base-nq",
                                  do_lower_case=True, use_gpu=True)

# Important: 
# Now that after we have the DPR initialized, we need to call update_embeddings() to iterate over all
# previously indexed documents and update their embedding representation. 
# While this can be a time consuming operation (depending on corpus size), it only needs to be done once. 
# At query time, we only need to embed the query and compare it the existing doc embeddings which is very fast.
document_store.update_embeddings(retriever)

Loading from  https://dl.fbaipublicfiles.com/dpr/checkpoint/retriever/single/nq/hf_bert_base.cp


08/13/2020 12:26:23 - INFO - haystack.retriever.dpr_utils -   Loading saved model from models/dpr/checkpoint/retriever/single/nq/bert-base-encoder.cp


Saved to  models/dpr/checkpoint/retriever/single/nq/bert-base-encoder.cp


08/13/2020 12:26:23 - INFO - haystack.retriever.dense -   Loaded encoder params:  {'do_lower_case': True, 'pretrained_model_cfg': 'bert-base-uncased', 'encoder_model_type': 'hf_bert', 'pretrained_file': None, 'projection_dim': 0, 'sequence_length': 256}
08/13/2020 12:26:24 - INFO - filelock -   Lock 139669666690216 acquired on /root/.cache/torch/transformers/4dad0251492946e18ac39290fcfe91b89d370fee250efe9521476438fe8ca185.7156163d5fdc189c3016baca0775ffce230789d7fa2a42ef516483e4ca884517.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…

08/13/2020 12:26:25 - INFO - filelock -   Lock 139669666690216 released on /root/.cache/torch/transformers/4dad0251492946e18ac39290fcfe91b89d370fee250efe9521476438fe8ca185.7156163d5fdc189c3016baca0775ffce230789d7fa2a42ef516483e4ca884517.lock





08/13/2020 12:26:25 - INFO - filelock -   Lock 139669666691280 acquired on /root/.cache/torch/transformers/f2ee78bdd635b758cc0a12352586868bef80e47401abe4c4fcc3832421e7338b.36ca03ab34a1a5d5fa7bc3d03d55c4fa650fed07220e2eeebc06ce58d0e9a157.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…

08/13/2020 12:26:33 - INFO - filelock -   Lock 139669666691280 released on /root/.cache/torch/transformers/f2ee78bdd635b758cc0a12352586868bef80e47401abe4c4fcc3832421e7338b.36ca03ab34a1a5d5fa7bc3d03d55c4fa650fed07220e2eeebc06ce58d0e9a157.lock





08/13/2020 12:26:41 - INFO - haystack.retriever.dense -   Loading saved model state ...
08/13/2020 12:26:41 - INFO - haystack.retriever.dense -   Loading saved model state ...
08/13/2020 12:26:42 - INFO - filelock -   Lock 139669666663168 acquired on /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…

08/13/2020 12:26:44 - INFO - filelock -   Lock 139669666663168 released on /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084.lock





08/13/2020 12:26:44 - INFO - elasticsearch -   POST http://localhost:9200/document/_search?scroll=5m&size=1000 [status:200 request:0.150s]
08/13/2020 12:26:45 - INFO - elasticsearch -   POST http://localhost:9200/_search/scroll [status:200 request:0.016s]
08/13/2020 12:26:45 - INFO - elasticsearch -   DELETE http://localhost:9200/_search/scroll [status:200 request:0.010s]
08/13/2020 12:26:45 - INFO - haystack.database.elasticsearch -   Updating embeddings for 18 docs ...
08/13/2020 12:26:55 - INFO - elasticsearch -   POST http://localhost:9200/_bulk [status:200 request:0.600s]


### Reader

Similar to previous Tutorials we now initalize our reader.

Here we use a FARMReader with the *deepset/roberta-base-squad2* model (see: https://huggingface.co/deepset/roberta-base-squad2)



#### FARMReader

In [None]:
# Load a  local model or any of the QA models on
# Hugging Face's model hub (https://huggingface.co/models)

reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)

08/13/2020 12:26:55 - INFO - farm.utils -   device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None
08/13/2020 12:26:55 - INFO - farm.infer -   Could not find `deepset/roberta-base-squad2` locally. Try to download from model hub ...
08/13/2020 12:26:56 - INFO - filelock -   Lock 139669670832504 acquired on /root/.cache/torch/transformers/f7d4b9379a9c487fa03ccf3d8e00058faa9d664cf01fc03409138246f48760da.c6288e0f84ec797ba5c525c923a5bbc479b47c761aded9734a5f6a473b044c8d.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=559.0, style=ProgressStyle(description_…

08/13/2020 12:26:57 - INFO - filelock -   Lock 139669670832504 released on /root/.cache/torch/transformers/f7d4b9379a9c487fa03ccf3d8e00058faa9d664cf01fc03409138246f48760da.c6288e0f84ec797ba5c525c923a5bbc479b47c761aded9734a5f6a473b044c8d.lock





08/13/2020 12:26:58 - INFO - filelock -   Lock 139669649674536 acquired on /root/.cache/torch/transformers/8c0c8b6371111ac5fbc176aefcf9dbe129db7be654c569b8375dd3712fc4dc67.d045adc91e17ecdf7dc3eeff4c875df94bdf2eb749d72b3ae47ae93f8e85213c.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=498637366.0, style=ProgressStyle(descri…

08/13/2020 12:27:06 - INFO - filelock -   Lock 139669649674536 released on /root/.cache/torch/transformers/8c0c8b6371111ac5fbc176aefcf9dbe129db7be654c569b8375dd3712fc4dc67.d045adc91e17ecdf7dc3eeff4c875df94bdf2eb749d72b3ae47ae93f8e85213c.lock





	 We guess it's an *ENGLISH* model ... 
	 If not: Init the language model by supplying the 'language' param.
08/13/2020 12:27:20 - INFO - filelock -   Lock 139669651643192 acquired on /root/.cache/torch/transformers/1e3af82648d7190d959a9d76d727ef629b1ca51b3da6ad04039122453cb56307.6a4061e8fc00057d21d80413635a86fdcf55b6e7594ad9e25257d2f99a02f4be.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898822.0, style=ProgressStyle(descripti…

08/13/2020 12:27:22 - INFO - filelock -   Lock 139669651643192 released on /root/.cache/torch/transformers/1e3af82648d7190d959a9d76d727ef629b1ca51b3da6ad04039122453cb56307.6a4061e8fc00057d21d80413635a86fdcf55b6e7594ad9e25257d2f99a02f4be.lock





08/13/2020 12:27:23 - INFO - filelock -   Lock 139669651643192 acquired on /root/.cache/torch/transformers/b901c69e8e7da4a24c635ad81d016d274f174261f4f5c144e43f4b00e242c3b0.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…

08/13/2020 12:27:25 - INFO - filelock -   Lock 139669651643192 released on /root/.cache/torch/transformers/b901c69e8e7da4a24c635ad81d016d274f174261f4f5c144e43f4b00e242c3b0.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda.lock





08/13/2020 12:27:27 - INFO - filelock -   Lock 139669651643192 acquired on /root/.cache/torch/transformers/2d9b03b59a8af464bf4238025a3cf0e5a340b9d0ba77400011e23c130b452510.16f949018cf247a2ea7465a74ca9a292212875e5fd72f969e0807011e7f192e4.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=150.0, style=ProgressStyle(description_…

08/13/2020 12:27:28 - INFO - filelock -   Lock 139669651643192 released on /root/.cache/torch/transformers/2d9b03b59a8af464bf4238025a3cf0e5a340b9d0ba77400011e23c130b452510.16f949018cf247a2ea7465a74ca9a292212875e5fd72f969e0807011e7f192e4.lock





08/13/2020 12:27:29 - INFO - filelock -   Lock 139669645157936 acquired on /root/.cache/torch/transformers/507984f2e28c7dfed5db9a20acd68beb969c7f2833abc9e582e967fa0291f3dc.100c88dbe27dbd73822c575274ade4eb2427596ac56e96769249b7512341654d.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=189.0, style=ProgressStyle(description_…

08/13/2020 12:27:30 - INFO - filelock -   Lock 139669645157936 released on /root/.cache/torch/transformers/507984f2e28c7dfed5db9a20acd68beb969c7f2833abc9e582e967fa0291f3dc.100c88dbe27dbd73822c575274ade4eb2427596ac56e96769249b7512341654d.lock





08/13/2020 12:27:31 - INFO - farm.utils -   device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None
08/13/2020 12:27:31 - INFO - farm.infer -   Got ya 1 parallel workers to do inference ...
08/13/2020 12:27:31 - INFO - farm.infer -    0 
08/13/2020 12:27:31 - INFO - farm.infer -   /w\
08/13/2020 12:27:31 - INFO - farm.infer -   /'\
08/13/2020 12:27:31 - INFO - farm.infer -   


### Finder

The Finder sticks together reader and retriever in a pipeline to answer our actual questions. 

In [None]:
finder = Finder(reader, retriever)

## Voilà! Ask a question!

In [None]:
# You can configure how many candidates the reader and retriever shall return
# The higher top_k_retriever, the better (but also the slower) your answers. 
prediction = finder.get_answers(question="What are the past findings related to governance?", top_k_retriever=20, top_k_reader=20)

#prediction = finder.get_answers(question="Who is the father of Arya Stark?", top_k_retriever=10, top_k_reader=5)
#prediction = finder.get_answers(question="Who is the sister of Sansa?", top_k_retriever=10, top_k_reader=5)

08/13/2020 12:27:31 - INFO - elasticsearch -   POST http://localhost:9200/document/_search [status:200 request:0.112s]
08/13/2020 12:27:31 - INFO - haystack.finder -   Reader is looking for detailed answer in 386006 chars ...
Inferencing Samples: 100%|██████████| 1/1 [00:21<00:00, 21.50s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:21<00:00, 21.34s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:24<00:00, 24.74s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:15<00:00, 15.73s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:09<00:00,  9.60s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:09<00:00,  9.58s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:16<00:00, 16.56s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:10<00:00, 10.59s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:10<00:00, 10.04s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:14<00:00, 14.66s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [0

In [None]:
print_answers(prediction, details="minimal")

[   {   'answer': 'similar to those of the previous years',
        'context': 't for the CRA were as follows:\n'
                   'These results, which are similar to those of the previous '
                   'years, indicate that there is an opportunity to make '
                   'improvem'},
    {   'answer': 'In the past, the Agency has had a lack of dedicated benefit '
                  'management expertise for FC projects',
        'context': 'cy level when applicable. \n'
                   'In the past, the Agency has had a lack of dedicated '
                   'benefit management expertise for FC projects. In 2014, RMD '
                   'began to inc'},
    {   'answer': 'The internal audit also found that the committee and the '
                  'taxpayer relief general enquiries mailbox are not supported '
                  'by processes that validate the receipt and actions needed '
                  'to address identified issues',
        'context': ' The intern