# Question & Answering with ArXiV papers at scale
This notebook is about neural question and answering using transformers models (ALBERT) at SCALE. The below approach is capable to perform Q&A across millions of documents in few seconds.

I will be using ArXiV's papers abstracts to do Q&A at this point it time as I do not have access to actual PDF texts. But - the same approach can be followed to seek answers from actual text in place of just the abstracts. 

I will post another notebook when I get my hands on the actual paper's texts. Now let's dive in...

### Activate GPU support

In [None]:
import torch
torch.cuda.is_available()

True

If successful, the output of the cell above should print `True`. Note that Google Colaboratory also offers [TPU](https://cloud.google.com/tpu/) support. These *Tensor Processing Units* are specifically designed for machine learning tasks and may outperform conventional GPUs. While support for TPUs in PyTorch is still pending, [tensorflow](https://www.tensorflow.org/) models may benefit from using TPUs (see [this tutorial](https://colab.research.google.com/notebooks/tpu.ipynb)).

### Useful commands

Within the notebook environment, you can not only execute Python code, but also bash commands by prepending a `!`. For example, you can install new Python packages via the package manager `pip`. Here, we just check the installed version of PyTorch:

In [None]:
!pip show torch

Name: torch
Version: 1.6.0+cu101
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: /usr/local/lib/python3.6/dist-packages
Requires: future, numpy
Required-by: torchvision, torchtext, fastai


Another useful command is `!kill -9 -1`. It will reset all running kernels and free up memory (including GPU memory). Furthermore, there are a few commands to have a closer look on the hardware spcifications, i.e. to get information about the installed CPU and GPU:

In [None]:
!lscpu |grep 'Model name'

Model name:          Intel(R) Xeon(R) CPU @ 2.30GHz


In [None]:
!nvidia-smi -L

GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-a6dcd1b8-f639-9840-8c50-c78f7241b836)


In addition, you can check the available RAM and HDD memory:

In [None]:
!cat /proc/meminfo | grep 'MemAvailable'

MemAvailable:   12458600 kB


In [None]:
!df -h / | awk '{print $4}'

Avail
35G


Finally, one can execute the following command to get a live update on the GPU usage. This is useful to check how much of the GPU memory is in use to optimize the batchsize for training. Note that whenever the training routine in a notebook is still running, you need to execute this command in another Colaboratory notebook to get an instant response:

In [None]:
!nvidia-smi

Sat Sep 12 05:04:50 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.66       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   46C    P0    31W / 250W |     10MiB / 16280MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

### Mount Google Drive
Another important prerequisite for training our neural network is a place to save checkpoints of the trained model and to store obtained training data. Colaboratory provides convenient access to Google Drive via the `google.colab` Python module. The following command will mount your Google Drive contents to the folder path `/content/gdrive` on the Colaboratory instance. For authentication, you have to click the generated link and paste the authorization code into the input field:

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


### Download the arxiv meta data witg gsutil
We will need gsutil utility from google cloud sdk. Firstly, you need to authenticate yourself in Colab. Once you run the code below, it will ask you to follow a link to login and enter an access token that you receive upon successful login.


In [None]:
from google.colab import auth
auth.authenticate_user()

We would be using the gsutil command to upload and download files. So we first need to install the GCloud SDK.

In [None]:
!curl https://sdk.cloud.google.com | bash
!gcloud init

--- Logging error ---
Traceback (most recent call last):
  File "/usr/lib/python3.6/logging/__init__.py", line 998, in emit
    self.flush()
  File "/usr/lib/python3.6/logging/__init__.py", line 978, in flush
    self.stream.flush()
RuntimeError: reentrant call inside <_io.BufferedWriter name='/content/.config/logs/2020.09.12/05.06.17.945611.log'>
Call stack:
  File "/tools/google-cloud-sdk/lib/gcloud.py", line 104, in <module>
    main()
  File "/tools/google-cloud-sdk/lib/gcloud.py", line 100, in main
    sys.exit(gcloud_main.main())
  File "/tools/google-cloud-sdk/lib/googlecloudsdk/gcloud_main.py", line 186, in main
    gcloud_cli.Execute()
  File "/tools/google-cloud-sdk/lib/googlecloudsdk/calliope/cli.py", line 983, in Execute
    resources = calliope_command.Run(cli=self, args=args)
  File "/tools/google-cloud-sdk/lib/googlecloudsdk/calliope/backend.py", line 808, in Run
    resources = command_instance.Run(args)
  File "/tools/google-cloud-sdk/lib/surface/init.py", line 150, in

### Download the json metadata from the cloud

In [None]:
!gsutil cp -n gs://arxiv-dataset/metadata-v5/arxiv-metadata-oai.json /content/gdrive/My\ Drive/arxiv-metadata-oai.json
!ls -l /content/gdrive/My\ Drive

Copying gs://arxiv-dataset/metadata-v5/arxiv-metadata-oai.json...
\ [1 files][  4.2 GiB/  4.2 GiB]   32.1 MiB/s                                   
Operation completed over 1 objects/4.2 GiB.                                      
total 4877599
-rw------- 1 root root      31744 Aug 25  2017  654.doc
-rw------- 1 root root        151 Aug 25  2017  654.doc.gdoc
drwx------ 2 root root       4096 Jun  4 15:55  ADHD
-rw------- 1 root root 4503980272 Sep 12 05:13  arxiv-metadata-oai.json
drwx------ 2 root root       4096 Apr 29  2016 'Augmented Reality'
-rw------- 1 root root        151 Sep 15  2018 'Banque de France.gdoc'
-rw------- 1 root root        151 Oct  3  2016 'BPCE - Dylag Héléna.gdoc'
-rw------- 1 root root    1506388 Nov 16  2018  cdc-app-word.docx
-rw------- 1 root root        151 Nov 16  2018  cdc-app-word.gdoc
-rw------- 1 root root    2291303 Nov 16  2018  cdc-site-word.docx
-rw------- 1 root root        151 Nov 16  2018  cdc-site-word.gdoc
-rw------- 1 root root       6493 J


### Reading the entire json metadata
This cell may take a minute to run considering the volume of data

In [None]:
import os
import tqdm
import json

input_file = "/content/gdrive/My Drive/arxiv-metadata-oai.json"

data  = []
with tqdm.tqdm(total=os.path.getsize(input_file)) as pbar:
     with open(input_file, 'r') as f:
          for line in f:
              pbar.update(len(line))
              data.append(json.loads(line))

100%|██████████| 4503980272/4503980272 [02:34<00:00, 29065691.60it/s]


I'm limiting my analysis to just 50,000 documents because of the compute limit.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

data = pd.DataFrame(data[:50000])

### Welcome Haystack!

<img src="https://raw.githubusercontent.com/deepset-ai/haystack/master/docs/img/sketched_concepts_white.png">

The secret sauce behind scaling up is **Haystack**. It lets you scale QA models to large collections of documents! 
You can read more about this amazing library here https://github.com/deepset-ai/haystack

For installation: `! pip install git+https://github.com/deepset-ai/haystack.git`

But just to give a background, there are 3 major components to Haystack.
1. **Document Store:** Database storing the documents for our search. We recommend Elasticsearch, but have also more light-weight options for fast prototyping (SQL or In-Memory).
2. **Retriever:** Fast, simple algorithm that identifies candidate passages from a large collection of documents. Algorithms include TF-IDF or BM25, custom Elasticsearch queries, and embedding-based approaches. The Retriever helps to narrow down the scope for Reader to smaller units of text where a given question could be answered.
3. **Reader:** Powerful neural model that reads through texts in detail to find an answer. Use diverse models like BERT, RoBERTa or XLNet trained via FARM or Transformers on SQuAD like tasks. The Reader takes multiple passages of text as input and returns top-n answers with corresponding confidence scores. You can just load a pretrained model from Hugging Face's model hub or fine-tune it to your own domain data.

And then there is **Finder** which glues together a Reader and a Retriever as a pipeline to provide an easy-to-use question answering interface.

In [None]:
# installing haystack
! pip install git+https://github.com/deepset-ai/haystack.git

Collecting git+https://github.com/deepset-ai/haystack.git
  Cloning https://github.com/deepset-ai/haystack.git to /tmp/pip-req-build-iesig7ix
  Running command git clone -q https://github.com/deepset-ai/haystack.git /tmp/pip-req-build-iesig7ix
Collecting farm==0.4.7
[?25l  Downloading https://files.pythonhosted.org/packages/d6/ab/dd3bf1921645519ecc8a416f9a87ac6a5c24fe90b841115f8d9654778b2f/farm-0.4.7-py3-none-any.whl (187kB)
[K     |████████████████████████████████| 194kB 2.4MB/s 
[?25hCollecting fastapi
[?25l  Downloading https://files.pythonhosted.org/packages/48/65/454fb440d48098845875b5ba8599efafee1efabb97720a584c78674e6d26/fastapi-0.61.1-py3-none-any.whl (48kB)
[K     |████████████████████████████████| 51kB 3.5MB/s 
[?25hCollecting uvicorn
[?25l  Downloading https://files.pythonhosted.org/packages/32/9a/5f619c02f36e751071c2b7eaa37a7c4b767feb41e4c2de48e8fbe4e7b451/uvicorn-0.11.8-py3-none-any.whl (43kB)
[K     |████████████████████████████████| 51kB 4.7MB/s 
[?25hCollecting

In [None]:
# importing necessary dependencies
from haystack import Finder
from haystack.indexing.cleaning import clean_wiki_text
from haystack.indexing.utils import convert_files_to_dicts, fetch_archive_from_http
from haystack.reader.farm import FARMReader
from haystack.reader.transformers import TransformersReader
from haystack.utils import print_answers

### Setting up DocumentStore
Haystack finds answers to queries within the documents stored in a `DocumentStore`. The current implementations of `DocumentStore` include `ElasticsearchDocumentStore`, `SQLDocumentStore`, and `InMemoryDocumentStore`.

But they recommend `ElasticsearchDocumentStore` because as it comes preloaded with features like [full-text queries](https://www.elastic.co/guide/en/elasticsearch/reference/current/full-text-queries.html), [BM25 retrieval](https://www.elastic.co/elasticon/conf/2016/sf/improved-text-scoring-with-bm25), and [vector storage for text embeddings](https://www.elastic.co/guide/en/elasticsearch/reference/7.6/dense-vector.html).

So - Let's set up a `ElasticsearchDocumentStore`

In [None]:
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.6.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.6.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.6.2
 
import os
from subprocess import Popen, PIPE, STDOUT
es_server = Popen(['elasticsearch-7.6.2/bin/elasticsearch'],
                   stdout=PIPE, stderr=STDOUT,
                   preexec_fn=lambda: os.setuid(1)  # as daemon
                  )
# wait until ES has started
! sleep 30

In [None]:
from haystack.database.elasticsearch import ElasticsearchDocumentStore
document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document")

09/12/2020 05:24:42 - INFO - elasticsearch -   PUT http://localhost:9200/document [status:200 request:0.413s]
09/12/2020 05:24:42 - INFO - elasticsearch -   PUT http://localhost:9200/label [status:200 request:0.266s]


Once `ElasticsearchDocumentStore` is setup, we will write our documents/texts to the DocumentStore.
* Writing documents to `ElasticsearchDocumentStore` requires a format - **List of dictionaries**
The default format here is: 
`[{"name": "<some-document-name>, "text": "<the-actual-text>"},
{"name": "<some-document-name>, "text": "<the-actual-text>"}
{"name": "<some-document-name>, "text": "<the-actual-text>"}]`

(Optionally: you can also add more key-value-pairs here, that will be indexed as fields in Elasticsearch and can be accessed later for filtering or shown in the responses of the Finder)

* We will use **title** column to pass as `name` and **abstract** column to pass as the `text`

In [None]:
# Now, let's write the dicts containing documents to our DB.
document_store.write_documents(data[['title', 'abstract']].rename(columns={'title':'name','abstract':'text'}).to_dict(orient='records'))

09/12/2020 05:25:02 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.082s]
09/12/2020 05:25:03 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.037s]
09/12/2020 05:25:04 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.032s]
09/12/2020 05:25:05 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.090s]
09/12/2020 05:25:07 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.012s]
09/12/2020 05:25:08 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.005s]
09/12/2020 05:25:09 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.001s]
09/12/2020 05:25:10 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.010s]


### Let's prepare Retriever, Reader,  & Finder
**Retrievers** help narrowing down the scope for the Reader to smaller units of text where a given question could be answered. They use some simple but fast algorithm.

Here: We use Elasticsearch's default BM25 algorithm

In [None]:
from haystack.retriever.sparse import ElasticsearchRetriever
retriever = ElasticsearchRetriever(document_store=document_store)

A **Reader** scans the texts returned by retrievers in detail and extracts the k best answers. They are based on powerful, but slower deep learning models.

Haystack currently supports Readers based on the frameworks FARM and Transformers. With both you can either load a local model or one from Hugging Face's model hub (https://huggingface.co/models).

Here: a medium sized RoBERTa QA model using a Reader based on FARM (https://huggingface.co/deepset/roberta-base-squad2)

In [None]:
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True, context_window_size=500)

09/12/2020 05:29:31 - INFO - farm.utils -   device: cuda n_gpu: 1, distributed training: False, automatic mixed precision training: None
09/12/2020 05:29:31 - INFO - farm.infer -   Could not find `deepset/roberta-base-squad2` locally. Try to download from model hub ...
09/12/2020 05:29:32 - INFO - filelock -   Lock 140450060292392 acquired on /root/.cache/torch/transformers/f7d4b9379a9c487fa03ccf3d8e00058faa9d664cf01fc03409138246f48760da.c6288e0f84ec797ba5c525c923a5bbc479b47c761aded9734a5f6a473b044c8d.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=559.0, style=ProgressStyle(description_…

09/12/2020 05:29:33 - INFO - filelock -   Lock 140450060292392 released on /root/.cache/torch/transformers/f7d4b9379a9c487fa03ccf3d8e00058faa9d664cf01fc03409138246f48760da.c6288e0f84ec797ba5c525c923a5bbc479b47c761aded9734a5f6a473b044c8d.lock





09/12/2020 05:29:34 - INFO - filelock -   Lock 140450060325776 acquired on /root/.cache/torch/transformers/8c0c8b6371111ac5fbc176aefcf9dbe129db7be654c569b8375dd3712fc4dc67.d045adc91e17ecdf7dc3eeff4c875df94bdf2eb749d72b3ae47ae93f8e85213c.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=498637366.0, style=ProgressStyle(descri…

09/12/2020 05:30:20 - INFO - filelock -   Lock 140450060325776 released on /root/.cache/torch/transformers/8c0c8b6371111ac5fbc176aefcf9dbe129db7be654c569b8375dd3712fc4dc67.d045adc91e17ecdf7dc3eeff4c875df94bdf2eb749d72b3ae47ae93f8e85213c.lock





	 We guess it's an *ENGLISH* model ... 
	 If not: Init the language model by supplying the 'language' param.
09/12/2020 05:31:26 - INFO - filelock -   Lock 140450060343448 acquired on /root/.cache/torch/transformers/1e3af82648d7190d959a9d76d727ef629b1ca51b3da6ad04039122453cb56307.6a4061e8fc00057d21d80413635a86fdcf55b6e7594ad9e25257d2f99a02f4be.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898822.0, style=ProgressStyle(descripti…

09/12/2020 05:31:28 - INFO - filelock -   Lock 140450060343448 released on /root/.cache/torch/transformers/1e3af82648d7190d959a9d76d727ef629b1ca51b3da6ad04039122453cb56307.6a4061e8fc00057d21d80413635a86fdcf55b6e7594ad9e25257d2f99a02f4be.lock





09/12/2020 05:31:28 - INFO - filelock -   Lock 140450059230568 acquired on /root/.cache/torch/transformers/b901c69e8e7da4a24c635ad81d016d274f174261f4f5c144e43f4b00e242c3b0.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…

09/12/2020 05:31:30 - INFO - filelock -   Lock 140450059230568 released on /root/.cache/torch/transformers/b901c69e8e7da4a24c635ad81d016d274f174261f4f5c144e43f4b00e242c3b0.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda.lock





09/12/2020 05:31:32 - INFO - filelock -   Lock 140450059230568 acquired on /root/.cache/torch/transformers/2d9b03b59a8af464bf4238025a3cf0e5a340b9d0ba77400011e23c130b452510.16f949018cf247a2ea7465a74ca9a292212875e5fd72f969e0807011e7f192e4.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=150.0, style=ProgressStyle(description_…

09/12/2020 05:31:33 - INFO - filelock -   Lock 140450059230568 released on /root/.cache/torch/transformers/2d9b03b59a8af464bf4238025a3cf0e5a340b9d0ba77400011e23c130b452510.16f949018cf247a2ea7465a74ca9a292212875e5fd72f969e0807011e7f192e4.lock





09/12/2020 05:31:33 - INFO - filelock -   Lock 140450059230568 acquired on /root/.cache/torch/transformers/507984f2e28c7dfed5db9a20acd68beb969c7f2833abc9e582e967fa0291f3dc.100c88dbe27dbd73822c575274ade4eb2427596ac56e96769249b7512341654d.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=189.0, style=ProgressStyle(description_…

09/12/2020 05:31:34 - INFO - filelock -   Lock 140450059230568 released on /root/.cache/torch/transformers/507984f2e28c7dfed5db9a20acd68beb969c7f2833abc9e582e967fa0291f3dc.100c88dbe27dbd73822c575274ade4eb2427596ac56e96769249b7512341654d.lock





09/12/2020 05:31:35 - INFO - farm.utils -   device: cuda n_gpu: 1, distributed training: False, automatic mixed precision training: None
09/12/2020 05:31:35 - INFO - farm.infer -   Got ya 1 parallel workers to do inference ...
09/12/2020 05:31:35 - INFO - farm.infer -    0 
09/12/2020 05:31:35 - INFO - farm.infer -   /w\
09/12/2020 05:31:35 - INFO - farm.infer -   /'\
09/12/2020 05:31:35 - INFO - farm.infer -   


And finally:  The **Finder** sticks together reader and retriever in a pipeline to answer our actual questions. 

In [None]:
finder = Finder(reader, retriever)

### And we're done !
Below is the list of questions that I was asking the model and the results were pleasing.

In [None]:
sample_questions = ["What do we know about Bourin and Uchiyama?",
       "How is structure of event horizon linked with Morse theory?",
       "What do we know about symbiotic stars"]

In [None]:
prediction = finder.get_answers(question="What do we know about symbiotic stars", top_k_retriever=10, top_k_reader=2)
result = print_answers(prediction, details="minimal")

09/12/2020 05:31:56 - INFO - elasticsearch -   POST http://localhost:9200/document/_search [status:200 request:1.896s]
09/12/2020 05:31:56 - INFO - haystack.retriever.sparse -   Got 10 candidates from retriever
09/12/2020 05:31:56 - INFO - haystack.finder -   Reader is looking for detailed answer in 7283 chars ...
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.98s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 37.59 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 42.20 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  8.28 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 39.81 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 50.46 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 43.29 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 36.59 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 48.61 Batches/s]
Inferencing Samples: 100%|█

[   {   'answer': 'Their observed population in the\n'
                  'Galaxy is however poorly known, and is one to three orders '
                  'of magnitudes\n'
                  'smaller than the predicted population size',
        'context': '  The study of symbiotic stars is essential to understand '
                   'important aspects of\n'
                   'stellar evolution in interacting binaries. Their observed '
                   'population in the\n'
                   'Galaxy is however poorly known, and is one to three orders '
                   'of magnitudes\n'
                   'smaller than the predicted population size. IPHAS, the INT '
                   'Photometric Halpha\n'
                   'survey of the Northern Galactic plane, gives us the '
                   'opportunity to make a\n'
                   'systematic, complete search for symbiotic stars in a '
                   'magnitude-limited volume,\n'
                   'and discover a s