<a href="https://colab.research.google.com/github/leomaurodesenv/big-qa-architecture/blob/main/jupyter/1_Question_Answering_in_Wikipedia.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Question Answering in Wikipedia

Question Answering (QA) is the task of answering questions written in natural language automatically (typically reading comprehension questions). QA systems can be used in a variety of use cases. For example, they can extract information from knowledge bases, like a "sophisticated search engine". A knowledge base can be a set of websites, internal documents, or a collection of reports.   

This Jupyter Notebook implements a Question Answering algorithm, using [Haystack](https://haystack.deepset.ai), and a Knowledge Base with Wikipedia articles, using [Elasticsearch](https://www.elastic.co) as Document Store. _This code was executed in Google Colab_.


## Setup

Packages installation and setups.

### Package Installation

In [1]:
# Checking if you have a GPU running
# The code runs in CPU as well
!nvidia-smi

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.



In [2]:
# %%capture
# Install the Haystack
!pip install pip==22.2.2 --quiet
!pip install farm-haystack[colab]==1.8.0 --quiet
# !pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab]

# Install Huggingface
!pip install datasets==2.4.0 --quiet
!pip install transformers==4.20.1 --quiet
!pip install sentence-transformers==2.2.2 --quiet
!echo "Silent installation with success!"

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m24.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m666.4/666.4 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m47.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.0/50.0 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m40.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.4/4.4 MB[0m [31m66.6 MB/s[0m eta 

### Logging

We configure how logging messages should be displayed and which log level should be used before importing Haystack.

In [3]:
import logging

# Setup the Haystack logs
logging.basicConfig(format="%(levelname)s - %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

---
## Document Store

We are going to use Elasticsearch as Document Store.    
Elasticsearch supports queries using [full-text based](https://www.elastic.co/guide/en/elasticsearch/reference/current/full-text-queries.html), [BM25 retrieval](https://www.elastic.co/elasticon/conf/2016/sf/improved-text-scoring-with-bm25), and [vector space for text embeddings](https://www.elastic.co/guide/en/elasticsearch/reference/7.6/dense-vector.html).

### Starting the Elasticsearch
We manually download and execute the Elasticsearch server.

In [4]:
# In Colab / No Docker environments: Start Elasticsearch from source
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.9.2

import os
from subprocess import Popen, PIPE, STDOUT

es_server = Popen(
    ["elasticsearch-7.9.2/bin/elasticsearch"], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1)  # as daemon
)
# wait until ES has started
! sleep 30

In [5]:
# Connect to Elasticsearch
from haystack.document_stores import ElasticsearchDocumentStore

document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document")

INFO:haystack.telemetry:Haystack sends anonymous usage data to understand the actual usage and steer dev efforts towards features that are most meaningful to users. You can opt-out at anytime by calling disable_telemetry() or by manually setting the environment variable HAYSTACK_TELEMETRY_ENABLED as described for different operating systems on the documentation page. More information at https://haystack.deepset.ai/guides/telemetry


### Download SQuAD Dataset

We are going to use the [Stanford Question Answering Dataset (SQuAD)](https://rajpurkar.github.io/SQuAD-explorer/);   
SQuAD is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles.

It contains 19k unique articles about several contents, like:
* [Immune system](https://rajpurkar.github.io/SQuAD-explorer/explore/1.1/dev/Immune_system.html?version=1.1), [Pharmacy](https://rajpurkar.github.io/SQuAD-explorer/explore/1.1/dev/Pharmacy.html?version=1.1), Antibiotics, Bacteria,
* Windows 8, Database, Software testing, Games,
* Companies, Artists, Geology, Teacher & School, etc


In [6]:
# Download SQuAD dataset
doc_dir = "data/SQuAD"
filename = "train-v1.1.json"
dataset_url = f"https://rajpurkar.github.io/SQuAD-explorer/dataset/{filename}"

!mkdir -p {doc_dir}
!wget {dataset_url} -P {doc_dir}

--2023-02-20 17:55:35--  https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 30288272 (29M) [application/json]
Saving to: ‘data/SQuAD/train-v1.1.json’


2023-02-20 17:55:36 (133 MB/s) - ‘data/SQuAD/train-v1.1.json’ saved [30288272/30288272]



In [7]:
# Read the SQuAD dataset
import mmh3
import json
import pandas as pd

def read_squad_format(data):
    '''Process SQuAD dataset format'''
    flat = []
    for document in data:
        title = document.get("title", "")
        for paragraph in document["paragraphs"]:
            context = paragraph["context"]
            document_id = paragraph.get("document_id", "{:02x}".format(mmh3.hash128(str(context), signed=False)))
            for question in paragraph["qas"]:
                q = question["question"]
                id = question["id"]
                for answer in question["answers"]:
                    answer_text = answer["text"]
                    answer_start = answer["answer_start"]
                    flat.append({
                        "title": title,
                        "context": context,
                        "question": q,
                        "id": id,
                        "answer_text": answer_text,
                        "document_id": document_id,
                    })
    return pd.DataFrame.from_records(flat)

data = json.load(open(f"{doc_dir}/{filename}"))
squad = read_squad_format(data["data"])
squad

Unnamed: 0,title,context,question,id,answer_text,document_id
0,University_of_Notre_Dame,"Architecturally, the school has a Catholic character. Atop the Main Building...",To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?,5733be284776f41900661182,Saint Bernadette Soubirous,4a46195c99b673b0cb59b083fe7a95e9
1,University_of_Notre_Dame,"Architecturally, the school has a Catholic character. Atop the Main Building...",What is in front of the Notre Dame Main Building?,5733be284776f4190066117f,a copper statue of Christ,4a46195c99b673b0cb59b083fe7a95e9
2,University_of_Notre_Dame,"Architecturally, the school has a Catholic character. Atop the Main Building...",The Basilica of the Sacred heart at Notre Dame is beside to which structure?,5733be284776f41900661180,the Main Building,4a46195c99b673b0cb59b083fe7a95e9
3,University_of_Notre_Dame,"Architecturally, the school has a Catholic character. Atop the Main Building...",What is the Grotto at Notre Dame?,5733be284776f41900661181,a Marian place of prayer and reflection,4a46195c99b673b0cb59b083fe7a95e9
4,University_of_Notre_Dame,"Architecturally, the school has a Catholic character. Atop the Main Building...",What sits on top of the Main Building at Notre Dame?,5733be284776f4190066117e,a golden statue of the Virgin Mary,4a46195c99b673b0cb59b083fe7a95e9
...,...,...,...,...,...,...
87594,Kathmandu,"Kathmandu Metropolitan City (KMC), in order to promote international relatio...",In what US state did Kathmandu first establish an international relationship?,5735d259012e2f140011a09d,Oregon,eaebf79a40f3f4ed142ccedd04f76fce
87595,Kathmandu,"Kathmandu Metropolitan City (KMC), in order to promote international relatio...",What was Yangon previously known as?,5735d259012e2f140011a09e,Rangoon,eaebf79a40f3f4ed142ccedd04f76fce
87596,Kathmandu,"Kathmandu Metropolitan City (KMC), in order to promote international relatio...",With what Belorussian city does Kathmandu have a relationship?,5735d259012e2f140011a09f,Minsk,eaebf79a40f3f4ed142ccedd04f76fce
87597,Kathmandu,"Kathmandu Metropolitan City (KMC), in order to promote international relatio...",In what year did Kathmandu create its initial international relationship?,5735d259012e2f140011a0a0,1975,eaebf79a40f3f4ed142ccedd04f76fce


### Documents Preprocessing

In this tutorial, we apply a basic cleaning function, and index them in Elasticsearch:
 - cleaning texts; and
 - writing them to the Document Store


In [8]:
from haystack.schema import Document
from haystack.utils import clean_wiki_text

# Get unique dataset documents
unique_docs = squad[["title", "context", "document_id"]].drop_duplicates()
display(unique_docs)
list_docs = []

# Create Haystack Document objects
for _, row in unique_docs.iterrows():
    content = clean_wiki_text(row["context"])
    content_type = "text"
    meta = {"title": row["title"]}
    doc = Document(content=content, content_type=content_type,
                   id = row["document_id"], meta=meta)
    list_docs.append(doc)

Unnamed: 0,title,context,document_id
0,University_of_Notre_Dame,"Architecturally, the school has a Catholic character. Atop the Main Building...",4a46195c99b673b0cb59b083fe7a95e9
5,University_of_Notre_Dame,"As at most other universities, Notre Dame's students run a number of news me...",1ddb47d56c935012234b3e62a1294a8d
10,University_of_Notre_Dame,The university is the major seat of the Congregation of Holy Cross (albeit n...,34ee1f4d577d600af1fef4007a99f352
15,University_of_Notre_Dame,"The College of Engineering was established in 1920, however, early courses i...",82f722702349c8925f5650d12b6a520c
20,University_of_Notre_Dame,All of Notre Dame's undergraduate students are a part of one of the five und...,519d6984c7dfa8e90ba6a9fe30993f88
...,...,...,...
87574,Kathmandu,"Institute of Medicine, the central college of Tribhuwan University is the fi...",2818bd8ff6af5594b6162eea0baa863d
87579,Kathmandu,Football and Cricket are the most popular sports among the younger generatio...,8016ee5feb81e59ac63926ef66a989ed
87584,Kathmandu,"The total length of roads in Nepal is recorded to be (17,182 km (10,676 mi))...",faed62811921b9bafaa7f0e8d5618874
87589,Kathmandu,The main international airport serving Kathmandu and thus Nepal is the Tribh...,59cfd040e2fe242468391bd69ad532c6


In [9]:
# Now, write the documents into the Elasticsearch
document_store.write_documents(list_docs)

---
## Question Answering Pipeline

### Retriever

Retriever gets the `k` most useful documents (or sentences) for a given issue.   
Reader then consumes these documents to generates an answer.

* We used Elasticsearch's default BM25 algorithm

In [10]:
# Instantiate the Retriever
from haystack.nodes import BM25Retriever

retriever = BM25Retriever(document_store=document_store)

### Reader

A Reader scans the sentences returned by Retrievers and extracts the `n` possible answers.   
They are usually based on powerful, but slower deep learning models.

* We use a robust Deep Learning model called RoBERTa
* https://huggingface.co/deepset/roberta-base-squad2

In [11]:
# Load the Reader model from HuggingFace's hub
import torch
from haystack.nodes import FARMReader

use_gpu = torch.cuda.is_available()
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=use_gpu)

INFO:haystack.modeling.utils:Using devices: CPU
INFO:haystack.modeling.utils:Number of GPUs: 0


Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

INFO:haystack.modeling.model.language_model: * LOADING MODEL: 'deepset/roberta-base-squad2' (Roberta)


Downloading:   0%|          | 0.00/473M [00:00<?, ?B/s]

INFO:haystack.modeling.model.language_model:Auto-detected model language: english
INFO:haystack.modeling.model.language_model:Loaded 'deepset/roberta-base-squad2' (Roberta model) from model hub.


Downloading:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/772 [00:00<?, ?B/s]

INFO:haystack.modeling.utils:Using devices: CPU
INFO:haystack.modeling.utils:Number of GPUs: 0
INFO:haystack.modeling.infer:Got ya 2 parallel workers to do inference ...
INFO:haystack.modeling.infer: 0     0  
INFO:haystack.modeling.infer:/w\   /w\ 
INFO:haystack.modeling.infer:/'\   / \ 


### Pipeline

Let's combine the Retriever and the Reader in one Pipeline.

In [12]:
# Creating the QA pipeline
from haystack.pipelines import ExtractiveQAPipeline
from haystack.utils import print_answers

pipe = ExtractiveQAPipeline(reader, retriever)

---
## Asking questions!

### Question 1 - Pharmaceutical Industry

Querying a named-entity from documents.

In [13]:
# Check a Question & Answer
# squad[squad.title == "Pharmaceutical_industry"]
squad.iloc[36840]

title                                                                  Pharmaceutical_industry
context        Advertising is common in healthcare journals as well as through more mainstr...
question                                          What law regulates drug marketing in the US?
id                                                                    571d2f3bdd7acb1400e4c251
answer_text                                            Prescription Drug Marketing Act of 1987
document_id                                                   eb53f77b7f46947136c15779c07105fe
Name: 36840, dtype: object

In [14]:
# Question
question = "What law regulates drug marketing in the pharmaceutical industry?"
prediction = pipe.run(query=question, params={"Retriever": {"top_k": 20}, "Reader": {"top_k": 3}})

Inferencing Samples: 100%|██████████| 1/1 [00:02<00:00,  2.56s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.24 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.24 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.26 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.23 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:02<00:00,  2.27s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.32s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.25s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.25s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.24s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.23s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.17 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.24s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00

In [15]:
# Answers
print_answers(prediction, details="all")


Query: What law regulates drug marketing in the pharmaceutical industry?
Answers:
[   <Answer {'answer': 'Prescription Drug Marketing Act of 1987', 'type': 'extractive', 'score': 0.7634869813919067, 'context': 'loy lobbyists to influence politicians. Marketing of prescription drugs in the US is regulated by the federal Prescription Drug Marketing Act of 1987.', 'offsets_in_document': [{'start': 563, 'end': 602}], 'offsets_in_context': [{'start': 110, 'end': 149}], 'document_id': 'eb53f77b7f46947136c15779c07105fe', 'meta': {'__pydantic_initialised__': True, 'title': 'Pharmaceutical_industry'}}>,
    <Answer {'answer': 'Food and Drug Administration (FDA)', 'type': 'extractive', 'score': 0.19770637154579163, 'context': 'ates, new pharmaceutical products must be approved by the Food and Drug Administration (FDA) as being both safe and effective. This process generally ', 'offsets_in_document': [{'start': 74, 'end': 108}], 'offsets_in_context': [{'start': 58, 'end': 92}], 'document_id': '6

### Question 2 - Antibiotics

Querying a date from documents.

In [16]:
# Check a Question & Answer
# squad[squad.title == "Antibiotics"]
squad.iloc[1316]

title                                                                              Antibiotics
context        The emergence of resistance of bacteria to antibiotics is a common phenomeno...
question                                               When was the Luria-Delbruck experiment?
id                                                                    5733bc38d058e614000b6189
answer_text                                                                               1943
document_id                                                   ed3987cf71b8b06f3ba9270e76102aa4
Name: 1316, dtype: object

In [17]:
# Question
question = "When was the Luria-Delbruck?"
prediction = pipe.run(query=question, params={"Retriever": {"top_k": 20}, "Reader": {"top_k": 3}})

Inferencing Samples: 100%|██████████| 1/1 [00:07<00:00,  7.22s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.78s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.19 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:02<00:00,  2.32s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.24 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.24 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.59s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.03s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:02<00:00,  2.01s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.33s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.24s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.23s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.22s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00

In [18]:
# Answers
print_answers(prediction, details="all")


Query: When was the Luria-Delbruck?
Answers:
[   <Answer {'answer': '1943', 'type': 'extractive', 'score': 0.29898403584957123, 'context': 'g previously acquired antibacterial-resistance genes was demonstrated in 1943 by the Luria–Delbrück experiment. Antibiotics such as penicillin and ery', 'offsets_in_document': [{'start': 610, 'end': 614}], 'offsets_in_context': [{'start': 73, 'end': 77}], 'document_id': 'ed3987cf71b8b06f3ba9270e76102aa4', 'meta': {'__pydantic_initialised__': True, 'title': 'Antibiotics'}}>,
    <Answer {'answer': '14', 'type': 'extractive', 'score': 0.06844992004334927, 'context': 'n 1960, when his soccer coach took his team to a local gym. At the age of 14, he chose bodybuilding over soccer as a career. Schwarzenegger has respon', 'offsets_in_document': [{'start': 188, 'end': 190}], 'offsets_in_context': [{'start': 74, 'end': 76}], 'document_id': '3b4dc15200c58f184c2348edafbe45d3', 'meta': {'__pydantic_initialised__': True, 'title': 'Arnold_Schwarzenegger'}}>,
  

## Document Augmentation

Let's add FAQ (Frequently Asked Question) data about COVID.

### Adding more documents

In [19]:
# Download FAQ about COVID
from haystack.utils import fetch_archive_from_http

doc_dir = "data/covid_faq"
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/small_faq_covid.csv.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)

INFO:haystack.utils.import_utils:Fetching from https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/small_faq_covid.csv.zip to `data/covid_faq`


True

In [20]:
# Open CSV file
faq = pd.read_csv(f"{doc_dir}/small_faq_covid.csv")

# Data preprocessing
faq.fillna(value="", inplace=True)
faq["question"] = faq["question"].apply(lambda x: x.strip())
faq = faq[["question", "answer", "source", "category", "country", "last_update"]]
display(faq)

Unnamed: 0,question,answer,source,category,country,last_update
0,What is a novel coronavirus?,A novel coronavirus is a new coronavirus that has not been previously identi...,Center for Disease Control and Prevention (CDC),Coronavirus Disease 2019 Basics,USA,2020/03/17
1,"Why is the disease being called coronavirus disease 2019, COVID-19?","On February 11, 2020 the World Health Organization announced an official nam...",Center for Disease Control and Prevention (CDC),Coronavirus Disease 2019 Basics,USA,2020/03/17
2,Why might someone blame or avoid individuals and groups (create stigma) beca...,People in the U.S. may be worried or anxious about friends and relatives who...,Center for Disease Control and Prevention (CDC),Coronavirus Disease 2019 Basics,USA,2020/03/17
3,How can people help stop stigma related to COVID-19?,"People can fight stigma and help, not hurt, others by providing social suppo...",Center for Disease Control and Prevention (CDC),How It Spreads,USA,2020/03/17
4,What is the source of the virus?,"Coronaviruses are a large family of viruses. Some cause illness in people, a...",Center for Disease Control and Prevention (CDC),How It Spreads,USA,2020/03/17
...,...,...,...,...,...,...
208,Is water a possible source of infection in the transmission of SARS-CoV-2?,SARS-CoV-2 is similar to other coronaviruses for which water does not consti...,Bundesministerium für Gesundheit,,Germany,2020/03/18
209,Where can doctors and clinics obtain additional information?,The Robert Koch Institute posts information for professionals (in German) on...,Bundesministerium für Gesundheit,,Germany,2020/03/18
210,When was the first information about the outbreak received?,"On 31 December 2019, China’s WHO country office was informed of a cluster of...",Bundesministerium für Gesundheit,,Germany,2020/03/18
211,Where did the outbreak start?,"According to information from the Chinese authorities in Wuhan, some patient...",Bundesministerium für Gesundheit,,Germany,2020/03/18


In [21]:
# Create Document objects
list_docs = []
for _, row in faq.iterrows():
    content = f"{row['question']} {row['answer']}"
    content = clean_wiki_text(content)
    meta = {"title": "FAQ_covid", "country": row["country"],
            "source": row["source"], "category": row["category"]}
    doc = Document(content=content, content_type=content_type,
                   id = mmh3.hash128(str(content), signed=False), meta=meta)
    list_docs.append(doc)

In [22]:
# Add the documents to the Elasticsearch
document_store.write_documents(list_docs)

### Question 3 - COVID

Querying about the new COVID documents.

In [23]:
# Check a Question & Answer
# faq[faq.question.str.contains("SARS-CoV-2")]
faq_item = faq.iloc[188]
print("Question:", faq_item["question"])
print("Answer:", faq_item["answer"])

Question: What do SARS-CoV-2 and Covid-19 stand for?
Answer: On 11 February, the novel coronavirus that had provisionally been known as 2019-nCoV, was given a new name: SARS-CoV-2. The acronym SARS stands for severe acute respiratory syndrome. The name denotes its close relationship to the SARS coronavirus that caused an epidemic in 2002/2003.

The respiratory disease that can be caused by SARS-CoV-2 has also been given a new name. It is now called Covid-19 (Corona Virus Disease 2019).


In [24]:
# Question
question = "What is the novel coronavirus?"
prediction = pipe.run(query=question, params={"Retriever": {"top_k": 20}, "Reader": {"top_k": 3}})

Inferencing Samples: 100%|██████████| 1/1 [00:04<00:00,  4.11s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.46s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.26s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.33s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.69s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.63s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.60s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.27 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.07 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.05s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.25 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.24 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.26 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00

In [25]:
# Answers
print_answers(prediction, details="all")


Query: What is the novel coronavirus?
Answers:
[   <Answer {'answer': 'SARS-CoV-2', 'type': 'extractive', 'score': 0.8770715892314911, 'context': ' put in place to protect the German public from the novel coronavirus SARS-CoV-2? The Robert Koch Institute has been granted wider powers in coordinat', 'offsets_in_document': [{'start': 103, 'end': 113}], 'offsets_in_context': [{'start': 70, 'end': 80}], 'document_id': '277577168658522816254511899994281499324', 'meta': {'__pydantic_initialised__': True, 'title': 'FAQ_covid', 'country': 'Germany', 'source': 'Bundesministerium für Gesundheit', 'category': ''}}>,
    <Answer {'answer': 'Prevention for 2019', 'type': 'extractive', 'score': 0.7678199410438538, 'context': 'r or alcohol-based hand sanitizer\nYou can find additional information on preventing COVID-19 disease at CDC’s (Prevention for 2019 Novel Coronavirus).', 'offsets_in_document': [{'start': 423, 'end': 442}], 'offsets_in_context': [{'start': 111, 'end': 130}], 'document_id': '33