# Implementation of Open Domain Q.A. system for Covid19

In this assignment you will build an application that performs __information retrieval__ and __question answering__, which are the core tasks that take care the **Open Domain Q.A.** task. 

*   __Information Retrieval__ Retrieve documents with information that is relevant to the user’s information need and helps the user complete a task.

* __Question Answering__ Systems that automatically answer questions posed by humans in a natural language.

The steps are the following:
- Preprocess the data and init models for I.R.
- Preprocess the data and init models for Q.A.
- Implement the functions: retrieve_documents & answer question given a query.
- Run some analysis of the model output.

Note you are not requiered to strictly follow the steps suggested in the notebook as long as you build a model and provide some analysis.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
%cd /content/drive/MyDrive/LAP/Subjects/AP2/assignments

/content/drive/MyDrive/LAP/Subjects/AP2/assignments


## 1. Install packages and load libraries

In this section we will install all the packages and load all the libraries needed to run the code below.

In [None]:
!pip install Whoosh # search engine library
!pip install transformers

Collecting Whoosh
  Downloading Whoosh-2.7.4-py2.py3-none-any.whl (468 kB)
[?25l[K     |▊                               | 10 kB 29.9 MB/s eta 0:00:01[K     |█▍                              | 20 kB 28.9 MB/s eta 0:00:01[K     |██                              | 30 kB 19.5 MB/s eta 0:00:01[K     |██▉                             | 40 kB 11.6 MB/s eta 0:00:01[K     |███▌                            | 51 kB 11.5 MB/s eta 0:00:01[K     |████▏                           | 61 kB 13.4 MB/s eta 0:00:01[K     |█████                           | 71 kB 11.0 MB/s eta 0:00:01[K     |█████▋                          | 81 kB 12.0 MB/s eta 0:00:01[K     |██████▎                         | 92 kB 13.2 MB/s eta 0:00:01[K     |███████                         | 102 kB 12.3 MB/s eta 0:00:01[K     |███████▊                        | 112 kB 12.3 MB/s eta 0:00:01[K     |████████▍                       | 122 kB 12.3 MB/s eta 0:00:01[K     |█████████                       | 133 kB 12.3 MB/s eta

In [None]:
import codecs # base classes for standard Python codecs, like text encodings (UTF-8,...)
from IPython.core.display import display, HTML # object displaying in different formats
from whoosh.index import * # whoosh: full-text indexing and searching
from whoosh.fields import *
from whoosh import qparser
import glob
import random
import torch

## Information Retrieval

Based on https://www.kaggle.com/aotegi/neural-question-answering-for-cord19-task8
(deepest apreciation to Jon Ander Campos and Arantxa Otegi, winners of task 'What do we know about diagnostics and surveillance?' in COVID-19 Open Research Dataset Challenge (CORD-19))

The goal of this lab is to build a I.R. system that retrieves the most relevant documents given a query related to Covid19.

We only use the freely available [CORD-19 dataset](https://pages.semanticscholar.org/coronavirus-research), which contains metadata of over 51,000 scientific papers (full text is also available for around 40,000 of them) about COVID-19, SARS-CoV-2, and related coronaviruses.

As we are mostly interested in papers related to COVID-19 (and not other coronaviruses), we filter out papers that are about coronaviruses other than COVID-19 (for example, SARS-CoV and MERS).

The system has a main component that is an Information Retrieval system (IR), based on the classical BM25F search algorithm. This system indexes abstracts and paragraphs on the full text of the papers.

### Load info from data file

CORD19-dataset includes research papers related to coronavirus and COVID-19. In this section we first load the info. As we are not interested in all the metadata info from papers, we will select just text information, such as title, abstract and body text (already done for you).

CORD-19.v7 includes info of 51,078 papers, but some of them are repeated (they have the same *cord_uid*). Thus, we already filter out the repeated ones. 

As we are mostly interested in papers related to COVID-19 (and not other coronaviruses), we want to filter out papers that are about coronaviruses other than COVID-19 (for example, SARS-CoV and MERS). For that purpose, we created a list of synonyms of COVID-19 and we check if a synonym appears in the title or the abstract of a paper. 

List of synonyms used for filtering:

    'coronavirus 2019',
    'coronavirus disease 19',
    'cov2',
    'cov-2',
    'covid',
    'ncov 2019',
    '2019ncov',
    '2019-ncov',
    '2019 ncov',
    'novel coronavirus',
    'sarscov2',
    'sars-cov-2',
    'sars cov 2',
    'severe acute respiratory syndrome coronavirus 2',
    'wuhan coronavirus',
    'wuhan pneumonia',
    'wuhan virus'

In that way, we filter out those papers that do not include any of the synonyms. From now on, we will consider only the papers that we keep after filtering.

This are the number of papers after filtering:

In [None]:
def read_passages(path):
    count = 0
    passages = []
    with open(path) as f:
        for line in f:
            count += 1
            passages.append(line)
    return passages, count

In [None]:
path='../data/passages'
passages, count = read_passages(path)
print("Number of passages related to 'COVID-19':", count)
print()
print("3 Random passages:")
print()
print(passages[random.randrange(count)])
print(passages[random.randrange(count)])
print(passages[random.randrange(count)])

Number of passages related to 'COVID-19': 308665

3 Random passages:

Background As the coronavirus (COVID-19) pandemic spreads globally, hospitals are rushing to adapt their facilities which may not have been designed to deal with infections adequately. We present the management of a patient with suspected COVID-19 pneumonia. Case A 66-years-old man presented to the hospital and his recent travel history, infective symptoms and CXR made him a possible COVID-19 suspect. Emergency surgery was decided considering the septic condition. The patient was transported to operating theatre with supplemental oxygen over a face mask and plastic covering over the trolley. Rapid sequence intubation was performed by an experienced anesthetist using a videolaryngoscope. After surgery, the patient remained intubated to avoid re-intubation due to initial presentation of respiratory distress. Droplet, contact and airborne infection precautions were instituted. Conclusions Our objective was to facilitate

### 3. Create an index for the paper retrieval system

The system that we are going to develop in our approach is the information retrieval system. An information retrieval system is a tool that searches for  documents that are relevant to an information need from a collection of documents. This system has two main modules: the indexing system and the query system. 

The first module is in charge of creating the primary data structure for the system, which is the index. The second component is the one with which users interact submitting a query based on their information need, and based on this query and using the index, retrieves documents. In this section we will create an index, and in the next section, we will develop the query system. For the implementation of these modules, we will use [Whoosh library](https://pypi.org/project/Whoosh/), which contains functions for indexing text and then searching the index.

The index is a data structure that makes it possible to search for information in a document collection in a very efficient way. In short, it lists, for every word, all documents that contain it.

In order to create an index, we must define the schema of the index. The schema lists the fields in the index. A field is a piece of information for each document in the index, for example, id, path of the document, title and text. We define the type of these last two fields as “TEXT”, which means that they will be searchable. As it is common practice, we also define to apply the Stemming Analyzer to these text fields. Applying this analyzer all the text will be tokenized, then all the tokens will be converted to lowercase, a stopword filter will be applied in order to remove too common words, and finally, a stemming algorithm will be applied.

In [None]:
# Schema definition:
# - id: type ID, unique, stored; doc id in order given the passages file
# - text: type TEXT processed by StemmingAnalyzer; not stored; content of the passage
schema = Schema(id = ID(stored=True,unique=True),
                text = TEXT(analyzer=analysis.StemmingAnalyzer())
               )

Once we have the schema, we can create an index.


In [None]:
# Create an index
if not os.path.exists("index"):
    os.mkdir("index")

ix = create_in("index", schema)
writer = ix.writer() #run once! or restart runtime

Next, we will add documents to the index. We will index the papers related to COVID-19, not only the abstracts that are in the metadata file, but also the full text provided in PMC or PDF JSON format. As having shorter documents is better for the answering system that we will develop later, we will not index the whole text in a paper together. Instead, the indexing unit will be an abstract or each of the paragraphs of the full text (as marked in JSON files).

This could take several minutes.

In [None]:
# Add papers to the index, iterating through each row in the metadata dataframe
for ind,passage_text in enumerate(passages): 
    writer.add_document(id=str(ind),text=passage_text)

Finally, we will save the added documents to the index.

In [None]:
# Save the added documents
writer.commit()
print("Index successfully created")

# Sanity check
print("Number of documents (abstracts and paragraphs of papers) in the index: ", ix.doc_count())

Index successfully created
Number of documents (abstracts and paragraphs of papers) in the index:  308665


### Define a function to query the index and retrieve relevant papers

In this section we will define a function that given a question and a maximum number of documents as input, it uses this query to retrieve relevant papers that were indexed in the previous section.

In this function we set the algorithm used for scoring (we will be using the default BM25 algorithm), and we  also set the query parser to use, defining the default field to search (in our case '*text*’ field). Then, we run the query and get the most relevant documents on the index (*n_docs* documents at maximum). 

The output of the function is a set (*n_docs*) of texts and scores.

In [None]:
# Input: Question and maximum number of documents to retrieve
def retrieve_documents(query, topk):
    scores=[]
    text=[]
    # Open the searcher for reading the index. The default BM25 algorithm will be used for scoring
    with ix.searcher() as searcher:
        searcher = ix.searcher()
        
        # Define the query parser ('text' will be the default field to search), and set the input query
        q = qparser.QueryParser("text", ix.schema, group=qparser.OrGroup).parse(query)
    
        # Search using the query q, and get the topk documents, sorted with the highest-scoring documents first
        results = searcher.search(q, limit=topk)
        # results is a list of dictionaries where each dictionary is the stored fields of the document

    # Iterate over the retrieved documents
    for hit in results:
        scores.append(hit.score)
        text.append(passages[int(hit['id'])])
    return text, scores

Retrieve 3 most relevant documents and scores given the query "How long individuals are contagious?":

In [None]:
retrieve_documents("How long individuals are contagious?", 3)

(['The model gives what fractions are susceptible, exposed, infectious and immune, respectively, for short and long-term M -individuals as well as P -individuals over time. The main focus is to study how many people in the province become infected as a result of the M -individuals visiting the province.\n',
  "In a time of such extreme uncertainty, making economic decisions becomes challenging because pandemics are rare. The most recent comparable episode is the Spanish flu of 1918 (Trilla et al., 2008) , so pandemics are likely to occur at most once during one's lifetime. Nevertheless, individuals need to make everyday decisions such as how to manage inventories of staples, how much to consume and save, when to buy or sell stocks, etc., and these decisions depend on the expectation of how long and severe the epidemic is. Governments must also make decisions such as to what extent imposing travel restrictions, social distancing, closure of schools and businesses, etc., and for how long

## Question Answering

What does it mean for BERT to achieve "human-level performance on Question Answering"? Is BERT the greatest search engine ever, able to find the answer to any question we pose it?

For something like text classification, you definitely want to fine-tune BERT on your own dataset. For question answering, however, it seems like you may be able to get decent results using a model that's already been fine-tuned on the SQuAD benchmark. In this Notebook, we'll do exactly that, and see that it performs well on text that wasn't in the SQuAD dataset.

### The SQuAD v1.1 Benchmark

When someone mentions "Question Answering" as an application of BERT, what they are really referring to is applying BERT to the Stanford Question Answering Dataset (SQuAD).

The task posed by the SQuAD benchmark is a little different than you might think. Given a question, and *a passage of text containing the answer*, BERT needs to highlight the "span" of text corresponding to the correct answer. 

The SQuAD homepage has a fantastic tool for exploring the questions and reference text for this dataset, and even shows the predictions made by top-performing models.

For example, here are some [interesting examples](https://rajpurkar.github.io/SQuAD-explorer/explore/1.1/dev/Super_Bowl_50.html?model=r-net+%20(ensemble)%20(Microsoft%20Research%20Asia)&version=1.1) on the topic of Super Bowl 50.


### BERT Input Format

To feed a QA task into BERT, we pack both the question and the reference text into the input.



[![Input format for QA](https://drive.google.com/uc?export=view&id=1dfgTaE_SABpr2blqwTjq9PTyhYabO8_m)](https://drive.google.com/uc?export=view&id=1dfgTaE_SABpr2blqwTjq9PTyhYabO8_m)



The two pieces of text are separated by the special `[SEP]` token. 

BERT also uses "Segment Embeddings" to differentiate the question from the reference text. These are simply two embeddings (for segments "A" and "B") that BERT learned, and which it adds to the token embeddings before feeding them into the input layer. 

### Start & End Token Classifiers

BERT needs to highlight a "span" of text containing the answer--this is represented as simply predicting which token marks the start of the answer, and which token marks the end.

![Start token classification](http://www.mccormickml.com/assets/BERT/SQuAD/start_token_classification.png)

For every token in the text, we feed its final embedding into the start token classifier. The start token classifier only has a single set of weights (represented by the blue "start" rectangle in the above illustration) which it applies to every word.

After taking the dot product between the output embeddings and the 'start' weights, we apply the softmax activation to produce a probability distribution over all of the words. Whichever word has the highest probability of being the start token is the one that we pick.

We repeat this process for the end token--we have a separate weight vector this.

![End token classification](http://www.mccormickml.com/assets/BERT/SQuAD/end_token_classification.png)

### Load Fine-Tuned BERT-large

For Question Answering we use the `BertForQuestionAnswering` class from the `transformers` library.

This class supports fine-tuning, but for this example we will keep things simpler and load a BERT model that has already been fine-tuned for the SQuAD benchmark.

The `transformers` library has a large collection of pre-trained models which you can reference by name and load easily. The full list is in their documentation [here](https://huggingface.co/transformers/pretrained_models.html).

For Question Answering, they have a version of BERT-large that has already been fine-tuned for the SQuAD benchmark. 

BERT-large is really big... it has 24-layers and an embedding size of 1,024, for a total of 340M parameters! Altogether it is 1.34GB, so expect it to take a couple minutes to download to your Colab instance. 


In [None]:
from transformers import BertForQuestionAnswering

model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

Downloading:   0%|          | 0.00/443 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.25G [00:00<?, ?B/s]

Load the tokenizer as well. 


In [None]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

### Answer Questions


In [None]:
def answer_question(question, answer_text):
    '''
    Takes a `question` string and an `answer_text` string (which contains the
    answer), and identifies the words within the `answer_text` that are the
    answer. Prints them out.
    '''
    # ======== Tokenize ========
    # Apply the tokenizer to the input text, treating them as a text-pair.
    input_ids = tokenizer.encode(question, answer_text)

    # Report how long the input sequence is.
    # print('Query has {:,} tokens.\n'.format(len(input_ids)))

    # ======== Set Segment IDs ========
    # Search the input_ids for the first instance of the `[SEP]` token.
    sep_index = input_ids.index(tokenizer.sep_token_id)

    # The number of segment A tokens includes the [SEP] token istelf.
    num_seg_a = sep_index + 1

    # The remainder are segment B.
    num_seg_b = len(input_ids) - num_seg_a

    # Construct the list of 0s and 1s.
    segment_ids = [0]*num_seg_a + [1]*num_seg_b

    # There should be a segment_id for every input token.
    assert len(segment_ids) == len(input_ids)

    # ======== Evaluate ========
    # Run our example through the model.
    outputs = model(torch.tensor([input_ids]), # The tokens representing our input text.
                    token_type_ids=torch.tensor([segment_ids]), # The segment IDs to differentiate question from answer_text
                    return_dict=True) 

    start_scores = outputs.start_logits[0][(num_seg_a):len(input_ids)-1]
    end_scores = outputs.end_logits[0][(num_seg_a):len(input_ids)-1]

    # ======== Reconstruct Answer ========
    # Find the tokens with the highest `start` and `end` scores.
    answer_start = torch.argmax(start_scores) + num_seg_a
    answer_end = torch.argmax(end_scores) + num_seg_a
    
    # Get the string versions of the input tokens.
    tokens = tokenizer.convert_ids_to_tokens(input_ids)

    # Start with the first token.
    answer = tokens[answer_start]

    # Select the remaining answer tokens and join them with whitespace.
    for i in range(answer_start + 1, answer_end + 1):
        
        # If it's a subword token, then recombine it with the previous token.
        if tokens[i][0:2] == '##':
            answer += tokens[i][2:]
        
        # Otherwise, add a space then the token.
        else:
            answer += ' ' + tokens[i]

    s_scores = start_scores.detach().numpy().flatten()
    e_scores = end_scores.detach().numpy().flatten()

    score = max(s_scores) + max(e_scores)

    return answer, score

In [None]:
question = "How many parameters does BERT-large have?"
answer_text = "BERT-large is really big... it has 24-layers and an embedding size of 1,024, for a total of 340M parameters! Altogether it is 1.34GB, so expect it to take a couple minutes to download to your Colab instance."

In [None]:
answer_question(question, answer_text)

('340m', 11.890032)

## Analysis of the output
It is important to run some analysis of the output of our model in order to gain some insights about good and bad things of the model. Use some questions from the lab 8.IR_for_covid19 to test your model.

In addition, each function should return a score. This will allow us to better understand how confident the model is returning both documents and answers. The scoring by default should be best retrieved document & best answer in that retrieved document.

### TO-DO 
- Build the Open Domain Q.A. system.
- Show some correct and incorrect examples for IR & QA for Covid 19 and its scores.
### TO-DO (optional)
- Try to improve over the presented baseline scoring system, for example, returning the highest scored answer in the first 10 most relevant documents, feel free to explore any other scoring strategy.
- Apply a threshold based on the score to filter out unwanted results (low scored answers).

In [None]:
import pandas as pd

def best_answers(query, count):
    answers = []
    answer_scores = []
    docs, scores = retrieve_documents(query, count)
    for i, doc in enumerate(docs):
        answer, score = answer_question(query, doc)
        answers.append(answer)
        answer_scores.append(score)
    df = pd.DataFrame({"Document": docs, "Document Score": scores, 
                       "Answer": answers, "Answer Score": answer_scores})
    df = df.sort_values(by=['Answer Score'], ascending=False)
    display(df)

In [None]:
best_answers('How long individuals are contagious?', 10)
# the retrieved documents are not good, the don't contain the answer of the question
# therefore, the correct answer to the question can not be extracted
# answers have high scores because they could be answers to similar questions

Unnamed: 0,Document,Document Score,Answer,Answer Score
0,The model gives what fractions are susceptible...,17.929984,short and long - term,8.046329
1,"In a time of such extreme uncertainty, making ...",16.118159,once during one ' s lifetime,6.973122
9,The epidemiological parameter R-naught (R0) si...,14.829481,how long an infected person can infect others,6.337094
2,"2. At the current stage, it is still not clear...",15.852532,1,3.391026
6,We model the evolution of the number of indivi...,14.9033,duration in months,3.288077
7,COVID-19 is a global pandemic and serious thre...,14.877905,long term,2.952847
3,As it is currently unknown how long antibodies...,15.78049,previously infected,2.405083
8,Time is another factor that likely contributes...,14.861068,how long it takes after initial infection for ...,1.50426
5,How is the data/information stored and for how...,15.270143,how long,-3.30806
4,In settings where restrictions on the free mov...,15.55561,take too long,-4.072904


In [None]:
best_answers("Range of incubation periods for the disease in humans", 10)
# many retrieved documents contain answers to the question
# answers are extracted correctly in those cases
# answers are similar, which means that there is agreement between passages

Unnamed: 0,Document,Document Score,Answer,Answer Score
7,"The high risk of infection, ambiguous characte...",23.39197,between one day to two weeks,9.249015
3,Evidence that COVID-19 is distributed from hum...,24.265407,at least 14 days,7.837053
2,A virus incubates for some time after it enter...,24.48347,two to 14 days,6.982939
8,The incubation period is the period between ex...,23.232804,between 2 and 14 days,2.211344
1,Where b is the transmission rate and q is a pa...,24.795063,from 0 to 1,1.829202
0,Our current understanding of the incubation pe...,25.277069,our current understanding of the incubation pe...,1.105138
6,To the Editor-A large global outbreak of coron...,23.487681,6 . 4 days,1.079832
5,"In this study, we assumed that the incubation ...",23.655545,1 / ω p ) was the same as latent period,-1.716398
4,A noteworthy exception to the inverse relation...,23.870067,longer incubation period among human influenza...,-2.788375
9,Our analysis show that the spread of the disea...,23.215506,π,-6.929626


In [None]:
best_answers("Prevalence of asymptomatic shedding and transmission", 10)
# some retrieved documents are quite good, they are related to the question
# answers are extracted correctly, answers with higuest scores are quite good

Unnamed: 0,Document,Document Score,Answer,Answer Score
9,-Viral shedding has been demonstrated up to 63...,19.373313,it is likely that asymptomatic and presymptoma...,4.966821
5,"We report persistent shedding of SARS-CoV-2, b...",20.009836,transmission potential for asymptomatic or min...,4.495231
7,We analyzed 2 clusters of 12 patients in Vietn...,19.568091,one asymptomatic patient demonstrated virus sh...,4.128008
8,We analyzed 2 clusters of 12 patients in Vietn...,19.568091,one asymptomatic patient demonstrated virus sh...,4.128008
1,SARS-CoV-2 spread rapidly within months despit...,21.097671,viral shedding is highest before symptom onset,4.078474
0,Recently published reports suggested potential...,21.931981,transmission dynamics of asymptomatic individuals,3.206972
4,"Likewise, both symptomatic and asymptomatic pa...",20.301687,contagious,1.325464
2,"11, 12 However, other studies have questioned ...",21.064213,asymptomatic transmission accounted for 6 . 4 ...,0.710997
6,Asymptomatic carriers of other HCoVs including...,19.688799,uncommon,-0.791206
3,to evaluate the prevalence of asymptomatic inf...,20.353246,duration of sars - cov - 2 viral shedding,-1.236726


In [None]:
best_answers("Persistence of virus on surfaces of different materials", 10)
# some retrieved documents are related to the question
# only the first answer is an answer to the question
# other answers are too short or unrelated

Unnamed: 0,Document,Document Score,Answer,Answer Score
4,"Besides, the half-life, stability, and decay o...",23.093292,the virus seems to be more stable on plastic a...,2.199644
0,The ability of a virus to transfer between and...,29.465962,skin,-0.395259
2,Most studies on the stability of a virus on a ...,24.896343,persistence of coronaviruses on different type...,-0.425359
9,"Persistence of most bacteria, fungi, and virus...",20.794876,materials,-0.492468
6,Inanimate surfaces are the most prone site for...,22.443088,inanimate,-0.730319
3,Studies have shown that viruses adsorbed on su...,23.133594,persistence of sars - cov - 2 and other corona...,-1.51208
1,Severe acute respiratory syndrome coronavirus ...,27.136969,persistence time on inanimate surfaces varied ...,-1.630709
7,Human coronavirus strain 229E (HuCoV-229E) can...,22.429773,persisted,-2.071662
8,Although a number of studies have quantified t...,22.215731,porous materials,-2.347842
5,"A Medline search has been done on January 28, ...",22.91679,surfaces,-6.756984


In [None]:
best_answers("Immune response and immunity", 10)
# retrieved documents are not very , but the question is very open
# few answers are related to the question
# some answers are too short, a longer answer is expected for this question

Unnamed: 0,Document,Document Score,Answer,Answer Score
2,Immunological studies have led to a partial un...,14.451851,the body ' s response to viral infections and ...,3.436003
1,In the pathogenesis of a standard viral infect...,14.467251,adaptive immunity,2.56083
4,Clearance time measures the earliest time unti...,14.415005,"immune response , clearance times show varying...",2.232779
3,"From the perspective of antiviral immunity, a ...",14.443661,antiviral immunity,1.217393
6,Inflammation is a 'side effect' of the immune ...,13.88367,inflammation is a ' side effect ' of the immun...,0.78523
7,"gene is known to function in immune cells, pla...",13.799215,"gene is known to function in immune cells , pl...",0.587603
8,To have a signature picture of immune response...,13.76716,immune responses,-1.122201
9,"Finally, from a prophylactic point of view, th...",13.757107,immune response to vaccination,-1.440553
5,"Taken together, this work provokes questions a...",13.939726,potential diversity of immune responses to sar...,-2.844353
0,Respiratory viral infections can cause patholo...,14.668925,immune,-6.139703


In [None]:
best_answers("Does smoking increase risk for COVID-19?", 10)
# retrieved documents are quite good and related to the question
# most answers are correctly extracted and answer the question

Unnamed: 0,Document,Document Score,Answer,Answer Score
0,The copyright holder for this preprint this ve...,22.69417,even if smoking does have a small protective e...,7.799596
5,All 19 studies were of patients who had alread...,20.474601,does not represent the effect of smoking on th...,7.53517
4,All 12 studies were of patients who had alread...,20.474601,does not represent the effect of smoking on th...,7.534319
8,(ii) Smokers have 15% chance of severe COVID19...,19.255714,greater risk,6.816445
3,44 The absence of evidence suggesting harm fro...,21.272013,considerably increased risk of disease transmi...,6.069552
1,"The objective is to show (as before) that, und...",22.357154,smoking reduces the risk,5.608153
9,To produce relative risk factors for comorbidi...,19.239323,relative increase in incidence of severe cases,5.539629
7,c. Smoking should be stopped as it increases t...,20.033095,increases the risk and severity of covid - 19,5.071908
2,"There is no template for telehealth, let alone...",22.041327,smoking has innumerable adverse health 33 effe...,4.903539
6,"Unexpectedly, we found that having asthma, imm...",20.21711,advanced age is the main factor for hospitaliz...,4.340332


In [None]:
best_answers("Risk of fatality among symptomatic hospitalized patients", 10)
# some of the retrieved documents contain the answer to the question
# some ansers contain percentages that answer the question
# the percentages are different in some cases

Unnamed: 0,Document,Document Score,Answer,Answer Score
2,Although this study was not designed to evalua...,23.683454,~ 15 - 20 %,5.088109
9,Coronavirus disease 2019 (COVID- 19) is an ong...,21.757563,patients with comorbid conditions including hy...,2.269718
4,"As of April 18, 2020, 2.16 million patients in...",22.650262,12 . 38 %,1.608995
0,Clinical manifestations of COVID-19 are simila...,24.873457,4e15 % have died,1.459875
1,"Since the first reports of covid-19 in Wuhan, ...",24.284328,high frequencies of diabetic individuals,0.429442
3,Acute infections are associated with increased...,23.58699,studies indicated that there is an increased r...,-0.658474
8,The fatality risk after 35 days of onset of sy...,21.777711,12 . 38 %,-1.488249
7,In this paper we apply survival analysis metho...,21.8798,covid - 19 fatality,-2.234887
6,Patient trajectories and risk factors for seve...,21.906851,risk factors for severe outcomes,-2.397173
5,COVID-19 Fatality and Comorbidity Risk Factors...,22.462238,covid - 19 fatality and comorbidity risk facto...,-5.813991


In [None]:
best_answers("Efforts targeted at a universal coronavirus vaccine", 10)
# most documents are quite good, they are related to the question
# some answers mention universal coronavirus vaccines

Unnamed: 0,Document,Document Score,Answer,Answer Score
4,Despite recent efforts in basic and translatio...,21.092022,no universal influenza vaccine available again...,1.954384
0,The continued explosive spread of severe acute...,25.489049,the apparent inevitability of future novel cor...,-0.654698
5,Note: BCG coverage is the age at which a vacci...,20.977075,bcg at birth,-0.946686
7,"In Canada, while there was no consistent effor...",20.348944,no consistent effort of universal vaccination,-1.777842
1,Rapid identification and deployment of effecti...,23.340901,vaccines and broad - spectrum antiviral agents,-3.76339
9,To streamline coronavirus vaccine and drug eff...,20.041853,to streamline coronavirus vaccine and drug,-4.328826
6,Universal coronavirus vaccines: the time to st...,20.709963,universal coronavirus vaccines : the time to s...,-5.053947
2,The novel coronavirus infection (COVID-19 or C...,21.791502,no,-5.663885
3,There are 18 biotechnology companies and unive...,21.737127,there are 18 biotechnology companies and unive...,-5.734766
8,Vaccination is the most effective and economic...,20.194432,development,-5.815524


In [None]:
best_answers("What is known about the efficacy of school closures?", 10)
# retrieved documents are bad, they don't contain the answer to the question
# answers have no relation with the question

Unnamed: 0,Document,Document Score,Answer,Answer Score
4,What else do you need? At least one parent who...,20.110155,at least one parent,1.412675
9,This article summarizes what is currently know...,19.65906,coronavirus,0.673179
8,What is already known about this topic • COVID...,19.936274,what is already known about this topic,0.528483
3,What is already known about this subject?\n,21.07679,what is already known,-0.097616
5,Hydroxychloroquine in COVID-19 patients: what ...,20.044282,hydroxychloroquine in covid - 19 patients : wh...,-0.478912
0,What is already known about this topic?\n,21.07679,what is already known,-0.649346
1,What is already known about this topic?\n,21.07679,what is already known,-0.649346
2,What is already known about the topic?\n,21.07679,what is already known,-0.843583
6,We examined what is known about the prevalence...,20.044282,prevalence,-1.66956
7,Informed and trusted communication. Physical d...,19.97539,known risks,-3.181487


In [None]:
best_answers("Is there any evidence to suggest geographic based virus mutations?", 10)
# some documents are related to the question
# few answers are quite good, and are somewhat related to the question

Unnamed: 0,Document,Document Score,Answer,Answer Score
0,We suggest that close contact with an infected...,22.080783,there is no evidence that part of covid - 19 i...,7.909936
8,Covid-19 is a human to human spreading disease...,19.763864,there is growing evidence to suggest that the ...,5.750208
9,The COVID-19 pandemic caused by the SARS-Cov2 ...,19.741909,local transmission refers to acquisition withi...,3.877108
1,Our analysis demonstrates that the genome of t...,21.622683,the results also suggest that there are four d...,3.533509
6,"Trott et al. also agree that we ""convincingly ...",20.000722,had any of these studies provided robust evide...,2.596944
2,Our secondary aims are to establish the incide...,21.065055,any evidence to suggest faecal virus transmitt...,1.776677
5,The 2020 coronavirus pandemic is developing at...,20.050778,there is reasonable evidence to suggest the co...,1.553917
7,Our study found 67% of our sample from across ...,19.848596,there were demographic and geographical variat...,0.218135
3,"Based on the above data and considerations, we...",20.43295,based on the above data and considerations,-0.707712
4,The high incidence of cough and fever in COVID...,20.054813,any evidence to suggest faecal virus transmitt...,-2.282377
