# Implementation of Open Domain Q.A. system for Covid19

In this assignment you will build an application that performs __information retrieval__ and __question answering__, which are the core tasks that take care the **Open Domain Q.A.** task. 

*   __Information Retrieval__ Retrieve documents with information that is relevant to the user’s information need and helps the user complete a task.

* __Question Answering__ Systems that automatically answer questions posed by humans in a natural language.

The steps are the following:
- Preprocess the data and init models for I.R.
- Preprocess the data and init models for Q.A.
- Implement the functions: retrieve_documents & answer question given a query.
- Run some analysis of the model output.

Note you are not requiered to strictly follow the steps suggested in the notebook as long as you build a model and provide some analysis.


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
%cd /content/drive/MyDrive/LAP/Subjects/AP2/assignments

/content/drive/MyDrive/LAP/Subjects/AP2/assignments


## 1. Install packages and load libraries

In this section we will install all the packages and load all the libraries needed to run the code below.

In [3]:
!pip install Whoosh # search engine library
!pip install transformers

Collecting Whoosh
  Downloading Whoosh-2.7.4-py2.py3-none-any.whl (468 kB)
[?25l[K     |▊                               | 10 kB 21.6 MB/s eta 0:00:01[K     |█▍                              | 20 kB 28.6 MB/s eta 0:00:01[K     |██                              | 30 kB 20.0 MB/s eta 0:00:01[K     |██▉                             | 40 kB 9.2 MB/s eta 0:00:01[K     |███▌                            | 51 kB 7.6 MB/s eta 0:00:01[K     |████▏                           | 61 kB 8.9 MB/s eta 0:00:01[K     |█████                           | 71 kB 9.7 MB/s eta 0:00:01[K     |█████▋                          | 81 kB 9.9 MB/s eta 0:00:01[K     |██████▎                         | 92 kB 10.8 MB/s eta 0:00:01[K     |███████                         | 102 kB 9.6 MB/s eta 0:00:01[K     |███████▊                        | 112 kB 9.6 MB/s eta 0:00:01[K     |████████▍                       | 122 kB 9.6 MB/s eta 0:00:01[K     |█████████                       | 133 kB 9.6 MB/s eta 0:00:01

In [4]:
import codecs # base classes for standard Python codecs, like text encodings (UTF-8,...)
from IPython.core.display import display, HTML # object displaying in different formats
from whoosh.index import * # whoosh: full-text indexing and searching
from whoosh.fields import *
from whoosh import qparser
import glob
import random
import torch

## Information Retrieval

Based on https://www.kaggle.com/aotegi/neural-question-answering-for-cord19-task8
(deepest apreciation to Jon Ander Campos and Arantxa Otegi, winners of task 'What do we know about diagnostics and surveillance?' in COVID-19 Open Research Dataset Challenge (CORD-19))

The goal of this lab is to build a I.R. system that retrieves the most relevant documents given a query related to Covid19.

We only use the freely available [CORD-19 dataset](https://pages.semanticscholar.org/coronavirus-research), which contains metadata of over 51,000 scientific papers (full text is also available for around 40,000 of them) about COVID-19, SARS-CoV-2, and related coronaviruses.

As we are mostly interested in papers related to COVID-19 (and not other coronaviruses), we filter out papers that are about coronaviruses other than COVID-19 (for example, SARS-CoV and MERS).

The system has a main component that is an Information Retrieval system (IR), based on the classical BM25F search algorithm. This system indexes abstracts and paragraphs on the full text of the papers.

### Load info from data file

CORD19-dataset includes research papers related to coronavirus and COVID-19. In this section we first load the info. As we are not interested in all the metadata info from papers, we will select just text information, such as title, abstract and body text (already done for you).

CORD-19.v7 includes info of 51,078 papers, but some of them are repeated (they have the same *cord_uid*). Thus, we already filter out the repeated ones. 

As we are mostly interested in papers related to COVID-19 (and not other coronaviruses), we want to filter out papers that are about coronaviruses other than COVID-19 (for example, SARS-CoV and MERS). For that purpose, we created a list of synonyms of COVID-19 and we check if a synonym appears in the title or the abstract of a paper. 

List of synonyms used for filtering:

    'coronavirus 2019',
    'coronavirus disease 19',
    'cov2',
    'cov-2',
    'covid',
    'ncov 2019',
    '2019ncov',
    '2019-ncov',
    '2019 ncov',
    'novel coronavirus',
    'sarscov2',
    'sars-cov-2',
    'sars cov 2',
    'severe acute respiratory syndrome coronavirus 2',
    'wuhan coronavirus',
    'wuhan pneumonia',
    'wuhan virus'

In that way, we filter out those papers that do not include any of the synonyms. From now on, we will consider only the papers that we keep after filtering.

This are the number of papers after filtering:

In [5]:
def read_passages(path):
    count = 0
    passages = []
    with open(path) as f:
        for line in f:
            count += 1
            passages.append(line)
    return passages, count

In [6]:
path='../data/passages'
passages, count = read_passages(path)
print("Number of passages related to 'COVID-19':", count)
print()
print("3 Random passages:")
print()
print(passages[random.randrange(count)])
print(passages[random.randrange(count)])
print(passages[random.randrange(count)])

Number of passages related to 'COVID-19': 308665

3 Random passages:

In this randomized pilot study, 99m Tc-MDP had an effective inhibitory effect on the inflammatory disease progression for the therapy of COVID-19, and it can accelerate the absorption of pulmonary inflammation in a short period of time during the process of treatment.

The under-reporting factor changes as the growth of the epidemic changes, dropping from higher values during the initial rapid growth phase to lower values during the slower growth phase. This simply reflects that relatively fewer new infections happen during the time between initial infection and actual confirmation of cases in the slower growing phases. Figure 6 shows the relation between actual infections and confirmed case numbers for a scenario where more effective government interventions reduce the reproductive rate to 0.94, causing a steady drop in new daily infections. This causes a significantly sharper drop in new infections during the adapt

### 3. Create an index for the paper retrieval system

The system that we are going to develop in our approach is the information retrieval system. An information retrieval system is a tool that searches for  documents that are relevant to an information need from a collection of documents. This system has two main modules: the indexing system and the query system. 

The first module is in charge of creating the primary data structure for the system, which is the index. The second component is the one with which users interact submitting a query based on their information need, and based on this query and using the index, retrieves documents. In this section we will create an index, and in the next section, we will develop the query system. For the implementation of these modules, we will use [Whoosh library](https://pypi.org/project/Whoosh/), which contains functions for indexing text and then searching the index.

The index is a data structure that makes it possible to search for information in a document collection in a very efficient way. In short, it lists, for every word, all documents that contain it.

In order to create an index, we must define the schema of the index. The schema lists the fields in the index. A field is a piece of information for each document in the index, for example, id, path of the document, title and text. We define the type of these last two fields as “TEXT”, which means that they will be searchable. As it is common practice, we also define to apply the Stemming Analyzer to these text fields. Applying this analyzer all the text will be tokenized, then all the tokens will be converted to lowercase, a stopword filter will be applied in order to remove too common words, and finally, a stemming algorithm will be applied.

In [7]:
# Schema definition:
# - id: type ID, unique, stored; doc id in order given the passages file
# - text: type TEXT processed by StemmingAnalyzer; not stored; content of the passage
schema = Schema(id = ID(stored=True,unique=True),
                text = TEXT(analyzer=analysis.StemmingAnalyzer())
               )

Once we have the schema, we can create an index.


In [8]:
# Create an index
if not os.path.exists("index"):
    os.mkdir("index")

ix = create_in("index", schema)
writer = ix.writer() #run once! or restart runtime

Next, we will add documents to the index. We will index the papers related to COVID-19, not only the abstracts that are in the metadata file, but also the full text provided in PMC or PDF JSON format. As having shorter documents is better for the answering system that we will develop later, we will not index the whole text in a paper together. Instead, the indexing unit will be an abstract or each of the paragraphs of the full text (as marked in JSON files).

This could take several minutes.

In [9]:
# Add papers to the index, iterating through each row in the metadata dataframe
for ind,passage_text in enumerate(passages): 
    writer.add_document(id=str(ind),text=passage_text)

Finally, we will save the added documents to the index.

In [10]:
# Save the added documents
writer.commit()
print("Index successfully created")

# Sanity check
print("Number of documents (abstracts and paragraphs of papers) in the index: ", ix.doc_count())

Index successfully created
Number of documents (abstracts and paragraphs of papers) in the index:  308665


### Define a function to query the index and retrieve relevant papers

In this section we will define a function that given a question and a maximum number of documents as input, it uses this query to retrieve relevant papers that were indexed in the previous section.

In this function we set the algorithm used for scoring (we will be using the default BM25 algorithm), and we  also set the query parser to use, defining the default field to search (in our case '*text*’ field). Then, we run the query and get the most relevant documents on the index (*n_docs* documents at maximum). 

The output of the function is a set (*n_docs*) of texts and scores.

In [11]:
# Input: Question and maximum number of documents to retrieve
def retrieve_documents(query, topk):
    scores=[]
    text=[]
    # Open the searcher for reading the index. The default BM25 algorithm will be used for scoring
    with ix.searcher() as searcher:
        searcher = ix.searcher()
        
        # Define the query parser ('text' will be the default field to search), and set the input query
        q = qparser.QueryParser("text", ix.schema, group=qparser.OrGroup).parse(query)
    
        # Search using the query q, and get the topk documents, sorted with the highest-scoring documents first
        results = searcher.search(q, limit=topk)
        # results is a list of dictionaries where each dictionary is the stored fields of the document

    # Iterate over the retrieved documents
    for hit in results:
        scores.append(hit.score)
        text.append(passages[int(hit['id'])])
    return text, scores

Retrieve 3 most relevant documents and scores given the query "How long individuals are contagious?":

In [12]:
retrieve_documents("How long individuals are contagious?", 3)

(['The model gives what fractions are susceptible, exposed, infectious and immune, respectively, for short and long-term M -individuals as well as P -individuals over time. The main focus is to study how many people in the province become infected as a result of the M -individuals visiting the province.\n',
  "In a time of such extreme uncertainty, making economic decisions becomes challenging because pandemics are rare. The most recent comparable episode is the Spanish flu of 1918 (Trilla et al., 2008) , so pandemics are likely to occur at most once during one's lifetime. Nevertheless, individuals need to make everyday decisions such as how to manage inventories of staples, how much to consume and save, when to buy or sell stocks, etc., and these decisions depend on the expectation of how long and severe the epidemic is. Governments must also make decisions such as to what extent imposing travel restrictions, social distancing, closure of schools and businesses, etc., and for how long

## Question Answering

### Load Fine-Tuned BERT-large

For Question Answering we use the `BertForQuestionAnswering` class from the `transformers` library.

This class supports fine-tuning, but for this example we will keep things simpler and load a BERT model that has already been fine-tuned for the SQuAD benchmark.

The `transformers` library has a large collection of pre-trained models which you can reference by name and load easily. The full list is in their documentation [here](https://huggingface.co/transformers/pretrained_models.html).

For Question Answering, they have a version of BERT-large that has already been fine-tuned for the SQuAD benchmark. 

BERT-large is really big... it has 24-layers and an embedding size of 1,024, for a total of 340M parameters! Altogether it is 1.34GB, so expect it to take a couple minutes to download to your Colab instance. 


In [13]:
from transformers import BertForQuestionAnswering

model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

Downloading:   0%|          | 0.00/443 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.25G [00:00<?, ?B/s]

Load the tokenizer as well. 


In [14]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

### Answer Questions


In [15]:
def answer_question(question, answer_text):
    '''
    Takes a `question` string and an `answer_text` string (which contains the
    answer), and identifies the words within the `answer_text` that are the
    answer. Prints them out.
    '''
    # ======== Tokenize ========
    # Apply the tokenizer to the input text, treating them as a text-pair.
    input_ids = tokenizer.encode(question, answer_text)

    # Report how long the input sequence is.
    # print('Query has {:,} tokens.\n'.format(len(input_ids)))

    # ======== Set Segment IDs ========
    # Search the input_ids for the first instance of the `[SEP]` token.
    sep_index = input_ids.index(tokenizer.sep_token_id)

    # The number of segment A tokens includes the [SEP] token istelf.
    num_seg_a = sep_index + 1

    # The remainder are segment B.
    num_seg_b = len(input_ids) - num_seg_a

    # Construct the list of 0s and 1s.
    segment_ids = [0]*num_seg_a + [1]*num_seg_b

    # There should be a segment_id for every input token.
    assert len(segment_ids) == len(input_ids)

    # ======== Evaluate ========
    # Run our example through the model.
    outputs = model(torch.tensor([input_ids]), # The tokens representing our input text.
                    token_type_ids=torch.tensor([segment_ids]), # The segment IDs to differentiate question from answer_text
                    return_dict=True) 

    start_scores = outputs.start_logits
    end_scores = outputs.end_logits

    # ======== Reconstruct Answer ========
    # Find the tokens with the highest `start` and `end` scores.
    answer_start = torch.argmax(start_scores)
    answer_end = torch.argmax(end_scores)
    
    # Get the string versions of the input tokens.
    tokens = tokenizer.convert_ids_to_tokens(input_ids)

    # Start with the first token.
    answer = tokens[answer_start]

    # Select the remaining answer tokens and join them with whitespace.
    for i in range(answer_start + 1, answer_end + 1):
        
        # If it's a subword token, then recombine it with the previous token.
        if tokens[i][0:2] == '##':
            answer += tokens[i][2:]
        
        # Otherwise, add a space then the token.
        else:
            answer += ' ' + tokens[i]

    s_scores = start_scores.detach().numpy().flatten()
    e_scores = end_scores.detach().numpy().flatten()

    score = max(s_scores) + max(e_scores)

    return answer, score

In [16]:
question = "How many parameters does BERT-large have?"
answer_text = "BERT-large is really big... it has 24-layers and an embedding size of 1,024, for a total of 340M parameters! Altogether it is 1.34GB, so expect it to take a couple minutes to download to your Colab instance."

In [17]:
answer_question(question, answer_text)

('340m', 11.890032)

## Analysis of the output
It is important to run some analysis of the output of our model in order to gain some insights about good and bad things of the model. Use some questions from the lab 8.IR_for_covid19 to test your model.

In addition, each function should return a score. This will allow us to better understand how confident the model is returning both documents and answers. The scoring by default should be best retrieved document & best answer in that retrieved document.

### TO-DO 
- Build the Open Domain Q.A. system.
- Show some correct and incorrect examples for IR & QA for Covid 19 and its scores.
### TO-DO (optional)
- Try to improve over the presented baseline scoring system, for example, returning the highest scored answer in the first 10 most relevant documents, feel free to explore any other scoring strategy.
- Apply a threshold based on the score to filter out unwanted results (low scored answers).

In [44]:
def best_answers(query, count):
    best_score = -float("inf")
    best_answer = ""
    docs, scores = retrieve_documents(query, count)
    for i, doc in enumerate(docs):
        answer, score = answer_question(query, doc)
        if score > best_score:
            best_score = score
            best_answer = answer
        # print(answer, scores[i], score)
    print(best_answer, score)

In [45]:
best_answers('How long individuals are contagious?', 10)

short and long - term 6.337094


In [46]:
best_answers("Range of incubation periods for the disease in humans", 10)

between one day to two weeks 2.918795


In [47]:
best_answers("Prevalence of asymptomatic shedding and transmission", 10)

it is likely that asymptomatic and presymptomatic transmission is occurring 4.9668207


In [48]:
best_answers("Persistence of virus on surfaces of different materials", 10)

[CLS] -0.492468


In [49]:
best_answers("Immune response and immunity", 10)

the body ' s response to viral infections and the immune response to viruses -1.4405526


In [50]:
best_answers("Does smoking increase risk for COVID-19?", 10)

even if smoking does have a small protective effects against covid - 19 5.539629


In [51]:
best_answers("Risk of fatality among symptomatic hospitalized patients", 10)

~ 15 - 20 % 2.269718


In [52]:
best_answers("Efforts targeted at a universal coronavirus vaccine", 10)

no universal influenza vaccine available against all influenza virus subtypes -4.328826


In [53]:
best_answers("What is known about the efficacy of school closures?", 10)

at least one parent 0.67317945


In [54]:
best_answers("Is there any evidence to suggest geographic based virus mutations?", 10)

there is no evidence that part of covid - 19 is synthetic . 3.8771076
