#I.R. for Covid19

Based on https://www.kaggle.com/aotegi/neural-question-answering-for-cord19-task8
(deepest apreciation to Jon Ander Campos and Arantxa Otegi, winners of task 'What do we know about diagnostics and surveillance?' in COVID-19 Open Research Dataset Challenge (CORD-19))

The goal of this lab is to build a I.R. system that retrieves the most relevant documents given a query related to Covid19.

We only use the freely available [CORD-19 dataset](https://pages.semanticscholar.org/coronavirus-research), which contains metadata of over 51,000 scientific papers (full text is also available for around 40,000 of them) about COVID-19, SARS-CoV-2, and related coronaviruses.

As we are mostly interested in papers related to COVID-19 (and not other coronaviruses), we filter out papers that are about coronaviruses other than COVID-19 (for example, SARS-CoV and MERS).

The system has a main component that is an Information Retrieval system (IR), based on the classical BM25F search algorithm. This system indexes abstracts and paragraphs on the full text of the papers.

## 1. Install packages and load libraries<a class="anchor" id="libraries"></a>

In this section we will install all the packages and load all the libraries needed to run the code below.

In [2]:
!pip install Whoosh # search engine library

Collecting Whoosh
  Downloading Whoosh-2.7.4-py2.py3-none-any.whl (468 kB)
[?25l[K     |▊                               | 10 kB 22.2 MB/s eta 0:00:01[K     |█▍                              | 20 kB 8.5 MB/s eta 0:00:01[K     |██                              | 30 kB 7.3 MB/s eta 0:00:01[K     |██▉                             | 40 kB 6.9 MB/s eta 0:00:01[K     |███▌                            | 51 kB 5.4 MB/s eta 0:00:01[K     |████▏                           | 61 kB 6.3 MB/s eta 0:00:01[K     |█████                           | 71 kB 6.1 MB/s eta 0:00:01[K     |█████▋                          | 81 kB 5.7 MB/s eta 0:00:01[K     |██████▎                         | 92 kB 6.4 MB/s eta 0:00:01[K     |███████                         | 102 kB 6.4 MB/s eta 0:00:01[K     |███████▊                        | 112 kB 6.4 MB/s eta 0:00:01[K     |████████▍                       | 122 kB 6.4 MB/s eta 0:00:01[K     |█████████                       | 133 kB 6.4 MB/s eta 0:00:01[K

In [3]:
import codecs # base classes for standard Python codecs, like text encodings (UTF-8,...)
from IPython.core.display import display, HTML # object displaying in different formats
from whoosh.index import * # whoosh: full-text indexing and searching
from whoosh.fields import *
from whoosh import qparser
import glob
import random

## 2. Load info from data file<a class="anchor" id="files"></a>

CORD19-dataset includes research papers related to coronavirus and COVID-19. In this section we first load the info. As we are not interested in all the metadata info from papers, we will select just text information, such as title, abstract and body text (already done for you).

CORD-19.v7 includes info of 51,078 papers, but some of them are repeated (they have the same *cord_uid*). Thus, we already filter out the repeated ones. 

As we are mostly interested in papers related to COVID-19 (and not other coronaviruses), we want to filter out papers that are about coronaviruses other than COVID-19 (for example, SARS-CoV and MERS). For that purpose, we created a list of synonyms of COVID-19 and we check if a synonym appears in the title or the abstract of a paper. 

List of synonyms used for filtering:

    'coronavirus 2019',
    'coronavirus disease 19',
    'cov2',
    'cov-2',
    'covid',
    'ncov 2019',
    '2019ncov',
    '2019-ncov',
    '2019 ncov',
    'novel coronavirus',
    'sarscov2',
    'sars-cov-2',
    'sars cov 2',
    'severe acute respiratory syndrome coronavirus 2',
    'wuhan coronavirus',
    'wuhan pneumonia',
    'wuhan virus'

In that way, we filter out those papers that do not include any of the synonyms. From now on, we will consider only the papers that we keep after filtering.

This are the number of papers after filtering:

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
%cd /content/drive/MyDrive/LAP/Subjects/AP2/labs

/content/drive/MyDrive/LAP/Subjects/AP2/labs


In [7]:
path='../data/' #set path to nlp-app-II/data/ -> passages
count=0
passages=[]
with open(path+'passages') as f:
    for line in f:
      count+=1
      passages.append(line)
      if count == 5000:
          break
print("Number of passages related to 'COVID-19':", count)
print()
print("3 Random passages:")
print()
print(passages[random.randrange(count)])
print(passages[random.randrange(count)])
print(passages[random.randrange(count)])

Number of passages related to 'COVID-19': 5000

3 Random passages:

Parameters in the third phase (begins March 15 th for the province of Bergamo, March 17 th for the Lombardy Region) = 1,12 ± 0,01 = 1,09 ± 0,01

Erythema and pain improved, and he was discharged on apixaban for three months. [11] . Markers of thrombosis, such as elevation of D-dimer have been reported to correlate with disease severity, development of acute respiratory distress syndrome and mortality [12] [13] [14] [15] [16] [17] .

Adverse cardiac effects and proarrhythmogenic properties of hydroxychloroquine, especially in combination with macrolide antibiotics, such as Azithromycin, deserves particular mention (144) . Hydroxychloroquine, azithromycin and, to a lesser extent, lopinavir have been associated with prolongation of the QTc interval and increase the risk for tachyarrhythmias and sudden cardiac death. Careful consideration of patient risk profile, pre-treatment ECG assessment and monitoring of pharmacokinet

## 3. Create an index for the paper retrieval system <a class="anchor" id="index"></a>

The system that we are going to develop in our approach is the information retrieval system. An information retrieval system is a tool that searches for  documents that are relevant to an information need from a collection of documents. This system has two main modules: the indexing system and the query system. 

The first module is in charge of creating the primary data structure for the system, which is the index. The second component is the one with which users interact submitting a query based on their information need, and based on this query and using the index, retrieves documents. In this section we will create an index, and in the next section, we will develop the query system. For the implementation of these modules, we will use [Whoosh library](https://pypi.org/project/Whoosh/), which contains functions for indexing text and then searching the index.

The index is a data structure that makes it possible to search for information in a document collection in a very efficient way. In short, it lists, for every word, all documents that contain it.

In order to create an index, we must define the schema of the index. The schema lists the fields in the index. A field is a piece of information for each document in the index, for example, id, path of the document, title and text. We define the type of these last two fields as “TEXT”, which means that they will be searchable. As it is common practice, we also define to apply the Stemming Analyzer to these text fields. Applying this analyzer all the text will be tokenized, then all the tokens will be converted to lowercase, a stopword filter will be applied in order to remove too common words, and finally, a stemming algorithm will be applied.

In [8]:
# Schema definition:
# - id: type ID, unique, stored; doc id in order given the passages file
# - text: type TEXT processed by StemmingAnalyzer; not stored; content of the passage
schema = Schema(id = ID(stored=True,unique=True),
                text = TEXT(analyzer=analysis.StemmingAnalyzer())
               )

Once we have the schema, we can create an index.


In [9]:
# Create an index
if not os.path.exists("index"):
    os.mkdir("index")

ix = create_in("index", schema)
writer = ix.writer() #run once! or restart runtime

Next, we will add documents to the index. We will index the papers related to COVID-19, not only the abstracts that are in the metadata file, but also the full text provided in PMC or PDF JSON format. As having shorter documents is better for the answering system that we will develop later, we will not index the whole text in a paper together. Instead, the indexing unit will be an abstract or each of the paragraphs of the full text (as marked in JSON files).

This could take several minutes.

In [10]:
# Add papers to the index, iterating through each row in the metadata dataframe
for ind,passage_text in enumerate(passages): 
    writer.add_document(id=str(ind),text=passage_text)

Finally, we will save the added documents to the index.

In [11]:
# Save the added documents
writer.commit()
print("Index successfully created")

# Sanity check
print("Number of documents (abstracts and paragraphs of papers) in the index: ", ix.doc_count())

Index successfully created
Number of documents (abstracts and paragraphs of papers) in the index:  5000


## 4. Define a function to query the index and retrieve relevant papers <a class="anchor" id="retrieval"></a>

In this section we will define a function that given a question and a maximum number of documents as input, it uses this query to retrieve relevant papers that were indexed in the previous section.

In this function we set the algorithm used for scoring (we will be using the default BM25 algorithm), and we  also set the query parser to use, defining the default field to search (in our case '*text*’ field). Then, we run the query and get the most relevant documents on the index (*n_docs* documents at maximum). 

The output of the function is a set (*n_docs*) of texts and scores.

In [12]:
# Input: Question and maximum number of documents to retrieve
def retrieve_docs(qstring, n_docs):
    scores=[]
    text=[]
    # Open the searcher for reading the index. The default BM25 algorithm will be used for scoring
    with ix.searcher() as searcher:
        searcher = ix.searcher()
        
        # Define the query parser ('text' will be the default field to search), and set the input query
        q = qparser.QueryParser("text", ix.schema, group=qparser.OrGroup).parse(qstring)
    
        # Search using the query q, and get the n_docs documents, sorted with the highest-scoring documents first
        results = searcher.search(q, limit=n_docs)
        # results is a list of dictionaries where each dictionary is the stored fields of the document
  
    # Iterate over the retrieved documents
    for hit in results:
        scores.append(hit.score)
        text.append(passages[int(hit['id'])])
    return text,scores
        

Retrieve 3 most relevant documents and scores given the query "How long individuals are contagious?":

In [13]:
retrieve_docs("How long individuals are contagious?",3)

(['As it is currently unknown how long antibodies against COVID-19 last after primary infection, repetitive antibody testing will be crucial to assess long term immunity in order to develop future vaccines. In addition, several COVID-19 strains with different virulence have been reported [70, 71] . It is not yet known how fast the virus mutates, creating strains for which previously infected (or vaccinated) individuals would no longer be immune to. It will be important to include pregnant woman in vaccination trials, since they are considered a high-risk population [50] .\n',
  'This method of automated contact tracing will work as long as A and C (and possibly E) are enrolled in the service even if B and D are not. However, D is completely isolated and by remaining so for a long time is observing social distancing from any other individual. B is representative of an individual who observes partial social distancing. Hence, for D this service is not necessary and for B it is of limited

Test as many queries you want:

- Range of incubation periods for the disease in humans
- Prevalence of asymptomatic shedding and transmission
- Persistence of virus on surfaces of different materials
- Immune response and immunity
- Does smoking increase risk for COVID-19?
- Risk of fatality among symptomatic hospitalized patients
- Efforts targeted at a universal coronavirus vaccine
- What is known about the efficacy of school closures?
-Is there any evidence to suggest geographic based virus mutations?

Check more queries in section 6 https://www.kaggle.com/aotegi/neural-question-answering-for-cord19-task8

Are the retrieved documents relevant for each query?

In [14]:
retrieve_docs("Range of incubation periods for the disease in humans",3)

(['There exists an incubation period t c for the development of SARS-CoV-2-related symptoms, which average has been reported to range between 5.2 days [41, 42] and 6.4 days [43] . We can assume the incubation period Statistically-based methodology for revealing real contagion trends and correcting delay-induced errors in the assessment of COVID-19 pandemic follows an exponential distribution with λ = 1 tc :\n',
  'patients were local residents of Wuhan. 26.0% of patients outside of Wuhan did not have a recent travel to Wuhan or contact with people from Wuhan. The median incubation period was 3.0 days (range, 0 to 24.0).\n',
  'In summary, 2019-nCoV elicits a rapid spread of outbreak with human-to-human transmission, with a median incubation period of 3 days and a relatively low fatality rate. Absence of fever and . CC-BY-NC-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.\n'],
 [19.76657561

In [15]:
retrieve_docs("Prevalence of asymptomatic shedding and transmission",3)

(['Children may shed HCoVs for extended periods of time after infection, potentially leading to additional transmissions within close-contact settings. There are very limited data regarding duration of shedding in children. In prospective childcare studies, 34% of children with HCoV had detectable virus at 1 week or more following symptom onset, with shedding documented for up to 18 days ( Figure 1 ) [47, 48] . A longitudinal study of weekly nasal swabs taken from symptomatic and asymptomatic adults and children similarly found that viral detection extended beyond 1 week [49] . Children aged <5 years with HCoV detection were frequently asymptomatic, especially with HCoV-229E. These findings reinforce serologic-based findings of asymptomatic infection in 7% of children with HCoV-229E in the 1960s [50] . Clinical symptoms associated with the 4 common HCoVs generally appear to be indistinguishable from cold symptoms or influenza-like illness (rhinorrhea, sore throat, cough, wheezing, and 

In [16]:
retrieve_docs("Persistence of virus on surfaces of different materials",3)

(["As for COVID-19, the reproductive rate or in other words the average number of people that an infected person transmits the virus to during the peak of the epidemic is between two and three (range 2.5-2.9), which is somewhat higher than seasonal influenza [27] . This number reflects both the virus' characteristics and infection potential, as well as the human behavior (e.g., social distancing or not). The virus is transmitted by droplet-infection as well as surface-contact (face-to-fomite), with certain reported data describing persistence of viable virus on surfaces up to four days [11, 28] . More evidence is needed on the possibility of airborne transmission and transmission during aerosol generating procedures and should be taken into consideration until further evidence is available [29] [30] [31] . Pre-symptomatic people can also transmit the disease, and for that reason, in China everyone is advised to wear face masks outside of the home environment [11, 32] . Guideline 1. Eve

In [17]:
retrieve_docs("Immune response and immunity",3)

(['Abstract The novel coronavirus SARS-CoV2 causes COVID-19, a pandemic threatening millions. As protective immunity does not exist in humans and the virus is capable of escaping innate immune responses, it can proliferate, unhindered, in primarily infected tissues. Subsequent cell death results in the release of virus particles and intracellular components to the extracellular space, which result in immune cell recruitment, the generation of immune complexes and associated damage. Infection of monocytes/macrophages and/or recruitment of uninfected immune cells can result in massive inflammatory responses later in the disease. Uncontrolled production of pro-inflammatory mediators contributes to ARDS and cytokine storm syndrome. Antiviral agents and immune modulating treatments are currently being trialled. Understanding immune evasion strategies of SARS-CoV2 and the resulting delayed massive immune response will result in the identification of biomarkers that predict outcomes as well a

In [18]:
retrieve_docs("Does smoking increase risk for COVID-19?",3)

(['Among patients with COVID-19, new-onset CVD increases in individuals who have risk factors, including smoking and diabetes. The Chinese Center for Disease Control and Prevention reported that COVID-19 patients with diabetes had higher mortality [11] . In South Korea, the KCDC reported that as of April 30, 247 deaths occurred, of which 244 are deaths with underlying disease. Among them, the mortality rate of COVID-19 patients with the underlying disease with a metabolic disease or cardiovascular diseases, such as diabetes, stroke, and hypertension [12] . Clinical data characterizing patients with COVID-19 give evidence that CVD risk factors, including smoking and diabetes, are likely associated with negative progression and adverse outcomes of COVID-19 [13] . Recently, a high level of ACE2 has been observed in the brains of smokers [14] . Hence, we consider that smoking and diabetes might increase the ability of SARS-CoV-2 to enter and infect the brain based on the high expression of

In [19]:
retrieve_docs("Risk of fatality among symptomatic hospitalized patients",3)

(['Clinical manifestations in pediatric patients have not been systematically described. Among the 31 pediatric patients with MERS-CoV infection documented from June 2012 to April 19, 2016, 13 patients (42%) were asymptomatic [94] . In another study, among 7 pediatric patients identified from April 2014 to November 2016, 3 were asymptomatic; fever, cough, shortness of breath, diarrhea, and vomiting were observed in 4 patients [95] . Although pediatric patients typically have mild disease, high-risk children with underlying conditions, including cystic fibrosis, nephrotic syndrome, and unidentified underlying conditions, died of MERS-CoV infection, with a fatality rate of 12% (4/33) (see Supplementary Table 1 ) [96, 97] .\n',
  'We found a low prevalence of SARS-CoV-2 (2.7% [5/188]) among pregnant and postpartum patients after initiating universal testing. Prevalence among symptomatic patients (22.2% [4/18]) was similar to initial targeted screening approaches (19.1% [8/42]). Among 170 

In [20]:
retrieve_docs("Efforts targeted at a universal coronavirus vaccine",3)

(['Engagement of S protein with the host receptor results in considerable changes in molecular conformation. The S protein has a critical function in host-cell entry, and thus is a major target for vaccine research and antibody-mediated VN efforts.\n',
  'The development of a vaccine against a disease is a combined effort from academicians and industries. Under normal circumstances, the final product of the vaccine for use in humans takes at least 15-20 years passing through six phases of assessment (Bregu et al. 2011) . In the first phase, the academician identifies a target that has the potential to be a vaccine candidate. Thereafter, this candidate is challenged for its vaccine potential by testing it in animal models for the disease where the safety as well as immune responses to the antigen is analyzed. This identification and development of the vaccine candidate is a bottleneck and takes majority of the time (*9-10 years) required for vaccine development. Once these steps are suc

In [21]:
retrieve_docs("What is known about the efficacy of school closures?",3)

(['While symptoms of COVID-19 disease may be (sometimes only slightly) milder in comparison to infections with SARS-CoV or MERS-CoV, several key pathogen-associated and clinical features of disease are similar and we can extrapolate knowledge from what is already known about the pathophysiology of SARS and MERS .\n',
  'What we know about the novel coronavirus’s proteins\n',
  'Coronaviruses contribute to the burden of respiratory diseases in children, frequently manifesting in upper respiratory symptoms considered to be part of the “common cold.” Recent epidemics of novel coronaviruses recognized in the 21st century have highlighted issues of zoonotic origins of transmissible respiratory viruses and potential transmission, disease, and mortality related to these viruses. In this review, we discuss what is known about the virology, epidemiology, and disease associated with pediatric infection with the common community-acquired human coronaviruses, including species 229E, OC43, NL63, an

In [22]:
retrieve_docs("Is there any evidence to suggest geographic based virus mutations?",3)

(['Although there is no definitive, evidence-based treatment for COVID-19, preliminary data suggest that remdesivir may improve outcomes in critically-ill patients 10 . Further carefully designed studies are needed to confirm the role of these agents in the treatment of COVID-19.\n',
  'The pathophysiology of this interaction remains poorly characterized. However, preliminary data suggest that acute inflammation superimposed on pre-existing CVD can precipitate cardiac injury, acute coronary syndrome, and myocardial dysfunction, and trigger arrhythmias in patients with COVID-19. 3, 4 Furthermore, there is evidence of direct myocardial infiltration, potentially as a result of the affinity of the SARS-CoV-2 virus for the angiotensinconverting enzyme 2 receptor. 6 Given the frequency of cardiac manifestations and injuries, the cardiac rehabilitation system will likely be overwhelmed by an unprecedented number of discharged patients with new or exacerbated CVD.\n',
  'Some experts advocate 

If you want to test this model in other domain, load your own *passages* file. In *passages* we store one passage per line. You can try to create your own *passages* file, for example, using Wikipedia abstracts (1st paragraph of each Wiki page) or SQuAD dataset passages. Feel free to use this data or your own *passages* in the assignment.