#I.R. for Covid19

Based on https://www.kaggle.com/aotegi/neural-question-answering-for-cord19-task8
(deepest apreciation to Jon Ander Campos and Arantxa Otegi, winners of task 'What do we know about diagnostics and surveillance?' in COVID-19 Open Research Dataset Challenge (CORD-19))

The goal of this lab is to build a I.R. system that retrieves the most relevant documents given a query related to Covid19.

We only use the freely available [CORD-19 dataset](https://pages.semanticscholar.org/coronavirus-research), which contains metadata of over 51,000 scientific papers (full text is also available for around 40,000 of them) about COVID-19, SARS-CoV-2, and related coronaviruses.

As we are mostly interested in papers related to COVID-19 (and not other coronaviruses), we filter out papers that are about coronaviruses other than COVID-19 (for example, SARS-CoV and MERS).

The system has a main component that is an Information Retrieval system (IR), based on the classical BM25F search algorithm. This system indexes abstracts and paragraphs on the full text of the papers.

## 1. Install packages and load libraries<a class="anchor" id="libraries"></a>

In this section we will install all the packages and load all the libraries needed to run the code below.

In [None]:
!pip install Whoosh # search engine library

In [None]:
import codecs # base classes for standard Python codecs, like text encodings (UTF-8,...)
from IPython.core.display import display, HTML # object displaying in different formats
from whoosh.index import * # whoosh: full-text indexing and searching
from whoosh.fields import *
from whoosh import qparser
import glob
import random

## 2. Load info from data file<a class="anchor" id="files"></a>

CORD19-dataset includes research papers related to coronavirus and COVID-19. In this section we first load the info. As we are not interested in all the metadata info from papers, we will select just text information, such as title, abstract and body text (already done for you).

CORD-19.v7 includes info of 51,078 papers, but some of them are repeated (they have the same *cord_uid*). Thus, we already filter out the repeated ones. 

As we are mostly interested in papers related to COVID-19 (and not other coronaviruses), we want to filter out papers that are about coronaviruses other than COVID-19 (for example, SARS-CoV and MERS). For that purpose, we created a list of synonyms of COVID-19 and we check if a synonym appears in the title or the abstract of a paper. 

List of synonyms used for filtering:

    'coronavirus 2019',
    'coronavirus disease 19',
    'cov2',
    'cov-2',
    'covid',
    'ncov 2019',
    '2019ncov',
    '2019-ncov',
    '2019 ncov',
    'novel coronavirus',
    'sarscov2',
    'sars-cov-2',
    'sars cov 2',
    'severe acute respiratory syndrome coronavirus 2',
    'wuhan coronavirus',
    'wuhan pneumonia',
    'wuhan virus'

In that way, we filter out those papers that do not include any of the synonyms. From now on, we will consider only the papers that we keep after filtering.

This are the number of papers after filtering:

In [None]:
from google.colab import drive 

In [None]:
path='drive/My Drive/' #set path to nlp-app-II/data/ -> passages
count=0
passages=[]
with open(path+'passages') as f:
    for line in f:
      count+=1
      passages.append(line)
      if count == 5000:
          break
print("Number of passages related to 'COVID-19':", count)
print()
print("3 Random passages:")
print()
print(passages[random.randrange(count)])
print(passages[random.randrange(count)])
print(passages[random.randrange(count)])

## 3. Create an index for the paper retrieval system <a class="anchor" id="index"></a>

The system that we are going to develop in our approach is the information retrieval system. An information retrieval system is a tool that searches for  documents that are relevant to an information need from a collection of documents. This system has two main modules: the indexing system and the query system. 

The first module is in charge of creating the primary data structure for the system, which is the index. The second component is the one with which users interact submitting a query based on their information need, and based on this query and using the index, retrieves documents. In this section we will create an index, and in the next section, we will develop the query system. For the implementation of these modules, we will use [Whoosh library](https://pypi.org/project/Whoosh/), which contains functions for indexing text and then searching the index.

The index is a data structure that makes it possible to search for information in a document collection in a very efficient way. In short, it lists, for every word, all documents that contain it.

In order to create an index, we must define the schema of the index. The schema lists the fields in the index. A field is a piece of information for each document in the index, for example, id, path of the document, title and text. We define the type of these last two fields as “TEXT”, which means that they will be searchable. As it is common practice, we also define to apply the Stemming Analyzer to these text fields. Applying this analyzer all the text will be tokenized, then all the tokens will be converted to lowercase, a stopword filter will be applied in order to remove too common words, and finally, a stemming algorithm will be applied.

In [None]:
# Schema definition:
# - id: type ID, unique, stored; doc id in order given the passages file
# - text: type TEXT processed by StemmingAnalyzer; not stored; content of the passage
schema = Schema(id = ID(stored=True,unique=True),
                text = TEXT(analyzer=analysis.StemmingAnalyzer())
               )

Once we have the schema, we can create an index.


In [None]:
# Create an index
if not os.path.exists("index"):
    os.mkdir("index")

ix = create_in("index", schema)
writer = ix.writer() #run once! or restart runtime

Next, we will add documents to the index. We will index the papers related to COVID-19, not only the abstracts that are in the metadata file, but also the full text provided in PMC or PDF JSON format. As having shorter documents is better for the answering system that we will develop later, we will not index the whole text in a paper together. Instead, the indexing unit will be an abstract or each of the paragraphs of the full text (as marked in JSON files).

This could take several minutes.

In [None]:
# Add papers to the index, iterating through each row in the metadata dataframe


for ind,passage_text in enumerate(passages): 
    writer.add_document(id=str(ind),text=passage_text)
        

Finally, we will save the added documents to the index.

In [None]:
# Save the added documents
writer.commit()
print("Index successfully created")

# Sanity check
print("Number of documents (abstracts and paragraphs of papers) in the index: ", ix.doc_count())

## 4. Define a function to query the index and retrieve relevant papers <a class="anchor" id="retrieval"></a>

In this section we will define a function that given a question and a maximum number of documents as input, it uses this query to retrieve relevant papers that were indexed in the previous section.

In this function we set the algorithm used for scoring (we will be using the default BM25 algorithm), and we  also set the query parser to use, defining the default field to search (in our case '*text*’ field). Then, we run the query and get the most relevant documents on the index (*n_docs* documents at maximum). 

The output of the function is a set (*n_docs*) of texts and scores.

In [None]:
# Input: Question and maximum number of documents to retrieve
def retrieve_docs(qstring, n_docs):
    scores=[]
    text=[]
    # Open the searcher for reading the index. The default BM25 algorithm will be used for scoring
    with ix.searcher() as searcher:
        searcher = ix.searcher()
        
        # Define the query parser ('text' will be the default field to search), and set the input query
        q = qparser.QueryParser("text", ix.schema, group=qparser.OrGroup).parse(qstring)
    
        # Search using the query q, and get the n_docs documents, sorted with the highest-scoring documents first
        results = searcher.search(q, limit=n_docs)
        # results is a list of dictionaries where each dictionary is the stored fields of the document
  
    # Iterate over the retrieved documents
    for hit in results:
        scores.append(hit.score)
        text.append(passages[int(hit['id'])])
    return text,scores
        

Retrieve 3 most relevant documents and scores given the query "How long individuals are contagious?":

In [None]:
retrieve_docs("How long individuals are contagious?",3)

Test as many queries you want:

- Range of incubation periods for the disease in humans
- Prevalence of asymptomatic shedding and transmission
- Persistence of virus on surfaces of different materials
- Immune response and immunity
- Does smoking increase risk for COVID-19?
- Risk of fatality among symptomatic hospitalized patients
- Efforts targeted at a universal coronavirus vaccine
- What is known about the efficacy of school closures?
-Is there any evidence to suggest geographic based virus mutations?

Check more queries in section 6 https://www.kaggle.com/aotegi/neural-question-answering-for-cord19-task8

Are the retrieved documents relevant for each query?

If you want to test this model in other domain, load your own *passages* file. In *passages* we store one passage per line. You can try to create your own *passages* file, for example, using Wikipedia abstracts (1st paragraph of each Wiki page) or SQuAD dataset passages. Feel free to use this data or your own *passages* in the assignment.