<a href="https://colab.research.google.com/github/mrkarezina/research-heatmap/blob/master/huggingface_t5_testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CORD-19 doc2query-T5 Experiments

This notebook aims to the explore using the [doc2query-T5 model](https://github.com/castorini/docTTTTTquery#data-and-trained-models) to retrieve documents relevant to the research questions for each topic in the COVID-19 [Open Research Dataset](https://pages.semanticscholar.org/coronavirus-research).

In [0]:
from IPython.core.display import display, HTML

In [0]:
from google.colab import drive
drive.mount('/content/drive')

## Data Loading

First we will install dependencies and download the T5 model checkpoint. 

In [1]:
!pip install transformers

# docTTTTTquery T5-base checkpoint
!curl -o t5-base.zip "https://storage.googleapis.com/doctttttquery_git/t5-base.zip"
!unzip t5-base.zip -d "t5-model-tf"
!gsutil cp gs://t5-data/pretrained_models/base/operative_config.gin t5-model-tf/
!rm t5-base.zip

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/a3/78/92cedda05552398352ed9784908b834ee32a0bd071a9b32de287327370b7/transformers-2.8.0-py3-none-any.whl (563kB)
[K     |████████████████████████████████| 573kB 2.8MB/s 
[?25hCollecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/98/2c/8df20f3ac6c22ac224fff307ebc102818206c53fc454ecd37d8ac2060df5/sentencepiece-0.1.86-cp36-cp36m-manylinux1_x86_64.whl (1.0MB)
[K     |████████████████████████████████| 1.0MB 42.4MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/99/50/93509f906a40bffd7d175f97fd75ea328ad9bd91f48f59c4bd084c94a25e/sacremoses-0.0.41.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 39.7MB/s 
Collecting tokenizers==0.5.2
[?25l  Downloading https://files.pythonhosted.org/packages/d1/3f/73c881ea4723e43c1e9acf317cf407fab3a278daab3a69c98dcac511c04f/tokenizers-0.5.2-cp36-cp36m-manylinux1_x86_64.whl (3.7MB)
[K     |████

Let's load the checkpoint of our doc2query-T5 model. We'll use the config and tokenizer from the Hugging Face T5-base model.

In [3]:
import torch
from transformers import T5Tokenizer, T5Config, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained('t5-base')
config = T5Config.from_pretrained('t5-base')
model = T5ForConditionalGeneration.from_pretrained('t5-model-tf/model.ckpt-1004000.index', from_tf=True, config=config)

HBox(children=(IntProgress(value=0, description='Downloading', max=791656, style=ProgressStyle(description_wid…




HBox(children=(IntProgress(value=0, description='Downloading', max=1199, style=ProgressStyle(description_width…




We can check the memory of the runtime you were allocated. Hopefully you got assigned a Tesla P100-PCIE-16GB. If you got an 8GB RAM you might run into "CUDA out of memory error" when running some cells.

In [4]:
# Check for GPU
if torch.cuda.is_available():     
    device = torch.device("cuda")
    print('Using: ', torch.cuda.get_device_name(0))
else:
    print( 'No GPU available.')
    device = torch.device("cpu")

model = model.to(device)
!nvidia-smi

Using:  Tesla K80
Sun Apr 26 14:11:16 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P0    57W / 149W |   1317MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usa

### Loading CORD-19

We will also need to download the CORD-19 dataset.

In [0]:
%%capture
%%shell
DATE=2020-04-10
DATA_DIR=./covid-"${DATE}"
mkdir "${DATA_DIR}"

wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/"${DATE}"/comm_use_subset.tar.gz -P "${DATA_DIR}"
wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/"${DATE}"/noncomm_use_subset.tar.gz -P "${DATA_DIR}"
wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/"${DATE}"/custom_license.tar.gz -P "${DATA_DIR}"
wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/"${DATE}"/biorxiv_medrxiv.tar.gz -P "${DATA_DIR}"
wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/"${DATE}"/metadata.csv -P "${DATA_DIR}"

ls "${DATA_DIR}"/*.tar.gz | xargs -I {} tar -zxvf {} -C "${DATA_DIR}"

We'll need to load the JSON documents into dataframe. The following preprocessing script is adapted from https://www.kaggle.com/maksimeren/covid-19-literature-clustering

In [0]:
import numpy as np
import pandas as pd
import glob
import json

Load all json documents.

In [0]:
root_path='./covid-2020-04-10'
all_json = glob.glob(f'{root_path}/**/*.json', recursive=True)
len(all_json)

59311

Load meta-data.

In [0]:
metadata_path = f'{root_path}/metadata.csv'
meta_df = pd.read_csv(metadata_path, dtype={
    'pubmed_id': str,
    'Microsoft Academic Paper ID': str, 
    'doi': str
})
meta_df.head()

Unnamed: 0,cord_uid,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,Microsoft Academic Paper ID,WHO #Covidence,has_pdf_parse,has_pmc_xml_parse,full_text_file,url
0,xqhn0vbp,1e1286db212100993d03cc22374b624f7caee956,PMC,Airborne rhinovirus detection and effect of ul...,10.1186/1471-2458-3-5,PMC140314,12525263,no-cc,"BACKGROUND: Rhinovirus, the most common cause ...",2003-01-13,"Myatt, Theodore A; Johnston, Sebastian L; Rudn...",BMC Public Health,,,True,True,custom_license,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...
1,gi6uaa83,8ae137c8da1607b3a8e4c946c07ca8bda67f88ac,PMC,Discovering human history from stomach bacteria,10.1186/gb-2003-4-5-213,PMC156578,12734001,no-cc,Recent analyses of human pathogens have reveal...,2003-04-28,"Disotell, Todd R",Genome Biol,,,True,True,custom_license,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...
2,le0ogx1s,,PMC,A new recruit for the army of the men of death,10.1186/gb-2003-4-7-113,PMC193621,12844350,no-cc,"The army of the men of death, in John Bunyan's...",2003-06-27,"Petsko, Gregory A",Genome Biol,,,False,True,custom_license,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...
3,fy4w7xz8,0104f6ceccf92ae8567a0102f89cbb976969a774,PMC,Association of HLA class I with severe acute r...,10.1186/1471-2350-4-9,PMC212558,12969506,no-cc,BACKGROUND: The human leukocyte antigen (HLA) ...,2003-09-12,"Lin, Marie; Tseng, Hsiang-Kuang; Trejaut, Jean...",BMC Med Genet,,,True,True,custom_license,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2...
4,0qaoam29,5b68a553a7cbbea13472721cd1ad617d42b40c26,PMC,A double epidemic model for the SARS propagation,10.1186/1471-2334-3-19,PMC222908,12964944,no-cc,BACKGROUND: An epidemic of a Severe Acute Resp...,2003-09-10,"Ng, Tuen Wai; Turinici, Gabriel; Danchin, Antoine",BMC Infect Dis,,,True,True,custom_license,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2...


In [0]:
class FileReader:
    def __init__(self, file_path):
        with open(file_path) as file:
            content = json.load(file)
            self.paper_id = content['paper_id']
            self.abstract = []
            self.body_text = []
            # Dataset changed, missing abstracts don't have a key
            try:
              for entry in content['abstract']:
                  self.abstract.append(entry['text'])
            except KeyError as e:
                  self.abstract.append("")
            # Body text
            for entry in content['body_text']:
                self.body_text.append(entry['text'])
            self.abstract = '\n'.join(self.abstract)
            self.body_text = '\n'.join(self.body_text)
    def __repr__(self):
        return f'{self.paper_id}: {self.abstract[:200]}... {self.body_text[:200]}...'

def get_breaks(content, length):
    data = ""
    words = content.split(' ')
    total_chars = 0

    # add break every length characters
    for i in range(len(words)):
        total_chars += len(words[i])
        if total_chars > length:
            data = data + "<br>" + words[i]
            total_chars = 0
        else:
            data = data + " " + words[i]
    return data

first_row = FileReader(all_json[0])
print(first_row)

bdaa40d95b82093f60a1c5ac8b798d67cef3a52b: Here we propose a vaccination strategy for SARS-CoV-2 based on identification of both highly conserved regions of the virus and newly acquired adaptations that are presented by MHC class I and II acro... The current SARS-CoV-2 pandemic has precipitated an urgent need to rapidly develop and deploy a safe and effective vaccine. Optimally designed vaccines maximize immunogenicity towards regions of prote...


Load all of the documents including full body text into dataframe.

In [0]:
from tqdm.notebook import tqdm

dict_ = {'paper_id': [], 'doi':[], 'abstract': [], 'body_text': [], 'authors': [], 'title': [], 'journal': [], 'abstract_summary': []}
for entry in tqdm(all_json):
    
    try:
        content = FileReader(entry)
    except Exception as e:
        continue  # invalid paper format, skip
    
    # get metadata information
    meta_data = meta_df.loc[meta_df['sha'] == content.paper_id]
    # no metadata, skip this paper
    if len(meta_data) == 0:
        continue
    
    dict_['abstract'].append(content.abstract)
    dict_['paper_id'].append(content.paper_id)
    dict_['body_text'].append(content.body_text)
    
    # also create a column for the summary of abstract to be used in a plot
    if len(content.abstract) == 0: 
        # no abstract provided
        dict_['abstract_summary'].append("Not provided.")
    elif len(content.abstract.split(' ')) > 100:
        # abstract provided is too long for plot, take first 300 words append with ...
        info = content.abstract.split(' ')[:100]
        summary = get_breaks(' '.join(info), 40)
        dict_['abstract_summary'].append(summary + "...")
    else:
        # abstract is short enough
        summary = get_breaks(content.abstract, 40)
        dict_['abstract_summary'].append(summary)
        
    # get metadata information
    meta_data = meta_df.loc[meta_df['sha'] == content.paper_id]
    
    try:
        # if more than one author
        authors = meta_data['authors'].values[0].split(';')
        if len(authors) > 2:
            # more than 2 authors, may be problem when plotting, so take first 2 append with ...
            dict_['authors'].append(get_breaks('. '.join(authors), 40))
        else:
                # authors will fit in plot
                dict_['authors'].append(". ".join(authors))
    except Exception as e:
        # if only one author - or Null valie
        dict_['authors'].append(meta_data['authors'].values[0])
    
    # add the title information, add breaks when needed
    try:
        title = get_breaks(meta_data['title'].values[0], 40)
        dict_['title'].append(title)
    # if title was not provided
    except Exception as e:
        dict_['title'].append(meta_data['title'].values[0])
    
    # add the journal information
    dict_['journal'].append(meta_data['journal'].values[0])
    
    # add doi
    dict_['doi'].append(meta_data['doi'].values[0])
    
df_covid = pd.DataFrame(dict_, columns=['paper_id', 'doi', 'abstract', 'body_text', 'authors', 'title', 'journal', 'abstract_summary'])
df_covid.head()

HBox(children=(IntProgress(value=0, max=59311), HTML(value='')))




Unnamed: 0,paper_id,doi,abstract,body_text,authors,title,journal,abstract_summary
0,bdaa40d95b82093f60a1c5ac8b798d67cef3a52b,10.1101/2020.03.31.018978,Here we propose a vaccination strategy for SAR...,The current SARS-CoV-2 pandemic has precipitat...,Mark Yarmarkovich. John M. Warrington. Alvi...,A SARS-CoV-2 Vaccination Strategy Focused on<...,,Here we propose a vaccination strategy for<br...
1,00340eea543336d54adda18236424de6a5e91c9d,10.1101/2020.03.16.20034470,"During the past three months, a new coronaviru...","In December 2019, a novel coronavirus, SARS-Co...",Carla Mavian. Simone Marini. Costanza Manes...,Regaining perspective on SARS-CoV-2<br>molecu...,,"During the past three months, a new coronavir..."
2,2ea102f58147dab02e4dea90eb90dbc67149f678,10.1101/2020.02.29.971101,The 2019 novel coronavirus (2019-nCoV or SARS-...,"A new coronavirus, named the Novel Coronavirus...",Aiping Wu. Peihua Niu. Lulan Wang. Hangyu ...,"Mutations, Recombination and Insertion in the...",,The 2019 novel coronavirus (2019-nCoV or<br>S...
3,132837acbb324a3845909b6482b90045b25519ca,10.1101/2020.03.24.20042168,"On January 23, 2020, China imposed a quarantin...","On December 31, 2019, China reported to the co...",Gustavo Cruz-Pacheco. Fernando J<br>Bustaman...,Dispersion of a new coronavirus SARS-CoV-2 by...,,"On January 23, 2020, China imposed a quaranti..."
4,073d74442e2655d79b0b3f764a627ec667ad422c,10.1101/2020.03.08.20032946,The newly emergent human virus SARS-CoV-2 is r...,Environmental transmission: transmission via c...,Luca Ferretti. Chris Wymant. Michelle<br>Ke...,Quantifying dynamics of SARS-CoV-2<br>transmi...,,The newly emergent human virus SARS-CoV-2 is<...


In [0]:
# Drop empty abstracts
df_covid['abstract'].replace('', np.nan, inplace=True)
df_covid = df_covid[df_covid['abstract'].notna()]

df_covid.shape

(26305, 8)

## Scoring CORD-19 Questions

In the following section we will experiment with querying different types of questions against CORD-19 documents to see if the loss scores meaningfully reflect the relevance of the question to the document.

In [0]:
batch_size = 5
# Prevent token indices sequence length is longer than the specified maximum
max_sequence_len = 512

def encode(doc):
  return tokenizer.encode_plus(doc, max_length=max_sequence_len, return_tensors="pt")["input_ids"]

def eval(document, questions, target_ques=None):
  display(HTML(f"<b>Doc Sample:</b> {document[:500]}"))

  scores = []
  with torch.no_grad():
    for q in questions:
      input_ids = encode(f"{document} </s>")
      question_ids = encode(f"{q} </s>")
      outputs = model(input_ids.to(device),
                      lm_labels=question_ids.to(device))
      scores.append([outputs[0], q])

  scores = sorted(scores, key=lambda x: x[0])
  for s in scores:
    if s[1] == target_ques:
      display(HTML(f"<p><b>Loss: {s[0]} Target question: {target_ques}</b></p>"))
    else:
      display(HTML(f"<p>Loss: {s[0]}  Question: {s[1]}</p>"))

To get an idea of whether the loss scores make sense we can check if relevant questions rank higher up than random ones. The loss here is the cross-entropy loss.

Consider the input document denoted as "input" and the "target" labels which are the tokens in the question denoted as $(w_1, w_2, ...)$. We define the loss as:

$loss = - log P(w_1 | input) - log P(w_2 | w_1, input)-log P(w_3 | w_1, w_2, input) ... P(w_i|w_{i-1}, ..., input)$

where $P(w_i|w_{i-1}, ..., input)$ is the probability assigned by the model (decoder) for the word $w_i$  when fed the "input" document and the previously generated words $w_{i-1}, w_{i-2}, ... , w_1$. Thus, the loss reflects the probability of the model producing all the words in the question given the document as input.

In [0]:
# Sample queries from MS-MARCO + random questions
questions = ["what was the goal of the manhattan project",
             "who was briefed by president on the manhattan project", 
             "what was the manhattan project", 
             "who led the development of the atomic bomb",

             # Random questions
             "Efforts to support sustained education, access, and capacity building in the area of ethics",
             "does she like apples",
             "what organs are in the pancreas",
             "what is my favorite color", 
             "how many days until christmas"]
document = 'The Manhattan Project was the name for a project conducted during World War II, to develop the first atomic bomb. It refers specifically to the period of the project from 194 â¦ 2-1946 under the control of the U.S. Army Corps of Engineers, under the administration of General Leslie R. Groves.'
eval(document, questions)

print("\n")

document = 'Manhattan Project. The Manhattan Project was a research and development undertaking during World War II that produced the first nuclear weapons. It was led by the United States with the support of the United Kingdom and Canada. From 1942 to 1946, the project was under the direction of Major General Leslie Groves of the U.S. Army Corps of Engineers. Nuclear physicist Robert Oppenheimer was the director of the Los Alamos Laboratory that designed the actual bombs. The Army component of the project was designated the'
eval(document, questions)





It looks like relavant queries are ranking higher. Now we can check if queries related to specific documents in the CORD-19 dataset are ranking higher than the other unrelated queries. The top document associated with each query is retreived from [Covidex](https://covidex.ai/).

There are also groups of queries that are created by breaking down the long form query into more specific "what, how, who" questions that the doc2query-T5 model was trained on.

In [0]:
questions = [
           # Queries specific to documents
           "Tools and studies to monitor phenotypic change and potential adaptation of the virus",
 
           "Research on the economic impact of this or any pandemic. This would include identifying policy and programmatic alternatives that lessen/mitigate risks to critical government services",
           "What is the economic impact of a pandemic",
           "What is the financial impact of a pandemic",
           "How can the economic impact of pandemic be reduced",
 
           "Real-time tracking of whole genomes and a mechanism for coordinating the rapid dissemination of that information to inform the development of diagnostics and therapeutics and to track variations of the virus over time",
           "How to track the variations of the virus over time",
           "How can monitoring of whole genomes help the development of diagnostics and therapeutics",
 
           "Methods evaluating potential complication of Antibody-Dependent Enhancement (ADE) in vaccine recipients.",
 
           "Public health mitigation measures that could be effective for control",
 
           "Best telemedicine practices, barriers and facilitators, and specific actions to remove/expand them within and across state boundaries",
           "What are the best practices for telemedicine",
           "What are the barriers for telemedicine",
 
 
           # Several unrealted queries from MS-MARCO
           "what was the goal of the manhattan project",
           "why do we use cookies at chipotle",
           "who plays the team in quidditch",
            ]

# Pairs of target query and doc
for document in [("Tools and studies to monitor phenotypic change and potential adaptation of the virus", "Moving away from genome scan methods used for human GWAS (ultimately inappropriate for the short highly polymorphic genomes of RNA viruses), our work shows the power and potential of multi-class machine learning algorithms in inferring the functional genetic changes associated with phenotypic change (e.g. crossing a species barrier). We show that even distantly related viruses within a viral family share highly conserved genetic signatures of host specificity; reinforce how fitness landscapes of host adaptation are shaped by host phylogeny; and highlight the evolutionary trajectories of RNA viruses in rapid expansion and under great evolutionary pressure. We do so by (for each dataset) unveiling a set of phenotype characteristic mutations which are shown to be functionally relevant, thus providing new insights into phenotypic relationships between RNA viruses. These methods also provide a solid statistical framework with which the degree of host adaptation can be inferred, thus serving as a valuable tool for studying host transition events with particular relevance for emerging infectious diseases. These methods can then serve as rigorous tools of emergence potential assessment, specifically in scenarios where rapid host classification of newly emerging viruses can be more important than identifying putative functional sites."),
                ("Research on the economic impact of this or any pandemic. This would include identifying policy and programmatic alternatives that lessen/mitigate risks to critical government services", "Mitigation of a severe influenza pandemic can be achieved using a range of interventions to reduce transmission. Interventions can reduce the impact of an outbreak and buy time until vaccines are developed, but they may have high social and economic costs. The non-linear effect on the epidemic dynamics means that suitable strategies crucially depend on the precise aim of the intervention. National pandemic influenza plans rarely contain clear statements of policy objectives or prioritization of potentially conflicting aims, such as minimizing mortality (depending on the severity of a pandemic) or peak prevalence or limiting the socio-economic burden of contact-reducing interventions. We use epidemiological models of influenza A to investigate how contact-reducing interventions and availability of antiviral drugs or pre-pandemic vaccines contribute to achieving particular policy objectives. Our analyses show that the ideal strategy depends on the aim of an intervention and that the achievement of one policy objective may preclude success with others, e.g., constraining peak demand for public health resources may lengthen the duration of the epidemic and hence its economic and social impact. Constraining total case numbers can be achieved by a range of strategies, whereas strategies which additionally constrain peak demand for services require a more sophisticated intervention. If, for example, there are multiple objectives which must be achieved prior to the availability of a pandemic vaccine (i.e., a time-limited intervention), our analysis shows that interventions should be implemented several weeks into the epidemic, not at the very start."),
             ("Real-time tracking of whole genomes and a mechanism for coordinating the rapid dissemination of that information to inform the development of diagnostics and therapeutics and to track variations of the virus over time", "In recent decades, many infectious diseases have significantly increased in incidence and/or geographic range, in some cases impacting heavily on human, animal or plant populations. Some of these ‘emerging infectious diseases’ are associated with pathogens that have appeared in populations for the first time as a result of cross-species transmission (e.g. human immunodeficiency virus—acquired immunodeficiency syndrome (HIV-AIDS), severe acute respiratory syndrome (SARS)), while others were previously known but are rapidly increasing in incidence or geographic range as a result of underlying epidemiological changes (e.g. multi-drug resistant Staphylococcus aureus (MRSA) infection, dengue, West Nile encephalitis, foot and mouth disease, cassava mosaic disease). The latter include prominent diseases as tuberculosis, malaria and yellow fever that were once on the decline but are now ‘re-emerging diseases’."),
             ("Methods evaluating potential complication of Antibody-Dependent Enhancement (ADE) in vaccine recipients.", "Immune enhancement (antibody-dependent enhancement, ADE) has been clearly shown to occur in experimental laboratory infections of cats previously infected by natural or experimental infection, and of cats previously vaccinated with Primucell FIP vaccine, experimental MLV vaccines, experimental inactivated vaccines, and experimental recombinant vaccines containing the S gene (McArdle et al., 1992, 1995; Ngichabe, 1992; Scott et al., 1992, 1995a,b; Weiss and Scott, 1981). Antibodies to the S protein produced by the host result in enhanced infection of macrophages via Fc receptors, and the infected macrophages then transport the virus throughout the body. In the enhanced infection there is a decrease in incubation time—as short as 1–2 days—after exposure to virulent FIPV. The relative amount of virus and antibodies is important in order for ADE to occur. Higher concentrations of antibody neutralize the virus, but as the concentration of antibody decreases a concentration occurs where enhanced infection results. Other related coronaviruses can cause enhanced FCoV infection in the cat, including CCV."),
                           ("Public health mitigation measures that could be effective for control", "The novel coronavirus disease 2019 (COVID-19) outbreak on the Diamond Princess ship has caused over 634 cases as of February 20, 2020. We model the transmission process on the ship with a stochastic model and estimate the basic reproduction number at 2.2 (95%CI: 2.1−2.4). We estimate a large dispersion parameter than other coronaviruses, which implies that the virus is difficult to go extinction. The epidemic doubling time is at 4.6 days (95%CI: 3.0−9.3), and thus timely actions were crucial. The lesson learnt on the ship is generally applicable in other settings."),
              ("Best telemedicine practices, barriers and faciitators, and specific actions to remove/expand them within and across state boundaries", "Even before the arrival of COVID-19, telemedicine was increasingly being adopted to bring specialty-palliative care into the homes of seriously ill patients and their families. Patients who receive palliative care by telemedicine are typically very satisfied with the convenience and timesaving of video care. Telemedicine also saves valuable drive-time for home-visiting palliative care clinicians and increases capacity at brick-and-mortar clinics.1 With the emergence of COVID-19, telemedicine has been catapulted into the role of a critically essential service for patients to help mitigate the spread of COVID19 and preserve valuable personal protective equipment. For example, the University of California, SanFrancisco (UCSF) has mandated telemedicine be used to care for palliative care and nonpalliative care patients in ambulatory settings, whenever possible. Similarly, many hospice agencies are currently offering most, if not all, social work and chaplaincy support by telemedicine. For hospitals, strict limitations on visitors have meant that some inpatient palliative care consult programs are performing family meetings and consults virtually. To support these changes, many telemedicine")
             ]:
  eval(document[1], questions, target_ques=document[0])
  print("")




















We can query different types of questions over many CORD-19 documents to see if there are certain queries the model favors.

We see that some queries from the MS-MARCO training set and the "childhood death" not from the training set consistently rank near the top.

In [0]:
questions = [
             # Specific to documents
             "what is the leading cause of childhood death in the world", # Doc 3
             "how can regulation help reduce foreign pathogens", # Doc 2
             
             # CORD-19 topic questions
             "what new drugs are being developed", 
             "effectiveness of drugs being developed and tried to treat COVID-19 patients.", 
             "exploration of use of best animal models and their predictive value for a human vaccine.",
             "capabilities to discover a therapeutic (not vaccine) for the disease, and clinical effectiveness studies to discover therapeutics, to include antiviral agents.",
             "natural history of the virus and shedding of it from an infected person",
             "implementation of diagnostics and products to improve clinical processes",

              # What, how style questions
              "what is the incubation period of COVID-19",
              "what is the effectiveness of chloroquine for COVID-19",
              "what is the duration of viral shedding for COVID-19",
              "how does COVID-19 bind to the ACE2 receptor",
              "how do weather conditions affect the transmission of COVID-19",
              "tell me about IgG and IgM tests for COVID-19",
              "what is the prognostic value of IL-6 levels in COVID-19",
             
             # Predicted queries from MS-MARCO
             "what was the goal of the manhattan project",
             "who was briefed by president on the manhattan project",
             "why do we use cookies at chipotle",
             "who plays the team in quidditch",
             
             # Random
             "what is my favorite color", 
             "how many days until christmas"]


for index, row in df_covid.iterrows():
  if index > 200:
    break
  display(HTML(f"<p><b>Title:</b> {row['title']}</p>"))
  document = row['abstract']
  eval(document, questions)
  print("")










































































































































































































































































KeyboardInterrupt: ignored

Now let's compute the losses for a question over the dataset. We can check whether "what is the financial impact of a pandemic" or "what was the goal of the manhattan project" has a lower mean loss, the also the losses of the top scoring documents.

This will also allow us to estimate the inference speed for scoring one query over the entire CORD-19 dataset.

In [0]:
# # For CUDA out of memory error
# torch.cuda.empty_cache()
# import gc
# gc.collect()

import numpy as np
from tqdm.notebook import tqdm

test_size = 40000
top_n = 10

def save_results(data, id):
  with open(f"/content/drive/My Drive/results_{id}.json", "w") as file:
    file.write(json.dumps(data, indent=4)) 
  print("Saved to Drive")

def test_question(question):
  question_ids = tokenizer.encode(question, return_tensors="pt")
  losses = []
  results = {
      "top_docs": []
  }

  with torch.no_grad():
    for doc in tqdm(df_covid['abstract'][:test_size]):
      input_ids = encode(doc)

      outputs = model(input_ids.to(device), 
                    lm_labels=question_ids.to(device))
      
      # Outputs of forward pass
      # Prediction scores for each vocabulary token before softmax
      # lm_labels provided to return loss
      loss, prediction_scores = outputs[:2]
      losses.append(loss.item())

    # Display the documents with lowest loss, argpartition return min indeces not in order
    losses = np.array(losses)
    max_indeces = np.argpartition(losses, top_n)[:top_n]

    print(f"Mean loss: {np.mean(losses)}  Question: {question}")
    print(f"Best Matches:")
    for i in max_indeces:
      results["top_docs"].append({'Loss': losses[i], 'Title': df_covid.iloc[i]['title'] , 'Abstract': df_covid.iloc[i]['abstract'][:500]})
      print(f"Loss: {losses[i]} Title: {df_covid.iloc[i]['title']}\n  Sample: {df_covid.iloc[i]['abstract'][:500]}")

    return results


In [0]:
res = {}
for q in ['what is the financial impact of a pandemic', 'what was the goal of the manhattan project']:
  res[q] = test_question(q)

save_results(res, 1)

HBox(children=(IntProgress(value=0, max=26305), HTML(value='')))


Mean loss: 3.3629893023893875  Question: what is the financial impact of a pandemic
Best Matches:
Loss: 1.2454053163528442 Title:  Impact of pandemic control over airport<br>economics: Reconciling public health with airport<br>business through a streamlined approach in pandemic<br>control
  Sample: Rapid aviation commercialisation and upsurge in worldwide affluence created a new avenue for disease proliferation across countries at an unprecedented rate. Epidemic and pandemic occurrences over the last decade demonstrate airports' role in disease transmission; while also exhibiting their importance as containment nodes. Tremendous amount of resources and effort are necessary to achieve the latter but inevitably, disrupt normal operations. The contrasting objectives between public health auth
Loss: 1.1110522747039795 Title:  Pandemic Influenza Planning in Nursing Homes:<br>Are We Prepared?
  Sample: Avian influenza or Influenza A (H5N1) is caused by a viral strain that occurs naturally i

HBox(children=(IntProgress(value=0, max=26305), HTML(value='')))


Mean loss: 3.6796209783704534  Question: what was the goal of the manhattan project
Best Matches:
Loss: 2.1684350967407227 Title:  The Third Annual Meeting of the European Virus<br>Bioinformatics Center
  Sample: viral infections and outbreaks, being successfully used to detect, control, and treat infections of humans and animals. This active field of research has attracted approximately 110 experts in virology and bioinformatics/computational biology from Europe and other parts of the world to attend the two-day meeting in Glasgow to increase scientific exchange between laboratory-and computer-based researchers. The meeting was held at the McIntyre Building of the University of Glasgow; a perfect locati
Loss: 2.344064950942993 Title:  Reverse Genetics of Measles Virus and<br>Resulting Multivalent Recombinant Vaccines:<br>Applications of Recombinant Measles Viruses
  Sample: An overview is given on the development of technologies to allow reverse genetics of RNA viruses, i.e., the res

We see that the "what is the financial impact of a pandemic" question has a slightly lower mean loss score although manhattan project question was seen frequently in training.

The loss scores for the top documents returned by the finacial impact query are significalty lower that the manhattan project although their mean losses are similar.

COVIDEX comparison?

Now let's see if there are particular COVID-19 related questions with low loss scores.

In [0]:
questions = ["what new drugs are being developed",
             "what is the incubation period of COVID-19",
              "what is the effectiveness of chloroquine for COVID-19",
              "what is the duration of viral shedding for COVID-19",
              "how does COVID-19 bind to the ACE2 receptor",
              "how do weather conditions affect the transmission of COVID-19",
              "tell me about IgG and IgM tests for COVID-19",
              "what is the prognostic value of IL-6 levels in COVID-19"]

res = {}
for q in questions:
  res[q] = test_question(q)

save_results(res, 2)

HBox(children=(IntProgress(value=0, max=26305), HTML(value='')))


Mean loss: 4.359434314133028  Question: what new drugs are being developed
Best Matches:
Loss: 1.6767781972885132 Title:  Challenges and recent progress in drug<br>discovery for tropical diseases
  Sample: Infectious tropical diseases have a huge effect in terms of mortality and morbidity, and impose a heavy economic burden on affected countries. These diseases predominantly affect the world's poorest people. Currently available drugs are inadequate for the majority of these diseases, and there is an urgent need for new treatments. This Review discusses some of the challenges involved in developing new drugs to treat these diseases and highlights recent progress. While there have been notable succ
Loss: 1.6075297594070435 Title:  Novel Inhibitor Design for Hemagglutinin<br>against H1N1 Influenza Virus by Core Hopping Method
  Sample: The worldwide spread of H1N1 avian influenza and the increasing reports about its resistance to the current drugs have made a high priority for developin

HBox(children=(IntProgress(value=0, max=26305), HTML(value='')))


Mean loss: 4.668094021432007  Question: what is the incubation period of COVID-19
Best Matches:
Loss: 0.6816296577453613 Title:  In-flight Transmission Cluster of COVID-19: A<br>Retrospective Case Series
  Sample: Objectives: No data were available about in-flight transmission of SARS-CoV-2. Here, we report an in-flight transmission cluster of COVID-19 and describe the clinical characteristics of these patients.
Methods: After a flight, laboratory-confirmed COVID-19 was reported in 12 patients. Ten patients were admitted to the designated hospital. Data were collected from 25 th January to 28 th February 2020. Clinical information was retrospectively collected.
Results: All patients are passengers without
Loss: 0.690512478351593 Title:  Epidemiologic Characteristics of COVID-19 in<br>Guizhou, China
  Sample: 162 laboratory-confirmed cases related to COVID-19. We described the demographic 15 characteristics of the cases and estimated the incubation period, serial interval and 16 basic 

HBox(children=(IntProgress(value=0, max=26305), HTML(value='')))


Mean loss: 5.334950432854899  Question: what is the effectiveness of chloroquine for COVID-19
Best Matches:
Loss: 0.9819251894950867 Title:  A systematic review on the efficacy and safety<br>of chloroquine for the treatment of COVID-19
  Sample: a b s t r a c t a r t i c l e i n f o
Purpose: COVID-19 (coronavirus disease 2019) is a public health emergency of international concern. As of this time, there is no known effective pharmaceutical treatment, although it is much needed for patient contracting the severe form of the disease. The aim of this systematic review was to summarize the evidence regarding chloroquine for the treatment of COVID-19. Methods: PubMed, EMBASE, and three trial Registries were searched for studies on the use of
Loss: 1.6743583679199219 Title:  Azithromycin and ciprofloxacin have a<br>chloroquine-like effect on respiratory epithelial cells
  Sample: There is interest in the use of chloroquine/hydroxychloroquine (CQ/HCQ) and azithromycin (AZT) in COVID-19 thera

HBox(children=(IntProgress(value=0, max=26305), HTML(value='')))


Mean loss: 5.83694266549219  Question: what is the duration of viral shedding for COVID-19
Best Matches:
Loss: 1.2328652143478394 Title:  Clinical course and risk factors for mortality<br>of adult inpatients with COVID-19 in Wuhan,<br>China: a retrospective cohort study
  Sample: Background Since December, 2019, Wuhan, China, has experienced an outbreak of coronavirus disease 2019 (COVID-19), caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Epidemiological and clinical characteristics of patients with COVID-19 have been reported but risk factors for mortality and a detailed clinical course of illness, including viral shedding, have not been well described.
In this retrospective, multicentre cohort study, we included all adult inpatients (≥18 ye
Loss: 1.1800122261047363 Title:  Virologic and clinical characteristics for<br>prognosis of severe COVID-19: a retrospective<br>observational study in Wuhan, China
  Sample: The severe acute respiratory syndrome coron

HBox(children=(IntProgress(value=0, max=26305), HTML(value='')))


Mean loss: 6.173504550682232  Question: how does COVID-19 bind to the ACE2 receptor
Best Matches:
Loss: 2.231472969055176 Title:  The SARS-CoV-2 exerts a distinctive strategy<br>for interacting with the ACE2 human receptor
  Sample: The COVID-19 disease has plagued over 110 countries and has resulted in over 4,000 deaths within 10 weeks. We compare the interaction between the human ACE2 receptor and the SARS-CoV-2 spike protein with that of other pathogenic coronaviruses using molecular dynamics simulations. SARS-CoV, SARS-CoV-2, and HCoV-NL63 recognize ACE2 as the natural receptor but present a distinct binding interface to ACE2 and a different network of residue-residue contacts. SARS-CoV and SARS-CoV-2 have comparable bi
Loss: 2.3017055988311768 Title:  The first-in-class peptide binder to the<br>SARS-CoV-2 spike protein
  Sample: Coronavirus disease 19 is an emerging global health crisis. With over 200,000 29 confirmed cases to date, this pandemic continues to expand, spurring res

HBox(children=(IntProgress(value=0, max=26305), HTML(value='')))


Mean loss: 6.3802170541344925  Question: how do weather conditions affect the transmission of COVID-19
Best Matches:
Loss: 2.1481802463531494 Title:  A mathematical model of COVID-19 transmission<br>between frontliners and the general public
  Sample: The number of COVID-19 cases is continuously increasing in different countries (as of March 2020) including the Philippines. It is estimated that the basic reproductive number of COVID-19 is around 1.5 to 4. The basic reproductive number characterizes the average number of persons that a primary case can directly infect in a population full of susceptible individuals.
However, there can be superspreaders that can infect more than this estimated basic reproductive number. In this study, we formul
Loss: 1.8399919271469116 Title:  Evidence that higher temperatures are<br>associated with lower incidence of COVID-19 in pandemic<br>state, cumulative cases reported up to March 27, 2020
  Sample: Seasonal temperature variation may impact the tra

HBox(children=(IntProgress(value=0, max=26305), HTML(value='')))


Mean loss: 5.999646185179034  Question: tell me about IgG and IgM tests for COVID-19
Best Matches:
Loss: 2.5236029624938965 Title:  Clinical significance of IgM and IgG test for<br>diagnosis of highly suspected COVID-19 infection
  Sample: Quick, simple and accurate diagnosis of suspected COVID-19 is very important for the screening and therapy of patients. Although several methods were performed in clinical practice, however, the IgM and IgG diagnostic value evaluation was little performed. 57 suspected COVID-19 infection patients were enrolled in our study. 24 patients with positive and 33 patients with negative nucleic acid test. The positive rate of COVID-19 nucleic acid was 42.10%. The positive detection rate of combination o
Loss: 2.9075546264648438 Title:  Antibody responses to SARS-CoV-2 in patients<br>of novel coronavirus disease 2019
  Sample: The novel coronavirus SARS-CoV-2 is a newly emerging virus. The antibody response in infected patient remains largely unknown, and th

HBox(children=(IntProgress(value=0, max=26305), HTML(value='')))


Mean loss: 4.875280444777481  Question: what is the prognostic value of IL-6 levels in COVID-19
Best Matches:
Loss: 1.9586209058761597 Title:  Potential Factors for Prediction of Disease<br>Severity of COVID-19 Patients
  Sample: Objective: Coronavirus disease 2019 (COVID-19) is an escalating global epidemic caused by SARS-CoV-2, with a high mortality in critical patients. Effective indicators for predicting disease severity in SARS-CoV-2 infected patients are urgently needed. Methods: In this study, 43 COVID-19 patients admitted in Chongqing Public Health Medical Center were involved. Demographic data, clinical features, and laboratory examinations were obtained through electronic medical records. Peripheral blood speci
Loss: 1.5714422464370728 Title:  Detectable serum SARS-CoV-2 viral load<br>(RNAaemia) is closely associated with drastically<br>elevated interleukin 6 (IL-6) level in critically ill<br>COVID-19 patients
  Sample: Although the SARS-CoV-2 viral load detection of respira

# Testing Hugging Face API

As a sanity check we can see if the HF api + the doc2query-T5 checkpoint is predicting queries similar to the ones generated by the [doc2query model and T5 CLI](https://github.com/castorini/docTTTTTquery#t5-inference-predicting-queries-from-passages).

In [0]:
%%capture
# Predicted questions
!curl -o predicted_queries_topk_sampling.zip "https://storage.googleapis.com/doctttttquery_git/predicted_queries_topk_sampling.zip"
!unzip predicted_queries_topk_sampling.zip -d "predicted_queries"
!rm -f predicted_queries_topk_sampling.zip

# MS MARCO Dataset
!curl "https://storage.googleapis.com/doctttttquery_git/collection.tar.gz" --output collection.tar.gz
!tar -xvf collection.tar.gz
!rm collection.tar.gz

In [0]:
import pandas as pd

df = pd.read_csv("collection.tsv",sep='\t', header=None)

In [7]:
pd.options.display.max_colwidth = 200
df.head(20)

Unnamed: 0,0,1
0,0,The presence of communication amid scientific minds was equally important to the success of the Manhattan Project as scientific intellect was. The only cloud hanging over the impressive achievemen...
1,1,The Manhattan Project and its atomic bomb helped bring an end to World War II. Its legacy of peaceful uses of atomic energy continues to have an impact on history and science.
2,2,Essay on The Manhattan Project - The Manhattan Project The Manhattan Project was to see if making an atomic bomb possible. The success of this project would forever change the world forever making...
3,3,"The Manhattan Project was the name for a project conducted during World War II, to develop the first atomic bomb. It refers specifically to the period of the project from 194 â¦ 2-1946 under the ..."
4,4,"versions of each volume as well as complementary websites. The first websiteâThe Manhattan Project: An Interactive Historyâis available on the Office of History and Heritage Resources website,..."
5,5,"The Manhattan Project. This once classified photograph features the first atomic bomb â a weapon that atomic scientists had nicknamed Gadget.. The nuclear age began on July 16, 1945, when it was..."
6,6,Nor will it attempt to substitute for the extraordinarily rich literature on the atomic bombs and the end of World War II. This collection does not attempt to document the origins and development ...
7,7,Manhattan Project. The Manhattan Project was a research and development undertaking during World War II that produced the first nuclear weapons. It was led by the United States with the support of...
8,8,"In June 1942, the United States Army Corps of Engineersbegan the Manhattan Project- The secret name for the 2 atomic bombs."
9,9,"One of the main reasons Hanford was selected as a site for the Manhattan Project's B Reactor was its proximity to the Columbia River, the largest river flowing into the Pacific Ocean from the Nort..."


Compare the generated queries.

In [11]:
docs_to_test = 10
num_questions = 5

for i, doc in enumerate(df[1][:docs_to_test]):
  doc_token_ids = tokenizer.encode(doc, return_tensors="pt")
  
  greedy_outputs = model.generate(
    doc_token_ids.to(device),
    do_sample=True,
    max_length=64,
    top_k=10,
    num_return_sequences=5
  )

  for j, sample_output in enumerate(greedy_outputs):
    print("{}: {}".format(j, tokenizer.decode(sample_output, skip_special_tokens=True)))

  print("\n ---- doc2query-T5 predictions ---- \n")
  for j in range(num_questions):
    with open(f"predicted_queries/predicted_queries_topk_sample00{j}.txt000-1004000", 'r') as qs:
      for line_num, q in enumerate(qs):
        if line_num == i:
          print(q.strip())
          break

  print("-"*50)

0: what was the success of the atomic researchers and engineers means
1: what is their success
2: which cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant
3: which is true of the success of the atomic scientists and engineers
4: what is the cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success means; hundreds of thousands of innocent lives were obliterated.

 ---- doc2query-T5 predictions ---- 

what was important to the success of the manhattan project
why was the manhattan project important?
what was important about the manhattan project
why was the success of the manhattan project so important?
who was the manhattan project a scientific project for
--------------------------------------------------
0: why did the manhattan project impact the united states
1: what legacy of peaceful uses of atomic energy continues to have an impact on history and science.
2: what