# Questions and Answers on the COVID-19 articles
### - Find Documents with TF-IDF, Ask Questions with Transformers

This notebook searches over the COVID research papers dataset and tries to find some answers to the questions posed for the related Kaggle tasks. These are not intended to be direct answers (in all cases) but to point to relevant papers and highlight potentially interesting points that specific papers cover related to the task questions.

From the technical perspective this uses [TF-IDF scores](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) across the papers to find papers related to the questions, sorts them according to these scores, and then uses the [Huggingface Transformers library](https://github.com/huggingface/transformers) and its [Question-Answers model (pipeline)](https://huggingface.co/transformers/main_classes/pipelines.html#questionansweringpipeline) to answer the questions. 

The answers given by the model are ranked by the confidence given by the model. Only answers above a specific confidence threshold are selected. The threshold confidence is varied across questions based on running the model with the question on a smaller subset and tuning the threshold to what produces useful results for that questions (in my opinion).

I believe the results are most useful to help someone going through a large body of research to find interesting papers to look into that they might have otherwise not noticed, or to help focus search over large sets of documents. Perhaps to highlights some ideas that could otherwise be missed.

In [None]:
import os
import numpy as np
import pandas as pd 
from memory_profiler import profile
from typing import List
import pickle
from gensim.models.word2vec import Word2Vec

from tqdm.auto import tqdm
tqdm.pandas()


Install some profiling tools to help track where the memory and CPU goes:

In [None]:
!pip install memory_utils

In [None]:
!pip install codeprofile

In [None]:
import memory_utils
from codeprofile import profiler

Set a few parameters that are used later in the notebook. These limit the resource use to fit in the limits of a Kaggle kernel. They determine how many documents at most to parse looking for answers to a question (DOC_LIMIT), and the maximum size of a document that is parsed for question-answering (DOC_SIZE_CAP):

In [None]:
#limit number of documents if DOC_LIMIT != None. 
#allows testing the code without waiting 2 days, and to run a large set of questions in the notebook timelimit / resources
DOC_LIMIT = 42
DOC_SIZE_CAP = 100000 #caps document length at 100k characters for processing, saves memory and processing time

# Preprocessed data and libraries

## Transformers

Transformers are a deep learning model for natural language processing. Here I import and later use the one by [Hugging Face](https://github.com/huggingface/transformers/).

In [None]:
!pip install --upgrade transformers

In [None]:
from transformers import pipeline

nlp = pipeline("question-answering")

## Preprocessed Datasets

This notebook uses a number of datasets I created previously. They save me resources as they provide a lot of preprocessed data to use, and allow focusing this notebook on the question-answering:

- word2vec: Relations of the dataset words in 300-dimensional vector space. Useful to find related words, such as synonyms for queries.
- Inverted index: Maps words to their TF-IDF scores in different documents. Useful to find highly relevant documents for a set of keywords. Those docs can then be fed to the question-answer model.
- TF-IDF matrix: A set of statistical scores on how frequent words are across all documents vs specific document. Used as input to the inverted index.
- Doc ids: Maps the document identifiers in the inverted index back to the original Kaggle data, and metadata such as publication dates, authors, and journal.

In [None]:
with open("/kaggle/input/covid-word2vec/word2vec.pickle", "rb") as f:
    w2v = pickle.load(f)

In [None]:
with open("/kaggle/input/covid-tfidf/i_index.pickle", "rb") as f:
    i_index = pickle.load(f)

In [None]:
with open("/kaggle/input/covid-tfidf/tfidf_matrix.pickle", "rb") as f:
    tfidf_matrix = pickle.load(f)

In [None]:
with open("/kaggle/input/covid-tfidf/doc_ids.pickle", "rb") as f:
    doc_ids = pickle.load(f)

### Load the document metadata provided by Kaggle

In [None]:
#need to be able to load all the documents for the question-answering later, so load up all the paths first to identify where to find the document later.
import glob, os, json

def load_doc_paths():
    all_file_paths = []
    base_paths = [
        "/kaggle/input/CORD-19-research-challenge/biorxiv_medrxiv/biorxiv_medrxiv/pdf_json/*",
        "/kaggle/input/CORD-19-research-challenge/biorxiv_medrxiv/biorxiv_medrxiv/pmc_json/*",
        "/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/pdf_json/*",
        "/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/pmc_json/*",
        "/kaggle/input/CORD-19-research-challenge/noncomm_use_subset/noncomm_use_subset/pdf_json/*",
        "/kaggle/input/CORD-19-research-challenge/noncomm_use_subset/noncomm_use_subset/pmc_json/*",
        "/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/pdf_json/*",
        "/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/pmc_json/*",
    ]
    for base_path in base_paths:
        file_paths_glob = glob.glob(base_path)
        all_file_paths.extend(file_paths_glob)
    return all_file_paths

In [None]:
all_doc_paths = load_doc_paths()

The metadata describes the journal, authors, and similar metadata about each article in the dataset. Need it to give a more meaningful context for the answers once they are found.

In [None]:
df_metadata = pd.read_csv("/kaggle/input/CORD-19-research-challenge/metadata.csv")
df_metadata.head()

### Inverted Index

The inverted index comes from one of my [previous notebooks](https://www.kaggle.com/donkeys/tf-idf-and-inverted-index-creation-for-covid19), and is made available as a [dataset](https://www.kaggle.com/donkeys/covid-tfidf) I import in this notebook. What does an [inverted index](https://en.wikipedia.org/wiki/Inverted_index) mean? I would consider a normal index to map documents to words it contains. This inverted index contains word-weight pairs mapped to documents.

For example, if we take the word "patient" from the  inverted index *i_index*:

In [None]:
i_index["patient"]

Depending on how the dataset evolves, the specific numbers above might change over time. But at the time of writing this, it said the following for the first two rows:

```
array([[3.2264000e+04, 6.1301637e-01],
       [9.9830000e+03, 6.1036140e-01],
       ...,
```

This shows an array where each element is another array of two elements. The above is basically saying:
- Document at index 32264 has a TF-IDF score of 0.61301637 for the word "patient"
- Document at index  9983 has a TF-IDF score of 0.61036140 for the word "patient"
- And so on...

Notice that all the documents are sorted by their TF-IDF score, so in this case the first item in the array is the document with the highest score for "patient". And the last one would be the lowest score. Which I may have capped at some small value when the index was created, to save space and computation.

In any case, this index allows to quickly look up related documents with highest scores for words of interest. The idea was to use this to to build the set of documents for question-answering per specific terms, but in this case I first translate it to a word-doc weight dictionary.

**Weight Dictionaries**

Besides having words mapped to documents (as in the inverted index), it is also useful to have the other type of mapping available. From words to documents to their weights for that word. The following builds that. For example:

```
  word_dicts["bob"]={1:0.5,
                     2:0.1,
                     7:0.3}
```

The above shows an example where word "bob" maps to 3 documents with ID's of 1,2, and 7. Each having a weight of 0.5, 0.1, and 0.3 respectively, for the word "bob".

This weight dictionary is actually what gets used in this notebook later to find the docs for a set of (key)words of interest.


In [None]:
def build_dicts(threshold):
    word_weight_dicts = {}
    for word in tqdm(i_index.keys()):
        doc_weights = i_index[word]
        doc_weight_dict = {}
        word_weight_dicts[word] = doc_weight_dict
        for doc_idx, doc_weight in doc_weights:
            doc_idx = int(doc_idx)
            doc_weight_dict[doc_idx] = doc_weight
            #reduct sizes by capping on some number of docs
            if doc_weight < threshold and len(doc_weight_dict) > 1000:
                break
    return word_weight_dicts
            

### Word2Vec

Word2vec is quite a traditional NLP model these days. It describes relations of words across documents. For example, maybe "patient" and some other words are used similarly? Example:

In [None]:
w2v.init_sims()

In [None]:
w2v.similar_by_vector("patient", 10)

I use word2vec to find pairs of words used similarly. Looking at the top 10 words reported (as for "patient" above), loop through them and pick the ones that higher than 0.5 score for similarity. Later use these as synonyms to identify related documents to a topic of interest. Why 10 and 0.5? Consider those as hyperparameters I chose based on some experiments with this data.

In [None]:
def find_pairs(words_sentence):
    words = words_sentence.split(" ")
    word_lists = []
    for word in words:
        synonyms = w2v.similar_by_vector(word, topn = 10)
        selected = [word]
        for synonym in synonyms:
            #if word2vec distance factor is less than 0.5, stop adding. expect input to be sorted..
            if synonym[1] < 0.5:
                break
            selected.append(synonym[0])
        word_lists.append(selected)
    
    for word_list in word_lists:
        print("synonyms found:")
        print(f"{word_list[0]}: {word_list[1:]}")
    
    return word_lists


Since there can be quite many documents with the keywords of interest, I need to filter them. A threshold search is done to find which score to use as a cut-off point. In this case (or when I wrote this..) I use a score threshold of 70% out of what all the documents score. So if the word is in a document, but has a TF-IDF lower than max 30% for that word, for that document, the document is not included.

For example:

- Imagine 70% of documents have a TF-IDF score for patient at 0.6 or less. 0.6 is then selected as a threshold, and any document having a TF-IDF sore less than, or equal to 0.6, is not included.

In [None]:
def find_weight_threshold():
    arrays = [weights for weights in i_index.values()]
    print(f"loaded doc weights for {len(arrays)} words.")
    all_data = np.concatenate(arrays)
    print(f"a total of {all_data.shape} doc-word weights loaded.")
    d_mean = np.mean(all_data[:,1])
    d_med = np.median(all_data[:,1])
    d_max = np.max(all_data[:,1])
    d_min = np.min(all_data[:,1])
    p80 = np.percentile(all_data[:,1],80)
    p70 = np.percentile(all_data[:,1],70)
    p30 = np.percentile(all_data[:,1],30)
    print(f"min={d_min}, max={d_max}\n"+
          f"avg={d_mean}, median={d_med}\n"+
          f"p30={p30}, p80=(p80)")
    with np.printoptions(precision=20, suppress=True):
        print(np.array([d_min, d_max, d_mean, d_med, p30, p80]))
    #threshold = np.max([d_mean, d_med])
    threshold = p70 #using 70 to limit the size of the notebook
    print(f"threshold: {threshold}")
    return threshold

In [None]:
%time
threshold = find_weight_threshold()
word_dicts = build_dicts(threshold)


Calculate weights for documents given the set of search terms. For example, assume search "incubation period":
- get list of synonyms for "incubation" and "period". call this list "synonyms".
- two examples what it could be for each (will use these as examples):
   - incubation: preincubation, incubating
   - period: interval, duration
- loop every word/synonym:
   - use the above built dictionary of word weights for all docs to find TFIDF weight for every doc
   - build a list of weights for all docs for each word / synonym
   - sum all per doc
   - example doc 1:
      - incubation: incubation=0.2, preincubation=0.14, total=0.2+0.14=0.34
      - period: period=0.2, internal=0.1, duration=0.22, total=0.2+0.1+0.22=0.52
      - sum: 0.34+0.52=0.86
   - example doc 2:
      - incubation: incubation=0.5, preincubation=0.3, total=0.5+0.3=0.8
      - period: None, total=None  <- doc 2 does not have word "period" in this example, or the score is too low
      - sum: 0.8+None=None <- if sore for one word is missing, the doc gets removed

In selecting the actual documents for question-answering, I use the above formula to calculate scores for documents, and then sort them by their scores. Ask the question from each of the top N of those documents.

Note that all the documents that make it into this formula should already have quite high TF-IDF score due to earlier threshold filtering. So I just take the ones I find here and sum their keyword-scores into one score per document..


In [None]:
from collections import defaultdict

@profiler.profile_func
def find_docs_for_words(synonym_lists):
    total_scores = defaultdict(lambda: [])
    #assume query (keyword) string "incubation period"
    #word_list1 = incubation and its synonyms
    #word_list2 = period and its synonyms
    for word_list in synonym_lists:
        doc_scores = defaultdict(lambda: [])
        #each word list represent one base word and its synonyms, 
        #so first sum up all weights for a single word and its synonyms
        #note that above we filtered by threshold and number of words, so should not have very small scores in it
        #TODO: improve score weights and filter count threshold
        for word in word_list:
            if word not in word_dicts:
                #some of the synonyms from word2vec may be rare and were dropped by earlier preprocessing steps
                #this prints those so we can see if it is a real loss (should we go back and add it) or not
                print(f"missed word: {word}")
                continue
            word_doc_scores = word_dicts[word]
            for doc_idx in word_doc_scores:
                #get weights for this word for each document the word appears in
                doc_scores[doc_idx].append(word_doc_scores[doc_idx])
        for doc_idx in doc_scores:
            #sum up all synonyms into one score for each document. 
            #after this each doc has as many lists as it has base words with weights
            total_scores[doc_idx].append(sum(doc_scores[doc_idx]))
        #so at this point total_scores has one entry per doc_id: (weights1, weights2). 
        #if there are less than 2, it did not have weight in one of the two, and will be removed later
        #TODO: nicer filtering schema        
    return total_scores

Often a large number of documents is found to have one or more of the keywords. To be included for question-answering, the document must have scores for all the keywords, or their synonyms. In the above example that would be "incubation" and "period", or their synonyms.

In [None]:
@profiler.profile_func
def filter_weighted_docs(n_words, total_scores):
    #remove all docs that do not have a weight for one of the N base words
    to_remove = []
    for doc_id in total_scores:
        ds = total_scores[doc_id]
        if len(ds) < n_words:
            to_remove.append(doc_id)
            continue
        total_scores[doc_id] = sum(ds)
    print(f"removing {len(to_remove)} docs")
    for key in to_remove:
        del total_scores[key]


For further processing, we build a list of the documents that match the given keyword search criteria, highest scoring first.

In [None]:
@profiler.profile_func
def doc_scores_to_df(total_scores):
    #NOTE: here we sort the docs by score so from this on highest scoring will be first
    #filter_weighted_docs has summed all the word weights for keywords into one, stores in item[1] here 
    ts_dict = {k: v for k, v in sorted(total_scores.items(), key=lambda item: item[1], reverse=True)}
    #print(ts_dict)
    df_ts = pd.DataFrame(ts_dict.items(), columns=['DocID', 'WeightScore'])
    return df_ts

Also, need to match the numerical document id's from the preprocessed datasets to the Kaggle provided metadata. This allows us to load the original file, and to access metadata for final presentation of results in this notebook (to include author, journal, etc. info).

In [None]:
@profiler.profile_func
def get_doc_ids_to_load(doc_ids, df_total_scores):
    doc_ids_to_load = []
    for index, row in df_total_scores.iterrows():
        doc_idx = int(row["DocID"])
        #print(doc_idx)
        doc_ids_to_load.append(doc_ids[doc_idx])
    return doc_ids_to_load

Matching the document ID's also allows us to load the original, full-text documents in the Kaggle dataset.

In [None]:
@profiler.profile_func
def load_docs(doc_ids_to_load, filepaths, df_metadata):
    loaded_docs = {}
    new_ids = []
    
    for doc_id in tqdm(doc_ids_to_load):
        if doc_id in loaded_docs:
            print(f"WARNING: duplicate doc id to load: {doc_id}, skipping")
            continue
         
        #TODO: this should not work if SHA is nan, why does it?
        doc_sha = df_metadata[df_metadata["cord_uid"] == doc_id]["sha"]
        if doc_sha.shape[0] > 0:
            doc_sha = doc_sha.values[0]
        else:
            doc_sha = None
        #print(doc_sha)
        #TODO: this should not work if PMCID is nan, why does it?
        doc_pmcid = df_metadata[df_metadata["cord_uid"] == doc_id]["pmcid"]
        if doc_pmcid.shape[0] > 0:
            doc_pmcid = doc_pmcid.values[0]
        else:
            doc_pmcid = None
        pmc_path = None
        sha_path = None
        for filepath in filepaths:
            if isinstance(doc_pmcid, str) and doc_pmcid in filepath:
                pmc_path = filepath
                break
            if isinstance(doc_sha, str) and doc_sha in filepath:
                sha_path = filepath
        if pmc_path is not None:
            #always favour PMC docs since they are described as higher quality (not scanned from PDF but direct machine format)
            filepath = pmc_path
        else:
            filepath = sha_path
        #print(filepath)
        if filepath is None:
            print(f"WARNING: cannot find path for doc id {doc_id}. Possibly Kaggle dataset has changed?")
            continue
        with open(filepath) as f:
            d = json.load(f)
            body = ""
            for idx, paragraph in enumerate(d["body_text"]):
                body += f"{paragraph['text']}\n"
                #print(paragraph)
                #print("---------")
            loaded_docs[doc_id] = body
            new_ids.append(doc_id)
            
    return loaded_docs, new_ids

## Search Functions

There are actually two types of "queries" used in this notebook. The first one consists of keywords I selected that I thought could have a high TF-IDF score for documents related to a question. The second query is the question-answering query to prodive the final answers based on the selected documents.

So, first we find and load the docs for the keywords.

In [None]:
@profiler.profile_func
def find_docs_for_query(query):
    print(f"query: {query}")
    pairs = find_pairs(query)
    n_words = len(pairs)
    print(f"query has {n_words} words")
    total_scores = find_docs_for_words(pairs)
    #print(total_scores)
    print(f"number of docs with some search terms (at high score): {len(total_scores)}")
    filter_weighted_docs(n_words, total_scores)
    print(f"number of docs with all search terms (at high score): {len(total_scores)}")
    #this also sorts the scores before creating the dataframe. so highest scoring are first
    df_scores = doc_scores_to_df(total_scores)
    query_doc_ids = get_doc_ids_to_load(doc_ids, df_scores)
    print(f"num. doc ids to load for the final docs: {len(query_doc_ids)}")
    if DOC_LIMIT is not None:
        #this avoid the overhead of loading thousands of extra documents. memory+processing time
        #the multiplier is to give it an extra chance to find answers that meet the confidence level
        query_doc_ids = query_doc_ids[:DOC_LIMIT*2]
    print(f"num. doc ids to load for the final docs after capping: {len(query_doc_ids)}")
    loaded_docs, query_doc_ids = load_docs(query_doc_ids, all_doc_paths, df_metadata)
    return query_doc_ids, loaded_docs



Asking the actual question(s) is possible once we have found the source set of documents of interest. We can then present the question to the question-answer model, using each of the selected documents as the question context, one at a time.

In [None]:
@profiler.profile_func
def run_query(loaded_docs, doc_ids, question):
    scores = []
    answers = []
    processed_ids = set()
    for doc_id in tqdm(doc_ids):
        if doc_id in processed_ids:
            print(f"skipping already processed doc id: {doc_id}")
            continue
        processed_ids.add(doc_id)
        #doc_id = doc_ids[idx]
        context = loaded_docs[doc_id]
        #print(len(scores))
        #memory_utils.print_memory()
        #print(len(context))
        context = context[:DOC_SIZE_CAP]
        #print(len(context))
        if context is None:
            print(f"skipping doc id {doc_id}, not found")
            continue
        with profiler.profile("nlp question"):
            answer = nlp(question=question, context=context)
        score = answer["score"]
        answer_text = answer["answer"]
        print(f"question: {question}")
        print(f"  doc id: {doc_id}")
        print(f"  answer: {answer_text}")
        print(f"  score: {score}")
        scores.append(score)
        answers.append(answer_text)
        if DOC_LIMIT is not None and len(answers) > DOC_LIMIT:
            break
    return scores, answers

A nice results presentation also needs to be built to show the questions and answers along with the name of the article, its authors, and journal it was published in.

In [None]:
@profiler.profile_func
def build_results_df(query_doc_ids, df_metadata, scores, answers, score_limit):
    titles = []
    publish_times = []
    journals = []
    author_lists = []
    filtered_scores = []
    filtered_answers = []
    for idx, doc_id in enumerate(query_doc_ids):
        if DOC_LIMIT is not None and idx > DOC_LIMIT:
            break
        
        doc_meta = df_metadata[df_metadata["cord_uid"] == doc_id]
        title = doc_meta["title"].values[0]
        publish_time = doc_meta["publish_time"].values[0]
        journal = doc_meta["journal"].values[0]
        authors = doc_meta["authors"].values[0]
        score = scores[idx]
        answer = answers[idx]
        if (score < score_limit):
            continue

        titles.append(title)
        publish_times.append(publish_time)
        journals.append(journal)
        author_lists.append(authors)
        filtered_scores.append(score)
        filtered_answers.append(answer)
    
    df_result = pd.DataFrame({
        "Article title": titles,
        "Published": publish_times,
        "Journal": journals,
        "Authors": author_lists,
        "Confidence": filtered_scores,
        "Answer": filtered_answers
    })
    return df_result
    

In [None]:
def answer_a_question(tfidf_sentence, question, df_metadata, score_limit):
    query_doc_ids, query_docs = find_docs_for_query(tfidf_sentence)
    print(f"doc_ids={len(query_doc_ids)}, docs={len(query_docs)}")
    scores, answers = run_query(query_docs, query_doc_ids, question)
    df = build_results_df(query_doc_ids, df_metadata, scores, answers, score_limit)
    return df

In [None]:
pd.set_option('display.max_rows', 1000)

In [None]:
memory_utils.print_memory()

# Questions and Answers

The following sections are coarsely divided to roughly match four of the Kaggle tasks and their related questions.

In each section, a number of questions formulated to match some aspects of the Kaggle task questions are presented to the question-answer (QA) model, along with a set of keywords that are first used to filter a set of documents to use as context for the questions. These questions are a result of my own experiments with different questions.

Based on the confidence level given by the QA model for the question and each presented document, a set of answers is collected. The answers with high enough confidence are presented as the final results. The threshold confidence is set as a hyperparater based on earlier experiments, and varies by question.

The answers are combined with metadata for the article from the Kaggle dataset. This includes the following:

- Article title: Well, the title given by the authors to their article.. :)
- Published: The publication date as listed in the metadata for this article.
- Journal: The journal given in the metadata for this article.
- Authors: Given article authors from metadata.
- Confidence: The confidence score given for the answer by the question-answer model.
- Answer: The actual answer to the posed question, based on the article text, as given by the question-answer model.

The TF-IDF score for the document is not shown in the table, but the entries are sorted so the the ones with the highest TF-IDF score are first in the table. Articles with low confidence score are filtered out, so there might be others in the dataset with higher TF-IDF but not included here. That would be because the QA model was not as confident on its answer for those, and the threshold (hyperparameter) set for this question filtered it out.

# Task 1

Task 1 refers to the task that was the highest ranked task in the list of tasks at the time. This section tries to answer questions related to this task. The task description:

Task Details

What is known about transmission, incubation, and environmental stability? What do we know about natural history, transmission, and diagnostics for the virus? What have we learned about infection prevention and control?

Specifically, we want to know what the literature reports about:

- Range of incubation periods for the disease in humans (and how this varies across age and health status) and how long individuals are contagious, even after recovery.
- Prevalence of asymptomatic shedding and transmission (e.g., particularly children).
- Seasonality of transmission.
- Physical science of the coronavirus (e.g., charge distribution, adhesion to hydrophilic/phobic surfaces, environmental survival to inform decontamination efforts for affected areas and provide information about viral shedding).
- Persistence and stability on a multitude of substrates and sources (e.g., nasal discharge, sputum, urine, fecal matter, blood).
- Persistence of virus on surfaces of different materials (e,g., copper, stainless steel, plastic).
- Natural history of the virus and shedding of it from an infected person
- Implementation of diagnostics and products to improve clinical processes
- Disease models, including animal models for infection, disease and transmission
- Tools and studies to monitor phenotypic change and potential adaptation of the virus
- Immune response and immunity
- Effectiveness of movement control strategies to prevent secondary transmission in health care and community settings
- Effectiveness of personal protective equipment (PPE) and its usefulness to reduce risk of transmission in health care and community settings
- Role of the environment in transmission



In [None]:
q = "What is the incubation period?"
df = answer_a_question("covid19 incubation period",  q, df_metadata, 0.7)

In [None]:
print(f"Q: {q}")
df

In [None]:
q = "How prevalent is asymptomatic shedding and transmission?"
df = answer_a_question("asymptomatic shedding transmission",  q, df_metadata, 0.2)

In [None]:
print(f"Q: {q}")
df

In [None]:
q = "What is the transmission seasonality?"
df = answer_a_question("covid19 transmission seasonality",  q, df_metadata, 0.02)

In [None]:
print(f"Q: {q}")
df

In [None]:
q = "What is covid19 chemical structure?"
df = answer_a_question("covid19 chemical structure",  q, df_metadata, 0.1)

In [None]:
print(f"Q: {q}")
df

In [None]:
q = "How long does covid19 survive?"
df = answer_a_question("covid19 persistent host",  q, df_metadata, 0.1)

In [None]:
print(f"Q: {q}")
df

In [None]:
q = "How long is host infectous?"
df = answer_a_question("covid19 infect host",  q, df_metadata, 0.1)

In [None]:
print(f"Q: {q}")
df

In [None]:
q = "What does covid19 persist on?"
df = answer_a_question("covid19 copper steel plastic",  q, df_metadata, 0.005)

In [None]:
print(f"Q: {q}")
df

In [None]:
q = "What is the history of covid19?"
df = answer_a_question("covid19 history",  q, df_metadata, 0.05)

In [None]:
print(f"Q: {q}")
df

In [None]:
q = "What is the disease model?"
df = answer_a_question("covid19 disease model",  q, df_metadata, 0.1)

In [None]:
print(f"Q: {q}")
df

In [None]:
q = "What are effective diagnostics processes?"
df = answer_a_question("covid19 diagnostic process",  q, df_metadata, 0.1)

In [None]:
print(f"Q: {q}")
df

In [None]:
q = "How does the virus change and adapt?"
df = answer_a_question("covid19 phenotypic change adaptation",  q, df_metadata, 0.01)

In [None]:
print(f"Q: {q}")
df

In [None]:
q = "What is effective movement control strategy?"
df = answer_a_question("covid19 movement control strategy",  q, df_metadata, 0.1)

In [None]:
print(f"Q: {q}")
df

In [None]:
q = "What is effective protective equipment?"
df = answer_a_question("covid19 personal protective equipment",  q, df_metadata, 0.01)

In [None]:
print(f"Q: {q}")
df

In [None]:
q = "What is the role of environment in transmission?"
df = answer_a_question("covid19 environment transmission",  q, df_metadata, 0.05)

In [None]:
print(f"Q: {q}")
df

In [None]:
q = "What is immune response?"
df = answer_a_question("covid19 immune response",  q, df_metadata, 0.05)

In [None]:
print(f"Q: {q}")
df

In [None]:
q = "How long is immunity?"
df = answer_a_question("covid19 immunity period",  q, df_metadata, 0.05)

In [None]:
print(f"Q: {q}")
df

# Task 2

2nd ranked in the list, at the time:

Task Details

What do we know about COVID-19 risk factors? What have we learned from epidemiological studies?

Specifically, we want to know what the literature reports about:

- Data on potential risks factors
- Smoking, pre-existing pulmonary disease
- Co-infections (determine whether co-existing respiratory/viral infections make the virus more transmissible or virulent) and other co-morbidities
- Neonates and pregnant women
- Socio-economic and behavioral factors to understand the economic impact of the virus and whether there were differences.
- Transmission dynamics of the virus, including the basic reproductive number, incubation period, serial interval, modes of transmission and environmental factors
- Severity of disease, including risk of fatality among symptomatic hospitalized patients, and high-risk patient groups
- Susceptibility of populations
- Public health mitigation measures that could be effective for control


In [None]:
q = "What are risk factors?"
df = answer_a_question("covid19 risk factor",  q, df_metadata, 0.2)

In [None]:
print(f"Q: {q}")
df

In [None]:
q = "Does smoking increase risk?"
df = answer_a_question("covid19 smoke risk",  q, df_metadata, 0.1)

In [None]:
print(f"Q: {q}")
df

In [None]:
q = "How does coninfection affect transmission?"
df = answer_a_question("covid19 coinfection transmission",  q, df_metadata, 0.1)

In [None]:
print(f"Q: {q}")
df

In [None]:
q = "What is risk to pregnant women?"
df = answer_a_question("covid19 pregnant woman",  q, df_metadata, 0.1)

In [None]:
print(f"Q: {q}")
df

In [None]:
q = "Does social status affect risk?"
df = answer_a_question("covid19 social economic",  q, df_metadata, 0.1)

In [None]:
print(f"Q: {q}")
df

In [None]:
q = "How is the virus transmitted?"
df = answer_a_question("covid19 transmission dynamic",  q, df_metadata, 0.1)

In [None]:
print(f"Q: {q}")
df

In [None]:
q = "What is the reproductive number?"
df = answer_a_question("covid19 reproduction number",  q, df_metadata, 0.7)

In [None]:
print(f"Q: {q}")
df

In [None]:
q = "How is covid19 transmitted?"
df = answer_a_question("covid19 transmission mode",  q, df_metadata, 0.2)

In [None]:
print(f"Q: {q}")
df

In [None]:
q = "What the impact of environment on transmission?"
df = answer_a_question("covid19 environment factor",  q, df_metadata, 0.2)

In [None]:
print(f"Q: {q}")
df

In [None]:
q = "What is the severity of covid19?"
df = answer_a_question("covid19 severity risk",  q, df_metadata, 0.03)

In [None]:
print(f"Q: {q}")
df

In [None]:
q = "Who is most at risk?"
df = answer_a_question("covid19 risk population",  q, df_metadata, 0.01)

In [None]:
print(f"Q: {q}")
df

In [None]:
q = "What are effective mitigation measures?"
df = answer_a_question("covid19 mitigation measure",  q, df_metadata, 0.2)

In [None]:
print(f"Q: {q}")
df

In [None]:
q = "What the asymptotic fatality rate?"
df = answer_a_question("covid19 asymptotic fatality",  q, df_metadata, 0.1)

In [None]:
print(f"Q: {q}")
df

# Task 4

4th in the list, with questions:

- Effectiveness of drugs being developed and tried to treat COVID-19 patients.
- Clinical and bench trials to investigate less common viral inhibitors against COVID-19 such as naproxen, clarithromycin, and minocyclinethat that may exert effects on viral replication.
- Methods evaluating potential complication of Antibody-Dependent Enhancement (ADE) in vaccine recipients.
- Exploration of use of best animal models and their predictive value for a human vaccine.
- Capabilities to discover a therapeutic (not vaccine) for the disease, and clinical effectiveness studies to discover therapeutics, to include antiviral agents.
- Alternative models to aid decision makers in determining how to prioritize and distribute scarce, newly proven therapeutics as production ramps up. This could include identifying approaches for expanding production capacity to ensure equitable and timely distribution to populations in need.
- Efforts targeted at a universal coronavirus vaccine.
- Efforts to develop animal models and standardize challenge studies
- Efforts to develop prophylaxis clinical studies and prioritize in healthcare workers
- Approaches to evaluate risk for enhanced disease after vaccination
- Assays to evaluate vaccine immune response and process development for vaccines, alongside suitable animal models [in conjunction with therapeutics]

In [None]:
q = "What drugs are effective?"
df = answer_a_question("covid19 drug effective",  q, df_metadata, 0.4)

In [None]:
print(f"Q: {q}")
df

In [None]:
q = "How effective are drugs?"
df = answer_a_question("covid19 drug effective", q, df_metadata, 0.4)

In [None]:
print(f"Q: {q}")
df

In [None]:
q = "What inhibitors are effective?"
df = answer_a_question("covid19 viral inhibitor",  q, df_metadata, 0.2)

In [None]:
print(f"Q: {q}")
df

In [None]:
q = "What is the effectiveness of inhibitors?"
df = answer_a_question("covid19 viral inhibitor",  q, df_metadata, 0.2)

In [None]:
print(f"Q: {q}")
df

In [None]:
q = "How effective is antibiotic enhancement?"
df = answer_a_question("covid19 antibiotic enhancement",  q, df_metadata, 0.1)

In [None]:
print(f"Q: {q}")
df

In [None]:
q = "Are animal models effective for humans?"
df = answer_a_question("covid19 animal model",  q, df_metadata, 0.1)

In [None]:
print(f"Q: {q}")
df

In [None]:
#q = "How can production capacity be expanded?"
#df = answer_a_question("covid19 production capacity",  q, df_metadata, 0.1)

In [None]:
#print(f"Q: {q}")
#df

In [None]:
q  = "How to get people to use masks?"
df = answer_a_question("covid19 mask respirator",  q, df_metadata, 0.15)

In [None]:
print(f"Q: {q}")
df

In [None]:
q  = "What is effective vaccine?"
df = answer_a_question("covid19 vaccine develop",  q, df_metadata, 0.35)

In [None]:
print(f"Q: {q}")
df

In [None]:
q = "What are vaccine risks?"
df = answer_a_question("covid19 vaccine risk",  q, df_metadata, 0.01)

In [None]:
print(f"Q: {q}")
df

# Task 5

5th in the list, questions given:

- Resources to support skilled nursing facilities and long term care facilities.
- Mobilization of surge medical staff to address shortages in overwhelmed communities
- Age-adjusted mortality data for Acute Respiratory Distress Syndrome (ARDS) with/without other organ failure – particularly for viral etiologies
- Extracorporeal membrane oxygenation (ECMO) outcomes data of COVID-19 patients
- Outcomes data for COVID-19 after mechanical ventilation adjusted for age.
- Knowledge of the frequency, manifestations, and course of extrapulmonary manifestations of COVID-19, including, but not limited to, possible cardiomyopathy and cardiac arrest.
- Application of regulatory standards (e.g., EUA, CLIA) and ability to adapt care to crisis standards of care level.
- Approaches for encouraging and facilitating the production of elastomeric respirators, which can save thousands of N95 masks.
- Best telemedicine practices, barriers and faciitators, and specific actions to remove/expand them within and across state boundaries.
- Guidance on the simple things people can do at home to take care of sick people and manage disease.
- Oral medications that might potentially work.
- Use of AI in real-time health care delivery to evaluate interventions, risk factors, and outcomes in a way that could not be done manually.
- Best practices and critical challenges and innovative solutions and technologies in hospital flow and organization, workforce protection, workforce allocation, community-based support resources, payment, and supply chain management to enhance capacity, efficiency, and outcomes.
- Efforts to define the natural history of disease to inform clinical care, public health interventions, infection prevention control, transmission, and clinical trials
- Efforts to develop a core clinical outcome set to maximize usability of data across a range of trials
- Efforts to determine adjunctive and supportive interventions that can improve the clinical outcomes of infected patients (e.g. steroids, high flow oxygen)

In [None]:
q = "How can we support nursing facilities?"
df = answer_a_question("covid19 support nurse facility",  q, df_metadata, 0.4)

In [None]:
print(f"Q: {q}")
df

In [None]:
q = "Does age have effect?"
df = answer_a_question("organ failure mortality",  q, df_metadata, 0.2)

In [None]:
print(f"Q: {q}")
df

In [None]:
q = "How do ethical principles map to covid19?"
df = answer_a_question("covid19 ethical principle",  q, df_metadata, 0.4)

In [None]:
print(f"Q: {q}")
df

In [None]:
q = "What do people fear?"
df = answer_a_question("covid19 fear anxiety",  q, df_metadata, 0.4)

In [None]:
print(f"Q: {q}")
df

In [None]:
q = "What are local barriers and enablers?"
df = answer_a_question("covid19 barrier enabler",  q, df_metadata, 0.4)

In [None]:
print(f"Q: {q}")
df

In [None]:
#q = "How are healthcare providers affected?"
#df = answer_a_question("covid19 provider health psychological",  q, df_metadata, 0.4)

In [None]:
#print(f"Q: {q}")
#df

In [None]:
q = "what is the experiment outcome?"
df = answer_a_question("membrane oxygenation",  q, df_metadata, 0.02)

In [None]:
print(f"Q: {q}")
df

In [None]:
q = "Is heart attack more likely?"
df = answer_a_question("covid19 heart attack",  q, df_metadata, 0.1)

In [None]:
print(f"Q: {q}")
df

In [None]:
q = "How does regulation affect care level?"
df = answer_a_question("covid19 regulatory regulation",  q, df_metadata, 0.1)

In [None]:
print(f"Q: {q}")
df

In [None]:
#q = "How to get people to use masks?"
#df = answer_a_question("covid19 mask respirator",  q, df_metadata, 0.15)

In [None]:
print(f"Q: {q}")
df

In [None]:
q = "How to provide effective remote support?"
df = answer_a_question("covid19 telemedicine support",  q, df_metadata, 0.35)

In [None]:
print(f"Q: {q}")
df

In [None]:
q = "What can people do at home?"
df = answer_a_question("covid19 home guidance",  q, df_metadata, 0.01)

In [None]:
print(f"Q: {q}")
df

In [None]:
q = "What are effective diagnostics processes?"
df = answer_a_question("covid19 diagnostic process",  q, df_metadata, 0.002)

In [None]:
print(f"Q: {q}")
df

In [None]:
q = "What oral medication works?"
df = answer_a_question("covid19 oral medication",  q, df_metadata, 0.35)

In [None]:
print(f"Q: {q}")
df

In [None]:
q = "What are best practices at hospitals?"
df = answer_a_question("covid19 hospital best practice",  q, df_metadata, 0.15)

In [None]:
print(f"Q: {q}")
df

In [None]:
q = "What is effective protective equipment?"
df = answer_a_question("covid19 personal protective equipment",  q, df_metadata, 0.4)

In [None]:
print(f"Q: {q}")
df

In [None]:
#q = "How is artificial intelligence used?"
#df = answer_a_question("covid19 intervention automation",  q, df_metadata, 0.1)

In [None]:
#print(f"Q: {q}")
#df

That's all folks.

# Conclusions

I found it quite impressive with how well the transformers question-answering model was able to find matching information in large sets of text. And my approach to find relevant documents to ask about, using TF-IDF scores for keywords across the documents seemed to work quite well. 

Next time I do a literature search on something, I will definitely look into trying this in my own areas of research. In that situation, I would likely be better off finding ways to finetune and identify when it finds the most relevant papers and information, and how to best apply it. Of course, the research community still ha some way to go to having nice sets of papers available in a suitable format for this type of processing. Here, Kaggle has done it for us, but in my experience raw PDF is what you mostly get.

But as I said, I found the results and their quality very interesting. I think such an approach could be quite helpful to identify documents and topics in large sets of text when researching some specific problems. For example, to
- Find missed documents that could be highly relevant
- Group different insights quickly
- Reduce a large set of documents to go through manually
- Highlight new ideas and viewpoints that could otherwise be missed

I believe this type of service would be a useful helping tool for literature search in any domain. For academic texts, further improvements could include:
- Ranking by journal (higher ranked journals get more points)
- In quite many cases I see titles give good indicators of usefulness as well. Weighting different article elements in general, in scoring, could be an option.
- Fancy UI for a search engine to let you explore the results
- Integrating explorative data analysis approaches with tuning the keywords, questions, hyperparameters, and results
- Running multiple queries in parallel, assuming you have the resources
- Using previous question answers to produce more detailed questions. For example, when running this notebook, a question about vaccines lists many random looking items in the answers, but also things like rVSV-ZEBOV, which seems to be an experimental Ebola vaccine. Such terms could further be used to guide the search.
- The answers given by the transformer model are very short and concide. Slicing larger parts of text near the answer could make it easire to directly deduce if it is worth the effor to explore that answer with a more in-depth manual review of the paper.

Hope this is helpful for some researcher :)

In [None]:
profiler.print_run_stats()