## COVID-19 Open Research Dataset Challenge - What do we know about vaccines and therapuetics?
The following questions were analysed specifically: 
- Effectiveness of drugs being developed and tried to treat COVID-19 patients.
  - Clinical and bench trials to investigate less common viral inhibitors against COVID-19 such as naproxen, clarithromycin, and minocyclinethat that may exert effects on viral replication.
- Methods evaluating potential complication of Antibody-Dependent Enhancement (ADE) in vaccine recipients.
- Exploration of use of best animal models and their predictive value for a human vaccine.
- Capabilities to discover a therapeutic (not vaccine) for the disease, and clinical effectiveness studies to discover therapeutics, to include antiviral agents.
- Efforts targeted at a universal coronavirus vaccine.
- Efforts to develop animal models and standardize challenge studies
- Assays to evaluate vaccine immune response and process development for vaccines, alongside suitable animal models (in conjunction with therapeutics)

## Our approach - Creating a timeline visualizing the progress of vaccines/cures on COVID-19 and other similar viral diseases.
Our goal is to create an intuitive visualization of the progress of research on vaccines and therapuetics regarding COVID-19. Not only is this useful for professional researchers in having a quick overview of the clinical trial stages of each investigated vaccine/therapeutic, but also for the public, to have a better understanding of the time frame for which to expect a cure or solution. We decided to create vizualizations of research progress of other virusses as well as COVID-19, to get a better picture of the timescale and ammount of research that goes into making a vaccine or therapeutics.

Several steps were taken to create the visualizations:
1. Load and preprocess the data:
    - lemmatize all texts and remove stopwords
2. Select papers containing words relevant to the research question
    - using either string pattern matching or word embeddings
    - relevant words were manually selected based on the research questions and indicativaty of clinical stage trial (e.g. mouse vs human test subject, words expressing certainty etc.)
3. Extract keywords from selected papers
    - TODO: write how we do this @Simon, @Silvan
4. Extract links between selected papers
    - TODO: write how we do this @Levi @Miguel
5. Visualize extracted papers, links and summaries
    - TODO: explain how (after we know how) @Levi @Gloria


### 0.a Imports

In [None]:
# TODO: write your imports here
import os
import json

### 0.b Functions

In [None]:
# As kaggle only allows notebook submissions, all functions should be in the notebook. Just copy your functions and paste them here.

class Dataset:
    """COVID19 Papers Dataset
    
    Attributes:
        data_dir: string location where data files can be found.
        paper_ids: list containing str of unique pdf ids for each paper. 
            ie. ['sha1', 'sha2', 'sha3', ...]
        titles: list containing str titles of papers. 
            ie. ['title1', 'title2', 'title3', ...]
        abstracts: list containing str abstracts of each paper. 
            ie. ['abstract1', 'abstract2', 'abstract3', ...]
        n_paragraphs: list of integers specifying the amount of paragraphs in each paper. 
            ie. [n1, n2, n3, ...]
        contents: nested list containing contents of paper; contents of each paper stored in a list of strings containing paragraphs. 
            ie. [['paper1_p1', 'paper1_p2', ...], ['paper2_p1', 'paper2_p2', ...], ...]
    
    Attributes are initially empty. To populate data, run class method of load_data().
    
    Usage:
        # declare directory where data is stored
        data_dir = '/kaggle/input/CORD-19-research-challenge'  
        data = Dataset(data_dir)
        data.load_data()
        
        # get attributes
        data.paper_ids
        data.titles
        ...
    """
    
    def __init__(self, data_dir:str):
        # init lists to store data
        self.data_dir = data_dir
        self.paper_ids = []
        self.titles = []
        self.abstracts = []
        self.n_paragraphs = []
        self.contents = []
        
        print("[INFO] Empty Dataset object created.")
        
    @property
    def __len__(self):
        """Denotes the total number of samples."""
        return f"Dataset instance has {len(self.paper_ids)} samples"
    
    def load_data(self):
        """Load data from dataset data directory."""
        data_dir = str(self.data_dir)
        subdir = [x for x in os.listdir(data_dir) if os.path.isdir(os.path.join(data_dir,x))]
        
        initial_samples = len(self.paper_ids)
        
        print(f"[INFO] Loading data from {data_dir}...")
        # loop through folders with json files
        for folder in subdir:
#             path = os.path.join(data_dir,folder, folder)
            path = os.path.join(data_dir,folder, folder, 'pdf_json')
            # loop through json files and scrape data
            for file in os.listdir(path):
                file_path = os.path.join(path, file)
                
                # open file only if it is a file
                if os.path.isfile(file_path):
                    with open(file_path) as f:
                        data_json = json.load(f)
                        self.paper_ids.append(data_json['paper_id'])
                        self.titles.append(data_json['metadata']['title'])

                        # combine abstract texts / process
                        combined_str = ''
                        for text in data_json['abstract']:
                            combined_str += text['text'].lower()

                        self.abstracts.append(combined_str)

                        # take only text part for content
                        paragraphs = []
                        content = data_json['body_text']

                        for paragraph in content:
                            paragraphs.append(paragraph['text'].lower())

                        self.n_paragraphs.append(len(content))
                        self.contents.append(paragraphs) 
                else:
                    print('[WARNING]', file_path, 'not a file. Check pointed path directory in load_data().')
        
        end_samples = len(self.paper_ids)
        loaded_samples = end_samples - initial_samples
        print(f"[INFO] Data loaded into dataset instance. {loaded_samples} samples added. | Start amount = {initial_samples}; End amount = {end_samples}")

# def load_data(path):
#     #TODO @Kwan: function for loading data goes here
#     return data

def remove_stopwords(text):
    #TODO @Kwan: function for removing stopwords
    return text_without_stopwords

def lemmatize(text):
    #TODO @Kwan: function for lemmatizing text
    return lemmatized_text

def select_papers(data, virus_strings, clinical_stage_strings):
    #TODO @Pooja @Miguel: select papers based on relevant_strings

    # output: selected papers + the strings that were found in these papers
    selected_papers['found_substrings'] = found_substrings
    return selected_papers

def extract_keywords(text):
    # TODO @Simon @Silvan: extract keywords
    return keywords

def extract_links(data):
    # TODO @Levi @Miguel: extract links between papers    
    return links

def visualize_data(data,keywords,summaries):
    #TODO @Levi @Kwan: visualize data

### 0.c Relevant strings

In [None]:
# patterns defining different coronavirus diseases
virus_strings = {'covid-19':['COVID-19','SARS-CoV-2',\
                 'coronavirus disease','severe acute \
                 respiratory syndrome coronavirus 2'], \
                'common cold':['common cold', 'human coronavirus \
                229E', 'human coronavirus OC43', '229E', 'OC43'] \
                'SARS-CoV (2003)':['SARS-CoV'], 'HCoV NL63 (2004)':['HCoV NL63'], \
                'HKU1 (2005)': ['HKU1'], 'MERS-CoV (2012)':['MERS-CoV']}

# patterns defining different stages of clinical trials:
# if this works I will put it in a txt file and load it here

clinical_stage_strings = {}
clinical_stage_strings['preclinical'] = 'Preclinical trial/study/studies/development nonclinical studies \
laboratory animals prior to moving to the phase one trials \
adverse effects and immunogenicity vaccine safety and the immunological response to the drug, such as toxicity, \
toxic effects at all possible dosage levels and the interactions with the immune system. \
preclinical protocol in hopes to more accurately determine drug reactions in humans. \
transgenic animals, genetically modified animals, Typically, in drug development studies animal testing involves two species. The most commonly used models are murine and canine, although primate and porcine are also used. \
mice, ferrets, monkey, in vitro, pig, dog, This data allows researchers to allometrically estimate a safe starting dose of the drug for clinical trials in humans. \
Importantly, the regulatory guidelines of FDA, EMA, and other similar international and regional authorities usually require safety testing in at least two mammalian species, including one non-rodent species, prior to human trials authorization \
Deciding whether a drug is ready for clinical trials (the so-called move from bench to bedside) involves extensive preclinical studies that yield preliminary efficacy, toxicity, pharmacokinetic and safety information. Wide doses of the drug are tested using in vitro (test tube or cell culture) and in vivo (animal) experiments, and it is also possible to perform in silico profiling using computer models of the drug–target interactions. \
Much like for clinical trials, there are certain types of trials that have to be done, such as toxicology studies in most cases, and other trials that are specific to the particular study compound or question. Understanding that the goal of preclinical trials is to move into the clinical stage is key and the studies should be designed around that goal.'

clinical_stage_strings['phase_0'] = 'Phase 0 phase zero (optional) trial/study \
Phase 0 trials are optional first-in-human trials. Single subtherapeutic doses of the study drug or treatment are given to a small number of subjects (typically 10 to 15) to gather preliminary data on the agent\'s pharmacodynamics (what the drug does to the body) and pharmacokinetics (what the body does to the drugs).[35] For a test drug, the trial documents the absorption, distribution, metabolization, and removal (excretion) \
of the drug, and the drug\'s interactions within the body, to confirm that these appear to be as expected. \
Phase 0 involves exploratory, first-in-human (FIH) trials that are run according to FDA guidelines. Also called human microdose studies, they have single sub-therapeutic doses given to 10 to 15 subjects and yield pharmacokinetic data or help with imaging specific targets without introducing pharmacological effects.'

clinical_stage_strings['phase_1'] = 'Phase 1 phase I trial/study / Early phase clinical trial \
Screening for safety Often are first-in-person trials. Testing within a small group of people (typically 20–80) to evaluate safety, determine safe dosage ranges, and identify side effects. \
The following stage in vaccine trials is the phase one study, which consists of introducing the drug into the human population. \
A vaccine trial might involve forming two groups from the target population. For example, from the set of trial subjects, each subject may be randomly assigned to receive either a new vaccine or a "control" treatment: The control treatment may be a placebo, or an adjuvant-containing cocktail, or an established vaccine (which might be intended to protect against a different pathogen). \
After the administration of the vaccine or placebo, the researchers collect data on antibody production, on health outcomes (such as illness due to the targeted infection or to another infection). This data is summarized as a statistic, which is used to estimate the protective efficacy of the vaccine. Then, following the trial protocol, the specified statistical test is performed to gauge the statistical significance of the observed differences in the outcomes between the treatment and control groups. \
Side effects of the vaccine are also noted, and these too contribute to the decision on whether to license it. \
One very typical version of phase one studies in vaccines involves an escalation study, which is used in mainly medicinal research trials. The drug is introduced into a small cohort of healthy volunteers. Vaccine escalation studies aim to minimize chances of serious adverse effects (SAE) by slowly increasing the drug dosage or frequency.[3] The first level of an escalation study usually has two or three groups of around 10 healthy volunteers. \
Each subgroup receives the same vaccine dose, which is the expected lowest dose necessary to invoke an immune response (the main goal in a vaccine - to create immunity). New subgroups can be added to experiment with a different dosing regimen as long as the previous subgroup did not experience SAEs. There are variations in the vaccination order that can be used for different studies. For example, the first subgroup could complete the entire regimen before the second subgroup starts or the second can begin \
before the first ends as long as SAEs were not detected.[3] The vaccination schedule will vary depending on the nature of the drug (i.e. the need for a booster or several doses over the course of short time period). Escalation studies are ideal for minimizing risks for SAEs that could occur with less controlled and divided protocols. \
They are primarily designed to assess the safety and tolerability of a drug, but the pharmacokinetics and, if possible, the pharmacodynamics are also measured. \
The typical Phase I trial has a single ascending dose (SAD) design, meaning that subjects are dosed in small groups called cohorts. Each member of a cohort might receive a single dose of the study drug or a placebo. A very low dose is used for the first cohort. The dose is then escalated in the next cohort if safety and tolerability allow. \
Dose escalation is stopped when maximum tolerability and/or maximum exposure is reached. \
SAD studies are usually followed by multiple ascending dose (MAD) studies, which have a very similar design, with cohorts and escalating doses. The only difference is that the subjects receive multiple doses of the study drug or placebo. \
While safety and tolerability are still important endpoints, the multiple dose setting often allows first investigations of the pharmacodynamic effects in addition to the pharmacokinetics. \
Finally, food effect studies are often conducted to investigate the potential impact of food intake on the absorption of the drug. \
new vaccine or a "control" treatment \
The control treatment may be a placebo, or an adjuvant-containing cocktail, or an established vaccine \
infection which is used to estimate the protective efficacy of the vaccine \
Side effects of the vaccine are also noted, and these too contribute to the decision on whether to license it \
escalation study Vaccine escalation studies aim to minimize chances of serious adverse effects (SAE) by slowly increasing the drug dosage or frequency. \
expected lowest dose necessary to invoke an immune response dosing regimen'
'

clinical_stage_strings['phase_2'] = 'Phase 2 phase II trial/study / Early phase clinical trial \
Establishing the preliminary efficacy of the drug, usually against a placebo \
Testing with a larger group of people (typically 100–300) to determine efficacy and   to further evaluate its safety. \
The transition to phase two relies on the immunogenic and toxicity results from phase one and the small cohort of healthy volunteers.[4] Phase two will consist of more healthy volunteers in the vaccine target population (~hundreds of people) to determine reactions in a more diverse set of humans and test different schedules. \
Phase II trials are performed on larger groups of patients and are designed to assess the efficacy of the drug and to continue the Phase I safety assessments. Most importantly, Phase II clinical studies help to establish therapeutic doses for the large-scale Phase III studies. \
Phase II studies are sometimes divided into Phases IIA and IIB. Phase IIA is designed to assess dosing requirements whereas Phase IIB focuses on drug efficacy. \
In addition, a treatment study with several different doses of the compound in comparison with a placebo and/or an active comparator over a treatment duration of 12 to 16 weeks is usually an essential part of the Phase II program.'

clinical_stage_strings['phase_3'] = 'Phase 3 phase III trial/study / Late phase clinical trial \
Final confirmation of safety and efficacy Testing with large groups of people (typically 1,000–3,000) to confirm its efficacy,	evaluate its effectiveness, monitor side effects, compare it to commonly used treatments, and collect information that will allow it to be used safely. \
Similarly, phase three trials continue to monitor toxicity, immunogenicity, and SAEs on a much larger scale.[4] The vaccine must be shown to be safe and effective in natural disease conditions before being submitted for approval and then general production. In the United States, the Food and Drug Administration (FDA) is responsible for approving vaccines.[5] \
Phase III trials are randomized controlled multicentre trials and provide most of the long-term safety data. Phase III trials investigate the efficacy and safety of a new drug over 6 to 12 months or longer in a large patient population (several hundred patients or more) under conditions that reflect daily clinical life much more closely than the Phase I or II trials and allow evaluation of the overall benefit-risk relationship of the drug. These trials are usually conducted on an outpatient basis with no in-house days and include an active comparator \
Phase IIIA studies are used for the approval of the drug from the appropriate regulatory agencies (known as Pivotal study). \
Phase IIIB studies are often performed to obtain additional safety data or to support publication, marketing claims (label extension) or to prepare launch for the drug.'

clinical_stage_strings['phase_4'] = 'Phase 4 phase IV trial/study / Late phase clinical trial \
    Safety studies during sales. Postmarketing studies delineate risks, benefits, and optimal use. As such, they are ongoing during the drug\'s lifetime of active medical use \
    Phase four trials are typically monitor stages that collect information continuously on vaccine usage, adverse effects, and long-term immunity.[5] \
    Phase IV trials are also known as post-marketing surveillance trials involving safety surveillance (pharmacovigilance) and ongoing technical support after approval. \
    There are multiple observational designs and evaluation schemes that can be used in Phase IV studies to assess the effectiveness, cost-effectiveness, and safety of an intervention in real-world settings. \
    This could entail the drug being tested in a certain new population (e.g. pregnant women). The safety surveillance is designed to detect any rare or long-term adverse effects over a much larger patient population and longer time period.'

### 1. Load and Preprocess the data

In [None]:
# path to data
path = '../input/CORD-19-research-challenge'

# load data in pandas dataframe
data = load_data(path)

# add colums with processed text (no stopwords, lemmatized)
processed_text = data['text'].apply(remove_stopwords())
processed_text = processed_text.apply(lemmatize())

# append processed text to data
data['processed_text'] = processed_text

### 2. Select papers containing words relevant to the research question 

In [None]:
selected_papers = select_papers(data, virus_strings, clinical_stage_strings)

### 3. Extract keywords from selected papers

In [1]:
keywords = extract_keywords(selected_papers)

### 4. Extract links between selected papers

In [5]:
paper_links = extract_links(selected_papers)

KeyError: (0, 0)

### 5. Visualize extracted papers, links and summaries

In [None]:
visualize_data(selected_papers,keywords,paper_links)