## COVID-19 Open Research Dataset Challenge - What do we know about vaccines and therapuetics?
The following questions were analysed specifically: 
- Effectiveness of drugs being developed and tried to treat COVID-19 patients.
  - Clinical and bench trials to investigate less common viral inhibitors against COVID-19 such as naproxen, clarithromycin, and minocyclinethat that may exert effects on viral replication.
- Methods evaluating potential complication of Antibody-Dependent Enhancement (ADE) in vaccine recipients.
- Exploration of use of best animal models and their predictive value for a human vaccine.
- Capabilities to discover a therapeutic (not vaccine) for the disease, and clinical effectiveness studies to discover therapeutics, to include antiviral agents.
- Efforts targeted at a universal coronavirus vaccine.
- Efforts to develop animal models and standardize challenge studies
- Assays to evaluate vaccine immune response and process development for vaccines, alongside suitable animal models (in conjunction with therapeutics)

## Our approach - Creating a timeline visualizing the progress of vaccines/cures on COVID-19 and other similar viral diseases.
Our goal is to create an intuitive visualization of the progress of research on vaccines and therapuetics regarding COVID-19. Not only is this useful for professional researchers in having a quick overview of the clinical trial stages of each investigated vaccine/therapeutic, but also for the public, to have a better understanding of the time frame for which to expect a cure or solution. We decided to create vizualizations of research progress of other virusses as well as COVID-19, to get a better picture of the timescale and ammount of research that goes into making a vaccine or therapeutics.

Several steps were taken to create the visualizations:
1. Load and preprocess the data:
    - lemmatize all texts and remove stopwords
2. Select papers containing words relevant to the research question
    - using either string pattern matching or word embeddings
    - relevant words were manually selected based on the research questions and indicativaty of clinical stage trial (e.g. mouse vs human test subject, words expressing certainty etc.)
3. Extract keywords from selected papers
    - TODO: write how we do this @Simon, @Silvan
4. Extract links between selected papers
    - TODO: write how we do this @Levi @Miguel
5. Visualize extracted papers, links and summaries
    - TODO: explain how (after we know how) @Levi @Gloria


### 0.a Imports

In [5]:
# TODO: write your imports here
import os
import json

import pandas as pd

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 

from nltk.stem import WordNetLemmatizer

# path to data
data_dir = '../../src'  
keyword_dir = '../../keywords'

### 0.b Functions

In [6]:
# As kaggle only allows notebook submissions, all functions should be in the notebook. Just copy your functions and paste them here.

class Dataset:
    """COVID19 Papers Dataset
    
    Attributes:
        data_dir: string location where data files can be found.
        paper_ids: list containing str of unique pdf ids for each paper. 
            ie. ['sha1', 'sha2', 'sha3', ...]
        titles: list containing str titles of papers. 
            ie. ['title1', 'title2', 'title3', ...]
        abstracts: list containing str abstracts of each paper. 
            ie. ['abstract1', 'abstract2', 'abstract3', ...]
        n_paragraphs: list of integers specifying the amount of paragraphs in each paper. 
            ie. [n1, n2, n3, ...]
        contents: nested list containing contents of paper; contents of each paper stored in a list of strings containing paragraphs. 
            ie. [['paper1_p1', 'paper1_p2', ...], ['paper2_p1', 'paper2_p2', ...], ...]
    
    Attributes are initially empty. To populate data, run class method of load_data().
    
    Usage:
        # declare directory where data is stored
        data_dir = '/kaggle/input/CORD-19-research-challenge'  
        data = Dataset(data_dir)
        data.load_data()
        
        # get attributes
        data.paper_ids
        data.titles
        ...
    """
    
    def __init__(self, data_dir:str):
        # init lists to store data
        self.data_dir = data_dir
        self.paper_ids = []
        self.titles = []
        self.abstracts = []
        self.n_paragraphs = []
        self.contents = []
        
        self.dates = []
        self.authors = []
        
        print("[INFO] Empty Dataset object created.")
        
    @property
    def __len__(self):
        """Denotes the total number of samples."""
        return f"Dataset instance has {len(self.paper_ids)} samples"
    
    def load_data(self):
        """Load data from dataset data directory."""
        data_dir = str(self.data_dir)
        subdir = [x for x in os.listdir(data_dir) if os.path.isdir(os.path.join(data_dir,x))]
        
        initial_samples = len(self.paper_ids)
        
        print(f"[INFO] Loading data from {data_dir}...")
        # loop through folders with json files
        for folder in subdir:
#             path = os.path.join(data_dir,folder, folder)
            path = os.path.join(data_dir,folder, folder, 'pdf_json')
            # loop through json files and scrape data
            for file in os.listdir(path):
                file_path = os.path.join(path, file)
                
                # open file only if it is a file
                if os.path.isfile(file_path):
                    with open(file_path) as f:
                        data_json = json.load(f)
                        self.paper_ids.append(data_json['paper_id'])
                        self.titles.append(data_json['metadata']['title'])
                        self.dates.append(data_json['metadata'][''])

                        # combine abstract texts / process
                        combined_str = ''
                        for text in data_json['abstract']:
                            combined_str += text['text'].lower()

                        self.abstracts.append(combined_str)

                        # take only text part for content
                        paragraphs = []
                        content = data_json['body_text']

                        for paragraph in content:
                            paragraphs.append(paragraph['text'].lower())

                        self.n_paragraphs.append(len(content))
                        self.contents.append(paragraphs) 
                else:
                    print('[WARNING]', file_path, 'not a file. Check pointed path directory in load_data().')
        
        end_samples = len(self.paper_ids)
        loaded_samples = end_samples - initial_samples
        print(f"[INFO] Data loaded into dataset instance. {loaded_samples} samples added. | Start amount = {initial_samples}; End amount = {end_samples}")

        
def tokenize_check(text):
    if isinstance(text, str):
        word_tokens = word_tokenize(text)
    elif isinstance(text, list):
        word_tokens = text
    else:
        raise TypeError
    
    return word_tokens
    

def remove_stopwords(text, remove_symbols=False):
    """ Tokenize and/or remove stopwords and/or unwanted symbols from string"""
    list_stopwords = set(stopwords.words('english'))
    # list of signs to be removed if parameter remove_symbols set to True
    list_symbols = ['.', ',', '(', ')', '[', ']']
    
    # check input type and tokenize if not already
    word_tokens = tokenize_check(text)

    # filter out stopwords
    text_without_stopwords = [w for w in word_tokens if not w in list_stopwords] 
    
    if remove_symbols is True:
        text_without_stopwords = [w for w in text_without_stopwords if not w in list_symbols]
    
    return text_without_stopwords

# from nltk.stem import WordNetLemmatizer 

def lemmatize(text):
    """ Tokenize and/or lemmatize string """
    lemmatizer = WordNetLemmatizer()
    
    # check input type and tokenize if not already
    word_tokens = tokenize_check(text)
    
    lemmatized_text = [lemmatizer.lemmatize(w) for w in word_tokens]
    
    return lemmatized_text

def select_papers(data, virus_strings, clinical_stage_strings):
    #TODO @Pooja @Miguel: select papers based on relevant_strings

    # output: selected papers + the strings that were found in these papers
    selected_papers['found_substrings'] = found_substrings
    return selected_papers

def extract_keywords(text):
    # TODO @Simon @Silvan: extract keywords
    return keywords

def extract_links(data):
    # TODO @Levi @Miguel: extract links between papers    
    return links

#def visualize_data(data,keywords,summaries):
#    #TODO @Levi @Kwan: visualize data

### 0.c Relevant strings

In [7]:
# keywords that define the virus the paper is about (likely in title)
virus_keywords = pd.read_csv(keyword_dir+'/virus_keywords.csv')

# keywords describing clinical phase
clinical_stage_keywords = pd.read_csv(keyword_dir+'/phase_keywords.csv')

# keywords describing treatment types
drug_keywords = pd.read_csv(keyword_dir+'/drug_keywords.csv')

### 1. Load and Preprocess the data

In [None]:
meta_data = pd.read_csv(data_dir+'/meta_data.csv')

In [8]:
# create dataset object
data = Dataset(data_dir)

# load data
data.load_data()

[INFO] Empty Dataset object created.
[INFO] Loading data from ../../src...


KeyError: 'date'

### 2. Select papers containing words relevant to the research question 

In [6]:
selected_papers = select_papers(data, virus_keywords, clinical_stage_keywords, drug_keywords)

TypeError: select_papers() takes 3 positional arguments but 4 were given

### 3. Extract keywords from selected papers

In [None]:
keywords = extract_keywords(selected_papers)

### 4. Extract links between selected papers

In [None]:
paper_links = extract_links(selected_papers)

### 5. Visualize extracted papers, links and summaries

In [None]:
visualize_data(selected_papers,keywords,paper_links)