## COVID-19 Open Research Dataset Challenge - What do we know about vaccines and therapuetics?
The following questions were analysed specifically: 
- Effectiveness of drugs being developed and tried to treat COVID-19 patients.
  - Clinical and bench trials to investigate less common viral inhibitors against COVID-19 such as naproxen, clarithromycin, and minocyclinethat that may exert effects on viral replication.
- Methods evaluating potential complication of Antibody-Dependent Enhancement (ADE) in vaccine recipients.
- Exploration of use of best animal models and their predictive value for a human vaccine.
- Capabilities to discover a therapeutic (not vaccine) for the disease, and clinical effectiveness studies to discover therapeutics, to include antiviral agents.
- Efforts targeted at a universal coronavirus vaccine.
- Efforts to develop animal models and standardize challenge studies
- Assays to evaluate vaccine immune response and process development for vaccines, alongside suitable animal models (in conjunction with therapeutics)

## Our approach - Creating a timeline visualizing the progress of vaccines/cures on COVID-19 and other similar viral diseases.
Our goal is to create an intuitive visualization of the progress of research on vaccines and therapuetics regarding COVID-19. Not only is this useful for professional researchers in having a quick overview of the clinical trial stages of each investigated vaccine/therapeutic, but also for the public, to have a better understanding of the time frame for which to expect a cure or solution. We decided to create vizualizations of research progress of other virusses as well as COVID-19, to get a better picture of the timescale and ammount of research that goes into making a vaccine or therapeutics.

Several steps were taken to create the visualizations:
1. Load and preprocess the data:
    - lemmatize all texts and remove stopwords
2. Categorize papers based on keywords 
    - using either string pattern matching or word embeddings
    - relevant words were manually selected based on the research questions and indicativaty of clinical stage trial (e.g. mouse vs human test subject, words expressing certainty etc.)
    - categories are: virus, clinical stage, drug type
3. Extract keywords/summaries from selected papers
    - TODO: write how we do this @Simon, @Silvan
5. Visualize extracted papers, links and summaries
    - TODO: explain how (after we know how) @Levi @Gloria


### 0.a Imports

In [1]:
# TODO: write your imports here
import os
import json

import pandas as pd

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize

from nltk.stem import WordNetLemmatizer

import pickle as pk
import numpy as np

# path to data
data_dir = '../../src'  
keyword_dir = '../../keywords'

### 0.b Functions

In [6]:
# As kaggle only allows notebook submissions, all functions should be in the notebook. Just copy your functions and paste them here.
          
def load_data(data_dir):
    """Load data from dataset data directory."""
    sha = []
    full_text = []

    subdir = [x for x in os.listdir(data_dir) if os.path.isdir(os.path.join(data_dir,x))]

    print(f"[INFO] Loading data from {data_dir}...")
    # loop through folders with json files
    for folder in subdir:
        path = os.path.join(data_dir,folder, folder)
#       path = os.path.join(data_dir,folder, folder, 'pdf_json')
        # loop through json files and scrape data
        for file in os.listdir(path):
            file_path = os.path.join(path, file)

            # open file only if it is a file
            if os.path.isfile(file_path):
                with open(file_path) as f:
                    data_json = json.load(f)
                    sha.append(data_json['paper_id'])

                    # combine abstract texts / process
                    combined_str = ''
                    for text in data_json['body_text']:
                        combined_str += text['text'].lower()
                        
                    full_text.append(combined_str)

            else:
                print('[WARNING]', file_path, 'not a file. Check pointed path directory in load_data().')

    loaded_samples = len(sha)
    print(f"[INFO] Data loaded into dataset instance. {loaded_samples} samples added.")
    
    df = pd.DataFrame()
    df['sha'] = sha
    df['full_text'] = full_text
    
    return df

def clean_time(val):
    try:
        return datetime.strptime(val, '%Y-%m-%d')
    except:
        try:
            return datetime.strptime(val, '%Y %b %d')
        except:
            try:
                return datetime.strptime(val, '%Y %b')
            except:
                try:
                    return datetime.strptime(val, '%Y')
                except:
                    try:
                        return datetime.strptime('-'.join(val.split(' ')[:3]), '%Y-%b-%d')
                    except Exception as e:
                        return None

In [19]:
def tokenize_check(text):
    if isinstance(text, str):
        word_tokens = word_tokenize(text)
    elif isinstance(text, list):
        word_tokens = text
    else:
        raise TypeError
    return word_tokens
    

def remove_stopwords(text, remove_symbols=False):
    """ Tokenize and/or remove stopwords and/or unwanted symbols from string"""
    list_stopwords = set(stopwords.words('english'))
    # list of signs to be removed if parameter remove_symbols set to True
    list_symbols = ['.', ',', '(', ')', '[', ']']
    
    # check input type and tokenize if not already
    word_tokens = tokenize_check(text)

    # filter out stopwords
    text_without_stopwords = [w for w in word_tokens if not w in list_stopwords] 
    
    if remove_symbols is True:
        text_without_stopwords = [w for w in text_without_stopwords if not w in list_symbols]
    
    return text_without_stopwords

# from nltk.stem import WordNetLemmatizer 

def lemmatize(text):
    """ Tokenize and/or lemmatize string """
    lemmatizer = WordNetLemmatizer()
    
    # check input type and tokenize if not already
    word_tokens = tokenize_check(text)
    
    lemmatized_text = [lemmatizer.lemmatize(w) for w in word_tokens]
    
    return lemmatized_text

def flatten_list(l):
    """ Flatten a list of lists """
    return [item for sublist in l for item in sublist]

def dfkw_cleaning(df):
    """ Clean df for a better keyword finding """


    # Data cleaning:
    # Turn df into a dictionary with a list of key phrases
    # Lower all of them and remove null values
    dfd = {k: [x.lower() for x in v if not pd.isnull(x)] for k, v in df.to_dict('list').items()}
    
    for k, v in dfd.items():

        # Split terms that are in brackets, like "Acyclovir (Aciclovir)"
        v = flatten_list([x.replace('\xa0', '').replace(')', '').split('(') for x in v]) 
        # Remove redundant values (i.e., ['coronavirus', 'coronavirus disease'] can be left as ['coronavirus']; the element 'coronavirus disease' is useless)
        v = [x for x in v if not any([y in x for y in [z for z in v if z != x]])]

        # Store the updated v
        dfd[k] = v

    # Return the clean df
    return pd.DataFrame.from_dict({k: pd.Series(v) for k, v in dfd.items()})

def find_keywords(text, df):
    """ Find relevant papers for the categories in df
    Returns a dictionary with the paper id's that match the categories
    It also stores the sentences where the matches have been found. This can be returned too if so the team decides """

    # Turn df into a dictionary with a list of key phrases
    # Lower all of them and remove null values
    dfd = {k: [x.lower() for x in v if not pd.isnull(x)] for k, v in df.to_dict('list').items()}

    matches = {}
    scores = {}
    
    for k, v in dfd.items():

        # Find matches

        # if you use keyphrase, it handles phase i and phase ii the same way, it would count both..
        # Solved this; see below
      
        for sentence in sent_tokenize(text):
            # Lower-case the sentence for better pattern finding
            sentence_l = sentence.lower()
            # Words have to be tokenized because there are cases like where "sars-cov" is counted where the actual word is "sars-cov-23"
            words = tokenize_check(sentence_l)
            # The condition for a match will be that the word(s) or is (are) in both the tokenized and non-tokenized sentence

            for keyphrase in v:

                # Check that the individual words that compose the key phrase are all 
                # in the words list
                words_in = all([words.count(x) > 0 for x in keyphrase.split(' ')])

                # Check if the keyphrase is in the non-tokenized sentence
                insentence = keyphrase in sentence_l

                # The key phrase is in the sentence if both conditions meet
                insentence = insentence and words_in

                # Now add the match
                if insentence:
                    try:
                        already_a_match = sentence in matches[k]
                    except KeyError:
                        matches[k] = [sentence]
                    else:
                        if not already_a_match:
                            matches[k].append(sentence)
                          
        # score is scaled by the number of values to choose from
        if k in matches:
          scores[k] = len(matches) / len(v)

    # return the keys with the highest score. also return the sentences for this.
    if len(scores.keys()) > 0:
        max_score = list(scores.keys())[np.argmax(scores.values())]
        return max_score, matches[max_score]
    else:
        # Returning np.nan allows detecting these nan's with .isnull()
        return np.nan, np.nan

def kw_match_tables(df):
    """ Build table with boolean values indicating kw matches """

    keywords = {
        'virus': virus_keywords.columns.tolist(), 
        'stage': clinical_stage_keywords.columns.tolist(), 
        'drug': drug_keywords.columns.tolist(), 
    }

    # headers = [item for sublist in [x.columns.tolist() for x in [virus_keywords, clinical_stage_keywords, drug_keywords]] for item in sublist]
    headers = flatten_list([x.columns.tolist() for x in [virus_keywords, clinical_stage_keywords, drug_keywords]])

    # Use titles instead of sha hashes because there's a lot of papers without sha that contain the keywords; yet all papers have a title
    table = pd.DataFrame(False, index=df.title, columns=headers)

    # Fill with True values
    for k, kws in keywords.items():
        for kw in kws:
            table.loc[df[df[k] == kw].title.tolist(), kw] = True

    # Merge
    df = pd.merge(df, table, on='title')

    return df

def summarize(text):
    # TODO @Simon @Silvan: extract keywords
    return 'summary'

#def visualize_data(data,keywords,summaries):
#    #TODO @Levi @Kwan: visualize data

### 0.c Relevant strings

In [20]:
# keywords that define the virus the paper is about (likely in title)
virus_keywords = pd.read_csv(keyword_dir+'/virus_keywords.csv')

# keywords describing clinical phase
clinical_stage_keywords = pd.read_csv(keyword_dir+'/phase_keywords.csv')

# keywords describing treatment types
drug_keywords = pd.read_csv(keyword_dir+'/drug_keywords.csv')

### 1. Load and Preprocess the data

In [21]:
# try the preloaded dataframe to speed up the process
try:
    df = pk.load(open('df.pkl','rb'))
except:
    # create dataset object
    meta_data = pd.read_csv(data_dir+'/metadata.csv').loc[::100, :]
    meta_data['publish_time'] = meta_data['publish_time'].apply(clean_time)
    full_texts = load_data(data_dir)

    # merge full text and metadata, so the paper selection can be performed either on full text
    # or abstract, if the full text is not available.
    df = pd.merge(meta_data,full_texts,on='sha',how='outer')
    df['full_text'][df['full_text'].isna()] = df['abstract'][df['full_text'].isna()]

    # drop papers with no abstract and no full text
    df = df.dropna(subset=['abstract','full_text'])
    df = df[df['full_text'] != 'Unknown']
    pk.dump(df,open('df.pkl','wb'))

[INFO] Loading data from ../../src...
[INFO] Data loaded into dataset instance. 20 samples added.


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['full_text'][df['full_text'].isna()] = df['abstract'][df['full_text'].isna()]


In [22]:
df.head()

Unnamed: 0,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,Microsoft Academic Paper ID,WHO #Covidence,has_full_text,full_text_file,full_text
0,,Elsevier,Intrauterine virus infections and congenital h...,10.1016/0002-8703(72)90077-4,,4361535.0,els-covid,Abstract The etiologic basis for the vast majo...,,"Overall, James C.",American Heart Journal,,,False,custom_license,Abstract The etiologic basis for the vast majo...
2,,Elsevier,Detection and characterization of subgenomic R...,10.1016/0042-6822(88)90585-5,,2841794.0,els-covid,Abstract Defective viral particles containing ...,,"Nüesch, Jürg; Krech, Sabine; Siegl, Günter",Virology,,,False,custom_license,Abstract Defective viral particles containing ...
12,,Elsevier,Chapter 2 Virus Replication,10.1016/B978-0-12-375158-4.00002-X,,,els-covid,Publisher Summary This chapter describes virus...,,,Fenner's Veterinary Virology,,,False,custom_license,Publisher Summary This chapter describes virus...
16,,Elsevier,Chapter 8 The Industry and the Developing World,10.1016/B978-044451868-2/50010-6,,,els-covid,Publisher Summary Effective medicines are hard...,,"Dukes, Graham",The Law and Ethics of the Pharmaceutical Industry,,,False,custom_license,Publisher Summary Effective medicines are hard...
29,,Elsevier,Threatwatch: Is the Saudi virus a new SARS?,10.1016/S0262-4079(13)61215-4,,,els-covid,The Middle Eastern coronavirus has started beh...,,,New Scientist,,,False,custom_license,The Middle Eastern coronavirus has started beh...


### 2. Define virus type, clinical stage and drug type

In [23]:
try:
    df = pk.load(open('df_kw.pkl','rb'))
except:
    # First clean the keyword dataframes for a better keyword finding
    virus_keywords = dfkw_cleaning(virus_keywords)
    clinical_stage_keywords = dfkw_cleaning(clinical_stage_keywords)
    drug_keywords = dfkw_cleaning(drug_keywords)
    # function on full text --> think about applying on full text or on abstract
    df['virus'], df['virus_sentence'] = zip(*df['abstract'].apply(find_keywords, df=virus_keywords))
    df['stage'], df['stage_sentence'] = zip(*df['abstract'].apply(find_keywords, df=clinical_stage_keywords))
    df['drug'], df['drug_sentence'] = zip(*df['abstract'].apply(find_keywords, df=drug_keywords))    
    # drop papers with nan values?
    pk.dump(df,open('df_kw.pkl','wb'))

In [24]:
df.head()

Unnamed: 0,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,...,WHO #Covidence,has_full_text,full_text_file,full_text,virus,virus_sentence,stage,stage_sentence,drug,drug_sentence
0,,Elsevier,Intrauterine virus infections and congenital h...,10.1016/0002-8703(72)90077-4,,4361535.0,els-covid,Abstract The etiologic basis for the vast majo...,,"Overall, James C.",...,,False,custom_license,Abstract The etiologic basis for the vast majo...,,,,,,
2,,Elsevier,Detection and characterization of subgenomic R...,10.1016/0042-6822(88)90585-5,,2841794.0,els-covid,Abstract Defective viral particles containing ...,,"Nüesch, Jürg; Krech, Sabine; Siegl, Günter",...,,False,custom_license,Abstract Defective viral particles containing ...,,,,,,
12,,Elsevier,Chapter 2 Virus Replication,10.1016/B978-0-12-375158-4.00002-X,,,els-covid,Publisher Summary This chapter describes virus...,,,...,,False,custom_license,Publisher Summary This chapter describes virus...,,,preclinical,[Before the development of in vitro cell cultu...,,
16,,Elsevier,Chapter 8 The Industry and the Developing World,10.1016/B978-044451868-2/50010-6,,,els-covid,Publisher Summary Effective medicines are hard...,,"Dukes, Graham",...,,False,custom_license,Publisher Summary Effective medicines are hard...,,,preclinical,[A pharmaceutical corporation that wishes to m...,,
29,,Elsevier,Threatwatch: Is the Saudi virus a new SARS?,10.1016/S0262-4079(13)61215-4,,,els-covid,The Middle Eastern coronavirus has started beh...,,,...,,False,custom_license,The Middle Eastern coronavirus has started beh...,,,,,,


#### 2.2. Add boolean table for keyword findings

In [25]:
# Update df with the tables with boolean values
df = kw_match_tables(df)
df.head()

Unnamed: 0,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,...,preclinical,Phase 0,Phase 1,Phase 2,Phase 3,Phase 4,antiviral drugs,less common viral inhibitors,therapeutics,vaccine
0,,Elsevier,Intrauterine virus infections and congenital h...,10.1016/0002-8703(72)90077-4,,4361535.0,els-covid,Abstract The etiologic basis for the vast majo...,,"Overall, James C.",...,False,False,False,False,False,False,False,False,False,False
1,,Elsevier,Detection and characterization of subgenomic R...,10.1016/0042-6822(88)90585-5,,2841794.0,els-covid,Abstract Defective viral particles containing ...,,"Nüesch, Jürg; Krech, Sabine; Siegl, Günter",...,False,False,False,False,False,False,False,False,False,False
2,,Elsevier,Chapter 2 Virus Replication,10.1016/B978-0-12-375158-4.00002-X,,,els-covid,Publisher Summary This chapter describes virus...,,,...,True,False,False,False,False,False,False,False,False,False
3,,Elsevier,Chapter 8 The Industry and the Developing World,10.1016/B978-044451868-2/50010-6,,,els-covid,Publisher Summary Effective medicines are hard...,,"Dukes, Graham",...,True,False,False,False,False,False,False,False,False,False
4,,Elsevier,Threatwatch: Is the Saudi virus a new SARS?,10.1016/S0262-4079(13)61215-4,,,els-covid,The Middle Eastern coronavirus has started beh...,,,...,False,False,False,False,False,False,False,False,False,False


### 3. Summarize the texts

In [26]:
df['summary'] = df['full_text'].apply(summarize)

### 4. Visualize extracted papers, links and summaries

In [14]:
visualize_data(df)

NameError: name 'visualize_data' is not defined