## COVID-19 Open Research Dataset Challenge - What do we know about vaccines and therapuetics?
The following questions were analysed specifically: 
- Effectiveness of drugs being developed and tried to treat COVID-19 patients.
  - Clinical and bench trials to investigate less common viral inhibitors against COVID-19 such as naproxen, clarithromycin, and minocyclinethat that may exert effects on viral replication.
- Methods evaluating potential complication of Antibody-Dependent Enhancement (ADE) in vaccine recipients.
- Exploration of use of best animal models and their predictive value for a human vaccine.
- Capabilities to discover a therapeutic (not vaccine) for the disease, and clinical effectiveness studies to discover therapeutics, to include antiviral agents.
- Efforts targeted at a universal coronavirus vaccine.
- Efforts to develop animal models and standardize challenge studies
- Assays to evaluate vaccine immune response and process development for vaccines, alongside suitable animal models (in conjunction with therapeutics)

## Our approach - Creating a timeline visualizing the progress of vaccines/cures on COVID-19 and other similar viral diseases.
Our goal is to create an intuitive visualization of the progress of research on vaccines and therapuetics regarding COVID-19. Not only is this useful for professional researchers in having a quick overview of the clinical trial stages of each investigated vaccine/therapeutic, but also for the public, to have a better understanding of the time frame for which to expect a cure or solution. We decided to create vizualizations of research progress of other virusses as well as COVID-19, to get a better picture of the timescale and ammount of research that goes into making a vaccine or therapeutics.

Several steps were taken to create the visualizations:
1. Load and preprocess the data:
    - lemmatize all texts and remove stopwords
2. Categorize papers based on keywords 
    - using either string pattern matching or word embeddings
    - relevant words were manually selected based on the research questions and indicativaty of clinical stage trial (e.g. mouse vs human test subject, words expressing certainty etc.)
    - categories are: virus, clinical stage, drug type
3. Extract keywords/summaries from selected papers
    - TODO: write how we do this @Simon, @Silvan
5. Visualize extracted papers, links and summaries
    - TODO: explain how (after we know how) @Levi @Gloria


### 0.a Imports

In [1]:
# TODO: write your imports here
import os
import json

import pandas as pd

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize

from nltk.stem import WordNetLemmatizer

import pickle as pk
import numpy as np

# path to data
data_dir = '../../src'  
keyword_dir = '../../keywords'

### 0.b Functions

In [2]:
# As kaggle only allows notebook submissions, all functions should be in the notebook. Just copy your functions and paste them here.
          
def load_data(data_dir):
    """Load data from dataset data directory."""
    sha = []
    full_text = []

    subdir = [x for x in os.listdir(data_dir) if os.path.isdir(os.path.join(data_dir,x))]

    print(f"[INFO] Loading data from {data_dir}...")
    # loop through folders with json files
    for folder in subdir:
#             path = os.path.join(data_dir,folder, folder)
        path = os.path.join(data_dir,folder, folder, 'pdf_json')
        # loop through json files and scrape data
        for file in os.listdir(path):
            file_path = os.path.join(path, file)

            # open file only if it is a file
            if os.path.isfile(file_path):
                with open(file_path) as f:
                    data_json = json.load(f)
                    sha.append(data_json['paper_id'])

                    # combine abstract texts / process
                    combined_str = ''
                    for text in data_json['body_text']:
                        combined_str += text['text'].lower()
                        
                    full_text.append(combined_str)

            else:
                print('[WARNING]', file_path, 'not a file. Check pointed path directory in load_data().')

    loaded_samples = len(sha)
    print(f"[INFO] Data loaded into dataset instance. {loaded_samples} samples added.")
    
    df = pd.DataFrame()
    df['sha'] = sha
    df['full_text'] = full_text
    
    return df

def clean_time(val):
    try:
        return datetime.strptime(val, '%Y-%m-%d')
    except:
        try:
            return datetime.strptime(val, '%Y %b %d')
        except:
            try:
                return datetime.strptime(val, '%Y %b')
            except:
                try:
                    return datetime.strptime(val, '%Y')
                except:
                    try:
                        return datetime.strptime('-'.join(val.split(' ')[:3]), '%Y-%b-%d')
                    except Exception as e:
                        return None

In [3]:
def tokenize_check(text):
    if isinstance(text, str):
        word_tokens = word_tokenize(text)
    elif isinstance(text, list):
        word_tokens = text
    else:
        raise TypeError
    return word_tokens
    

def remove_stopwords(text, remove_symbols=False):
    """ Tokenize and/or remove stopwords and/or unwanted symbols from string"""
    list_stopwords = set(stopwords.words('english'))
    # list of signs to be removed if parameter remove_symbols set to True
    list_symbols = ['.', ',', '(', ')', '[', ']']
    
    # check input type and tokenize if not already
    word_tokens = tokenize_check(text)

    # filter out stopwords
    text_without_stopwords = [w for w in word_tokens if not w in list_stopwords] 
    
    if remove_symbols is True:
        text_without_stopwords = [w for w in text_without_stopwords if not w in list_symbols]
    
    return text_without_stopwords

# from nltk.stem import WordNetLemmatizer 

def lemmatize(text):
    """ Tokenize and/or lemmatize string """
    lemmatizer = WordNetLemmatizer()
    
    # check input type and tokenize if not already
    word_tokens = tokenize_check(text)
    
    lemmatized_text = [lemmatizer.lemmatize(w) for w in word_tokens]
    
    return lemmatized_text

def find_keywords(text, df):
    """ Find relevant papers for the categories in df
    Returns a dictionary with the paper id's that match the categories
    It also stores the sentences where the matches have been found. This can be returned too if so the team decides """

    # Data cleaning:
    # Turn df into a dictionary with a list of key phrases
    # Lower all of them and remove null values
    dfd = {k: [x.lower() for x in v if not pd.isnull(x)] for k, v in df.to_dict('list').items()}
    
    matches = {}
    scores = {}
    
    # Remove redundant values (i.e., ['coronavirus', 'coronavirus disease'] can be left as ['coronavirus']; the element 'coronavirus disease' is useless)
    for k, v in dfd.items():
        # print(k)
        v = [x for x in v if not any([y in x for y in [z for z in v if z != x]])]
        dfd[k] = v

        # Find matches
        # Use the loop we're in where we've already cleaned the data to find the matches
        
        # if you use keyprhase, it handles phase i and phase ii the same way, it would count both..
        
        for sentence in sent_tokenize(text):
            for keyphrase in v:
                if keyphrase in sentence:
                    try:
                        already_a_match = sentence in matches[k]
                    except KeyError:
                        matches[k] = [sentence]
                    else:
                        if not already_a_match:
                            matches[k].append(sentence)
                            
        # score is scaled by the number of values to choose from
        if k in matches:
            scores[k] = len(matches)/len(v)

    # return the keys with the highest score. also return the sentences for this.
    if len(scores.keys()) > 0:
        max_score = list(scores.keys())[np.argmax(scores.values())]
        return max_score, matches[max_score]
    else:
        return 'nan','nan'

def summarize(text):
    # TODO @Simon @Silvan: extract keywords
    return 'summary'

#def visualize_data(data,keywords,summaries):
#    #TODO @Levi @Kwan: visualize data

### 0.c Relevant strings

In [4]:
# keywords that define the virus the paper is about (likely in title)
virus_keywords = pd.read_csv(keyword_dir+'/virus_keywords.csv')

# keywords describing clinical phase
clinical_stage_keywords = pd.read_csv(keyword_dir+'/phase_keywords.csv')

# keywords describing treatment types
drug_keywords = pd.read_csv(keyword_dir+'/drug_keywords.csv')

### 1. Load and Preprocess the data

In [6]:
# try the preloaded dataframe to speed up the process
try:
    df = pk.load(open('df.pkl','rb'))
except:
    # create dataset object
    meta_data = pd.read_csv(data_dir+'/metadata.csv')
    meta_data['publish_time'] = meta_data['publish_time'].apply(clean_time)
    full_texts = load_data(data_dir)

    # merge full text and metadata, so the paper selection can be performed either on full text
    # or abstract, if the full text is not available.
    df = pd.merge(meta_data,full_texts,on='sha',how='outer')
    df['full_text'][df['full_text'].isna()] = df['abstract'][df['full_text'].isna()]

    # drop papers with no abstract and no full text
    df = df.dropna(subset=['abstract','full_text'])
    df = df[df['full_text'] != 'Unknown']
    pk.dump(df,open('df.pkl','wb'))

In [7]:
df.head()

Unnamed: 0,cord_uid,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,Microsoft Academic Paper ID,WHO #Covidence,has_pdf_parse,has_pmc_xml_parse,full_text_file,url,full_text
0,8q5ondtn,,Elsevier,Intrauterine virus infections and congenital h...,10.1016/0002-8703(72)90077-4,,4361535.0,els-covid,Abstract The etiologic basis for the vast majo...,1972-12-31,"Overall, James C.",American Heart Journal,,,False,False,custom_license,https://doi.org/10.1016/0002-8703(72)90077-4,Abstract The etiologic basis for the vast majo...
3,cjuzul89,,Elsevier,Epidemiology of community-acquired respiratory...,10.1016/0002-9343(85)90361-4,,4014285.0,els-covid,Abstract Upper respiratory tract infections ar...,1985-06-28,"Garibaldi, Richard A.",The American Journal of Medicine,,,False,False,custom_license,https://doi.org/10.1016/0002-9343(85)90361-4,Abstract Upper respiratory tract infections ar...
4,jhx90hh0,,Elsevier,Monoclonal antibodies identify multiple epitop...,10.1016/0006-291x(85)91946-1,,2409966.0,els-covid,Abstract Nine hybridoma cell lines secreting a...,1985-06-28,"Cherel, Isabelle; Grosclaude, Jeanne; Rouze, P...",Biochemical and Biophysical Research Communica...,,,False,False,custom_license,https://doi.org/10.1016/0006-291x(85)91946-1,Abstract Nine hybridoma cell lines secreting a...
15,iqswl5kh,,Elsevier,Morphology and morphogenesis of a coronavirus ...,10.1016/0014-4800(76)90045-9,,187445.0,els-covid,Abstract The morphology and morphogenesis of v...,1976-12-31,"Doughri, A.M.; Storz, J.; Hajer, I.; Fernando,...",Experimental and Molecular Pathology,,,False,False,custom_license,https://doi.org/10.1016/0014-4800(76)90045-9,Abstract The morphology and morphogenesis of v...
28,z65m48tn,,Elsevier,Demonstration of viral antigen and immunoglobu...,10.1016/0021-9975(89)90122-9,,2469703.0,els-covid,Abstract Haemagglutinating encephalomyelitis v...,1989-02-28,"Narita, M.; Kawamura, H.; Haritani, M.; Kobaya...",Journal of Comparative Pathology,,,False,False,custom_license,https://doi.org/10.1016/0021-9975(89)90122-9,Abstract Haemagglutinating encephalomyelitis v...


### 2. Define virus type, clinical stage and drug type

In [8]:
try:
    df = pk.load(open('df_kw.pkl','rb'))
except:
    # function on full text --> think about applying on full text or on abstract
    df['virus'], df['virus_sentence'] = zip(*df['abstract'].apply(find_keywords,df=virus_keywords))
    df['stage'], df['stage_sentence'] = zip(*df['abstract'].apply(find_keywords,df=clinical_stage_keywords))
    df['drug'], df['drug_sentence'] = zip(*df['abstract'].apply(find_keywords,df=drug_keywords))
    
    # drop papers with nan values?
    pk.dump(df,open('df_kw.pkl','wb'))

In [9]:
df

Unnamed: 0,cord_uid,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,...,has_pmc_xml_parse,full_text_file,url,full_text,virus,virus_sentence,stage,stage_sentence,drug,drug_sentence
0,8q5ondtn,,Elsevier,Intrauterine virus infections and congenital h...,10.1016/0002-8703(72)90077-4,,4361535.0,els-covid,Abstract The etiologic basis for the vast majo...,1972-12-31,...,False,custom_license,https://doi.org/10.1016/0002-8703(72)90077-4,Abstract The etiologic basis for the vast majo...,,,preclinical,[Abstract The etiologic basis for the vast maj...,therapeutics,[Successful control of virus-induced congenita...
3,cjuzul89,,Elsevier,Epidemiology of community-acquired respiratory...,10.1016/0002-9343(85)90361-4,,4014285.0,els-covid,Abstract Upper respiratory tract infections ar...,1985-06-28,...,False,custom_license,https://doi.org/10.1016/0002-9343(85)90361-4,Abstract Upper respiratory tract infections ar...,common cold,[Serotypes of the rhinoviruses account for 20 ...,preclinical,[Pneumonia remains an important cause of morbi...,therapeutics,"[Given the diversity of pathogenic agents, it ..."
4,jhx90hh0,,Elsevier,Monoclonal antibodies identify multiple epitop...,10.1016/0006-291x(85)91946-1,,2409966.0,els-covid,Abstract Nine hybridoma cell lines secreting a...,1985-06-28,...,False,custom_license,https://doi.org/10.1016/0006-291x(85)91946-1,Abstract Nine hybridoma cell lines secreting a...,,,,,,
15,iqswl5kh,,Elsevier,Morphology and morphogenesis of a coronavirus ...,10.1016/0014-4800(76)90045-9,,187445.0,els-covid,Abstract The morphology and morphogenesis of v...,1976-12-31,...,False,custom_license,https://doi.org/10.1016/0014-4800(76)90045-9,Abstract The morphology and morphogenesis of v...,,,,,,
28,z65m48tn,,Elsevier,Demonstration of viral antigen and immunoglobu...,10.1016/0021-9975(89)90122-9,,2469703.0,els-covid,Abstract Haemagglutinating encephalomyelitis v...,1989-02-28,...,False,custom_license,https://doi.org/10.1016/0021-9975(89)90122-9,Abstract Haemagglutinating encephalomyelitis v...,,,preclinical,[Abstract Haemagglutinating encephalomyelitis ...,,
29,8zilr4cy,,Elsevier,Bovine herpesvirus-1-induced pharyngeal tonsil...,10.1016/0021-9975(92)90053-w,,1602058.0,els-covid,Abstract The potential involvement of the phar...,1992-04-30,...,False,custom_license,https://doi.org/10.1016/0021-9975(92)90053-w,Abstract The potential involvement of the phar...,,,,,,
30,i60trxri,,Elsevier,Pathogenicity and antigen detection of the Nou...,10.1016/0021-9975(92)90068-6,,1313460.0,els-covid,Abstract We compared the pathogenicity and the...,1992-01-31,...,False,custom_license,https://doi.org/10.1016/0021-9975(92)90068-6,Abstract We compared the pathogenicity and the...,,,preclinical,[Abstract We compared the pathogenicity and th...,,
31,49nvudqj,,Elsevier,Western and dot immunoblotting analysis of vir...,10.1016/0022-1759(84)90043-7,,6208281.0,els-covid,Abstract Viral proteins were separated by sodi...,1984-10-12,...,False,custom_license,https://doi.org/10.1016/0022-1759(84)90043-7,Abstract Viral proteins were separated by sodi...,,,,,,
32,x5ygn82d,,Elsevier,Plaque/focus immunoassay: a simple method for ...,10.1016/0022-1759(84)90301-6,,6389707.0,els-covid,"Abstract A new, simple enzyme-linked immunosor...",1984-11-30,...,False,custom_license,https://doi.org/10.1016/0022-1759(84)90301-6,"Abstract A new, simple enzyme-linked immunosor...",,,preclinical,"[Abstract A new, simple enzyme-linked immunoso...",,
43,tcrw7lzd,,Elsevier,Presence of infectious polyadenylated RNA in t...,10.1016/0042-6822(77)90498-6,,193262.0,els-covid,Abstract Avian infectious bronchitis virus (IB...,1977-04-30,...,False,custom_license,https://doi.org/10.1016/0042-6822(77)90498-6,Abstract Avian infectious bronchitis virus (IB...,,,,,,


### 3. Summarize the texts

In [10]:
df['summary'] = df['full_text'].apply(summarize)

### 4. Visualize extracted papers, links and summaries

In [15]:
visualize_data(df)

NameError: name 'visualize_data' is not defined