#### We are a team of three: one machine learning/NLP scientist, one front-end senior software engineer and one co-op data science intern/engineer. We all work in the same place, Thomson Reuters, and that’s how we know each other.  

### Task: 
What do we know about COVID-19 risk factors? 

### How we approached the challenge: 
Our platform was built based on the following principles: 
- Empower users (health researchers) to conduct literature survey efficiently. 
- Adaptable to future needs and challenges of health researchers.  
- Modularized to have the capability of being improved and polished in a short time and in parallel.  

### Our hypotheses/why? 

AI can benefit researchers by extracting and visualizing information in the most efficient and relevant manner at scale.  In the absence of expert feedback and annotations, we have built/provided a platform that extracts and ranks relevant info with potential for improvement in the future by taking advantage of expert annotation.

### How we solve it/why? 

The project was done in two slightly different approaches.  
The first approach focused on developing an end to end pipeline to address the first subtask which was: 
- Data on potential risks factors 
- Smoking, pre-existing pulmonary disease 
- Co-infections (determine whether co-existing respiratory/viral infections make the virus more transmissible or virulent) and other co-morbidities 
- Neonates and pregnant women 
- Socio-economic and behavioral factors to understand the economic impact of the virus and whether there were differences. 

We have developed a web app to visually accompany the analysis carried out for this task. 
The second approach was focused on question information retrieval but isn’t accompanied by with a web-app yet due to time constraint. 

### First approach had the following recipe: 
The platform is designed to visualize snippets of relevant topics through the following process: 
- Merge different sources of the data
- Extract all COVID 19 related papers 
- Expand and process the list of keywords
- Find excerpts of papers that include the keywords 
- Rank the excerpts
- Create an extractive summary for each paper
- Serve it to the web-app for visualization

### Second approach had the following recipe: 
The platform is designed to rank snippets of relevant topics through the following process: 

- Merge different sources of the data
- Extract all COVID 19 related paper s
- Convert a question into set of keywords (The questions are extracted from the [medical dictionary](https://docs.google.com/spreadsheets/d/1NoiAFJoydk3zuc-G0qqROarkhaGpfgbQhTVYhbYtLCM/edit#gid=0) shared with Kaggle participants.
- Expand the list of keywords
- Find excerpts of papers that include the keywords 
- Rank the excerpts
- Create an extractive summary for each paper 

### Pros and Cons of our approach and platform: 

#### Pros: 

- The code base is highly modular and simple  
- The code is very well documented 
- The visualization resonates with researchers. We interviewed an expert in epidemiology, since that is our target user. He liked all aspects of it including the extractive summaries of the papers. He also noted that using this platform could reduce the time taken for a literature survey, which normally takes 3 to 4 months, to less than a month.  
- The platform is taking a high recall approach to include everything 
- With a little bit of effort, we can convert this platform to an expert annotation platform where experts can interactively click on irrelevant sentences in the snippets and convert a completely unsupervised approach to a supervised learning task. 

#### Cons: 
- Keyword search is not the most efficient search. Word embedding is known to be a better approach for text analysis but there wasn’t enough time to implement and evaluate that. 
- Our ranking follows a simple approach, where an expert should decide on the metric for ranking the importance of a snippet. 
- The second approach is not coming with a visualization. 
- There may be more risk factors than what we currently searched for and an algorithm should extract the unknown risks too. 

In [1]:
import os
import io
import re
import sys
import glob
import json
import string
import requests

import numpy as np
import pandas as pd
from tqdm.notebook import tqdm

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize, sent_tokenize

stemmer = PorterStemmer()
root = '../../kaggle_data/'
stop_words = list(set(stopwords.words('english')))
stop_words.extend(['within', 'what', 'how', 'eg', 'ie'])

## Data Parsing and Extraction

### `PaperLoader` class will load all papers for the challenge and provide an interface for us to obtain `DataFrames` to work with. The focus will be on:
- Obtaining Paper title, Abstract, Body
    - The text body is filtered to remove sections containing lots of citations and hyperlinks
- Obtaining Authors, Journal of Publication, Publication Date and Publication Date
- Obtaining journal ratings(H index) to potentially sort paper based on journal quality
    - For the journal ratings, we use a list we obtained from **INSERT LINK HERE**

In [2]:
class PaperLoader():
    """
    Loads, parses and merges metadata for papers
    """
    
    def __init__(self, root_dir, no_bib=True):
        """
        Initializes PaperLoader class to read all .json files from root_directory
            
            no_bib: if true, clean noisy sections with bibliographies
            root_dir: root directory for papers
        """
        self.ROOT_DIR = root_dir
        self.JSON_FILES = glob.glob(f'{root}/**/*.json', recursive=True)
        self.PAPERS_COLUMN = {
            "doc_id": [None],
            "title": [None],
            "abstract": [None],
            "text_body": [None]
        }
        self.PAPERS_DF = None
        self.NO_BIB = no_bib

    
    def __clean_bib(self, body_text, thres):
        """
        Removes sections with more than 5 URL/DOI/HTTP instances
            
            body_text: array of dictionaries for text_body
            thres: number of hyperlinks tolerated before removal 
        """
        # Sometimes, the text body has duplicate sections consecutively.
        merged_body = []
        for segment in body_text:
            # We will combine these duplicate sections
            if len(merged_body) > 0:
                if merged_body[-1]['section'] == segment['section']:
                    merged_body[-1]['text'] += '\n' + segment['text']
                    continue
            merged_body.append(segment)

        merged_body = [
            segment for segment in merged_body
            if len(re.findall("(http|doi|www)", segment['text'])) <= thres
        ]
        return merged_body


    def create_paper_df(self):
        """
        Creates a Pandas DataFrame from all json files in root_directory
        Each json file represents a paper. 
        Features extracted are: doc_id, title, abstract, text_body
        """
        df_list = []
        
        for i in tqdm(range(len(self.JSON_FILES))):
            file_name = self.JSON_FILES[i]
            
            #Initialize row for returned df. Each row represents a paper
            row = {x: None for x in self.PAPERS_COLUMN}

            with open(file_name) as json_data:
                data = json.load(json_data)

                row['doc_id'] = data['paper_id']
                row['title'] = data['metadata']['title']
                
                # If title is empty, we skip the paper
                if len(row['title']) <= 2:
                    continue

                # If a paper does not have an abstract of a body, we will skip it
                if ('abstract' not in data or 'body_text' not in data):
                    continue
                else:
                    # Now need all of the abstract. Put it all in
                    # a list then use str.join() 
                    abstract_list = [abst['text'] for abst in data['abstract']]
                    abstract = "\n ".join(abstract_list)

                # Skip the paper if abstract is empty
                if len(abstract) <= 2:
                    continue

                row['abstract'] = abstract

                # And lastly the body of the text.
                # These clauses check if the user wants to clean up references
                if self.NO_BIB:
                    body_list = self.__clean_bib(data['body_text'], 4)
                else:
                    body_list = [bt for bt in data['body_text']]

                row['text_body'] = body_list

                df_list.append(row)
        # create final dataframe
        self.PAPERS_DF = pd.DataFrame(df_list)


    def merge_metadata(self, metadata = 'metadata.csv'):
        """
            Joins paper information with information on journal for paper,
            authors, doi and published date  
                metadata: path to csv file containing metadata
        """
        metadata_df = pd.read_csv(self.ROOT_DIR + metadata)
        metadata_df = metadata_df.loc[:, 
                          ['sha', 'publish_time', 'authors', 'journal', 'doi']]
        self.PAPERS_DF = self.PAPERS_DF.merge(metadata_df,
                                              left_on='doc_id',
                                              right_on='sha',
                                              how='inner')

    def merge_journals(self):
        """
        Joins paper information with information on journal ratings
        Important column: H_Index
        """
        journal_df = pd.read_csv(root + 'scimagoj_2018.csv', sep=';')
        papers_ratings_df = self.PAPERS_DF.merge(
            journal_df.loc[:, ['Title', 'H index']],
            left_on='journal',
            right_on='Title',
            how='left')
        papers_ratings_df = papers_ratings_df.drop(
            ['sha', 'Title'], axis=1).reset_index(drop=True)
        self.PAPERS_DF = papers_ratings_df

    def get_df(self):
        """
        Returns processed dataframe
        """
        self.PAPERS_DF = self.PAPERS_DF.dropna(
            subset=['abstract', 'text_body'])
        return self.PAPERS_DF

We will now parse the papers from our data(root) directory and store them in `papers_df`.

In [3]:
paper_loader = PaperLoader(root)
paper_loader.create_paper_df()
paper_loader.merge_metadata()
paper_loader.merge_journals()
papers_df = paper_loader.get_df()

HBox(children=(FloatProgress(value=0.0, max=59311.0), HTML(value='')))




In [4]:
papers_df.head(2)

Unnamed: 0,doc_id,title,abstract,text_body,publish_time,authors,journal,doi,H index
0,306ef95a3a91e13a93bcc37fb2c509b67c0b5640,A Novel Approach for a Novel Pathogen: using a...,Thousands of people in the United States have ...,[{'text': 'The 2019 novel coronavirus (SARS-Co...,2020-03-12,"Bryson-Cahn, Chloe; Duchin, Jeffrey; Makarewic...",Clin Infect Dis,10.1093/cid/ciaa256,
1,6599ebbef3d868afac9daa4f80fa075675cf03bc,International aviation emissions to 2025: Can ...,"International aviation is growing rapidly, res...","[{'text': 'Sixty years ago, civil aviation was...",2009-01-31,"Macintosh, Andrew; Wallace, Lailey",Energy Policy,10.1016/j.enpol.2008.08.029,178.0


In [5]:
papers_df.shape

(25312, 9)

## Filtering for covid-19 related papers released after 2019
There is a lot of noise in this dataset due to information about other strains of coronavirus so we will select only the papers that are related to Covid-19. 

While the older papers may contain some important insight on the variance among the  different strains of coronavirus, for our purposes, we will only be looking at papers published on 2019 or later because that is when Covid-19 was first discovered in humans.

In [6]:
# List of keywords for covid-19
cov_list = [
    'novel coronavi',
    'covid',
    'cov_2',
    'cord-19',
    'cord 19',
    '2019-nCoV',
    '2019 ncov',
    '2019 cov',
    'wuhan coronavi',
]

### `RelevantFilter` class will filter the dataframe from `PaperLoader` and filter for covid-19 papers published on 2019 or later. 
We will need to supply a list of covid-related keywords to filter from to the `constructor`


In [7]:
class RelevantFilter():
    
    def __init__(self, keywords, year='2019'):
        """
        constructor for RelevantFilter
            keywords: keywords to filter for
            year: papers written before this year will be discarded
        """
        self.KEYWORDS = keywords
        self.YEAR = year

    def extract_recent(self, df):
        """
        extracts documents published on or after self.YEAR
        """
        return df[df['publish_time'] >= self.YEAR]

    def filter_papers(self, df):
        """
        Filters for papers whose title have mention of 
        any of the terms in self.KEYWORDS
        """
        pattern = re.compile('(' + "|".join(self.KEYWORDS) + ')',
                                 re.IGNORECASE)
        # We will filter for rows with one or more matches 
        # for title and covid keywords
        df = df[df['title'].apply(lambda x: 
                                  len(pattern.findall(x)) >= 1
                                  if x else False)]
        
        return df

We will filter through `papers_df` to get only covid-19 related papers in `covid_df`

In [8]:
covid_filter = RelevantFilter(cov_list, '2019')
covid_df = covid_filter.filter_papers(papers_df)
covid_df = covid_filter.extract_recent(covid_df)

In [9]:
covid_df.shape

(929, 9)

In [10]:
covid_df.head(1)

Unnamed: 0,doc_id,title,abstract,text_body,publish_time,authors,journal,doi,H index
0,306ef95a3a91e13a93bcc37fb2c509b67c0b5640,A Novel Approach for a Novel Pathogen: using a...,Thousands of people in the United States have ...,[{'text': 'The 2019 novel coronavirus (SARS-Co...,2020-03-12,"Bryson-Cahn, Chloe; Duchin, Jeffrey; Makarewic...",Clin Infect Dis,10.1093/cid/ciaa256,


In [11]:
list(covid_df.head(2)['title'].values)

['A Novel Approach for a Novel Pathogen: using a home assessment team to evaluate patients for 2019 novel coronavirus (SARS-CoV-2)',
 'Modeling the dynamics of novel coronavirus (2019-nCov) with fractional derivative']

## Keyword Analysis (Phase 1)

We will now go through the papers to extract and rank excerpts that contain relevant information about risk factors for covid-19. 
We will do this through an analysis of:

- Risk factors for covid-19
- Study designs
    - We will use this to evaluate the quality of a paper's methodologies for our rankings
- Outcomes
    - We will incentivise excerpts to explicitly mention outcomes that we have found researchers look for(in our interviews)
- Fatality
    - We have determined that information on mortality and fatality would be of high value to researcehrs, and rightly so.

**Note**: The list of keywords were all obtained from a crowdsourced medical dictionary researchers had assembled. You can find more details [here](https://docs.google.com/spreadsheets/d/1t2e3CHGxHJBiFgHeW0dfwtvCG4x0CDCzcTFX7yz9Z2E/edit#gid=1217643351)

In [12]:
risk_factors = [{
    'name': 'smoking',
    'pattern': 'smok'
}, {
    'name': 'diabetes',
    'pattern': 'diabete'
}, {
    'name': 'pregnancy',
    'pattern': 'pregnan'
}, {
    'name': 'tuberculosis',
    'pattern': '(tubercul|mtb|\btb[A-Za-z0-9]\b)'
}, {
    'name': 'hypertension',
    'pattern': 'hypertension'
}, {
    'name': 'cancer',
    'pattern': 'cancer'
}, {
    'name': 'neonates',
    'pattern': '(baby|neonate|enfant)'
},
    {
    'name': 'liver disease',
    'pattern': 'liver disease'
},{
    'name': 'COPD',
    'pattern': 'COPD'
},{
    'name': 'pulmonary disease',
    'pattern': 'pulm'
},{
    'name': 'race/ethnicity',
    'pattern': 'ethnic'
}]

In [13]:
design_list = [
    'mathemat', 'profil', 'cross sectional case control',
    'matched case control', 'contact', 'surviv', 'tracing,', 'time to event',
    'time-to-event', 'risk factor analysis', 'logistic regression',
    'cross-sectional case-control', 'matched case-control',
    'observational case series', 'time series analysis', 'survival analysis',
    'investigati', 'model', 'outbreak', 'stochast', 'statist', 'analysi',
    'experiment', 'excret', 'investig',
    'retrospective cohort', 'cross-sectional case-control',
    'cross sectional case control', 'prevalence survey', 'systematic review ',
    'meta-analysis', 'meta analysis', 'matched case-control',
    'matched case control', 'medical record review',
    'observational case series', 'time series analysis',
    'pseudo-randomized controlled', 'pseudo randomized controlled',
    'randomized controlled', 'retrospective analysis', 'retrospective study',
    'retrospective studies'
]

In [14]:
outcome_list = [
    'risk', 'range', 'duration', 'asymptomatic', 'infecti', 'reproducti',
    'route', 'age', 'transmm'
    'stratifi', 'period,', 'health', 'r0', 'shedd', 'viral'
    'period', 'incub', 'generat', 'factor', 'interval,', 'serial'
]

In [15]:
fatality_list = ['icu', 'fatal', 'death', 'die', 'dead', 'dying', 'mortal']

#### The next cell will contain the default coeffecients for the algorithm's prioritization of different features. These coeffecients are a work in progress and we seek to constantly improve them with more expert feedback.

In [16]:
evaluation_weights= {
    'risk': 2,
    'design': 1,
    'outcome': 2,
    'fatality': 1,
    'section': 1,
    'inverse_length': 5
}

### `PaperAnalyzer` class will take in a DataFrame of papers and then analyze each paper. 
The analysis is done with with its `analyze_risks()`, `analyze_designs()` and `analyze_outcomes()` methods that will analyze the risk factors, designs and outcomes respectively for excerpts in the paper. Finally, the `get_df()` method will return a new DataFrame with rankings for relevancy of excerpts. The rankings also factor in the `section` of the paper that the excerpt is from, with sections like **discussion** or **results** that seem to have pertinent, concise information ranked higher. Furthermore, these rankings are also normalized by the lenght of the excerpts

*Note: These rankings for sections were determined through our interviews with epidimiologists.*

#### Helper functions for `PaperAnalyzer` 

In [17]:
def rank_design(design_keyword):
    """
    This helper function ranks study designs. So far we have
    confirmed rankings for only three study designs, but this
    data will be expanded and improved further with time as we 
    speak to more epidimiologists
    """
    design_rankings = {
        'meta': 10,
        'random': 8,
        'pseudo': 6,
    }
    current_ranking = -1
    for key in design_rankings.keys():
        if key in design_keyword.lower():
            current_ranking = min(current_ranking, design_rankings[key])
    
    if current_ranking == -1:
        current_ranking = 4
    return current_ranking    

def flatten(arr):
    """
    Returns a single flat list from a list of lists
    """
    return [item 
            for sublist in arr 
            for item in sublist]

In [18]:
class PaperAnalyzer():    
    """
    Takes in a dataframe of papers and sets it up for analysis
    """
    # Setting up static constants
    DEFAULT_RISKS = risk_factors
    DEFAULT_DESIGNS = design_list
    DEFAULT_OUTCOMES = outcome_list
    DEFAULT_FATAL = fatality_list
    DEFAULT_WEIGHTS = evaluation_weights
    
    def __init__(self, parent_df, weights = None):
        """
        Explodes the passed dataframe on sections for more granular analysis
        Sets up ranks to be updated later by methods. Client can supply their
        own dictionary of weights for different features.
        """
        # Section ratings
        self.section_ratings = {
                        'discus': 10,
                        'concl': 10,
                        'resul': 10,
                        'analy': 9,
                        'impli': 9,
                        'valu': 9,
                        'intro': 6
                        }
        
        parent_df['full_text'] = parent_df['text_body'].apply(lambda x: '\n'.join([sec['text'] for sec in x]))
        self.df = parent_df.explode('text_body')
        # Extracting section headers
        self.df['section'] = self.df['text_body'].apply(lambda x: 
                                                        x['section'] 
                                                        if type(x) == dict 
                                                        else None)
        # Extracting section texts
        self.df['text_body'] = self.df['text_body'].apply(lambda x:
                                                          x['text'] 
                                                          if type(x) == dict 
                                                          else None)
        # Dropping rows where section text is empty
        self.df = self.df[self.df['text_body'].notna()]
        self.df['total_rank'] = 0
        if weights:
            self.weights = weights
        else:
            self.weights = PaperAnalyzer.DEFAULT_WEIGHTS
        # TQDM is used for progress bars
        tqdm.pandas()

    def analyze_risks(self, risk_factors):
        """
        Analyses papers in self.df for risk factors and returns a report df
        with columns has_{risk_factor}?, {risk_factor}_count, 
        {risk_factor}_in_title and updates {total_rank} for each row.
        The match_indices column is produced for ease of visualization
        in the web app.
        """
        if risk_factors == None:
            risk_factors = PaperAnalyzer.DEFAULT_RISKS
        
        if type(risk_factors[0]) == dict:
            patterns = [risk['pattern'] for risk in risk_factors]
        elif type(risk_factors[0]) == str:
            patterns = [risk for risk in risk_factors]
        
        self.df = self.df[self.df['text_body'].apply(lambda x:
                                                    any(re.compile(pattern, re.IGNORECASE).findall(x)
                                                       for pattern in patterns)
                                                    )]
        self.df['risk_factors'] = [[]] * len(self.df)
        self.df['match_indices'] = [[]] * len(self.df)
        for i in tqdm(range(len(risk_factors))):
            factor = risk_factors[i]
            if type(factor) == dict:
                name = factor['name']
                pattern = re.compile(factor['pattern'], re.IGNORECASE)
            elif type(factor) == str:
                name = factor
                pattern = re.compile(factor, re.IGNORECASE)
    
            self.df['_matches'] = self.df['text_body'].apply(lambda x: 
                                                                      [(m.start(), m.group()) 
                                                                       for m in pattern.finditer(x)])
            
            self.df[name + '_count'] = self.df['_matches'].apply(lambda x: len(x))
            self.df['has_' + name + '?'] = self.df[name + '_count'].apply(lambda x: x > 0)
            self.df[name + '_in_title'] = self.df['title'].apply(lambda x:
                                                                         len(pattern.findall(x)) > 0)
            self.df[name + '_count'] = self.df.apply(lambda x:
                                                             x[name + '_count'] + 10
                                                            if x[name + '_in_title'] 
                                                             else x[name + '_count'],
                                                            axis=1)
            self.df['total_rank'] += self.weights['risk'] * self.df[name + '_count']
            
            self.df['risk_factors'] = self.df.apply(lambda x: 
                                    x['risk_factors'] + [name] if x['has_' + name + '?']
                                    else x['risk_factors'],
                                   axis=1)
            self.df['match_indices'] = self.df.apply(lambda x: 
                                    x['match_indices'] + [n[0] for n in x['_matches']] if x['has_' + name + '?']
                                    else x['match_indices'],
                                   axis=1)
            self.df.drop('_matches', axis=1, inplace=True)
            

    def analyze_designs(self, design_list):
        """
        Analyses papers in self.df for study designs and returns a report df 
        with 'design' and 'design_rank'. 'design_rank' is decided upon from the 
        input in crowdsourced medical dictionary.
        """
        if design_list == None:
            design_list = PaperAnalyzer.DEFAULT_DESIGNS
        self.df['design'] = self.df['text_body'].progress_apply(lambda x:
                                                                      [re.findall(des, x, re.IGNORECASE) 
                                                                       for des in design_list])
        self.df['design_rank'] = self.df['design'].apply(lambda x:
                                                                len(x))
        self.df['design_rank'] += self.df['design'].apply(lambda x: rank_design(' '.join(flatten(x))))
        self.df['total_rank'] += self.weights['design'] * self.df['design_rank']

    def analyze_outcomes(self, outcomes):
        """
        Analyses papers in self.df for outcomes and returns a report df 
        with 'outcomes' and 'outcome_rank'. 'outcome_rank' is decided upon 
        by the frequency of mentions of outcomes in the excerpt
        """
        
        if outcomes == None:
            outcomes = PaperAnalyzer.DEFAULT_OUTCOMES
        self.df['outcomes'] = self.df['text_body'].progress_apply(lambda x:
                                                                        [re.findall(outcome, x, re.IGNORECASE)
                                                                         for outcome in outcomes])
        self.df['outcome_rank'] = self.df['outcomes'].apply(lambda x: len(x))
        self.df['total_rank'] += self.weights['outcome']* self.df['outcome_rank']
        
    def analyze_fatality(self, fatality_list):
        """
        Analyses papers in self.df for information on fatality 
        returns a report df with 'fatality_rank'. 
        'fatality_rank' is decided upon by the frequency of 
        mentions of fatality in the excerpt
        """
        
        if fatality_list == None:
            fatality_list = PaperAnalyzer.DEFAULT_FATAL
        self.df['fatality_count'] = self.df['text_body'].progress_apply(lambda x:
                                                                        len([re.findall(key, x, re.IGNORECASE)
                                                                         for key in fatality_list]))
        self.df['fatal_info?'] = self.df['fatality_count'].apply(lambda x: x > 0)
        self.df['total_rank'] += self.weights['fatality']* self.df['fatality_count']

    def perform_analysis(self, risk_factors, design_list=None, outcomes=None, fatality_list = None):
        """
        This function is a wrapper function that provides interface
        to conduct analysis on all of risk factors, study designs and
        outcomes. Users may specify their own design_list or outcomes. If not,
        the default is used.
        """
        print("Analyzing risks")
        self.analyze_risks(risk_factors)
        print("Analyzing study designs")
        self.analyze_designs(design_list)
        print("Analyzing outcomes")
        self.analyze_outcomes(outcomes)
        print("Analyzing fatality")
        self.analyze_fatality(fatality_list)
        print("Generating final rankings")
        self.generate_risk_rankings()

        
    def generate_risk_rankings(self):
        """
        Appends columns in self.df that contain individual rankings for 
        each risk factor
        """
        self.df['section_rank'] = self.df['section'].apply(lambda x: self.section_ratings[x] 
                                                           if x in self.section_ratings else 5)
        # Obtaining list of risk factors
        risk_factors = [column for column in self.df.columns if 'has_' in column]
        risk_factors = [factor[4:-1] for factor in risk_factors]
        for i in tqdm(range(len(risk_factors))):
            factor = risk_factors[i]
            self.df[factor + '_rank'] = (
                            self.weights['risk']*self.df[factor + '_count'] + 
                            self.weights['section']*self.df['section_rank'] + 
                            self.weights['design']*self.df['design_rank'] + 
                            self.weights['outcome']*self.df['outcome_rank'] + 
                            self.weights['fatality']* self.df['fatality_count']
            )
            
            # Normalizing risk rank for length of excerpts
            self.df[factor + '_rank'] = self.df.apply(lambda x: x[factor + '_rank'] + 
                                                        (self.weights['inverse_length']/
                                                         (len(word_tokenize(x['text_body'])))), 
                                              axis=1)
        
        self.df['max_rank'] = self.df[[factor + '_rank' for factor in risk_factors]].max(axis=1)
        self.df['total_rank'] += self.weights['section'] * self.df['section_rank']
        # Normalizing total rank for length of excerpts
        self.df['total_rank'] = self.df.apply(lambda x: x['total_rank'] + 
                                                        (self.weights['inverse_length']/
                                                         (len(word_tokenize(x['text_body'])))), 
                                              axis=1)
    
    def get_df(self, risk_factor=None):
        """
        Applies section ratings, updates total ratings and returns reporting df
            risk_factor: if specified, the returned df will only have excerpts
                            that mention this risk factor
        """
        if risk_factor:
            if not self.__ANALYZED_RISKS:
                raise ValueError(self.__ERROR_MESSAGE)
            return self.df[self.df['has_' + risk_factor + '?'] == True]
        return self.df

In [19]:
covid_analysis = PaperAnalyzer(covid_df)
covid_analysis.analyze_risks(risk_factors)
covid_analysis.analyze_designs(design_list)
covid_analysis.analyze_outcomes(outcome_list)
covid_analysis.analyze_fatality(fatality_list)
covid_analysis.generate_risk_rankings()
enriched_covid_df = covid_analysis.get_df()

  from pandas import Panel


HBox(children=(FloatProgress(value=0.0, max=11.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=652.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=652.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=652.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=11.0), HTML(value='')))




In [20]:
enriched_covid_df.shape

(652, 66)

In [21]:
enriched_covid_df.sort_values(by='smoking_rank', ascending=False).iloc[0]['text_body']

"We found a significant higher ACE2 gene expression in smoker (including current smoker and former smoker) samples compared to non-smoker samples in the TCGA (p-value=0.05) and GSE40419 RNA-seq datasets (p-value=0.01, Fig. 2A ). Smokers in GSE10072 showed a higher mean of ACE2 gene expression than non-smokers. The difference is not significant (p-value=0.18), which may be due to the small sample size of this study (n=33) with insufficient power to detect the difference. The GSE19804 data which has only non-smoker samples available was not included into the analysis. Adjusted by other factors (age, gender, race and platforms) in multivariate analysis, smoking still shows a significant disparity in ACE2 gene expression (p-value=0.01, Fig. 1B) . These data were from the normal lung tissue of patients with lung adenocarcinoma, which may be different with the lung tissue of healthy people. Therefore, we also analyzed a gene expression dataset of airway epithelium from healthy smokers and he

In [22]:
enriched_covid_df.to_json("../../enriched_covid_df.json", orient='records')
enriched_covid_df.to_csv("../../enriched_covid_df.csv")

#### This marks the end of phase 1. The resulting dataframe will be stored as a json to be served by the web app.

## Question Search (Phase 2)
We will extend the capabilities from the `PaperAnalyzer` class and attempt to answer some questions.

### The `Question` class will decompose and resolve a question about risk factors.
The result will then be piped to an instance of `PaperAnalyzer` to conduct similar analysis. Users will be able to specify their own list of outcomes. If not specified, the default set of outcomes will be used.

In [23]:
class Question():
    """
    The purpose of this class is to resolve a question for 
    keyword searching
    """
    def __init__(self, question, design_list=None, outcomes=None):
        """
        The constuctor does most of the method-calling for question resolution
        """
        self.DESIGN_LIST = design_list
        self.OUTCOMES = outcomes
        self.RISK = question
        self.risk_factors = None
        self.design_list = None
        self.outcome_list = None
        self.__resolve_question()
        if design_list:
            self.__resolve_design()
        if outcomes:
            self.__resolve_outcomes()

    def __question_tokenize(self, sent):
        """
        Cleans the question string
        """
        abbvr_pattern = re.compile('(e.g.|i.e.)')
        sent = abbvr_pattern.sub('', sent)
        remove_punct_dict = {key: " " for key in string.punctuation}
        remove_punct_dict['.'] = ''
        remove_punct = str.maketrans(remove_punct_dict)
        sent = sent.translate(remove_punct)
        return sent.replace('R', 'R0').replace('-', ' ')

    def __resolve_question(self):
        """
        stems and removes irreleavnt words from questions
        to create keywords for keyword analysis
        """
        subquestion = self.RISK
        sub_q = self.__question_tokenize(subquestion)
        keywords = set([
            stemmer.stem(word) for word in word_tokenize(sub_q)
            if word.lower() not in stop_words and 'cov' not in word.lower()
            and word.lower().islower()  #This checks and removes numbers
        ])
        self.risk_factors = list(keywords)

    def __resolve_design(self):
        """
        Resolves study designs to allow for study-design evaluation
        """
        design_keys = self.DESIGN_LIST.split(",")
        self.design_list = list(set(design_keys))

    def __resolve_outcomes(self):
        """
        Resolves outcomes to allow for outcome evaluation
        """
        outcome_keys = self.__question_tokenize(self.OUTCOMES)
        outcome_keys = set([
            stemmer.stem(word) for word in word_tokenize(outcome_keys)
            if word.lower() not in stop_words
            and word.lower().islower()  #This checks and removes numbers
        ])
        self.outcome_list = list(outcome_keys)

    def get_keywords(self):
        """
        Returns keywords from earlier methods
        """
        result = {'risk': None, 'design': None, 'outcome': None}
        result['risk'] = self.risk_factors
        if self.design_list:
            result['design'] = self.design_list
        if self.outcome_list:
            result['outcome'] = self.outcome_list
        return result

In [24]:
def analyze_question(df, question):
    """
    Function to take in a Question instance and a 
    dataframe with covid-excerpts to perform 
    evaluation and rankings on information relevancy
    """
    reference_df = PaperAnalyzer(df)
    keys = question.get_keywords()
    reference_df.perform_analysis(keys['risk'], keys['design'],
                                  keys['outcome'])
    return reference_df.get_df()

#### We will be trying out the question answering pipeline now with a few questions from the aforementionned [medical dictionary](https://docs.google.com/spreadsheets/d/1t2e3CHGxHJBiFgHeW0dfwtvCG4x0CDCzcTFX7yz9Z2E/edit#gid=1217643351).

In [25]:
def get_google_sheet(url, sheet_name):
    response=requests.get(url=url)
    sample_file = io.BytesIO(response.content)
    df = pd.read_excel(sample_file, sheet_name = sheet_name)
    return df

dict_url = 'https://docs.google.com/spreadsheets/d/1t2e3CHGxHJBiFgHeW0dfwtvCG4x0CDCzcTFX7yz9Z2E/export?format=xlsx&id=1t2e3CHGxHJBiFgHeW0dfwtvCG4x0CDCzcTFX7yz9Z2E'
questions_df = get_google_sheet(dict_url, 'sub.question.matching')

In [26]:
questions_df.head(2)

Unnamed: 0.1,Unnamed: 0,Question,Subquestion,Outcome.list,Differences.list,Design.list,Notes
0,"What is known about transmission, incubation, ...",Range of incubation periods for the disease in...,Range of incubation periods for humans: genera...,incubation period,age.,"contact tracing, survival analysis, time-to-ev...",https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...
1,"What is known about transmission, incubation, ...",Range of incubation periods for the disease in...,Range of incubation periods for humans: by age...,"incubation period, stratified by age group",age.,"contact tracing, survival analysis, time-to-ev...",Notes


In [27]:
# Picking a sample question for analysis
ques = questions_df.iloc[127]['Subquestion']
ques = sent_tokenize(ques)[0]
ques

'Is COVID-19 transmitted on droplets?'

In [28]:
# Designs recommended for sample question
des = questions_df.iloc[127]['Design.list']
des

'risk factor analysis, logistic regression, cross-sectional case-control, matched case-control, observational case series, time series analysis, survival analysis'

In [29]:
# Outcomes recommended for sample question
outc = questions_df.iloc[127]['Outcome.list']
outc

'odds of COVID-19 acquisition by occupation, age group, PPE use, observed/self-reported risk behaviors (e.g., inappropriately lowering mask to speak, touching face, eating without washing hands)'

In [30]:
report_df = analyze_question(covid_df, Question(ques, des, outc))

Analyzing risks


HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))


Analyzing study designs


HBox(children=(FloatProgress(value=0.0, max=304.0), HTML(value='')))


Analyzing outcomes


HBox(children=(FloatProgress(value=0.0, max=304.0), HTML(value='')))


Analyzing fatality


HBox(children=(FloatProgress(value=0.0, max=304.0), HTML(value='')))


Generating final rankings


HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))




#### We will now look at the 5 most relevant excerpts to the question we picked earlier

In [31]:
for i in range(5, 10):
    print(report_df.sort_values(by='droplet_rank', ascending = False).iloc[i]['text_body'][:1000])
    print("-----------------------------","\n")

Intubating a patient with COVID-19 is a high-risk procedure, due to the proximity of the health care workers to the patients' oropharynx and the exposure to airway secretions, which can carry a high viral load. 47 During the SARS outbreak in 2003, health care workers performing intubations were shown to be at a significantly increased risk of nosocomial transmission. 48 This risk was shown to be greatly reduced where PPE was used appropriately and infection control measures were followed. 10 The availability and suitability of facemasks and respirators has escalated into an emotive, as well as scientific debate. A fluid resistant surgical facemask protects the wearer against sprays of bodily fluids and large droplets, whereas N95, FFP2 and FFP3 respirators are thought to protect the wearer against aerosolised and airborne pathogens as well. In laboratory studies, a FFP2 mask filters at least 94% of all particles that are 0.3 microns in diameter or larger; N95 masks block at least 95%, 