# **APPROACH:** The newest implementation outlined:
- open the meta CSV file in a dataframe - 44K entries
- drop the duplicates by abstract
- loop (comprehension) a list of lists for the keyword for searches. eg. search=[['incubation','period','range']]
- stem the keywords so incubation becomes incubat etc.
- append indpendently - different variations of corona references - (covid, cov, -cov, hcov) to the keywords avoids pulling older research
- query function in pandas to query the above terms with the keywords - the abstract has to have all the keywords and at least one of (covid, cov, -cov, hcov) 
- drop the duplicates in the df again after the queries
- caclulate a relevance score for all the abstracts in the query result df
- raw_score = total count of the keywords in the abstract
- final_score = (raw_score/len(abstract))*raw_score
- sort the df on the final_score ascending=False (best on top)
- drop rows with a relevance score < .02
- parse the relevant abstarcts into sentences split('. ') works well.
- test if keywords are in the sentences
- if sentence has all keywords, it is added to the df for display.
- if you are seeking statistics in the data, adding % to the search terms works well
- df with search results displayed in HTML - currently limiting results to 3 for ease of scanning

**Pros:** Currently the system is a very simple **(as Einstein said "make it as simple as possible, but no simpler")**, but quite effective solutiuon to providing insight on topical queries.  The code is easy to read and simple to understand.

**Cons:** Currently the system requires some human understanding in crafting keyword combinations, however, the recent relevance measure changes and stemming of keywords, have made it so the keywords can be very close to the NL questions and return very good results.


In [18]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# keep only documents with covid -cov-2 and cov2
def search_focus(df):
    dfa = df[df['abstract'].str.contains('covid')]
    dfb = df[df['abstract'].str.contains('-cov-2')]
    dfc = df[df['abstract'].str.contains('cov2')]
    dfd = df[df['abstract'].str.contains('ncov')]
    frames=[dfa,dfb,dfc,dfd]
    df = pd.concat(frames)
    df=df.drop_duplicates(subset='title', keep="first")
    return df


In [19]:
# load the meta data from the CSV file using 3 columns (abstract, title, authors),
df=pd.read_csv('metadata.csv', usecols=['title','journal','abstract','authors','doi','publish_time','sha'])
print (df.shape)
#drop duplicates
#df=df.drop_duplicates()
#drop NANs 
df=df.fillna('no data provided')
df = df.drop_duplicates(subset='title', keep="first")
df=df[df['publish_time'].str.contains('2020')]
# convert abstracts to lowercase
df["abstract"] = df["abstract"].str.lower()+df["title"].str.lower()
#show 5 lines of the new dataframe
df=search_focus(df)
print (df.shape)
df.head()

  interactivity=interactivity, compiler=compiler, result=result)


(204823, 7)
(66159, 7)


Unnamed: 0,sha,title,doi,abstract,publish_time,authors,journal
4662,no data provided,Latest assessment on COVID-19 from the Europea...,10.2807/1560-7917.es.2020.25.8.2002271,no data providedlatest assessment on covid-19 ...,2020-02-27,no data provided,Euro Surveill
4698,no data provided,Updated rapid risk assessment from ECDC on the...,10.2807/1560-7917.es.2020.25.9.2003051,no data providedupdated rapid risk assessment ...,2020-03-05,no data provided,Euro Surveill
4732,no data provided,Updated rapid risk assessment from ECDC on the...,10.2807/1560-7917.es.2020.25.10.2003121,no data providedupdated rapid risk assessment ...,2020-03-12,no data provided,Euro Surveill
4800,601e6ac1ad98e359dc021e8896a1a604331ca774,Empfehlungen zur intensivmedizinischen Therapi...,10.1007/s00063-020-00674-3,no data providedempfehlungen zur intensivmediz...,2020-03-12,"Kluge, Stefan; Janssens, Uwe; Welte, Tobias; W...",Med Klin Intensivmed Notfmed
5683,no data provided,The impact of COVID-19 on the provision of don...,10.1038/s41409-020-0873-x,no data providedthe impact of covid-19 on the ...,2020-03-23,"Szer, Jeff; Weisdorf, Daniel; Querol, Sergio; ...",Bone Marrow Transplant


In [20]:
import functools
from IPython.core.display import display, HTML
from nltk import PorterStemmer

# function to stem keyword list into a common base word
def stem_words(search_words):
    stemmer = PorterStemmer()
    singles=[]
    for w in search_words:
        singles.append(stemmer.stem(w))
    return singles


In [21]:
def search_dataframe(df,search_words):
    search_words=stem_words(search_words)
    df1=df[functools.reduce(lambda a, b: a&b, (df['abstract'].str.contains(s) for s in search_words))]
    return df1


In [22]:
# function analyze search results for relevance with word count / abstract length
def search_relevance(rel_df,search_words):
    rel_df['score']=""
    search_words=stem_words(search_words)
    for index, row in rel_df.iterrows():
        abstract = row['abstract']
        result = abstract.split()
        len_abstract=len(result)
        score=0
        for word in search_words:
            score=score+result.count(word)
        final_score=(score/len_abstract)
        rel_score=score*final_score
        rel_df.loc[index, 'score'] = rel_score
    rel_df=rel_df.sort_values(by=['score'], ascending=False)
    #rel_df= rel_df[rel_df['score'] > .01]
    return rel_df


In [23]:
# function to get best sentences from the search results
def get_sentences(df1,search_words):
    df_table = pd.DataFrame(columns = ["pub_date","authors","title","excerpt","rel_score"])
    search_words=stem_words(search_words)
    for index, row in df1.iterrows():
        pub_sentence=''
        sentences_used=0
        #break apart the absracrt to sentence level
        sentences = row['abstract'].split('. ')
        #loop through the sentences of the abstract
        highligts=[]
        for sentence in sentences:
            # missing lets the system know if all the words are in the sentence
            missing=0
            #loop through the words of sentence
            for word in search_words:
                #if keyword missing change missing variable
                if word not in sentence:
                    missing=1
                #if '%' in sentence:
                    #missing=missing-1
            # after all sentences processed show the sentences not missing keywords
            if missing==0 and len(sentence)<1000 and sentence!='':
                sentence=sentence.capitalize()
                if sentence[len(sentence)-1]!='.':
                    sentence=sentence+'.'
                pub_sentence=pub_sentence+'<br><br>'+sentence
        if pub_sentence!='':
            sentence=pub_sentence
            sentences_used=sentences_used+1
            authors=row["authors"].split(" ")
            link=row['doi']
            title=row["title"]
            score=row["score"]
            linka='https://doi.org/'+link
            linkb=title
            sentence='<p fontsize=tiny" align="left">'+sentence+'</p>'
            final_link='<p align="left"><a href="{}">{}</a></p>'.format(linka,linkb)
            to_append = [row['publish_time'],authors[0]+' et al.',final_link,sentence,score]
            df_length = len(df_table)
            df_table.loc[df_length] = to_append
    return df_table


In [27]:
display(HTML('<h1>Task 1: What is known about transmission, incubation, and environmental stability?</h1>'))

display(HTML('<h3>Results currently limited to two (10) for ease of scanning</h3>'))

# list of lists of search terms
questions=[
['Q: What is the range of incubation periods for the disease in humans?'],
['Q: How long are individuals are contagious?'],

['Q: Does the range of incubation period vary across age groups?'],
['Q: Does the range of incubation period vary with children?']]

#
search=[['incubation','period','range'],
['viral','shedding','duration'],
['incubation','period','age','statistically','significant'],
['incubation','period','child']]

q=0
for search_words in search:
    str1=''
    # a make a string of the search words to print readable version above table
    str1=' '.join(questions[q])
    
    #search the dataframe for all words
    df1=search_dataframe(df,search_words)

    # analyze search results for relevance 
    df1=search_relevance(df1,search_words)

    # get best sentences
    df_table=get_sentences(df1,search_words)
    
    length=df_table.shape[0]
    #limit 3 results
    df_table=df_table.head(15)
    df_table=df_table.drop(['rel_score'], axis=1)
    #convert df to html
    df_table=HTML(df_table.to_html(escape=False,index=False))
    
    # display search topic
    display(HTML('<h3>'+str1+'</h3>'))
    
    #display table
    if length<1:
        print ("No reliable answer could be located in the literature")
    else:
        display(df_table)
    q=q+1


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


pub_date,authors,title,excerpt
2020-03-18,"Jiang, et al.",Is a 14-day quarantine period optimal for effectively controlling coronavirus disease 2019 (COVID-19)?,Results the full range of incubation periods of the covid-19 cases ranged from 0 to 33 days among 2015 cases.
2020-05-15,"Bui, et al.",Estimation of the incubation period of SARS-CoV-2 in Vietnam,"Average incubation periods estimated using different distribution model ranged from 6.0 days to 6.4 days with the weibull distribution demonstrated the best fit to the data. The estimated mean of incubation period using weibull distribution model was 6.4 days (95% credible interval (ci): 4.89 - 8.5), standard deviation (sd) was 3.05 (95%ci 3.05 - 5.30), median was 5.6, ranges from 1.35 to 13.04 days (2.5th to 97.5th percentiles)."
2020-03-08,"Xia, et al.",Transmission of corona virus disease 2019 during the incubation period may lead to a quarantine loophole,"Results: the estimated mean incubation period for covid-19 was 4.9 days (95% confidence interval [ci], 4.4 to 5.4) days, ranging from 0.8 to 11.1 days (2.5th to 97.5th percentile)."
2020,"Yang, et al.","Estimation of incubation period and serial interval of COVID-19: analysis of 178 cases and 131 transmission chains in Hubei province, China","Our estimated median incubation period of covid-19 is 5.4 days (bootstrapped 95% confidence interval (ci) 4.8-6.0), and the 2.5th and 97.5th percentiles are 1 and 15 days, respectively; while the estimated serial interval of covid-19 falls within the range of -4 to 13 days with 95% confidence and has a median of 4.6 days (95% ci 3.7-5.5)."
2020,"Yang, et al.",[The preliminary analysis on the characteristics of the cluster for the COVID-19],"We selected 325 cases to estimate the incubation period and its range was 1 to 20 days, median was 7 days, and mode was 4 days."
2020-03-08,"Yang, et al.",[The preliminary analysis on the characteristics of the cluster for the Corona Virus Disease].,"We selected 325 cases to estimate the incubation period and found its range is 1 to 20 days, median was 7 days, and mode was 4 days."
2020-01-28,"Linton, et al.",Incubation Period and Other Epidemiological Characteristics of 2019 Novel Coronavirus Infections with Right Truncation: A Statistical Analysis of Publicly Available Case Data,Our results show that the incubation period falls within the range of 2-14 days with 95% confidence and has a mean of around 5 days when approximated using the best-fit lognormal distribution.
2020-01-28,"Backer, et al.","The incubation period of 2019-nCoV infections among travellers from Wuhan, China","Using the travel history and symptom onset of 88 confirmed cases that were detected outside wuhan, we estimate the mean incubation period to be 6.4 (5.6 - 7.7, 95% ci) days, ranging from 2.1 to 11.1 days (2.5th to 97.5th percentile)."
2020-02-06,"Backer, et al.","Incubation period of 2019 novel coronavirus (2019-nCoV) infections among travellers from Wuhan, China, 20–28 January 2020","Using the travel history and symptom onset of 88 confirmed cases that were detected outside wuhan in the early outbreak phase, we estimate the mean incubation period to be 6.4 days (95% credible interval: 5.6–7.7), ranging from 2.1 to 11.1 days (2.5th to 97.5th percentile)."
2020,"Backer, et al.","Incubation period of 2019 novel coronavirus (2019-nCoV) infections among travellers from Wuhan, China, 20-28 January 2020","Using the travel history and symptom onset of 88 confirmed cases that were detected outside wuhan in the early outbreak phase, we estimate the mean incubation period to be 6.4 days (95% credible interval: 5.6-7.7), ranging from 2.1 to 11.1 days (2.5th to 97.5th percentile)."


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


pub_date,authors,title,excerpt
2020-05-23,"Weiss, et al.",Spatial and temporal dynamics of SARS-CoV-2 in COVID-19 patients: A systematic review,"In this study, we aimed to provide a coherent overview from published studies of the duration of viral detection and viral load in covid-19 patients, stratified by specimen type, clinical severity and age."
2020,"Dodds, et al.",Model-Informed Drug Repurposing: Viral Kinetic Modeling to Prioritize Rational Drug Combinations for COVID-19,"The endpoints and metrics included viral load area under the curve (auc), duration of viral shedding, and epithelial cells infected. In addition, we observed that the time-window opportunity for a therapeutic intervention to effect duration of viral shedding exceeds the effect on sparing epithelial cells from infection or impact on viral load auc."
2020-07-21,"Dodds, et al.",Model-Informed Drug Repurposing: Viral Kinetic Modeling to Prioritize Rational Drug Combinations for COVID-19.,"The endpoints and metrics included viral load area under the curve (auc), duration of viral shedding, and epithelial cells infected. In addition, we observed that the time-window opportunity for a therapeutic intervention to effect duration of viral shedding exceeds the effect on sparing epithelial cells from infection or impact on viral load auc."
2020-07-28,"Cevik, et al.","SARS-CoV-2 viral load dynamics, duration of viral shedding and infectiousness: a living systematic review and meta-analysis","Background viral load kinetics and the duration of viral shedding are important determinants for disease transmission. We aim i) to characterise viral load dynamics, duration of viral rna, and viable virus shedding of sars-cov-2 in various body fluids and ii) to compare sars-cov-2 viral dynamics with sars-cov-1 and mers-cov. Methods: medline, embase, europe pmc, preprint servers and grey literature were searched to retrieve all articles reporting viral dynamics and duration of sars-cov-2, sars-cov-1 and mers-cov shedding. Funding: no funding was received.sars-cov-2 viral load dynamics, duration of viral shedding and infectiousness: a living systematic review and meta-analysis."
2020,"Warabi, et al.",Effects of oral care on prolonged viral shedding in coronavirus disease 2019 (COVID-19),"Methods and results: we evaluated the clinical course of eight covid-19 patients, including their duration of viral shedding, by pcr testing of nasopharyngeal swabs."
2020-07-24,"Warabi, et al.",Effects of oral care on prolonged viral shedding in coronavirus disease 2019 (COVID-19).,"Methods and results we evaluated the clinical course of eight covid-19 patients, including their duration of viral shedding, by pcr testing of nasopharyngeal swabs."
2020,"Chen, et al.",Associations of Clinical Characteristics and Treatment Regimens with Viral RNA Shedding Duration in Patients with COVID-19,"Results: the median viral rna shedding duration was 12 days (interquartile range, 8-16 d) after the onset of illness. Lopinavir/ritonavir use may be associated with prolonged viral rna shedding in non-severe patients; further randomized controlled trials are needed to confirm this finding.associations of clinical characteristics and treatment regimens with viral rna shedding duration in patients with covid-19."
2020-03-26,"Tan, et al.",Viral Kinetics and Antibody Responses in Patients with COVID-19,"The routes and duration of viral shedding, antibody response, and their associations with disease severity and clinical manifestations were systematically evaluated."
2020,"Li, et al.",Duration of SARS-CoV-2 RNA shedding and factors associated with prolonged viral shedding in patients with COVID-19,"A retrospective cohort of covid-19 patients admitted to a designated hospital in beijing was analyzed to study the factors affecting the duration of viral shedding. The median duration of viral shedding was 11 days (iqr, 8-14.3 days) as measured from illness onset. Univariate regression analysis showed that disease severity, corticosteroid therapy, fever (temperature>38.5°c), and time from onset to hospitalization were associated with prolonged duration of viral shedding (p < .05). Multivariate regression analysis showed that fever (temperature>38.5°c) (or, 5.1, 95%ci: 1.5-18.1), corticosteroid therapy (or, 6.3, 95%ci: 1.5-27.8), and time from onset to hospitalization (or, 1.8, 95%ci: 1.19-2.7) were associated with increased odds of prolonged duration of viral shedding. Corticosteroid treatment, fever (temperature>38.5°c), and longer time from onset to hospitalization were associated with prolonged viral shedding in covid-19 patients.duration of sars-cov-2 rna shedding and factors associated with prolonged viral shedding in patients with covid-19."
2020-07-09,"Li, et al.",Duration of SARS‐CoV‐2 RNA shedding and factors associated with prolonged viral shedding in patients with COVID‐19,"Methods: a retrospective cohort of covid‐19 patients admitted to a designated hospital in beijing was analyzed to study the factors affecting the duration of viral shedding. Results: the median duration of viral shedding was 11 days (iqr, 8‐14.3 days) as measured from illness onset. Univariate regression analysis showed that disease severity, corticosteroid therapy, fever (temperature>38.5℃), and time from onset to hospitalization were associated with prolonged duration of viral shedding (p<0.05). Multivariate regression analysis showed that fever (temperature>38.5℃) (or 5.1, 95%ci: 1.5‐18.1), corticosteroid therapy (or 6.3, 95%ci: 1.5‐27.8), and time from onset to hospitalization (or 1.8, 95%ci: 1.19‐2.7) were associated with increased odds of prolonged duration of viral shedding. All rights reserved.duration of sars‐cov‐2 rna shedding and factors associated with prolonged viral shedding in patients with covid‐19."


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


pub_date,authors,title,excerpt
2020-02-29,"Han, et al.",Estimate the incubation period of coronavirus 2019 (COVID-19),We found that the incubation periods of the groups with age>=40 years and age<40 years demonstrated a statistically significant difference.


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


pub_date,authors,title,excerpt
2020-03-18,"Jiang, et al.",Is a 14-day quarantine period optimal for effectively controlling coronavirus disease 2019 (COVID-19)?,The median incubation period of both male and female adults was similar (7-day) but significantly shorter than that (9-day) of child cases (p=0.02).
2020,"Merino-Navarro, et al.",Prevención y tratamiento del Covid-19 en la población pediátrica desde una perspectiva familiar y comunitaria./ [Prevention and treatment of Covid-19 in the pediatric population from the family and community perspective],Our objective is to analyze the scientific evidence on the specific recommendations for pediatric care in cases of covid-19 from the family and community settings.the main recommendations and preventive measures in primary health care settings and at home have been selected and analyzed from an integrative approach that includes the biopsychosocial aspects of the child during confinement.the importance of caring for children in the face of the disease lies above all in ensuring the correct measures for the prevention of contagion due to the condition of acting as possible carriers during an incubation period of up to 21 days.
2020-05-16,"Merino-Navarro, et al.",Prevención y tratamiento del Covid-19 en la población pediátrica desde una perspectiva familiar y comunitaria,The importance of caring for children in the face of the disease lies above all in ensuring the correct measures for the prevention of contagion due to the condition of acting as possible carriers during an incubation period of up to 21 days.
2020,"Liu, et al.",Risk factors associated with COVID-19 infection: a retrospective cohort study based on contacts tracing,"Children, old people, females, and family members are susceptible of covid-19 infection, while index cases in the incubation period had lower contagiousness."
2020-07-01,"Liu, et al.",Risk factors associated with COVID-19 infection: a retrospective cohort study based on contacts tracing.,"Conclusion children, old people, females and family members are susceptible to be infected with covid-19, while index cases in incubation period had lower contagiousness."
2020-06-01,"Paglia, et al.",COVID-19 and Paediatric Dentistry after the lockdown.,"The average incubation period is about 5 days, with an estimated range from 2 to 14 days; the incubation period in children is similar, however some have exhibited a longer incubation."
2020,"Paglia, et al.",COVID-19 and Paediatric Dentistry after the lockdown,"The average incubation period is about 5 days, with an estimated range from 2 to 14 days; the incubation period in children is similar, however some have exhibited a longer incubation."
2020,"Shen, et al.","Novel coronavirus infection in children outside of Wuhan, China","Six children had a family exposure and could provide the exact dates of close contact with someone who was confirmed to have 2019-ncov infection, among whom the median incubation period was 7.5 days."
2020,"Han, et al.",A comparative-descriptive analysis of clinical characteristics in 2019-coronavirus-infected children and adults,"The median incubation period of children and adults was 5 days (ranged, 3-12 days) and 4 days (ranged, 2-12 days), respectively."
2020-04-06,"Han, et al.",A comparative-descriptive analysis of clinical characteristics in 2019-Coronavirus-infected children and adults.,"The median incubation period of children and adults was 5 days (range 3-12 days) and 4 days (range 2-12 days), respectively."
