# **APPROACH:** The newest implementation outlined:
- open the meta CSV file in a dataframe - 44K entries
- drop the duplicates by abstract
- loop (comprehension) a list of lists for the keyword for searches. eg. search=[['incubation','period','range']]
- stem the keywords so incubation becomes incubat etc.
- append indpendently - different variations of corona references - (covid, cov, -cov, hcov) to the keywords avoids pulling older research
- query function in pandas to query the above terms with the keywords - the abstract has to have all the keywords and at least one of (covid, cov, -cov, hcov) 
- drop the duplicates in the df again after the queries
- caclulate a relevance score for all the abstracts in the query result df
- raw_score = total count of the keywords in the abstract
- final_score = (raw_score/len(abstract))*raw_score
- sort the df on the final_score ascending=False (best on top)
- drop rows with a relevance score < .02
- parse the relevant abstarcts into sentences split('. ') works well.
- test if keywords are in the sentences
- if sentence has all keywords, it is added to the df for display.
- if you are seeking statistics in the data, adding % to the search terms works well
- df with search results displayed in HTML - currently limiting results to 3 for ease of scanning

**Pros:** Currently the system is a very simple **(as Einstein said "make it as simple as possible, but no simpler")**, but quite effective solutiuon to providing insight on topical queries.  The code is easy to read and simple to understand.

**Cons:** Currently the system requires some human understanding in crafting keyword combinations, however, the recent relevance measure changes and stemming of keywords, have made it so the keywords can be very close to the NL questions and return very good results.


In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# keep only documents with covid -cov-2 and cov2
def search_focus(df):
    dfa = df[df['abstract'].str.contains('covid')]
    dfb = df[df['abstract'].str.contains('-cov-2')]
    dfc = df[df['abstract'].str.contains('cov2')]
    dfd = df[df['abstract'].str.contains('ncov')]
    frames=[dfa,dfb,dfc,dfd]
    df = pd.concat(frames)
    df=df.drop_duplicates(subset='title', keep="first")
    return df


In [2]:
# load the meta data from the CSV file using 3 columns (abstract, title, authors),
df=pd.read_csv('metadata.csv', usecols=['title','journal','abstract','authors','doi','publish_time','sha'])
print (df.shape)
#drop duplicates
#df=df.drop_duplicates()
#drop NANs 
df=df.fillna('no data provided')
df = df.drop_duplicates(subset='title', keep="first")
df=df[df['publish_time'].str.contains('2020')]
# convert abstracts to lowercase
df["abstract"] = df["abstract"].str.lower()+df["title"].str.lower()
#show 5 lines of the new dataframe
df=search_focus(df)
print (df.shape)



  interactivity=interactivity, compiler=compiler, result=result)


(204823, 7)
(66159, 7)


In [6]:
import functools
from IPython.core.display import display, HTML
from nltk import PorterStemmer

# function to stem keyword list into a common base word
def stem_words(search_words):
    stemmer = PorterStemmer()
    singles=[]
    for w in search_words:
        singles.append(stemmer.stem(w))
    return singles


In [10]:
def search_dataframe(df,search_words):
    search_words=stem_words(search_words)
    df1=df[functools.reduce(lambda a, b: a&b, (df['abstract'].str.contains(s) for s in search_words))]
    return df1


In [11]:
# function analyze search results for relevance with word count / abstract length
def search_relevance(rel_df,search_words):
    rel_df['score']=""
    search_words=stem_words(search_words)
    for index, row in rel_df.iterrows():
        abstract = row['abstract']
        result = abstract.split()
        len_abstract=len(result)
        score=0
        for word in search_words:
            score=score+result.count(word)
        final_score=(score/len_abstract)
        rel_score=score*final_score
        rel_df.loc[index, 'score'] = rel_score
    rel_df=rel_df.sort_values(by=['score'], ascending=False)
    #rel_df= rel_df[rel_df['score'] > .01]
    return rel_df


In [12]:
# function to get best sentences from the search results
def get_sentences(df1,search_words):
    df_table = pd.DataFrame(columns = ["pub_date","authors","title","excerpt","rel_score"])
    search_words=stem_words(search_words)
    for index, row in df1.iterrows():
        pub_sentence=''
        sentences_used=0
        #break apart the absracrt to sentence level
        sentences = row['abstract'].split('. ')
        #loop through the sentences of the abstract
        highligts=[]
        for sentence in sentences:
            # missing lets the system know if all the words are in the sentence
            missing=0
            #loop through the words of sentence
            for word in search_words:
                #if keyword missing change missing variable
                if word not in sentence:
                    missing=1
                #if '%' in sentence:
                    #missing=missing-1
            # after all sentences processed show the sentences not missing keywords
            if missing==0 and len(sentence)<1000 and sentence!='':
                sentence=sentence.capitalize()
                if sentence[len(sentence)-1]!='.':
                    sentence=sentence+'.'
                pub_sentence=pub_sentence+'<br><br>'+sentence
        if pub_sentence!='':
            sentence=pub_sentence
            sentences_used=sentences_used+1
            authors=row["authors"].split(" ")
            link=row['doi']
            title=row["title"]
            score=row["score"]
            linka='https://doi.org/'+link
            linkb=title
            sentence='<p fontsize=tiny" align="left">'+sentence+'</p>'
            final_link='<p align="left"><a href="{}">{}</a></p>'.format(linka,linkb)
            to_append = [row['publish_time'],authors[0]+' et al.',final_link,sentence,score]
            df_length = len(df_table)
            df_table.loc[df_length] = to_append
    return df_table


In [13]:
display(HTML('<h1>Task 1: What do we know about vaccines and therapeutics? What has been published concerning research and development and evaluation efforts of vaccines and therapeutics?</h1>'))

display(HTML('<h3>Results currently limited to two (10) for ease of scanning</h3>'))

# list of lists of search terms
questions=[
['Q: What are the Effectiveness of drugs being developed and tried to treat COVID-19 patients?'],
['Q: What Efforts targeted at a universal coronavirus vaccine?'],
['Q: What Approaches to evaluate risk for enhanced disease after vaccination?']
]
#search=[['incubation','period','range','mean','%']]
search=[['drugs','treat','patients'],
['Efforts','coronavirus', 'vaccine'],
['Approaches','risk','vaccination']
]
q=0
for search_words in search:
    str1=''
    # a make a string of the search words to print readable version above table
    str1=' '.join(questions[q])
    
    #search the dataframe for all words
    df1=search_dataframe(df,search_words)

    # analyze search results for relevance 
    df1=search_relevance(df1,search_words)

    # get best sentences
    df_table=get_sentences(df1,search_words)
    
    length=df_table.shape[0]
    #limit 3 results
    df_table=df_table.head(15)
    df_table=df_table.drop(['rel_score'], axis=1)
    #convert df to html
    df_table=HTML(df_table.to_html(escape=False,index=False))
    
    # display search topic
    display(HTML('<h3>'+str1+'</h3>'))
    
    #display table
    if length<1:
        print ("No reliable answer could be located in the literature")
    else:
        display(df_table)
    q=q+1


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


pub_date,authors,title,excerpt
2020-04-01,"Ottesen, et al.",Efficacy of a high-dose antiresorptive drug holiday to reduce the risk of medication-related osteonecrosis of the jaw (MRONJ): A systematic review.,"In 2 studies, patients were being treated with denosumab, but neither showed that a drug holiday was effective."
2020,"Pawar, et al.",Combating devastating COVID-19 by drug repurposing,"Some countries are against the use of these drugs because of adverse effects associated with drug repurposing and lack of statistically significant clinical data, but they have been found to be effective in some countries to treat covid-19 patients (off-label/investigational)."
2020-06-22,"Bishara, et al.",Emerging and experimental treatments for COVID-19 and drug interactions with psychotropic agents,"An even higher threshold of vigilance should be maintained for patients with pre-existing conditions and older adults due to added toxicity and drug interactions, especially with psychotropic agents.emerging and experimental treatments for covid-19 and drug interactions with psychotropic agents."
2020-07-09,"Mummed, et al.",Molecular targets for COVID-19 drug development: Enlightening Nigerians about the pandemic and future treatment,"However, as patient management and drug repositioning are taking place, it is imperative to identify other promising targets used by sars-cov-2 to establish infection, to develop novel therapeutics.molecular targets for covid-19 drug development: enlightening nigerians about the pandemic and future treatment."
2020-04-17,"Pawar, et al.",Combating Devastating COVID -19 by Drug Repurposing,• further investigations of these drugs are recommended to treat covid-19 patients on top priority.combating devastating covid -19 by drug repurposing.
2020-06-12,Au et al.,Anaesthetic Considerations for Rationalizing Drug Use in the Operating Theatre: Strategies in a Singapore Hospital During COVID-19,"Covid-19 patients in the critical care unit tend to have prolonged hospital stay requiring high doses of sedation and paralysis to treat acute respiratory distress syndrome, resulting in a shortage of these drugs."
2020-05-12,"Jafari, et al.",Considerations for interactions of drugs used for the treatment of COVID-19 with anti-Cancer treatments,"Because of the long-term use of chemotherapy drugs, drug interactions are important in these patients especially with sars-cov2 treatments now."
2020,"Jafari, et al.",Considerations for interactions of drugs used for the treatment of COVID-19 with anti-cancer treatments,"Because of the long-term use of chemotherapy drugs, drug interactions are important in these patients especially with sars-cov2 treatments now."
2020-07-17,"Zhu, et al.",Identification of SARS-CoV-2 3CL Protease Inhibitors by a Quantitative High-throughput Screening,"Conclusion and implications some of the newly identified inhibitors of sars-cov-2 3clpro may be used in combination therapy with other drugs for synergistic effect to treat covid-19 patients. Clinical significance some of the newly identified 3clpro inhibitors can be evaluated in drug combination therapy for synergistic effect to treat covid-19 patients, while the others can serve as starting points for medicinal chemistry optimization to improve potency and drug like properties for drug development.identification of sars-cov-2 3cl protease inhibitors by a quantitative high-throughput screening."
2020,"Mohanty, et al.",Application of Artificial Intelligence in COVID-19 drug repurposing,"This technology has the potential to improve the drug discovery, planning, treatment, and reported outcomes of the covid-19 patient, being an evidence-based medical tool. With prior usage experiences in patients, few of the old drugs, if shown active against sars-cov-2, can be readily applied to treat the covid-19 patients."


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


pub_date,authors,title,excerpt
2020,"Kaiser, et al.","To streamline coronavirus vaccine and drug efforts, NIH and firms join forces","More than 100 treatments and vaccines are in development to stem the covid-19 pandemic, and some onlookers have worried that this sprawling and potentially duplicative effort is wasting time and resources hoping to bring order to the chaos, the national institutes of health (nih) and major drug companies today announced a plan to stage carefully designed clinical trials of the drugs and vaccines they have decided are the highest priorities for testing and development the public-private partnership involves nih, other u s government agencies, 16 pharma companies and biotechs, and the nonprofit foundation for the national institutes of health (fnih) it aims to develop “an international strategy” for covid-19 research, a press release states however, nih director francis collins told reporters during a press call today that “it is primarily a u s -focused effort ”to streamline coronavirus vaccine and drug efforts, nih and firms join forces."
2020-03-24,"Victor, et al.",MATHEMATICAL PREDICTIONS FOR COVID-19 AS A GLOBAL PANDEMIC,"The effort to evaluate the disease equilibrium shows that unless there is a dedicated effort from government, decision makers and stakeholders, the world would hardly be reed of the covid-19 coronavirus and further spread is eminent and the rate of infection will continue to increase despite the increased rate of recovery because of the absence of vaccine at the moment.mathematical predictions for covid-19 as a global pandemic."
2020-06-16,"Luo, et al.","Combating the Coronavirus Pandemic: Early Detection, Medical Treatment, and a Concerted Effort by the Global Community","We also summarize potential treatments and vaccines against covid-19 and discuss ongoing clinical trials of interventions to reduce viral progression.combating the coronavirus pandemic: early detection, medical treatment, and a concerted effort by the global community."
2020,"Anonymous, et al.",Podcast: Beating a killer coronavirus,"As covid-19 continues to spread, so does the effort to find treatment and vaccinations against sars-cov-2, the coronavirus that causes the disease around the world, scientists are working nonstop on therapies they hope will stem the loss of life during this pandemic while trying to set us up to prevent future outbreaks what’s not clear is which of these treatments will work much about sars-cov-2 remains unknown in this episode of stereo chemistry, we dig into the efforts to beat the novel coronavirus and why in some cases it’s like throwing spaghetti against the wall to see what sticks learn more at http://cenm ag/coronapodcastpodcast: beating a killer coronavirus."
2020-04-29,"Wang, et al.","Decoding SARS-CoV-2 transmission, evolution and ramification on COVID-19 diagnosis, vaccine, and medicine","Tremendous effort has been given to the development of diagnostic tests, preventive vaccines, and therapeutic medicines for coronavirus disease 2019 (covid-19) caused by severe acute respiratory syndrome coronavirus 2 (sars-cov-2)."
2020,"Wang, et al.","Decoding SARS-CoV-2 Transmission and Evolution and Ramifications for COVID-19 Diagnosis, Vaccine, and Medicine","Tremendous effort has been given to the development of diagnostic tests, preventive vaccines, and therapeutic medicines for coronavirus disease 2019 (covid-19) caused by severe acute respiratory syndrome coronavirus 2 (sars-cov-2)."
2020-04-28,"Pawelec, et al.",Recent advances in influenza vaccines,"Since the beginning of 2020, an unprecedented international academic and industrial effort to develop effective vaccines against the new coronavirus sars-cov-2 has diverted attention away from influenza, but many of the lessons learned for the one will synergize with the other to mutual advantage."
2020,"Hanney, et al.",From COVID-19 research to vaccine application: why might it take 17 months not 17 years and what are the wider lessons?,"We noted four main approaches to reducing time-lags, namely increasing resources, working in parallel, starting or working at risk, and improving processes.examining these approaches alongside the matrix helps interpret the enormous global effort to develop a vaccine for the 2019 novel coronavirus sars-cov-2, the causative agent of covid-19."
2020-06-23,"Singh, et al.",Emerging Prevention and Treatment Strategies to Control COVID-19,The unprecedented rise of this pandemic has rapidly fueled research efforts to discover and develop new vaccines and treatment strategies against this novel coronavirus.
2020-06-04,Di et al.,Immune checkpoint inhibitors: A physiology-driven approach to the treatment of COVID-19,"While confirmed cases of infections of the deadly coronavirus disease 2019 (covid-19) have exceeded 4.7 million globally, scientists are pushing forward with efforts to develop vaccines and treatments in an attempt to slow the pandemic and lessen the disease’s damage."


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


pub_date,authors,title,excerpt
2020,"Gambichler, et al.",On the use of immune checkpoint inhibitors in patients with viral infections including COVID-19,"In detail, we provide available information on (1) safety regarding the risk of new infections, (2) effects on the outcome of pre-existing infections, (3) whether immunosuppressive drugs used to treat ici-related adverse events affect the risk of infection or virulence of pre-existing infections, (4) whether the use of vaccines in ici-treated patients is considered safe, and (5) whether there are beneficial effects of icis that even qualify them as a therapeutic approach for these viral infections.on the use of immune checkpoint inhibitors in patients with viral infections including covid-19."
2020,"Sheng, et al.",Management of Breast Cancer During the COVID-19 Pandemic: A Stage- and Subtype-Specific Approach,Guidelines such as these will be important as we continue to balance treatment of breast cancer against risk of sars-cov-2 exposure and infection until approval of a vaccine.management of breast cancer during the covid-19 pandemic: a stage- and subtype-specific approach.
2020-06-30,"Sheng, et al.",Management of Breast Cancer During the COVID-19 Pandemic: A Stage- and Subtype-Specific Approach.,Guidelines such as these will be important as we continue to balance treatment of breast cancer against risk of sars-cov-2 exposure and infection until approval of a vaccine.management of breast cancer during the covid-19 pandemic: a stage- and subtype-specific approach.
2020-03-17,"Handel, et al.","If long-term suppression is not possible, how do we minimize mortality for COVID-19 and other emerging infectious disease outbreaks?","If covid-19 containment policies fail and social distancing measures cannot be sustained until vaccines becomes available, the next best approach is to use interventions that reduce mortality and prevent excess infections while allowing low-risk individuals to acquire immunity through natural infection until population level immunity is achieved."
2020,"Hanney, et al.",From COVID-19 research to vaccine application: why might it take 17 months not 17 years and what are the wider lessons?,"We noted four main approaches to reducing time-lags, namely increasing resources, working in parallel, starting or working at risk, and improving processes.examining these approaches alongside the matrix helps interpret the enormous global effort to develop a vaccine for the 2019 novel coronavirus sars-cov-2, the causative agent of covid-19."
