# Abstract
The objective of this notebook will be to perform keyword analysis on the papers for Covid-19 risk factors.

In [174]:
import re
from datetime import datetime
import pandas as np
import pandas as pd

In [160]:
papers_df = pd.read_csv('output/papers_df.csv', 
                        index_col=0, 
                        keep_default_na=False,
                       parse_dates=['publish_time'])

In [161]:
papers_df.head(2)

Unnamed: 0,doc_id,title,abstract,text_body,publish_time,authors,journal,doi,Title,H index
0,0015023cc06b5362d332b3baf348d11567ca2fbb,The RNA pseudoknots in foot-and-mouth disease ...,word count: 194 22 Text word count: 5168 23 24...,"VP3, and VP0 (which is further processed to VP...",2020-01-11,Joseph C. Ward; Lidia Lasecka-Dykes; Chris Nei...,,10.1101/2020.01.10.901801,,
1,00340eea543336d54adda18236424de6a5e91c9d,Analysis Title: Regaining perspective on SARS-...,"During the past three months, a new coronaviru...","In December 2019, a novel coronavirus, SARS-Co...",2020-03-20,Carla Mavian; Simone Marini; Costanza Manes; I...,,10.1101/2020.03.16.20034470,,


In [162]:
transmission_keywords = ['transmi', 'sneez', 'contact trac', 'reproduc', 'environ']

In [163]:
smoke_keywords = ['smok']

In [164]:
covid_keywords = ['COVID-19', 'HCoV-19', 'CORD-19' ,'2019-nCoV', 'Wuhan coronavirus', 'SARS-CoV-2', 'SARS-COV-2']
covid_keywords = [word.lower() for word in covid_keywords]

In [165]:
covid_papers = papers_df[papers_df['text_body'].apply(lambda x : 
                                                any(key in x.lower() for
                                                    key in covid_keywords)
                                               )]

In [179]:
covid_papers = covid_papers[covid_papers['publish_time'] >= datetime(2019,1,1)]

In [180]:
covid_papers.shape

(1241, 10)

In [181]:
sample_df = covid_papers.sort_values(by='publish_time', ascending = True)

In [183]:
for sample in sample_df.iterrows():
    sample = sample[1]
    print(sample['publish_time'], sample['title'])
    matches = []
    for key in covid_keywords:
        match = re.search(key.lower(), sample['text_body'].lower())
        if match:
            print(match)
            matches.append(match)
    for match in matches:
        print(sample['text_body'][match.start()-20:match.end() + 20])

2019-08-27 00:00:00 A molecular cell atlas of the human lung from single cell RNA sequencing
<re.Match object; span=(30102, 30110), match='covid-19'>
2, used by SARS and COVID-19 coronaviruses, and 
2019-11-22 00:00:00 VADR: validation and annotation of virus sequence submissions to GenBank
<re.Match object; span=(48307, 48315), match='covid-19'>
<re.Match object; span=(48461, 48471), match='sars-cov-2'>
<re.Match object; span=(48461, 48471), match='sars-cov-2'>
riction. Due to the COVID-19 outbreak beginning 
sequences including SARS-CoV-2, even though VADR w
sequences including SARS-CoV-2, even though VADR w
2019-12-31 00:00:00 
<re.Match object; span=(262, 270), match='covid-19'>
 for as long as the COVID-19 resource centre rem
2020-01-01 00:00:00 The novel coronavirus outbreak in Wuhan, China
<re.Match object; span=(87, 95), match='covid-19'>
<re.Match object; span=(73, 82), match='2019-ncov'>
irus (2019-nCoV, or COVID-19) was identified in 
 novel coronavirus (2019-nCoV, or COVID-

-CoV, MERS-CoV, and 2019-nCoV are all betacoronav
2020-02-18 00:00:00 Recombination and lineage-specific mutations led to the emergence of SARS-CoV-2
<re.Match object; span=(79, 87), match='covid-19'>
<re.Match object; span=(299, 309), match='sars-cov-2'>
<re.Match object; span=(299, 309), match='sars-cov-2'>
er 2019, the recent COVID-19 pandemic has caused
(CoV) virus, dubbed SARS-CoV-2, of the betacoronav
(CoV) virus, dubbed SARS-CoV-2, of the betacoronav
2020-02-18 00:00:00 Title Page 1 2 Simulating and Forecasting the Cumulative Confirmed Cases of SARS-CoV-2 in China by Boltzmann Function-based Regression Analyses 4 5 of SARS-CoV-2 in mainland China (including Hubei Province) will become minimal between
<re.Match object; span=(146, 156), match='sars-cov-2'>
<re.Match object; span=(146, 156), match='sars-cov-2'>
 [2] [3] [4] . This SARS-CoV-2 outbreak was declar
 [2] [3] [4] . This SARS-CoV-2 outbreak was declar
2020-02-18 00:00:00 
<re.Match object; span=(788, 797), match='2019-nco

<re.Match object; span=(21, 31), match='sars-cov-2'>
<re.Match object; span=(21, 31), match='sars-cov-2'>
atterns (1) . While COVID-19 frequently induces 
 novel coronavirus (SARS-CoV-2) that emerged out o
 novel coronavirus (SARS-CoV-2) that emerged out o
2020-02-29 00:00:00 Clinical characteristics of 36 non-survivors with COVID-19 in Wuhan, China
<re.Match object; span=(196, 204), match='covid-19'>
<re.Match object; span=(1547, 1557), match='sars-cov-2'>
<re.Match object; span=(1547, 1557), match='sars-cov-2'>
 non-survivors with COVID-19.
 For this retrosp
rome coronavirus 2 (SARS-CoV-2) happened since Dec
rome coronavirus 2 (SARS-CoV-2) happened since Dec
2020-02-29 00:00:00 
<re.Match object; span=(421, 429), match='covid-19'>
on-to-person.
 The COVID-19 spread even shows t
2020-02-29 00:00:00 Sex differences in clinical findings among patients with coronavirus disease 2019 (COVID-19) and severe condition
<re.Match object; span=(985, 993), match='covid-19'>
<re.Match object; sp

<re.Match object; span=(24, 32), match='covid-19'>
<re.Match object; span=(709, 719), match='sars-cov-2'>
<re.Match object; span=(709, 719), match='sars-cov-2'>
shift hospitals for COVID-19 patients: where hea
 novel coronavirus (SARS-CoV-2). According to the 
 novel coronavirus (SARS-CoV-2). According to the 
2020-03-10 00:00:00 Aerodynamic Characteristics and RNA Concentration of SARS-CoV-2 Aerosol in Wuhan Hospitals during COVID-19 Outbreak
<re.Match object; span=(65, 73), match='covid-19'>
<re.Match object; span=(523, 533), match='sars-cov-2'>
<re.Match object; span=(523, 533), match='sars-cov-2'>
nd territories, the COVID-19 epidemic has result
rome Coronavirus 2 (SARS-CoV-2). [1] [2] [3] The t
rome Coronavirus 2 (SARS-CoV-2). [1] [2] [3] The t
2020-03-10 00:00:00 Title: Ascertainment rate of novel coronavirus disease (COVID- 1 19) in Japan 2 Running title: Ascertainment in Japan 3 Correspondence to
<re.Match object; span=(412, 420), match='covid-19'>
t mild case data of COVID-19 

<re.Match object; span=(1971, 1981), match='sars-cov-2'>
<re.Match object; span=(1971, 1981), match='sars-cov-2'>
rovince, to control COVID-19 spread. 2,3 WHO dec
rome coronavirus 2 (SARS-CoV-2) and the serial int
rome coronavirus 2 (SARS-CoV-2) and the serial int
2020-03-17 00:00:00 Title: International expansion of a novel SARS-CoV-2 mutant 2
<re.Match object; span=(616, 626), match='sars-cov-2'>
<re.Match object; span=(616, 626), match='sars-cov-2'>
nformation of human SARS-CoV-2 genome was derived 
nformation of human SARS-CoV-2 genome was derived 
2020-03-17 00:00:00 A Cybernetics-based Dynamic Infection Model for Analyzing SARS-COV-2 Infection Stability and Predicting Uncontrollable Risks
<re.Match object; span=(1696, 1704), match='covid-19'>
<re.Match object; span=(20, 30), match='sars-cov-2'>
<re.Match object; span=(20, 30), match='sars-cov-2'>
nfection process of COVID-19 in a city could be 
The spread speed of SARS-CoV-2 has been emergently
The spread speed of SARS-CoV-2 has 

<re.Match object; span=(13, 22), match='2019-ncov'>
<re.Match object; span=(24, 34), match='sars-cov-2'>
<re.Match object; span=(24, 34), match='sars-cov-2'>
makes up 1/8 of the COVID-19 genome, so this vac

oduction 2019-nCoV (SARS-CoV-2) was first reported
oduction 2019-nCoV (SARS-CoV-2) was first reported
2020-03-21 00:00:00 COVID-19 coronavirus vaccine design using reverse vaccinology and machine 1 learning 2 3
<re.Match object; span=(1149, 1157), match='covid-19'>
<re.Match object; span=(3100, 3110), match='sars-cov-2'>
<re.Match object; span=(3100, 3110), match='sars-cov-2'>
icacy and safety of COVID-19 88 vaccine developm
dhesive proteins in SARS-CoV-2 identified as poten
dhesive proteins in SARS-CoV-2 identified as poten
2020-03-21 00:00:00 Journal of Infection The index case of SARS-CoV-2 in Scotland: a case report
<re.Match object; span=(252, 260), match='covid-19'>
<re.Match object; span=(326, 336), match='sars-cov-2'>
<re.Match object; span=(326, 336), match='sars-cov-2'>
 

 novel coronavirus (SARS-CoV-2) infection since De
2020-07-31 00:00:00 Genetic diversity and evolution of SARS-CoV-2
<re.Match object; span=(18, 28), match='sars-cov-2'>
<re.Match object; span=(18, 28), match='sars-cov-2'>


2020-07-31 00:00:00 Pathogenic viruses: Molecular detection and characterization
<re.Match object; span=(24215, 24224), match='2019-ncov'>
 novel coronavirus (2019-nCoV) which emerged in W
2020-07-31 00:00:00 Cardiothoracic Imaging Asymptomatic novel coronavirus pneumonia patient outside Wuhan: The value of CT images in the course of the disease
<re.Match object; span=(239, 247), match='covid-19'>
<re.Match object; span=(194, 203), match='2019-ncov'>
ly officially named COVID-19 by the WHO on Febru
 novel coronavirus (2019-nCoV) and subsequently o
2020-07-31 00:00:00 Severe air pollution events not avoided by reduced anthropogenic activities during COVID-19 outbreak
<re.Match object; span=(3184, 3192), match='covid-19'>
virus disease 2019 (COVID-19) is an infectiou

In [185]:
sample_df.iloc[1]['text_body']

'As of September 2019, GenBank [1] contained more than 3 million viral sequences totaling over 4 billion nucleotides in length and including over 180,000 complete genomes for viruses other than influenza. More than 250,000 of these sequences were submitted in 2018. All sequence submissions are validated prior to deposition in GenBank. Automated validation and annotation methods become increasingly important as sequence submission numbers grow. Table 1 shows the number of sequences for the 16 virus species with the most sequences in GenBank. Influenza sequences are the second most abundant and the National Center of Biotechnology Information (NCBI), where GenBank is housed, has expended considerable effort to organize flu sequences and streamline the submission of new influenza virus sequences, including a tool to validate and annotate flu submissions called FLAN [2] . The influenza virus sequence submission tool (https://www.ncbi.nlm.nih.gov/ genome/viruses/variation/help/flu-help-cent

In [133]:
matches = []
for key in covid_keywords:
    match = re.search(key.lower(), sample['text_body'].lower())
    if match:
        print(match)
        matches.append(match)

<re.Match object; span=(27, 44), match='novel coronavirus'>


In [126]:
for match in matches:
    print(sample['text_body'][match.start()-20:match.end() + 20])

 coronavirus 229E (HcoV-229E) (Sanchez et al.
