# COVID-19 Thematic tagging with Regular Expressions

***UPDATE: Most of the ideas in this Notebook can be accessed via the [covid19_tools](https://www.kaggle.com/ajrwhite/covid19-tools) utility script, which you can import into your own Notebook, and is being updated more frequently than this.***

Goal: tag papers with themes (e.g. `tag_disease_covid19` or `tag_risk_smoking`) _using handcrafted rules_ based on synonyms and related terms, searching the **all_sources_metadata** file.

The chart below shows how multiple synonyms for Covid-19 become a single boolean field in the metadata:

In [1]:
import plotly.express as px
import plotly.graph_objects as go
    
hardcoded_data_for_intro_chart = {
    'covid': 1231,
    '2019 ncov': 576,
    'sars cov 2': 501,
    r'coronavirus 2\b': 154,
    'coronavirus 2019': 61,
    'wuhan coronavirus': 13,
    'coronavirus disease 19': 12,
    'ncov 2019': 10,
    'wuhan pneumonia': 7,
    '2019ncov': 6,
    'wuhan virus': 3,
    r'2019n cov\b': 2,
    r'2019 n cov\b': 2,
    r'\bn cov 2019': 0
}

title = 'Covid19 synonyms in title / abstract metadata<br><i>Hover over dots for exact values</i>'
fig = go.Figure()

fig.add_trace(go.Scatter(
    x=list(hardcoded_data_for_intro_chart.values())[::-1],
    y=list(hardcoded_data_for_intro_chart.keys())[::-1],
    marker=dict(color="crimson", size=12),
    mode="markers",
    name='Synonyms'
))

fig.add_trace(go.Scatter(
    x=[2105],
    y=['ncov 2019'],
    marker=dict(color='blue', size=20),
    mode='markers',
    text='tag_disease_covid19',
    name='tag_disease_covid19'
))

fig.add_annotation

fig.update_layout(title=title,
              xaxis_title='Counts',
              yaxis_title='Regular Expressions')
fig.show()

## Contents

1. [Motivation](#Motivation) - Why filter papers with regular expressions?
2. **[Diseases and conditions](#Diseases)** - Does paper discuss Covid-19, SARS, MERS, etc.?
3. **[Research Design](#Design)** - Is research design indicated in the abstract?
4. **[Potential risk factors](#Risks)** - Does paper indicate characteristics, comorbidities?
5. **[Immunity and vaccinations](#Immunity)** - Does paper discuss immunity and / or vaccines?
6. **[Geographies](#Geographies)** - Does paper cover specific continents, countries, etc?
7. **[Climate](#Climate)** - Does paper cover issues relating to climate and weather?
8. **[Transmission](#Transmission)** - Does paper mention transmission routes / rates?
9. [Output](#Output) - File outputs
10. [Filtering Tool](#Filtering) - TODO


## Motivation

**Unfocused dataset**: Dataset contains >44k papers, but most of them aren't specifically about Covid-19.

**Inconsistent terminology**: Terminology for Covid-19 wasn't standardised \[[1](https://www.who.int/emergencies/diseases/novel-coronavirus-2019/technical-guidance/naming-the-coronavirus-disease-%28covid-2019%29-and-the-virus-that-causes-it)\] \[[2](https://qz.com/1820422/coronavirus-why-wont-who-use-the-name-sars-cov-2/)\] when papers first emerged. 

**Handcrafted features can outperform inferred features**: Domain-specific handcrafted synonym lists can tag papers more efficiently than generic topic modelling approaches.

**Faster filtering on metadata**: Most of these themes can be extracted from the metadata (title and abstract). We can filter on these tags to identify useful papers for more involved analysis.


Click on **Code** button below to see code to import libraries and load data.

In [2]:
# Data libraries
import pandas as pd
import re
import pycountry

# Visualisation libraries
import plotly.express as px
import plotly.graph_objects as go

%matplotlib inline

pd.set_option('display.max_columns', 500)

# Load data
metadata_file = '../input/CORD-19-research-challenge/metadata.csv'
df = pd.read_csv(metadata_file,
                 dtype={'Microsoft Academic Paper ID': str,
                        'pubmed_id': str})

def doi_url(d):
    if d.startswith('http'):
        return d
    elif d.startswith('doi.org'):
        return f'http://{d}'
    else:
        return f'http://doi.org/{d}'
    
df.doi = df.doi.fillna('').apply(doi_url)

print(f'loaded DataFrame with {len(df)} records')

loaded DataFrame with 45774 records


In [3]:
# Helper function for filtering df on abstract + title substring
def abstract_title_filter(search_string):
    return (df.abstract.str.lower().str.replace('-', ' ').str.contains(search_string, na=False) |
            df.title.str.lower().str.replace('-', ' ').str.contains(search_string, na=False))

In [4]:
# Helper function for Cleveland dot plot visualisation of count data
def dotplot(input_series, title, x_label='Count', y_label='Regex'):
    subtitle = '<br><i>Hover over dots for exact values</i>'
    fig = go.Figure()
    fig.add_trace(go.Scatter(
    x=input_series.sort_values(),
    y=input_series.sort_values().index.values,
    marker=dict(color="crimson", size=12),
    mode="markers",
    name="Count",
    ))
    fig.update_layout(title=f'{title}{subtitle}',
                  xaxis_title=x_label,
                  yaxis_title=y_label)
    fig.show()

In [5]:
# Helper function which counts synonyms and adds tag column to DF
def count_and_tag(df: pd.DataFrame,
                  synonym_list: list,
                  tag_suffix: str) -> (pd.DataFrame, pd.Series):
    counts = {}
    df[f'tag_{tag_suffix}'] = False
    for s in synonym_list:
        synonym_filter = abstract_title_filter(s)
        counts[s] = sum(synonym_filter)
        df.loc[synonym_filter, f'tag_{tag_suffix}'] = True
    return df, pd.Series(counts)

In [6]:
# Function for printing out key passage of abstract based on key terms
def print_key_phrases(df, key_terms, n=5, chars=300):
    for ind, item in enumerate(df[:n].itertuples()):
        print(f'{ind+1} of {len(df)}')
        print(item.title)
        print('[ ' + item.doi + ' ]')
        try:
            i = len(item.abstract)
            for kt in key_terms:
                kt = kt.replace(r'\b', '')
                term_loc = item.abstract.lower().find(kt)
                if term_loc != -1:
                    i = min(i, term_loc)
            if i < len(item.abstract):
                print('    "' + item.abstract[i-30:i+chars-30] + '"')
            else:
                print('    "' + item.abstract[:chars] + '"')
        except:
            print('NO ABSTRACT')
        print('---')

# Diseases

- Covid-19
- Severe Acute Respiratory Syndrome (SARS)
- Middle East Respiratory Syndrome (MERS)
- Coronaviruses
- Acute Respiratory Distress Syndrome (ARDS)

## Covid-19

We are looking for papers that specifically refer to the recent outbreak, known variously as Covid-19, SARS-CoV-2, 2019-nCoV, Wuhan Pneumonia, novel coronavirus.

See: https://en.wikipedia.org/wiki/Coronavirus_disease_2019

In [7]:
covid19_synonyms = ['covid',
                    'coronavirus disease 19',
                    'sars cov 2', # Note that search function replaces '-' with ' '
                    '2019 ncov',
                    '2019ncov',
                    r'2019 n cov\b',
                    r'2019n cov\b',
                    'ncov 2019',
                    r'\bn cov 2019',
                    'coronavirus 2019',
                    'wuhan pneumonia',
                    'wuhan virus',
                    'wuhan coronavirus',
                    r'coronavirus 2\b']

In [8]:
df, covid19_counts = count_and_tag(df, covid19_synonyms, 'disease_covid19')

In [9]:
covid19_counts.sort_values(ascending=False)

covid                     1571
sars cov 2                 615
2019 ncov                  592
coronavirus 2\b            176
coronavirus 2019            64
coronavirus disease 19      16
wuhan coronavirus           13
ncov 2019                   11
wuhan pneumonia              7
2019ncov                     6
wuhan virus                  4
2019 n cov\b                 3
2019n cov\b                  2
\bn cov 2019                 0
dtype: int64

In [10]:
dotplot(covid19_counts, 'Covid-19 synonyms in title / abstract metadata')

In [11]:
novel_corona_filter = (abstract_title_filter('novel corona') &
                       df.publish_time.str.startswith('2020', na=False))
print(f'novel corona (published 2020): {sum(novel_corona_filter)}')
df.loc[novel_corona_filter, 'tag_disease_covid19'] = True

novel corona (published 2020): 1068


In [12]:
df.tag_disease_covid19.value_counts()

False    43294
True      2480
Name: tag_disease_covid19, dtype: int64

In [13]:
# SENSE CHECK: Confirm these all published 2020 (or missing date)
df[df.tag_disease_covid19].publish_time.str.slice(0, 4).value_counts(dropna=False)

2020    2465
NaN        8
2016       2
2018       1
2017       1
2011       1
2015       1
2019       1
Name: publish_time, dtype: int64

In [14]:
# Fix the earlier papers that are about something else
df.loc[df.tag_disease_covid19 & ~df.publish_time.str.startswith('2020', na=True),
       'tag_disease_covid19'] = False

## Severe Acute Respiratory Syndrome (SARS)

SARS typically means the related coronavirus that caused an outbreak in 2003, although Covid-19 is sometimes referred to with a SARS name.

See: https://en.wikipedia.org/wiki/Severe_acute_respiratory_syndrome_coronavirus

In [15]:
sars_synonyms = [r'\bsars\b',
                 'severe acute respiratory syndrome']

In [16]:
df, sars_counts = count_and_tag(df, sars_synonyms, 'disease_sars')

In [17]:
sars_counts

\bsars\b                             4379
severe acute respiratory syndrome    2869
dtype: int64

In [18]:
df.tag_disease_sars.value_counts()

False    40972
True      4802
Name: tag_disease_sars, dtype: int64

In [19]:
df.groupby('tag_disease_covid19').tag_disease_sars.value_counts()

tag_disease_covid19  tag_disease_sars
False                False               39355
                     True                 3946
True                 False                1617
                     True                  856
Name: tag_disease_sars, dtype: int64

## Middle East Respiratory Syndrome (MERS)

See: https://en.wikipedia.org/wiki/Middle_East_respiratory_syndrome

In [20]:
mers_synonyms = [r'\bmers\b',
                 'middle east respiratory syndrome']

In [21]:
df, mers_counts = count_and_tag(df, mers_synonyms, 'disease_mers')

In [22]:
mers_counts

\bmers\b                            1694
middle east respiratory syndrome    1470
dtype: int64

In [23]:
df.tag_disease_mers.value_counts()

False    43850
True      1924
Name: tag_disease_mers, dtype: int64

In [24]:
df.groupby('tag_disease_covid19').tag_disease_mers.value_counts()

tag_disease_covid19  tag_disease_mers
False                False               41534
                     True                 1767
True                 False                2316
                     True                  157
Name: tag_disease_mers, dtype: int64

## Coronaviruses

**IMPORTANT: This tag needs more work.**

Coronaviruses are a group of related viruses that cause disease in mammals and birds.

See: https://en.wikipedia.org/wiki/Coronavirus

In [25]:
corona_synonyms = ['corona', r'\bcov\b']

In [26]:
df, corona_counts = count_and_tag(df, corona_synonyms, 'disease_corona')

In [27]:
corona_counts

corona     9298
\bcov\b    3626
dtype: int64

In [28]:
df.tag_disease_corona.value_counts()

False    35971
True      9803
Name: tag_disease_corona, dtype: int64

In [29]:
df.groupby('tag_disease_covid19').tag_disease_corona.value_counts()

tag_disease_covid19  tag_disease_corona
False                False                 35400
                     True                   7901
True                 True                   1902
                     False                   571
Name: tag_disease_corona, dtype: int64

## Acute Respiratory Distress Syndrome (ARDS)

ARDS is a possible consequence of Covid-19 infection.

See: https://en.wikipedia.org/wiki/Acute_respiratory_distress_syndrome

In [30]:
ards_synonyms = ['acute respiratory distress syndrome',
                 r'\bards\b']

In [31]:
df, ards_counts = count_and_tag(df, ards_synonyms, 'disease_ards')

In [32]:
ards_counts

acute respiratory distress syndrome    283
\bards\b                               207
dtype: int64

In [33]:
df.tag_disease_ards.value_counts()

False    45445
True       329
Name: tag_disease_ards, dtype: int64

In [34]:
n = (df.tag_disease_covid19 & df.tag_disease_ards).sum()
print(f'There are {n} papers on Covid-19 and ARDS.')

There are 55 papers on Covid-19 and ARDS.


# Design

Research design (thanks to Savanna Reid for input on these):

- risk factor analysis
    - retrospective cohort
    - cross-sectional case-control
    - prospective case-control
    - matched case-control
    - medical records review
    - seroprevalence survey
    - syndromic surveillance
- time series analysis
    - survival analysis

In [35]:
riskfac_synonyms = [
    'risk factor analysis',
    'cross sectional case control',
    'prospective case control',
    'matched case control',
    'medical records review',
    'seroprevalence survey',
    'syndromic surveillance'
]
df, riskfac_counts = count_and_tag(df, riskfac_synonyms, 'design_riskfac')
dotplot(riskfac_counts, 'Risk factor analysis synonyms in title / abstract metadata')

In [36]:
riskfac_counts.sort_values(ascending=False)

syndromic surveillance          71
prospective case control        13
risk factor analysis            10
matched case control             9
seroprevalence survey            4
medical records review           0
cross sectional case control     0
dtype: int64

In [37]:
n = (df.tag_disease_covid19 & df.tag_design_riskfac).sum()
print(f'There are {n} papers on Covid-19 with a Risk Factor Analysis research design.')

There are 2 papers on Covid-19 with a Risk Factor Analysis research design.


# Risks

Potential risk factors:

- Generic risk factors
- _Demographic_:
    - Age
    - Sex
    - Bodyweight
    - Blood type
    - Ethnicity (TODO)
- _Behavioural:
    - Smoking
    - Occupation (TODO)
    - Animal contact (TODO)
    - Social activity (TODO)
- _Pre-existing conditions_:
    - Diabetes
    - Hypertension
    - Immunodeficiency (general)
    - Cancer (general)
    - Chronic respiratory disease (general - inc. asthma, bronchitis)
    - Asthma
    - Cardiovascular disease (TODO)
    - Chronic respiratory disease / bronchitis (TODO)
    - Cerebral infarction (TODO)

See _Estimation of risk factors for COVID-19 mortality - preliminary results_, https://doi.org/10.1101/2020.02.24.20027268

## Generic risk factors

Look for text that indicates that risk factors are assessed in the paper.

In [38]:
risk_factor_synonyms = ['risk factor',
                        'risk model',
                        'risk by',
                        'comorbidity',
                        'comorbidities',
                        'coexisting condition',
                        'co existing condition',
                        'clinical characteristics',
                        'clinical features',
                        'demographic characteristics',
                        'demographic features',
                        'behavioural characteristics',
                        'behavioural features',
                        'behavioral characteristics',
                        'behavioral features',
                        'predictive model',
                        'prediction model',
                        'univariate', # implies analysis of risk factors
                        'multivariate', # implies analysis of risk factors
                        'multivariable',
                        'univariable',
                        'odds ratio', # typically mentioned in model report
                        'confidence interval', # typically mentioned in model report
                        'logistic regression',
                        'regression model',
                        'factors predict',
                        'factors which predict',
                        'factors that predict',
                        'factors associated with',
                        'underlying disease',
                        'underlying condition']
df, risk_generic_counts = count_and_tag(df, risk_factor_synonyms, 'risk_generic')
dotplot(risk_generic_counts,
        'Count of generic risk factor indicated in title / abstract')

In [39]:
risk_generic_counts.sort_values(ascending=False)

risk factor                    850
confidence interval            568
odds ratio                     378
logistic regression            357
clinical features              345
clinical characteristics       335
multivariate                   289
factors associated with        246
comorbidities                  195
regression model               195
multivariable                  146
underlying disease              93
univariate                      91
comorbidity                     74
demographic characteristics     58
predictive model                48
prediction model                41
underlying condition            20
risk by                         13
univariable                     11
risk model                      11
factors predict                  9
demographic features             8
factors that predict             5
coexisting condition             3
co existing condition            2
behavioral features              1
behavioural characteristics      1
factors which predic

In [40]:
n = (df.tag_disease_covid19 & df.tag_risk_generic).sum()
print(f'There are {n} papers on Covid-19 and generic risk factors.')

There are 355 papers on Covid-19 and generic risk factors.


Printing out 5 examples, and key text from the Abstract.

In [41]:
print_key_phrases(df[df.tag_disease_covid19 & df.tag_risk_generic],
                  risk_factor_synonyms)

1 of 355
Real time estimation of the risk of death from novel coronavirus (2019-nCoV) infection: Inference using exported cases
[ http://doi.org/10.1101/2020.01.29.20019547 ]
    " estimated at 5433 cases (95% confidence interval (CI): 3883, 7160) and 17780 cases (95% CI: 9646, 28724), respectively. The latest estimates of the cCFR were 4.6% (95% CI: 3.1-6.6) for scenario 1 and 7.7% (95% CI: 4.9-11.3%) for scenario 2, respectively. The basic reproduction number was estimated "
---
2 of 355
Transmission and epidemiological characteristics of Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) infected Pneumonia (COVID-19): preliminary evidence obtained in comparison with 2003-SARS
[ http://doi.org/10.1101/2020.01.30.20019836 ]
    "ed separately and compared. A multivariate function model was constructed based on the confirmed COVID-19 case data. Results: The growth rate of new cases and deaths of COVID-19 were significantly faster than those of 2003-SARS. The number of confirm

## Demographic risk factors

## Age

In [42]:
age_synonyms = ['median age',
                'mean age',
                'average age',
                'elderly',
                r'\baged\b',
                r'\bold',
                'young',
                'teenager',
                'adult',
                'child'
               ]
df, age_counts = count_and_tag(df, age_synonyms, 'risk_age')
dotplot(age_counts, 'Age synonyms in title / abstract metadata')

In [43]:
age_counts.sort_values(ascending=False)

child          2492
\bold          1913
adult          1755
young          1144
\baged\b        705
elderly         374
median age      231
mean age        205
average age      36
teenager          6
dtype: int64

In [44]:
n = (df.tag_disease_covid19 & df.tag_risk_age).sum()
print(f'There are {n} papers on Covid-19 and age.')

There are 352 papers on Covid-19 and age.


## Sex

e.g. _Sex difference and smoking predisposition in patients with COVID-19_, https://doi.org/10.1016/S2213-2600(20)30117-X

In [45]:
sex_synonyms = ['sex',
                'gender',
                r'\bmale\b',
                r'\bfemale\b',
                r'\bmales\b',
                r'\bfemales\b',
                r'\bmen\b',
                r'\bwomen\b'
               ]
df, sex_counts = count_and_tag(df, sex_synonyms, 'risk_sex')
dotplot(sex_counts, 'Sex / gender synonyms in title / abstract metadata')

In [46]:
sex_counts.sort_values(ascending=False)

\bmale\b       594
sex            494
\bfemale\b     471
\bwomen\b      375
\bmales\b      254
gender         251
\bfemales\b    231
\bmen\b        198
dtype: int64

In [47]:
n = (df.tag_disease_covid19 & df.tag_risk_sex).sum()
print(f'There are {n} papers on Covid-19 and sex / gender.')

There are 226 papers on Covid-19 and sex / gender.


## Bodyweight

Obesity and related problems (e.g. diabetes, hypertension) have been widely speculated as risk factors, e.g. _The confluence of the COVID19 pandemic with the obesity epidemic_, https://doi.org/10.1136/bmj.m810

In [48]:
bodyweight_synonyms = [
    'overweight',
    'over weight',
    'obese',
    'obesity',
    'bodyweight',
    'body weight',
    r'\bbmi\b',
    'body mass',
    'body fat',
    'bodyfat',
    'kilograms',
    r'\bkg\b', # e.g. 70 kg
    r'\dkg\b'  # e.g. 70kg
]
df, bodyweight_counts = count_and_tag(df, bodyweight_synonyms, 'risk_bodyweight')
dotplot(bodyweight_counts, 'Bodyweight synonyms in title / abstract data')

In [49]:
bodyweight_counts.sort_values(ascending=False)

\bkg\b         384
body weight    188
obesity        100
body mass       34
obese           31
\dkg\b          29
\bbmi\b         20
overweight       9
bodyweight       6
body fat         2
kilograms        1
bodyfat          0
over weight      0
dtype: int64

In [50]:
n = (df.tag_disease_covid19 & df.tag_risk_bodyweight).sum()
print(f'There are {n} papers on Covid-19 and bodyweight')

There are 11 papers on Covid-19 and bodyweight


In [51]:
print_key_phrases(df[df.tag_disease_covid19 & df.tag_risk_bodyweight],
                  bodyweight_synonyms)

1 of 11
Early, low-dose and short-term application of corticosteroid treatment in patients with severe COVID-19 pneumonia: single-center experience from Wuhan, China
[ http://doi.org/10.1101/2020.03.06.20032342 ]
    ""
---
2 of 11
Epidemiological, Clinical Characteristics and Outcome of Medical Staff Infected with COVID-19 in Wuhan, China: A Retrospective Case Series Analysis
[ http://doi.org/10.1101/2020.03.09.20033118 ]
    ""
---
3 of 11
Clinical Characteristics of 74 Children with Coronavirus Disease 2019
[ http://doi.org/10.1101/2020.03.19.20027078 ]
    ""
---
4 of 11
An orally bioavailable broad-spectrum antiviral inhibits SARS-CoV-2 and multiple endemic, epidemic and bat coronavirus
[ http://doi.org/10.1101/2020.03.19.997890 ]
    ", and reduced virus titer and body weight loss. Decreased MERS-CoV yields in vitro and in vivo were associated with increased transition mutation frequency in viral but not host cell RNA, supporting a mechanism of lethal mutagenesis. The potency of 

## Smoking

e.g. _Sex difference and smoking predisposition in patients with COVID-19_,  https://doi.org/10.1016/S2213-2600(20)30117-X

- smoking
- smoke(rs)
- cigarette(s)
- cigar(s)
- e-cigarette(s)
- cannabis / marijuana / thc

In [52]:
smoking_synonyms = ['smoking',
                    'smoke',
                    'cigar', # this picks up cigar, cigarette, e-cigarette, etc.
                    'nicotine',
                    'cannabis',
                    'marijuana']
df, smoking_counts = count_and_tag(df, smoking_synonyms, 'risk_smoking')
dotplot(smoking_counts, 'Smoking synonym counts in title / abstract metadata')

In [53]:
smoking_counts.sort_values(ascending=False)

smoking      121
smoke         98
cigar         35
nicotine      10
cannabis       2
marijuana      1
dtype: int64

In [54]:
df.groupby('tag_disease_covid19').tag_risk_smoking.value_counts()

tag_disease_covid19  tag_risk_smoking
False                False               43115
                     True                  186
True                 False                2458
                     True                   15
Name: tag_risk_smoking, dtype: int64

In [55]:
n = (df.tag_disease_covid19 & df.tag_risk_smoking).sum()
print(f'tag_disease_covid19 x tag_risk_smoking currently returns {n} papers')

tag_disease_covid19 x tag_risk_smoking currently returns 15 papers


In [56]:
print_key_phrases(df[df.tag_disease_covid19 & df.tag_risk_smoking],
                  smoking_synonyms, n=12)

1 of 15
Integrative Bioinformatics Analysis Provides Insight into the Molecular Mechanisms of 2019-nCoV
[ http://doi.org/10.1101/2020.02.03.20020206 ]
    " on the expression of ACE2 in smoking individuals, we inferred that long-term smoking might be a risk factor for 2019-nCoV. Analyzing the ACE2 in SARS-CoV infected cells suggested that ACE2 was more than just a receptor but also participated in post-infection regulation, including immune response, c"
---
2 of 15
Bulk and single-cell transcriptomics identify tobacco-use disparity in lung gene expression of ACE2, the receptor of 2019-nCov
[ http://doi.org/10.1101/2020.02.05.20020107 ]
    "ated to race, age, gender and smoking status in ACE2 gene expression and its distribution among cell types. We didn't find significant disparities in ACE2 gene expression between racial groups (Asian vs Caucasian), age groups (>60 vs <60) or gender groups (male vs female). However, we observed signi"
---
3 of 15
Comorbidity and its impact on 1,590 p

## Diabetes

- Type I Diabetes
- Type II Diabetes

In [57]:
diabetes_synonyms = [
    'diabet', # picks up diabetes, diabetic, etc.
    'insulin', # any paper mentioning insulin likely to be relevant
    'blood sugar',
    'blood glucose',
    'ketoacidosis',
    'hyperglycemi', # picks up hyperglycemia and hyperglycemic
]
df, diabetes_counts = count_and_tag(df, diabetes_synonyms, 'risk_diabetes')
dotplot(diabetes_counts, 'Diabetes synonym counts in title / abstract metadata')

In [58]:
diabetes_counts.sort_values(ascending=False)

diabet           350
insulin          112
blood glucose     31
hyperglycemi      18
ketoacidosis       2
blood sugar        2
dtype: int64

In [59]:
n = (df.tag_disease_covid19 & df.tag_risk_diabetes).sum()
print(f'There are {n} papers on Covid-19 and diabetes')

There are 57 papers on Covid-19 and diabetes


In [60]:
print_key_phrases(df[df.tag_disease_covid19 & df.tag_risk_diabetes],
                  diabetes_synonyms, n=49)

1 of 57
Clinical features and progression of acute respiratory distress syndrome in coronavirus disease 2019
[ http://doi.org/10.1101/2020.02.17.20024166 ]
    "xistent conditions, including diabetes (20.8% vs. 1.8%), cerebrovascular disease (11.3% vs. 0%), and chronic kidney disease (15.1% vs. 3.6%). Compared to mild ARDS patients, those with moderate and severe ARDS had higher mortality rates. No significant effect of antivirus, glucocorticoid, or immunog"
---
2 of 57
Clinical characteristics of 25 death cases infected with COVID-19 pneumonia: a retrospective review of medical records in a single medical center, Wuhan, China
[ http://doi.org/10.1101/2020.02.19.20025239 ]
    "ion (16/25, 64%), followed by diabetes (10/25, 40%), heart diseases (8/25, 32%), kidney diseases (5/25, 20%), cerebral infarction (4/25, 16%), chronic obstructive pulmonary disease (COPD, 2/25, 8%), malignant tumors (2/25, 8%) and acute pancreatitis (1/25, 4%). The most common organ damage outside t"
---
3 of 57

## Hypertension

In [61]:
hypertension_synonyms = [
    'hypertension',
    'blood pressure',
    r'\bhbp\b', # HBP = high blood pressure
    r'\bhtn\b' # HTN = hypertension
]
df, hypertension_counts = count_and_tag(df, hypertension_synonyms, 'risk_hypertension')
dotplot(hypertension_counts, 'Hypertension synonyms in title / abstract metadata')

In [62]:
hypertension_counts.sort_values(ascending=False)

hypertension      199
blood pressure     77
\bhbp\b             4
\bhtn\b             2
dtype: int64

In [63]:
n = (df.tag_disease_covid19 & df.tag_risk_hypertension).sum()
print(f'There are {n} papers on Covid-19 and hypertension')

There are 49 papers on Covid-19 and hypertension


## Immunodeficiency

Immunodeficiency (e.g. HIV / AIDS, side effect of chemotherapy, etc.) may be important.

In [64]:
immunodeficiency_synonyms = [
    'immune deficiency',
    'immunodeficiency',
    r'\bhiv\b',
    r'\baids\b'
    'granulocyte deficiency',
    'hypogammaglobulinemia',
    'asplenia',
    'dysfunction of the spleen',
    'spleen dysfunction',
    'complement deficiency',
    'neutropenia',
    'neutropaenia', # alternate spelling
    'cell deficiency' # e.g. T cell deficiency, B cell deficiency
]
df, immunodeficiency_counts = count_and_tag(df,
                                            immunodeficiency_synonyms,
                                            'risk_immunodeficiency')
dotplot(immunodeficiency_counts, 'Immunodeficiency synonyms in title / abstract metadata')

In [65]:
immunodeficiency_counts.sort_values(ascending=False)

\bhiv\b                           1326
immunodeficiency                   745
neutropenia                         62
immune deficiency                   43
hypogammaglobulinemia                9
cell deficiency                      6
asplenia                             1
neutropaenia                         0
complement deficiency                0
spleen dysfunction                   0
dysfunction of the spleen            0
\baids\bgranulocyte deficiency       0
dtype: int64

In [66]:
n = (df.tag_disease_covid19 & df.tag_risk_immunodeficiency).sum()
print(f'tag_disease_covid19 x tag_risk_immunodeficiency currently returns {n} papers')

tag_disease_covid19 x tag_risk_immunodeficiency currently returns 22 papers


In [67]:
df[df.tag_disease_covid19 & df.tag_risk_immunodeficiency].head()

Unnamed: 0,cord_uid,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,Microsoft Academic Paper ID,WHO #Covidence,has_full_text,full_text_file,url,tag_disease_covid19,tag_disease_sars,tag_disease_mers,tag_disease_corona,tag_disease_ards,tag_design_riskfac,tag_risk_generic,tag_risk_age,tag_risk_sex,tag_risk_bodyweight,tag_risk_smoking,tag_risk_diabetes,tag_risk_hypertension,tag_risk_immunodeficiency
138,1qniriu0,6ae20454d1a9f228864de24660c2460becbc8151,biorxiv,Machine intelligence design of 2019-nCoV drugs,http://doi.org/10.1101/2020.01.30.927889,,,biorxiv,"AbstractWuhan coronavirus, called 2019-nCoV, i...",2020-02-04,Kaifu Gao; Duc Duy Nguyen; Rui Wang; Guo-Wei Wei,,,,True,biorxiv_medrxiv,https://doi.org/10.1101/2020.01.30.927889,True,True,False,True,False,False,False,False,False,False,False,False,False,True
144,cszqykpu,ff54e3e961a72eb1d2500166809b3651b2f98cf6,biorxiv,Predicting commercially available antiviral dr...,http://doi.org/10.1101/2020.01.31.929547,,,biorxiv,AbstractThe infection of a novel coronavirus f...,2020-02-02,Bo Ram Beck; Bonggun Shin; Yoonjung Choi; Sung...,,,,True,biorxiv_medrxiv,https://doi.org/10.1101/2020.01.31.929547,True,False,False,True,False,False,False,False,False,False,False,False,False,True
145,vbgf50os,9e94f9379fd74fcacc4f3a57e03cbe9035efee8e,biorxiv,Molecular Modeling Evaluation of the Binding E...,http://doi.org/10.1101/2020.01.31.929695,,,biorxiv,"AbstractThree anti-HIV drugs, ritonavir, lopin...",2020-02-03,Shen Lin; Runnan Shen; Jingdong He; Xinhao Li;...,,,,True,biorxiv_medrxiv,https://doi.org/10.1101/2020.01.31.929695,True,True,False,True,False,False,False,False,False,False,False,False,False,True
166,mtv80pjo,8a1fde8c65e439496ac5810504de23ef77312f28,biorxiv,Protein structure and sequence re-analysis of ...,http://doi.org/10.1101/2020.02.04.933135,,,biorxiv,AbstractAs the infection of 2019-nCoV coronavi...,2020-02-08,Chengxin Zhang; Wei Zheng; Xiaoqiang Huang; Er...,,,,True,biorxiv_medrxiv,https://doi.org/10.1101/2020.02.04.933135,True,False,False,True,False,False,False,False,False,False,False,False,False,True
396,7s9ot4vq,875b7c463f00772fa0dc18ada678bc1ff16a4274,medrxiv,"Comorbidity and its impact on 1,590 patients w...",http://doi.org/10.1101/2020.02.25.20027664,,,medrvix,Objective: To evaluate the spectrum of comorbi...,2020-02-27,Wei-jie Guan; Wen-hua Liang; Yi Zhao; Heng-rui...,,,,True,biorxiv_medrxiv,https://doi.org/10.1101/2020.02.25.20027664,True,False,False,True,False,False,True,True,True,False,True,True,True,True


## Cancer

In [68]:
cancer_synonyms = [
    'cancer',
    'malignant tumour',
    'malignant tumor',
    'melanoma',
    'leukemia',
    'leukaemia',
    'chemotherapy',
    'radiotherapy',
    'radiation therapy',
    'lymphoma',
    'sarcoma',
    'carcinoma',
    'blastoma',
    'oncolog'
]
df, cancer_counts = count_and_tag(df, cancer_synonyms, 'risk_cancer')
dotplot(cancer_counts, 'Cancer synonyms in title / abstract metadata')

In [69]:
cancer_counts.sort_values(ascending=False)

cancer               1196
leukemia              268
carcinoma             262
chemotherapy          136
lymphoma              122
sarcoma               112
melanoma               82
oncolog                74
blastoma               69
leukaemia              51
malignant tumor        19
radiotherapy           18
malignant tumour        5
radiation therapy       3
dtype: int64

In [70]:
n = (df.tag_disease_covid19 & df.tag_risk_cancer).sum()
print(f'There are {n} papers on Covid-19 and cancer')

There are 56 papers on Covid-19 and cancer


## Chronic respiratory disease

In [71]:
chronicresp_synonyms = [
    'chronic respiratory disease',
    'asthma',
    'chronic obstructive pulmonary disease',
    r'\bcopd',
    'chronic bronchitis',
    'emphysema'
]
df, chronicresp_counts = count_and_tag(df, chronicresp_synonyms, 'risk_chronicresp')
dotplot(chronicresp_counts, 'Chronic respiratory disease terms in title / abstract metadata')

In [72]:
chronicresp_counts.sort_values(ascending=False)

asthma                                   511
chronic obstructive pulmonary disease    203
\bcopd                                   181
chronic respiratory disease               41
chronic bronchitis                        25
emphysema                                 22
dtype: int64

In [73]:
n = (df.tag_disease_covid19 & df.tag_risk_chronicresp).sum()
print(f'There are {n} papers on Covid-19 and chronic respiratory disease')

There are 21 papers on Covid-19 and chronic respiratory disease


In [74]:
print_key_phrases(df[df.tag_disease_covid19 & df.tag_risk_chronicresp],
                  chronicresp_synonyms, n=15)

1 of 21
Clinical characteristics of 25 death cases infected with COVID-19 pneumonia: a retrospective review of medical records in a single medical center, Wuhan, China
[ http://doi.org/10.1101/2020.02.19.20025239 ]
    "ebral infarction (4/25, 16%), chronic obstructive pulmonary disease (COPD, 2/25, 8%), malignant tumors (2/25, 8%) and acute pancreatitis (1/25, 4%). The most common organ damage outside the lungs was the heart, followed by kidney and liver. In the patients' last examination before death, white blood"
---
2 of 21
Estimation of risk factors for COVID-19 mortality - preliminary results
[ http://doi.org/10.1101/2020.02.24.20027268 ]
    "10.2736; 15.8643], along with chronic respiratory disease (OR=7.7925 CI95%[5.5446; 10.4319]). Males are more likely to die from COVID-19 (OR=1.8518 (CI95%[1.5996; 2.1270]). Some limitations such as the lack of information about the correct prevalence of gender per age or about comorbidities per age "
---
3 of 21
Comorbidity and its impact o

## Asthma

In [75]:
# Only really one term for asthma
df, asthma_counts = count_and_tag(df, ['asthma'], 'risk_asthma')
asthma_counts

asthma    511
dtype: int64

In [76]:
n = (df.tag_disease_covid19 & df.tag_risk_asthma).sum()
print(f'There are {n} papers on Covid-19 and asthma')

There are 5 papers on Covid-19 and asthma


In [77]:
print_key_phrases(df[df.tag_disease_covid19 & df.tag_risk_asthma],
                  ['asthma'])

1 of 5
Exploring diseases/traits and blood proteins causally related to expression of ACE2, the putative receptor of 2019-nCov: A Mendelian Randomization analysis
[ http://doi.org/10.1101/2020.03.04.20031237 ]
    "ER+) breast and lung cancers, asthma, smoking and elevated ALT, among others. We also uncovered a number of plasma/serum proteins potentially linked to altered ACE2 expression, and the top enriched pathways included cytokine-cytokine-receptor interaction, VEGF signaling, JAK-STAT signaling etc. We a"
---
2 of 5
Potential Factors for Prediction of Disease Severity of COVID-19 Patients
[ http://doi.org/10.1101/2020.03.20.20039818 ]
    "oms of anhelation (78.6%) and asthma (71.4%). For laboratory examination, 57.1% severe cases showed significant reduction in lymphocyte count. The levels of Interluekin-6 (IL6), IL10, erythrocyte sedimentation rate (ESR) and D-Dimer (D-D) were significantly higher in severe patients than mild patien"
---
3 of 5
Clinical characteristics of 140 p

# Immunity

Looking for terms which indicate factors relating to vaccination and immunity.

## Generic immunity / vaccination

Papers which mention generic themes relating to immunity / vaccination. (As the research develops, we may extend this section to include specific lines of research relating to immunity / vaccination.

In [78]:
immunity_synonyms = [
    'immunity',
    r'\bvaccin',
    'innoculat'
]
df, immunity_counts = count_and_tag(df, immunity_synonyms, 'immunity_generic')
immunity_counts

immunity     2120
\bvaccin     5336
innoculat       0
dtype: int64

In [79]:
n = (df.tag_disease_covid19 & df.tag_immunity_generic).sum()
print(f'There are {n} papers on Covid-19 and immunity / vaccines')

There are 156 papers on Covid-19 and immunity / vaccines


In [80]:
print('Intersection of tag_disease_covid19, tag_risk_generic & tag_immunity_generic')
print('=' * 76)
print_key_phrases(df[df.tag_disease_covid19 &
                     df.tag_risk_generic &
                     df.tag_immunity_generic],
                  risk_factor_synonyms + immunity_synonyms)

Intersection of tag_disease_covid19, tag_risk_generic & tag_immunity_generic
1 of 8
COVID-2019: the role of the nsp2 and nsp3 in its pathogenesis
[ http://doi.org/10.1002/jmv.25719 ]
    "essure could account for some clinical features of this virus compared to SARS and Bat SARS-like CoV. The stabilizing mutation falling in the endosome-associated-protein-like domain of the nsp2 protein could account for COVID-2019 high ability of contagious, while the destabilizing mutation in nsp3 "
---
2 of 8
An outbreak of <scp>COVID</scp> ‐19 caused by a new coronavirus: what we know so far
[ http://doi.org/10.5694/mja2.50530 ]
    "g diagnostic, therapeutic and vaccine development efforts. Information on the new virus and its impact is being updated constantly. We know that SARS‐CoV‐2 can cause severe disease, although active surveillance of contacts is required to define the milder end of the disease spectrum and to estimate "
---
3 of 8
Persistence and clearance of viral RNA in 2019 novel coron

In [81]:
tag_columns = df.columns[df.columns.str.startswith('tag_')].tolist()

# Geographies

**IMPORTANT: This section is still under development, as have been focusing on Risk Factors and Research Design**

- Continents (inc. continental regions)
- Countries
- Key regions of countries
- Key cities

## Continents

These search strings include continents and subregions of continents, with particular focus on countries where initial outbreaks have been studied (e.g. China, Korea, Japan, Iran, Italy).

In [82]:
# Note that this section needs more work - have been focusing on later sections
continental_regions = {
    'asia': 'asia|china|korea|japan|hubei|wuhan|malaysia|singapore|hong kong',
    'east_asia': 'east asia|china|korea|japan|hubei|wuhan|hong kong',
    'south_asia': 'south asia|india|pakistan|bangladesh|sri lanka',
    'se_asia': r'south east asia|\bse asia|malaysia|thailand|indonesia|vietnam|cambodia|viet nam',
    'europe': 'europe|italy|france|spain|germany|austria|switzerland|united kingdom|ireland',
    'africa': 'africa|kenya',
    'middle_east': 'middle east|gulf states|saudi arabia|\buae\b|iran|persian',
    'south_america': 'south america|latin america|brazil|argentina',
    'north_america': 'north america|usa|united states|canada|caribbean',
    'australasia': 'australia|new zealand|oceania|australasia|south pacific'
}

counts = {}
for cr, s in continental_regions.items():
    con_filter = abstract_title_filter(s)
    counts[cr] = sum(con_filter)
    df.loc[con_filter, f'tag_continent_{cr}'] = True
    df[f'tag_continent_{cr}'].fillna(False, inplace=True)
counts = pd.Series(counts)
dotplot(counts, 'Continent counts in title / abstract metadata')

## Countries

We will just use countries that appear >50 times in the dateset. Can be adjusted to get more detail.

_**IMPORTANT**: This takes several minutes to run. Skip if not important. PyCountry uses official names which are different from commonly used names - need to fix this._

_TO DO: Add in country subregions (e.g. Hubei -> China, Lombardy -> Italy)_

In [83]:
### THIS SECTION TAKES A LONG TIME TO RUN SO COMMENTED OUT WHILE DEVELOPING
# MIN_PAPERS_PER_COUNTRY = 50
# counts = {}

# for i, country in enumerate(pycountry.countries):
#     if i % 20 == 0:
#         print(f'Checking country {i} ({country.name})')
#     country_filter = abstract_title_filter(r'\b' + re.escape(country.name.lower()) + r'\b')
#     n = sum(country_filter)
#     if n >= MIN_PAPERS_PER_COUNTRY:
#         counts[country.name] = n
#         df.loc[country_filter, f'tag_country_{country.alpha_3.lower()}'] = True
#         df[f'tag_country_{country.alpha_3.lower()}'].fillna(False, inplace=True)
# counts = pd.Series(counts)
# plt.figure(figsize=(5,7))
# dotplot(counts, 'Country counts in title / abstract metadata')
# df.groupby('tag_disease_covid19').tag_country_chn.value_counts()

# Climate

Climate has been hypothesised as a factor in the spread of Covid-19

In [84]:
climate_synonyms = [
    'climate',
    'weather',
    'humid',
    'sunlight',
    'air temperature',
    'meteorolog', # picks up meteorology, meteorological, meteorologist
    'climatolog', # as above
    'dry environment',
    'damp environment',
    'moist environment',
    'wet environment',
    'hot environment',
    'cold environment',
    'cool environment'
]
df, climate_counts = count_and_tag(df, climate_synonyms, 'climate_generic')
dotplot(climate_counts, 'Climate synonyms by title / abstract metadata')

In [85]:
climate_counts.sort_values(ascending=False)

climate              257
humid                167
weather               73
meteorolog            64
air temperature       18
climatolog             6
sunlight               6
hot environment        1
wet environment        1
moist environment      1
dry environment        1
cool environment       0
cold environment       0
damp environment       0
dtype: int64

In [86]:
n = (df.tag_disease_covid19 & df.tag_climate_generic).sum()
print(f'There are {n} papers on Covid-19 and climate:\n')
print_key_phrases(df[df.tag_disease_covid19 & df.tag_climate_generic],
                  climate_synonyms, n=n)

There are 16 papers on Covid-19 and climate:

1 of 16
The role of absolute humidity on transmission rates of the COVID-19 outbreak
[ http://doi.org/10.1101/2020.02.12.20022467 ]
    "at cold and dry (low absolute humidity) environments facilitate the survival and spread of droplet-mediated viral diseases, and warm and humid (high absolute humidity) environments see attenuated viral transmission (i.e., influenza). However, the role of absolute humidity in transmission of COVID-19"
---
2 of 16
Analysis of meteorological conditions and prediction of epidemic trend of 2019-nCoV infection in 2020
[ http://doi.org/10.1101/2020.02.13.20022715 ]
    "Objective: To investigate the meteorological condition for incidence and spread of 2019-nCoV infection, to predict the epidemiology of the infectious disease, and to provide a scientific basis for prevention and control measures against the new disease. Methods: The meteorological factors during the"
---
3 of 16
The Effects of "Fangcang, Huoshensh

# Transmission

## Transmission / incubation generic

In [87]:
transmission_synonyms = [
    'transmiss', # Picks up 'transmission' and 'transmissibility'
    'transmitted',
    'incubation',
    'environmental stability',
    'airborne',
    'via contact',
    'human to human',
    'through droplets',
    'through secretions',
    r'\broute',
    'exportation'
]
df, transmission_counts = count_and_tag(df, transmission_synonyms, 'transmission_generic')
dotplot(transmission_counts, 'Transmission / incubation synonyms by title / abstract metadata')

In [88]:
transmission_counts.sort_values(ascending=False)

transmiss                  4181
\broute                     752
transmitted                 663
incubation                  470
airborne                    316
human to human              214
exportation                  22
via contact                  14
through droplets              5
environmental stability       5
through secretions            0
dtype: int64

In [89]:
n = (df.tag_disease_covid19 & df.tag_transmission_generic).sum()
print(f'There are {n} papers on Covid-19 and transmission / incubation / environmental stability')
print('\nThis entire dataset is exported to thematic_tagging_output_transmission.csv')

There are 591 papers on Covid-19 and transmission / incubation / environmental stability

This entire dataset is exported to thematic_tagging_output_transmission.csv


## Reproduction rates ($R$ / $R_0$)

- Basic reproduction rate ($R_0$)
- Effective reproduction rate ($R$)

In [90]:
repr_synonyms = [
    r'reproduction \(r\)',
    'reproduction rate',
    'reproductive rate',
    '{r}_0',
    r'\br0\b',
    r'\br_0',
    '{r_0}',
    r'\b{r}',
    r'\br naught',
    r'\br zero'
]
df, repr_counts = count_and_tag(df,repr_synonyms, 'transmission_repr')
dotplot(repr_counts, 'R<sub>0</sub> synonyms by title / abstract metadata')

In [91]:
repr_counts.sort_values(ascending=False)

\br0\b                71
\br_0                  7
reproduction rate      7
reproductive rate      4
{r}_0                  1
\br zero               0
\br naught             0
\b{r}                  0
{r_0}                  0
reproduction \(r\)     0
dtype: int64

In [92]:
n = (df.tag_disease_covid19 & df.tag_transmission_repr).sum()
print(f'There are {n} papers on Covid-19 and R or R_0')
print('=')
print_key_phrases(df[df.tag_disease_covid19 & df.tag_transmission_repr], 
                  repr_synonyms, n=52, chars=500)

There are 67 papers on Covid-19 and R or R_0
=
1 of 67
A mathematical model for simulating the transmission of Wuhan novel Coronavirus
[ http://doi.org/10.1101/2020.01.19.911669 ]
    "he basic reproduction number (R0) was calculated from the RP model to assess the transmissibility of the 2019-nCoV."
---
2 of 67
Preliminary estimation of the basic reproduction number of novel coronavirus (2019-nCoV) in China, from 2019 to 2020: A data-driven analysis in the early phase of the outbreak
[ http://doi.org/10.1101/2020.01.23.916395 ]
    "he basic reproduction number, R0, of 2019-nCoV in the early phase of the outbreak.MethodsAccounting for the impact of the variations in disease reporting rate, we modelled the epidemic curve of 2019-nCoV cases time series, in mainland China from January 10 to January 24, 2020, through the exponential growth. With the estimated intrinsic growth rate (γ), we estimated R0 by using the serial intervals (SI) of two other well-known coronavirus diseases, MERS an

In [93]:
# DATA_FOLDER = '../input/CORD-19-research-challenge'

# import json
# import os

# json_list = []

# for row in df[df.tag_disease_covid19 &
#               df.tag_transmission_repr & 
#               df.has_full_text].itertuples():
#     filename = f'{row.sha}.json'
#     sources = ['biorxiv_medrxiv', 'comm_use_subset',
#                'custom_license', 'noncomm_use_subset']
#     for source in sources:
#         if filename in os.listdir(os.path.join(DATA_FOLDER, source, source)):
#             with open(os.path.join(DATA_FOLDER, source, source, filename), 'rb') as f:
#                 json_list.append(json.load(f))

In [94]:
# candidate_sections = [
#     'results',
#     'conclusion',
#     'conclusions',
#     'reproduction',
#     'r_0',
#     'r0',
#     'reproductive'
# ]

In [95]:
# for i, item in enumerate(json_list):
#     print(i)
#     body_text = item['body_text']
#     for sub_item in body_text:
#         found = False
#         for cs in candidate_sections:
#             if cs in sub_item['section'].lower():
#                 found = True
#         if found:
#             print(sub_item['section'])
#             print(sub_item['text'])
#             print()
#     print()

In [96]:
# for i, item in enumerate(json_list):
#     print(i)
#     body_text = item['body_text']
#     for sub_item in body_text:
#         if sub_item['section'] in ['Methods and Results', 'Results', 'Conclusions']:
#             print(sub_item['text'])
#     print()

# Output

## Covid-19 papers only

In [97]:
filename = 'thematic_tagging_output_covid19_only.csv'
print(f'Outputting {df.tag_disease_covid19.sum()} records to {filename}')
df[df.tag_disease_covid19].to_csv(filename, index=False)

Outputting 2473 records to thematic_tagging_output_covid19_only.csv


## Covid-19 papers x questions

### Risk factors

In [98]:
file_filter = df.tag_disease_covid19 & df.tag_risk_generic
filename = 'thematic_tagging_output_risk_factors.csv'
print(f'Outputting {file_filter.sum()} records to {filename}')
df[file_filter].to_csv(filename, index=False)

Outputting 355 records to thematic_tagging_output_risk_factors.csv


### Diabetes

In [99]:
file_filter = df.tag_disease_covid19 & df.tag_risk_diabetes
filename = 'thematic_tagging_output_risk_diabetes.csv'
print(f'Outputting {file_filter.sum()} records to {filename}')
df[file_filter].to_csv(filename, index=False)

Outputting 57 records to thematic_tagging_output_risk_diabetes.csv


### Smoking

In [100]:
file_filter = df.tag_disease_covid19 & df.tag_risk_smoking
filename = 'thematic_tagging_output_risk_smoking.csv'
print(f'Outputting {file_filter.sum()} records to {filename}')
df[file_filter].to_csv(filename, index=False)

Outputting 15 records to thematic_tagging_output_risk_smoking.csv


### Climate

In [101]:
file_filter = df.tag_disease_covid19 & df.tag_climate_generic
filename = 'thematic_tagging_output_climate.csv'
print(f'Outputting {file_filter.sum()} records to {filename}')
df[file_filter].to_csv(filename, index=False)

Outputting 16 records to thematic_tagging_output_climate.csv


## Transmission / incubation

In [102]:
file_filter = df.tag_disease_covid19 & df.tag_transmission_generic
filename = 'thematic_tagging_output_transmission.csv'
print(f'Outputting {file_filter.sum()} records to {filename}')
df[file_filter].to_csv(filename, index=False)

Outputting 591 records to thematic_tagging_output_transmission.csv


## $R$ / $R_0$

In [103]:
file_filter = df.tag_disease_covid19 & df.tag_transmission_repr
filename = 'thematic_tagging_output_repr.csv'
print(f'Outputting {file_filter.sum()} records to {filename}')
df[file_filter].to_csv(filename, index=False)

Outputting 67 records to thematic_tagging_output_repr.csv


## Full dataset

In [104]:
filename = 'thematic_tagging_output_full.csv'
print(f'Outputting {len(df)} records to {filename}')
df.to_csv(filename, index=False)

Outputting 45774 records to thematic_tagging_output_full.csv


# Filtering tool

TO DO: Add tool for filtering on tags.