# Visualised: How SARS-CoV-2 compares to other infectious diseases
## LUH ZQS Data Literacy Seminar

In this exercise we use a few data sources to compare SARS-CoV-2 to other infectious diseases using basic data processing and visualization in Python. The exercise is inspired by a 2014 [The Guardian visualization](https://www.theguardian.com/news/datablog/ng-interactive/2014/oct/15/visualised-how-ebola-compares-to-other-infectious-diseases) that compared Ebola with other infectious diseases.

The data sources include:
* The Microbe Scope data published by The Guardian available at [bit.ly/KIB_Microbescope](https://bit.ly/KIB_Microbescope). A copy of this dataset is included (see `data.csv`).
* [ORKG](https://orkg.org) Comparisons for SARS-CoV-2 [basic reproductive rate](https://www.orkg.org/orkg/comparison/R44930) and [case fatality rate](https://www.orkg.org/orkg/comparison/R41466)

To make this exercise interactive and fun, you'll need to fill the occasional `[BLANKS]` in the code before you can execute it.

Let's get started ...

First we need to install and import a few libraries we will need in our code

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from orkg import ORKG
from scipy.spatial.distance import squareform, pdist

Now let's read the Microbe Scope `data.csv` and get its shape, i.e. the number of rows and columns.

In [None]:
df_microbe_scope = pd.read_csv("[FILENAME]", skiprows=[1]) # Replace [FILENAME] accordingly
df_microbe_scope.shape

... and take a look at the data.

In [None]:
df_microbe_scope.head(3)

Of particular interest are the first three columns including the disease name, the case fatality rate and the average basic reproductive rate. 

Let's select and rename them. 

Take a look at the shape and explain what happenend: Why is the second number different compared to before?

In [None]:
df_microbe_scope = df_microbe_scope.iloc[:, 0:[TO_COLUMN]] # Replace [TO_COLUMN]; Hint: index starts with zero and the ending index is excluded
df_microbe_scope.columns = ['disease', 'case_fatality_rate', 'basic_reproductive_rate']
df_microbe_scope.shape

In [None]:
df_microbe_scope.head(3)

Next, we compute some descriptive statistics for case fatality and basic reproductive rates.

Let's do mean and max case fatality rate first.

In [None]:
df_microbe_scope['case_fatality_rate'].mean()

In [None]:
df_microbe_scope['case_fatality_rate'].max()

Now try the same for basic reproductive rate.

The following is a shorthand for descriptive statistics.

Try it for both basic reproductive and case fatality rates and compare the values. They should be same.

In [None]:
df_microbe_scope['case_fatality_rate'].describe()

Next, let's plot the data and use visualizations to understand it better.

Try this for both basic reproductive and case fatality rates.

In [None]:
df_microbe_scope.plot(kind='barh', x='disease', y='basic_reproductive_rate', figsize=(10,10))

That's great, but sorted would be better to quickly see the top three and how they compare to the bottom three.

In [None]:
df_microbe_scope.sort_values(by='basic_reproductive_rate').plot(kind='barh', x='disease', y='basic_reproductive_rate', figsize=(10,10))

The disease data is not univariate but multivariante, for each disease with have two variables (the two rates). 

It makes thus sense to visualize the diseases along both dimensions, using a scatter plot.

With this we can easily identify a few characteristics:
* Most diseases cluster in the lower left corner, meaning that most diseases tend to be on the lower ends of case fatality and basic reproductive rates
* A few diseases are exceptional with respect to one dimension, either very high case fatality rate or very high basic reproductive rate
* No disease has both high case fatality rate and basic reproductive rate

Hint: With the additional parameters `logy=True` and `loglog=True` you can modify the axes to log scaling.

In [None]:
df_microbe_scope.plot.scatter(x='basic_reproductive_rate', y='case_fatality_rate', figsize=(10,10))

So far we have only inspected the Microbe Scope dataset, which does not include SARS-CoV-2.

The ORKG provides data extracted from the scholarly literature for SARS-CoV-2 basic reproductive rate and case fatality rate.

Take a look at the two ORKG comparisons:
* SARS-CoV-2 [basic reproductive rate](https://www.orkg.org/orkg/comparison/R44930)
* SARS-CoV-2 [case fatality rate](https://www.orkg.org/orkg/comparison/R41466)

Both comparisons include numerous published articles, each with information about these rates (see the `has value` property in the comparisons).

Let's load the two comparisons into data frames from the respective CSV files, which we already download from the ORKG and make available here.

The data for the basic reproductive rate comparison first.

In [None]:
df_orkg_basic_reproductive_rate = pd.read_csv("comparison-R44930.csv")

Now we do the same for the case fatality rate comparison.

In [None]:
df_orkg_case_fatality_rate = pd.read_csv("[FILENAME]") # Replace [FILENAME] accordingly

Use the following code cell to take a look at this data. Where are the titles of the compared articles and can you find the `Has value` property?

In [None]:
df_orkg_basic_reproductive_rate

Now, what we actually want is the mean estimates for SARS-CoV-2 basic reproductive rate and case fatality rate as published by the numerous articles. That's easy to compute, right? 

Try to understand what is going on in the following line of code.

In [None]:
orkg_basic_reproductive_rate = np.mean(df_orkg_basic_reproductive_rate.loc[:, 'Has value'].to_numpy(dtype=np.float32))

orkg_basic_reproductive_rate

... and do the same for case fatality rate, which is a bit more involved because for some articles the property `Has value` has two values.

In [None]:
def try_join(l):
    try:
        return ','.join(map(str, l))
    except TypeError:
        return str(l)
    
df = df_orkg_case_fatality_rate
df['Has value'] = [try_join(l) for l in df['Has value']]
df['Has value'] = df['Has value'].str.extract('([0-9]*[.]?[0-9]+)')
df['Has value'] = pd.to_numeric(df['Has value'], errors='coerce')
orkg_case_fatality_rate = df['Has value'].mean()

orkg_case_fatality_rate

Now that we also have mean values for SARS-CoV-2 basic reproductive and case fatality rates, we can create a new dataset that combines the rates for all infectious diseases.

In [None]:
df_sars_cov_2 = pd.DataFrame([['SARS-CoV-2', orkg_basic_reproductive_rate, orkg_case_fatality_rate]], columns=['disease', 'case_fatality_rate', 'basic_reproductive_rate'])
df_diseases = pd.concat([df_microbe_scope, df_sars_cov_2])

Finally, let's compute the top N diseases that are closests to SARS-CoV-2 in terms of basic reproductive and case fatality rates. For this, we first compute the euclidean distance matrix and then extract the N smallest diseases. Note that SARS-CoV-2 is included because the euclidean distance with itself is zero.

In [None]:
df_distance_matrix = pd.DataFrame(squareform(pdist(df_diseases.iloc[:, 1:])), columns=df_diseases['disease'].unique(), index=df_diseases['disease'].unique())

In [None]:
print(df_distance_matrix.nsmallest([N], 'SARS-CoV-2').index.tolist()) # Replace [N] with a number for top-N