In [None]:
%matplotlib inline

In [None]:
from pprint import pprint
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import symfit

# Analysis of COVID-19 time series data

**PHYS 395 project 1; **
**Matt Wiens - #301294492**

## Notebook setup 

The first command here sets the default figure size to be a bit larger than normal. The second command sets it so all figure output areas are expanded by default.

In [None]:
# Set default plot size
plt.rcParams["figure.figsize"] = (12, 9)

In [None]:
%%javascript
IPython.OutputArea.auto_scroll_threshold = 9999

# Introduction

In this notebook we will be interested in how the basic reproduction number $R_0$ varies across countries grouped by several different metrics. The basic reproduction number is the average number of people one person infects given that they have COVID-19. The metrics we will consider in this notebook are (i) population density, (ii) GDP, and (iii) the EIU Democracy Index, which is a measure of how "democratic" countries are. 

The point of this notebook is twofold: The first goal is to provide a convenient framework for comparing how effective countries are at combating the COVID-19 epidemic based on various metrics. The second goal is to specifically test the metrics listed above based on a representative sample of countries, to test whether these metrics are good indicators of how effective a country is at managing COVID-19.

# Methods

The first part of analysis will provide an argument for us to *not* consider China's data, as it is highly suspect and I presume it to be unreliable.

The remainder of our analysis will be geared towards finding the basic reproduction numbers $R_0$ of different countries using the SIR model, and then using this $R_0$ value to compare countries based on the metrics listed in the introduction. The SIR model models the temporal behavior of an infectious outbreak through the equations

\begin{align}
     \frac{dS}{dt} &= - \frac{\beta I S}{N}, \\
     \frac{dI}{dt} &= \frac{\beta I S}{N} - \gamma I, \\
     \frac{dR}{dt} &= \gamma I,
\end{align}

where

+ $\beta$, $\gamma$ are constants
+ $S$ is the size of the susceptible population
+ $I$ is the size of the infected population
+ $R$ is the size of the recovered population
+ $N = S + I + R$ is total size of the population being considered

The basic reproductive number $R_0$ is related to the constants $\beta$ and $\gamma$ through

\begin{equation}
    R_0 = \frac{\beta}{\gamma}
    .
\end{equation}

Because we only have data for $I$ and $R$ (where in $R$ we will include recovered *and* deaths data), we can reformulate the above system of differential equations to be

\begin{align}
     \frac{dI}{dt} &= \frac{\beta I (N - I - R)}{N} - \gamma I, \\
     \frac{dR}{dt} &= \gamma I.
\end{align}

Having the data for $I$ and $R$ and the above differential equations, we can then estimate the parameters $\beta$, $\gamma$ and $N$ (and thus $R_0$) using [symfit](https://symfit.readthedocs.io/en/stable/index.html), which is a Python library which combines the power of [SciPy optimize](https://docs.scipy.org/doc/scipy/reference/optimize.html) and [SymPy](https://www.sympy.org/en/index.html) to perform curve fitting with the power of symbolic math.

The data for population density and GDP will be obtained from [World Bank Open Data](https://data.worldbank.org/), where the latest data point for each country (and each metric) will be considered. EIU Democracy Index scores are obtained from [The Economist Intelligence Unit](https://www.eiu.com/n/). COVID-19 data will be obtained from [John Hopkins Whiting School of Engineering](https://systems.jhu.edu/).

# Analysis

## Fetching data

Here, we'll fetch the latest COVID-19 data from John Hopkins CSSE.

In [None]:
# URLs to fetch
url_confirmed_csv = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
url_deaths_csv = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv"
url_recovered_csv = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv"

In [None]:
# Get data
df_confirmed = pd.read_csv(url_confirmed_csv, sep=",")
df_deaths = pd.read_csv(url_deaths_csv, sep=",")
df_recovered = pd.read_csv(url_recovered_csv, sep=",")

## Defining a function to "massage" the data

Here we'll define a function to "massage" the data into a useful form for our analysis. Here we will take the above dataframes (possibly filtered down by country) and then transform it to a dataframe which has two columns: dates (as `datetime.datetime`s) which column label `date`, and total cases for that date, with column label `cases`.

In [None]:
def massage_dataframe(df: pd.DataFrame) -> pd.DataFrame:
    # Drop columns we don't need
    df = df.drop(["Province/State", "Country/Region", "Lat", "Long"], axis=1)

    # Collapse all provinces/regions into single row
    df = df.sum(axis=0).to_frame("cases")

    # Move dates into a column
    df.index.name = "date"
    df.reset_index(inplace=True)

    # Parse dates to datetime
    df.date = pd.to_datetime(df.date)

    return df

## Plotting confirmed cases, deaths, and recovered

Now we'll plot the number of confirmed cases, death, and recovered for both China and the rest of the world (excluding China).

### Confirmed cases

In [None]:
# Filter by China/rest of world
df_confirmed_total_china = massage_dataframe(
    df_confirmed[df_confirmed["Country/Region"].eq("China")]
)
df_confirmed_total_remaining = massage_dataframe(
    df_confirmed[~df_confirmed["Country/Region"].eq("China")]
)

In [None]:
fig, ax = plt.subplots()

# Plot data
ax.plot_date(
    x=df_confirmed_total_china.date.values,
    y=df_confirmed_total_china.cases.values,
    fmt="-*",
    xdate=True,
)
ax.plot_date(
    x=df_confirmed_total_remaining.date.values,
    y=df_confirmed_total_remaining.cases.values,
    fmt="-*",
    xdate=True,
)

# Add cosmetics
ax.set_title("Confirmed COVID-19 cases")
ax.set_xlabel("date")
ax.set_ylabel("count")
ax.ticklabel_format(axis="y", style="sci", scilimits=(0, 0))

ax.legend(["China", "rest of the world"])

ax.grid(alpha=0.3)

### Deaths

In [None]:
# Filter by China/rest of world
df_deaths_total_china = massage_dataframe(
    df_deaths[df_deaths["Country/Region"].eq("China")]
)
df_deaths_total_remaining = massage_dataframe(
    df_deaths[~df_deaths["Country/Region"].eq("China")]
)

In [None]:
fig, ax = plt.subplots()

# Plot data
ax.plot_date(
    x=df_deaths_total_china.date.values,
    y=df_deaths_total_china.cases.values,
    fmt="-*",
    xdate=True,
)
ax.plot_date(
    x=df_deaths_total_remaining.date.values,
    y=df_deaths_total_remaining.cases.values,
    fmt="-*",
    xdate=True,
)

# Add cosmetics
ax.set_title("COVID-19 Deaths")
ax.set_xlabel("date")
ax.set_ylabel("count")
ax.ticklabel_format(axis="y", style="sci", scilimits=(0, 0))

ax.legend(["China", "rest of the world"])

ax.grid(alpha=0.3)

### Recovered cases

In [None]:
# Filter by China/rest of world
df_recovered_total_china = massage_dataframe(
    df_recovered[df_recovered["Country/Region"].eq("China")]
)
df_recovered_total_remaining = massage_dataframe(
    df_recovered[~df_recovered["Country/Region"].eq("China")]
)

In [None]:
fig, ax = plt.subplots()

# Plot data
ax.plot_date(
    x=df_recovered_total_china.date.values,
    y=df_recovered_total_china.cases.values,
    fmt="-*",
    xdate=True,
)
ax.plot_date(
    x=df_recovered_total_remaining.date.values,
    y=df_recovered_total_remaining.cases.values,
    fmt="-*",
    xdate=True,
)

# Add cosmetics
ax.set_title("Recovered COVID-19 cases")
ax.set_xlabel("date")
ax.set_ylabel("count")
ax.ticklabel_format(axis="y", style="sci", scilimits=(0, 0))

ax.legend(["China", "rest of the world"])

ax.grid(alpha=0.3)

### Brief discussion of plots (on why China's data makes no sense)

The above plots are pretty good justification to not even take China's data into consideration. The data shows that there was no exponential growth phase (including in Hubei province which contains Wuhan), which makes no sense given that they didn't immediately start quarantining.

## Fitting the world's data to an SIR model

Now let's try to fit the world's data (excluding China) to the "modified" SIR model written in the introduction. For the purposes of the SIR model we will take

+ infected = confirmed data - recovered data - deaths data
+ recovered = recovered data + deaths data

Our goal is to determine $\beta$, $\gamma$, and $N$ for the data; using this, we can determine the $S$ data and the basic reproduction number $R_0$.

### Defining a fitting function

The below function takes in data for $I$, $R$ and returns a tuple which contains

+ The symfit fit results (an instance of `symfit.core.fit_results.FitResults`). This, importantly, has a `params` attribute which is an ordered dictionary containing the estimated parameters $\beta$, $\gamma$, and $N$;

+ A function which evaluates the SIR system of differential equations given time data and the parameter estimates (an instance of `symfit.core.models.ODEModel`);

+ An array containing the time data used when fitting. The function makes the simplifying assumption that one day is equal to one unit of time.

The function also (optionally) takes in initial estimates for $\beta$, $\gamma$, and $N$. Note that having *reasonable* estimates is extremely important to having the fit produce good parameter estimates.

One other important point is that the data supplied to the fit should start when the infection data becomes non-negligible. Producing data prior to any infections will result in a worse fit.

In [None]:
def estimate_SIR_constants(
    i_data: np.ndarray,
    r_data: np.ndarray,
    beta_est: float = 0.01,
    gamma_est: float = 0.01,
    N_est: float = 15000,
) -> tuple:
    # Set up data for time
    t_data = np.arange(i_data.shape[0], dtype=float)

    # Set up variables and parameters
    I, R, t = symfit.variables("I, R, t")
    beta = symfit.Parameter("beta")
    gamma = symfit.Parameter("gamma", gamma_est)
    N = symfit.Parameter("N", N_est)

    # Set up and run the model
    model_dict = {
        symfit.D(I, t): beta * I * (N - I - R) / N - gamma * I,
        symfit.D(R, t): gamma * I,
    }
    ode_model = symfit.ODEModel(
        model_dict, initial={t: 0.0, I: i_data[0], R: r_data[0]}
    )

    fit = symfit.Fit(ode_model, t=t_data, I=i_data, R=r_data)

    # Run
    res = fit.execute()

    return (res, ode_model, t_data)

### Choosing the data to fit

Now we need to determine which data we want to use for the fit. By inspecting the plots in the "Plotting confirmed cases, deaths, and recovered" section above, we can decide at what data point we should start at. Looking at the points, starting in late February seems reasonable

In [None]:
# Choose where to start the data
remaining_start_idx = 35

In [None]:
# Collect infected and recovered data using our above definitions
i_data = (
    df_confirmed_total_remaining.cases.values
    - df_recovered_total_remaining.cases.values
    - df_deaths_total_remaining.cases.values
)[remaining_start_idx:]
r_data = (
    df_recovered_total_remaining.cases.values + df_deaths_total_remaining.cases.values
)[remaining_start_idx:]

### Estimating parameters using our fitting function

Now we'll use the function we defined to predict the SIR constants.

In [None]:
res, ode_model, t_data = estimate_SIR_constants(
    i_data=i_data, r_data=r_data, beta_est=0.01, gamma_est=0.01, N_est=5e5
)

Let's look what the estimated parameters are.

In [None]:
pprint(dict(res.params.items()))

And let's look at well this agrees with the data we supplied to the fitting function.

In [None]:
# Get I and R from our ODE model and determine S
I, R = ode_model(t=t_data, **res.params)
S = res.params["N"] - I - R

# Get the dates corresponding to our time data
t_data_dates = df_confirmed_total_remaining.date.values[remaining_start_idx:]

In [None]:
fig, ax = plt.subplots()

# Plot data
ax.plot_date(
    x=t_data_dates, y=S, fmt="-*", xdate=True,
)
ax.plot_date(
    x=t_data_dates, y=I, fmt="-*", xdate=True,
)
ax.plot_date(
    x=t_data_dates, y=R, fmt="-*", xdate=True,
)
ax.plot_date(
    x=t_data_dates, y=i_data, fmt="o", xdate=True,
)
ax.plot_date(
    x=t_data_dates, y=r_data, fmt="o", xdate=True,
)

# Add cosmetics
ax.set_title("World (excluding China) SIR curves")
ax.set_xlabel("date")
ax.set_ylabel("count")
ax.ticklabel_format(axis="y", style="sci", scilimits=(0, 0))

ax.legend(["S (estimated)", "I (estimated)", "R (estimated)", "I (data)", "R (data)"])

ax.grid(alpha=0.3)

We can see here that our estimated parameters are extremely accurate in their agreement with the data.

### Determine the basic reproductive number 

Now that we've found that our parameter estimates our reasonable, we can determine the basic reproductive number for our data.

In [None]:
print("R0 = %.2f" % (res.params["beta"] / res.params["gamma"]))

For the data used when I ran this computation (on 2020-03-30), the basic reproductive number $R_0$ was approximately 8. This means, as a global average, each infected person will infect an additional 8 non-infected people, which is quite worrying!

## Determine the basic reproductive number for a sample of countries

Now we will determine the basic reproductive number $R_0$ for a number of different countries, so we can see how this number varies based on the metrics discussed in the introduction.

Because we need to provide good estimates for

+ what data to supply to the fitting function
+ the parameters $\beta$, $\gamma$, and $N$

we will need to plot the data for each country we consider prior to fitting.

### Choosing which countries to sample

Here we will choose for which countries we will determine the basic reproductive number $R_0$.

Below are a list of all countries we have data for.

In [None]:
pprint(list(np.unique(df_confirmed["Country/Region"])))

I'll choose 15 of these countries to include in our analysis.

In [None]:
included_countries = np.array(
    [
        "Argentina",
        "Brazil",
        "Canada",
        "Chile",
        "Egypt",
        "Germany",
        "Indonesia",
        "Iran",
        "Jordan",
        "Korea, South",
        "Mexico",
        "Spain",
        "South Africa",
        "Thailand",
        "Ukraine",
    ]
)

We'll also need each country's population density, GDP, and EIU Democracy Index scores.

In [None]:
# Population density is measured in people per square kilometer
# of land area (2018 data)
included_countries_pop_density = np.array(
    [
        16.2585100979651,  # Argentina
        25.0617162430876,  # Brazil
        4.07530821431988,  # Canada
        25.1894460666519,  # Chile
        98.8734692852479,  # Egypt
        237.3709697733,  # Germany
        147.75219008926,  # Indonesia
        50.222420123284,  # Iran
        112.14249831043,  # Jordan
        529.652103632681,  # Korea, South
        64.9146264050001,  # Mexico
        93.5290582615872,  # Spain
        47.6301197767684,  # South Africa
        135.897206835131,  # Thailand
        77.0296673514129,  # Ukraine
    ]
)

# GDP is measured in USD (2018 data, except Iran is 2017)
included_countries_gdp = np.array(
    [
        519871519807.795,  # Argentina
        1868626087908.48,  # Brazil
        1713341704877.01,  # Canada
        298231133532.749,  # Chile
        250894760351.232,  # Egypt
        3947620162502.96,  # Germany
        1042173300625.55,  # Indonesia
        454012768723.589,  # Iran
        42231295774.6479,  # Jordan
        1619423701169.63,  # Korea, South
        1220699479845.98,  # Mexico
        1419041949909.82,  # Spain
        368288939768.322,  # South Africa
        504992757704.997,  # Thailand
        130832374404.882,  # Ukraine
    ]
)

included_countries_democracy_score = np.array(
    [
        7.02,  # Argentina
        6.86,  # Brazil
        9.22,  # Canada
        8.08,  # Chile
        3.06,  # Egypt
        8.68,  # Germany
        6.48,  # Indonesia
        2.38,  # Iran
        3.93,  # Jordan
        8.00,  # Korea, South
        6.09,  # Mexico
        8.29,  # Spain
        7.24,  # South Africa
        6.32,  # Thailand
        5.90,  # Ukraine
    ]
)

Let's plot their confirmed cases so that we can come up with good parameter estimates for our fitting function.

In [None]:
fig, ax = plt.subplots()

# Make sure colours used are unique
colours = sns.color_palette("hls", len(included_countries))
ax.set_prop_cycle("color", colours)

# Plot data for each country
for country in included_countries:
    df_massaged = massage_dataframe(
        df_confirmed[df_confirmed["Country/Region"].eq(country)]
    )
    ax.plot_date(
        x=df_massaged.date.values, y=df_massaged.cases.values, fmt="-*", xdate=True,
    )

# Add cosmetics
ax.set_title("Confirmed COVID-19 cases")
ax.set_xlabel("date")
ax.set_ylabel("count")
ax.ticklabel_format(axis="y", style="sci", scilimits=(0, 0))

ax.legend(included_countries)

ax.grid(alpha=0.3)

It's pretty difficult to discern which colour corresponds to which country, so we'll split the above plot into three separate plots, grouped according to the severity of their confirmed cases.

In [None]:
included_countries_plot_1 = np.array(["Germany", "Iran", "Korea, South", "Spain",])

fig, ax = plt.subplots()

# Plot data for each country
for country in included_countries_plot_1:
    df_massaged = massage_dataframe(
        df_confirmed[df_confirmed["Country/Region"].eq(country)]
    )
    ax.plot_date(
        x=df_massaged.date.values, y=df_massaged.cases.values, fmt="-*", xdate=True,
    )

# Add cosmetics
ax.set_title("Confirmed COVID-19 cases (group 1)")
ax.set_xlabel("date")
ax.set_ylabel("count")
ax.ticklabel_format(axis="y", style="sci", scilimits=(0, 0))

ax.legend(included_countries_plot_1)

ax.grid(alpha=0.3)

In [None]:
included_countries_plot_2 = np.array(["Brazil", "Canada", "Chile",])

fig, ax = plt.subplots()

# Plot data for each country
for country in included_countries_plot_2:
    df_massaged = massage_dataframe(
        df_confirmed[df_confirmed["Country/Region"].eq(country)]
    )
    ax.plot_date(
        x=df_massaged.date.values, y=df_massaged.cases.values, fmt="-*", xdate=True,
    )

# Add cosmetics
ax.set_title("Confirmed COVID-19 cases (group 2)")
ax.set_xlabel("date")
ax.set_ylabel("count")
ax.ticklabel_format(axis="y", style="sci", scilimits=(0, 0))

ax.legend(included_countries_plot_2)

ax.grid(alpha=0.3)

In [None]:
included_countries_plot_3 = np.array(
    [
        "Argentina",
        "Egypt",
        "Indonesia",
        "Jordan",
        "Mexico",
        "South Africa",
        "Thailand",
        "Ukraine",
    ]
)

fig, ax = plt.subplots()

# Plot data for each country
for country in included_countries_plot_3:
    df_massaged = massage_dataframe(
        df_confirmed[df_confirmed["Country/Region"].eq(country)]
    )
    ax.plot_date(
        x=df_massaged.date.values, y=df_massaged.cases.values, fmt="-*", xdate=True,
    )

# Add cosmetics
ax.set_title("Confirmed COVID-19 cases (group 3)")
ax.set_xlabel("date")
ax.set_ylabel("count")
ax.ticklabel_format(axis="y", style="sci", scilimits=(0, 0))

ax.legend(included_countries_plot_3)

ax.grid(alpha=0.3)

### Estimating initial parameter values and start indices

From these plots we can come up with initial $N$ estimates, as well as the starting indices where we will consider data for the fit. Further testing was done to make sure these $N$ estimates lead to good fits (testing code is shown in the next subsection below). Sometimes with the method we are using we need to give it a value for the $N$ estimate much lower than what it will actually predict, due to how to algorithm works (I'm not sure what the details are here).

In [None]:
included_countries_n_est = np.array(
    [
        1e3,  # Argentina
        1e4,  # Brazil
        1e4,  # Canada
        1e4,  # Chile
        1e3,  # Egypt
        3e3,  # Germany
        2e3,  # Indonesia
        5e4,  # Iran
        7e2,  # Jordan
        2e4,  # Korea, South
        1e2,  # Mexico
        2e4,  # Spain
        1e3,  # South Africa
        3e3,  # Thailand
        1.5e3,  # Ukraine
    ]
)

included_countries_start_idx = np.array(
    [
        45,  # Argentina
        45,  # Brazil
        40,  # Canada
        45,  # Chile
        45,  # Egypt
        40,  # Germany
        45,  # Indonesia
        32,  # Iran
        45,  # Jordan
        30,  # Korea, South
        45,  # Mexico
        40,  # Spain
        45,  # South Africa
        45,  # Thailand
        45,  # Ukraine
    ]
)

#### (Optional) testing code for selecting good initial parameter estimates

In [None]:
do_test = False

index_to_test = 0

country = included_countries[index_to_test]
n_est = included_countries_n_est[index_to_test]
start_idx = included_countries_start_idx[index_to_test]

# These initial estimates seem to be good globally
beta_est = 0.01
gamma_est = 0.01

# Run the test and produce a plot
if do_test:
    # Filter data by country
    df_confirmed_total = massage_dataframe(
        df_confirmed[df_confirmed["Country/Region"].eq(country)]
    )
    df_recovered_total = massage_dataframe(
        df_recovered[df_recovered["Country/Region"].eq(country)]
    )
    df_deaths_total = massage_dataframe(
        df_deaths[df_deaths["Country/Region"].eq(country)]
    )

    # Slice data
    i_data = (
        df_confirmed_total.cases.values
        - df_recovered_total.cases.values
        - df_deaths_total.cases.values
    )[start_idx:]
    r_data = (df_recovered_total.cases.values + df_deaths_total.cases.values)[
        start_idx:
    ]

    # Estimate parameters
    res, ode_model, t_data = estimate_SIR_constants(
        i_data=i_data, r_data=r_data, beta_est=0.01, gamma_est=0.01, N_est=n_est,
    )

    # Print the parameter values
    pprint(dict(res.params.items()))

    # Get I and R from our ODE model
    I, R = ode_model(t=t_data, **res.params)

    # Get the dates corresponding to our time data
    t_data_dates = df_confirmed_total.date.values[start_idx:]

    fig, ax = plt.subplots()

    # Plot data
    ax.plot_date(
        x=t_data_dates, y=I, fmt="-*", xdate=True,
    )
    ax.plot_date(
        x=t_data_dates, y=R, fmt="-*", xdate=True,
    )
    ax.plot_date(
        x=t_data_dates, y=i_data, fmt="o", xdate=True,
    )
    ax.plot_date(
        x=t_data_dates, y=r_data, fmt="o", xdate=True,
    )

    # Add cosmetics
    ax.set_title("Estimate testing curves")
    ax.set_xlabel("date")
    ax.set_ylabel("count")
    ax.ticklabel_format(axis="y", style="sci", scilimits=(0, 0))

    ax.legend(["I (estimated)", "R (estimated)", "I (data)", "R (data)"])

    ax.grid(alpha=0.3)

### Initializing data frame to store data

Before we start computing parameters, let's set up a dataframe to store our results. 

In [None]:
big_df = pd.DataFrame(
    data={
        "country": included_countries,
        "pop_density": included_countries_pop_density,
        "gdp": included_countries_gdp,
        "democracy_score": included_countries_democracy_score,
        "start_idx": included_countries_start_idx,
        "beta_est": np.ones(len(included_countries)) * 0.01,
        "gamma_est": np.ones(len(included_countries)) * 0.01,
        "n_est": included_countries_n_est,
        "n": np.zeros(len(included_countries)),
        "beta": np.zeros(len(included_countries)),
        "gamma": np.zeros(len(included_countries)),
        "r0": np.zeros(len(included_countries)),
    }
)

### Determining parameters for each country

Now that we have our data in order, let's estimate parameters using our fitting function.

In [None]:
for index, row in big_df.iterrows():
    # Filter data by country
    df_confirmed_total = massage_dataframe(
        df_confirmed[df_confirmed["Country/Region"].eq(row.country)]
    )
    df_recovered_total = massage_dataframe(
        df_recovered[df_recovered["Country/Region"].eq(row.country)]
    )
    df_deaths_total = massage_dataframe(
        df_deaths[df_deaths["Country/Region"].eq(row.country)]
    )

    # Only include the data we want to fit
    i_data = (
        df_confirmed_total.cases.values
        - df_recovered_total.cases.values
        - df_deaths_total.cases.values
    )[row.start_idx :]
    r_data = (df_recovered_total.cases.values + df_deaths_total.cases.values)[
        row.start_idx :
    ]

    # Estimate parameters
    res, ode_model, t_data = estimate_SIR_constants(
        i_data=i_data,
        r_data=r_data,
        beta_est=row.beta_est,
        gamma_est=row.gamma_est,
        N_est=row.n_est,
    )

    # Save the parameters
    big_df.at[index, "beta"] = res.params["beta"]
    big_df.at[index, "gamma"] = res.params["gamma"]
    big_df.at[index, "n"] = res.params["N"]
    big_df.at[index, "r0"] = res.params["beta"] / res.params["gamma"]

Now let's show all of the results stored in our dataframe.

In [None]:
big_df

## Analyzing countries based on metrics

Now that we have estimates for the basic reproductive $R_0$ for the countries we sampled, we can look at how $R_0$ correlates with the metrics we will consider.

### Population density

In [None]:
fig, ax = plt.subplots()

# Plot
ax.scatter(x=big_df.r0.values, y=big_df.pop_density.values)

# Annotations
for i in range(len(big_df.index)):
    ax.annotate(
        big_df.country.values[i],
        (big_df.r0.values[i] + 0.6, big_df.pop_density.values[i] + 0.5),
    )

# Add cosmetics
ax.set_title("Population density against basic reproductive number")
ax.set_xlabel(r"$R_0$")
ax.set_ylabel("population density (people per square kilometer land)")

ax.grid(alpha=0.3)

Here I don't see much of a correlation between population density and basic reproductive number. The outliers of South Korea and South Africa are interesting, although I don't have any particular insight into those datapoints.

### GDP

In [None]:
fig, ax = plt.subplots()

# Plot
ax.scatter(x=big_df.r0.values, y=big_df.gdp.values)

# Annotations
for i in range(len(big_df.index)):
    ax.annotate(
        big_df.country.values[i],
        (big_df.r0.values[i] + 0.6, big_df.gdp.values[i] + 1.5e3),
    )

# Add cosmetics
ax.set_title("GDP against basic reproductive number")
ax.set_xlabel(r"$R_0$")
ax.set_ylabel("GDP (USD)")

ax.grid(alpha=0.3)

Again there isn't much of a correlation here either.

### EIU Democracy Index score

In [None]:
fig, ax = plt.subplots()

# Plot
ax.scatter(x=big_df.r0.values, y=big_df.democracy_score.values)

# Annotations
for i in range(len(big_df.index)):
    ax.annotate(
        big_df.country.values[i],
        (big_df.r0.values[i] + 0.6, big_df.democracy_score.values[i] + 0.03),
    )

# Add cosmetics
ax.set_title("EIU Democracy Index score against basic reproductive number")
ax.set_xlabel(r"$R_0$")
ax.set_ylabel("EIU Democracy Index score")

ax.grid(alpha=0.3)

This plot seems to suggest the more democratic a country is, the less adept they are at handling COVID-19. However, what I suspect is happening here is that less democratic countries report less accurate data, and if we had accurate data we would not see this particular correlation.

# Discussion

Having performed the above analysis, it appears that the metrics of population density, GDP, and democraticness are not useful in predicting how well a country will handle COVID-19. I suspect the results of this notebook may change in the future when COVID-19 is more uniformly distributed across countries. However, currently I suspect that COVID-19 SIR rates have more to do with specific events (for example, the outbreak in Wuhan or the church service in South Korea) and geography (how close a country is to high density areas of COVID-10) than more "macro-level" metrics like GDP. With that said, I am surprised population density didn't have noticeable correlations with basic reproductive number, since it is intuitive that it would.

In terms of future directions for this analysis, I would like to discuss my methods with someone well-versed in epidemiology, since I currently do not have the intuition to verify if my results are reasonable or not. Having gained that intuition, I think a fruitful approach using the same methods would be look at which methods a countries are using to combat COVID-19, and then producing similar plots as above to see if the methods are actually effective.