In [None]:
%matplotlib inline

In [None]:
from pprint import pprint
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import symfit

# Analysis of COVID-19 time series data

**PHYS 395 project 1; **
**Matt Wiens - #301294492**

## Notebook setup 

The first command here sets the default figure size to be a bit larger than normal. The second command sets it so all figure output areas are expanded by default.

In [None]:
# Set default plot size
plt.rcParams["figure.figsize"] = (12, 9)

In [None]:
%%javascript
IPython.OutputArea.auto_scroll_threshold = 9999

# Introduction

In this notebook we will be interested in how the basic reproduction number $R_0$ varies across countries grouped by several different metrics. The basic reproduction number is the average number of people one person infects given that they have COVID-19. The metrics we will consider in this notebook are (i) population density, (ii) GDP, and (iii) the EIU Democracy Index, which is a measure of how "democratic" countries are. However, the analysis in this notebook can easily be extended to comparing additional metrics.

# Methods

The first part of analysis will provide an argument for us to *not* consider China's data, as it is highly suspect and I presume it to be unreliable.

The remainder of our analysis will be geared towards finding the basic reproduction numbers $R_0$ of different countries using the SIR model, and then using this $R_0$ value to compare countries based on the metrics listed in the introduction. The SIR model models the temporal behavior of an infectious outbreak through the equations

\begin{align}
     \frac{dS}{dt} &= - \frac{\beta I S}{N}, \\
     \frac{dI}{dt} &= \frac{\beta I S}{N} - \gamma I, \\
     \frac{dR}{dt} &= \gamma I,
\end{align}

where

+ $\beta$, $\gamma$ are constants
+ $S$ is the size of the susceptible population
+ $I$ is the size of the infected population
+ $R$ is the size of the recovered population
+ $N = S + I + R$ is total size of the population being considered

The basic reproductive number $R_0$ is related to the constants $\beta$ and $\gamma$ through

\begin{equation}
    R_0 = \frac{\beta}{\gamma}
    .
\end{equation}

Because we only have data for $I$ and $R$ (where in $R$ we will include recovered *and* deaths data), we can reformulate the above system of differential equations to be

\begin{align}
     \frac{dI}{dt} &= \frac{\beta I (N - I - R)}{N} - \gamma I, \\
     \frac{dR}{dt} &= \gamma I.
\end{align}

Having the data for $I$ and $R$ and the above differential equations, we can then estimate the parameters $\beta$, $\gamma$ and $N$ (and thus $R_0$) using [symfit](https://symfit.readthedocs.io/en/stable/index.html), which is a Python library which combines the power of [SciPy optimize](https://docs.scipy.org/doc/scipy/reference/optimize.html) and [SymPy](https://www.sympy.org/en/index.html) to perform curve fitting with the power of symbolic math.

# Analysis

## Fetching data

Here, we'll fetch the latest COVID-19 data from John Hopkins CSSE.

In [None]:
# URLs to fetch
url_confirmed_csv = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
url_deaths_csv = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv"
url_recovered_csv = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv"

In [None]:
# Get data
df_confirmed = pd.read_csv(url_confirmed_csv, sep=",")
df_deaths = pd.read_csv(url_deaths_csv, sep=",")
df_recovered = pd.read_csv(url_recovered_csv, sep=",")

## Defining a function to "massage" the data

Here we'll define a function to "massage" the data into a useful form for our analysis. Here we will take the above dataframes (possibly filtered down by country) and then transform it to a dataframe which has two columns: dates (as `datetime.datetime`s) which column label `date`, and total cases for that date, with column label `cases`.

In [None]:
def massage_dataframe(df: pd.DataFrame) -> pd.DataFrame:
    # Drop columns we don't need
    df = df.drop(["Province/State", "Country/Region", "Lat", "Long"], axis=1)

    # Collapse all provinces/regions into single row
    df = df.sum(axis=0).to_frame("cases")

    # Move dates into a column
    df.index.name = "date"
    df.reset_index(inplace=True)

    # Parse dates to datetime
    df.date = pd.to_datetime(df.date)

    return df

## Plotting confirmed cases, deaths, and recovered

Now we'll plot the number of confirmed cases, death, and recovered for both China and the rest of the world (excluding China).

### Confirmed cases

In [None]:
# Filter by China/rest of world
df_confirmed_total_china = massage_dataframe(
    df_confirmed[df_confirmed["Country/Region"].eq("China")]
)
df_confirmed_total_remaining = massage_dataframe(
    df_confirmed[~df_confirmed["Country/Region"].eq("China")]
)

In [None]:
fig, ax = plt.subplots()

# Plot data
ax.plot_date(
    x=df_confirmed_total_china.date.values,
    y=df_confirmed_total_china.cases.values,
    fmt="-*",
    xdate=True,
)
ax.plot_date(
    x=df_confirmed_total_remaining.date.values,
    y=df_confirmed_total_remaining.cases.values,
    fmt="-*",
    xdate=True,
)

# Add cosmetics
ax.set_title("Confirmed COVID-19 cases")
ax.set_xlabel("date")
ax.set_ylabel("count")
ax.ticklabel_format(axis="y", style="sci", scilimits=(0, 0))

ax.legend(["China", "rest of the world"])

ax.grid(alpha=0.3)

### Deaths

In [None]:
# Filter by China/rest of world
df_deaths_total_china = massage_dataframe(
    df_deaths[df_deaths["Country/Region"].eq("China")]
)
df_deaths_total_remaining = massage_dataframe(
    df_deaths[~df_deaths["Country/Region"].eq("China")]
)

In [None]:
fig, ax = plt.subplots()

# Plot data
ax.plot_date(
    x=df_deaths_total_china.date.values,
    y=df_deaths_total_china.cases.values,
    fmt="-*",
    xdate=True,
)
ax.plot_date(
    x=df_deaths_total_remaining.date.values,
    y=df_deaths_total_remaining.cases.values,
    fmt="-*",
    xdate=True,
)

# Add cosmetics
ax.set_title("COVID-19 Deaths")
ax.set_xlabel("date")
ax.set_ylabel("count")
ax.ticklabel_format(axis="y", style="sci", scilimits=(0, 0))

ax.legend(["China", "rest of the world"])

ax.grid(alpha=0.3)

### Recovered cases

In [None]:
# Filter by China/rest of world
df_recovered_total_china = massage_dataframe(
    df_recovered[df_recovered["Country/Region"].eq("China")]
)
df_recovered_total_remaining = massage_dataframe(
    df_recovered[~df_recovered["Country/Region"].eq("China")]
)

In [None]:
fig, ax = plt.subplots()

# Plot data
ax.plot_date(
    x=df_recovered_total_china.date.values,
    y=df_recovered_total_china.cases.values,
    fmt="-*",
    xdate=True,
)
ax.plot_date(
    x=df_recovered_total_remaining.date.values,
    y=df_recovered_total_remaining.cases.values,
    fmt="-*",
    xdate=True,
)

# Add cosmetics
ax.set_title("Recovered COVID-19 cases")
ax.set_xlabel("date")
ax.set_ylabel("count")
ax.ticklabel_format(axis="y", style="sci", scilimits=(0, 0))

ax.legend(["China", "rest of the world"])

ax.grid(alpha=0.3)

### Brief discussion of plots (on why China's data makes no sense)

The above plots are pretty good justification to not even take China's data into consideration. The data shows that there was no exponential growth phase (including in Hubei province which contains Wuhan), which makes no sense given that they didn't immediately start quarantining.

## Fitting the world's data to an SIR model

Now let's try to fit the world's data (excluding China) to the "modified" SIR model written in the introduction. For the purposes of the SIR model we will take

+ infected = confirmed data - recovered data - deaths data
+ recovered = recovered data + deaths data

Our goal is to determine $\beta$, $\gamma$, and $N$ for the data; using this, we can determine the $S$ data and the basic reproduction number $R_0$.

### Defining a fitting function

The below function takes in data for $I$, $R$ and returns a tuple which contains

+ The symfit fit results (an instance of `symfit.core.fit_results.FitResults`). This, importantly, has a `params` attribute which is an ordered dictionary containing the estimated parameters $\beta$, $\gamma$, and $N$;

+ A function which evaluates the SIR system of differential equations given time data and the parameter estimates (an instance of `symfit.core.models.ODEModel`);

+ An array containing the time data used when fitting. The function makes the simplifying assumption that one day is equal to one unit of time.

The function also (optionally) takes in initial estimates for $\beta$, $\gamma$, and $N$. Note that having *reasonable* estimates is extremely important to having the fit produce good parameter estimates.

One other important point is that the data supplied to the fit should start when the infection data becomes non-negligible. Producing data prior to any infections will result in a worse fit.

In [None]:
def estimate_SIR_constants(
    i_data: np.ndarray,
    r_data: np.ndarray,
    beta_est: float = 0.00001,
    gamma_est: float = 0.01,
    N_est: float = 150000,
) -> tuple:
    # Set up data for time
    t_data = np.arange(i_data.shape[0], dtype=float)

    # Set up variables and parameters
    I, R, t = symfit.variables("I, R, t")
    beta = symfit.Parameter("beta")
    gamma = symfit.Parameter("gamma", gamma_est)
    N = symfit.Parameter("N", N_est)

    # Set up and run the model
    model_dict = {
        symfit.D(I, t): beta * I * (N - I - R) / N - gamma * I,
        symfit.D(R, t): gamma * I,
    }
    ode_model = symfit.ODEModel(
        model_dict, initial={t: 0.0, I: i_data[0], R: r_data[0]}
    )

    fit = symfit.Fit(ode_model, t=t_data, I=i_data, R=r_data)

    # Run
    res = fit.execute()

    return (res, ode_model, t_data)

### Choosing the data to fit

Now we need to determine which data we want to use for the fit. By inspecting the plots in the "Plotting confirmed cases, deaths, and recovered" section above, we can decide at what data point we should start at. Looking at the points, starting in late February seems reasonable

In [None]:
# Choose where to start the data
remaining_start_idx = 35

In [None]:
# Collect infected and recovered data using our above definitions
i_data = (
    df_confirmed_total_remaining.cases.values
    - df_recovered_total_remaining.cases.values
    - df_deaths_total_remaining.cases.values
)[remaining_start_idx:]
r_data = (
    df_recovered_total_remaining.cases.values + df_deaths_total_remaining.cases.values
)[remaining_start_idx:]

### Estimating parameters using our fitting function

Now we'll use the function we defined to predict the SIR constants.

In [None]:
res, ode_model, t_data = estimate_SIR_constants(i_data, r_data)

Let's look what the estimated parameters are.

In [None]:
pprint(dict(res.params.items()))

And let's look at well this agrees with the data we supplied to the fitting function.

In [None]:
# Get I and R from our ODE model and determine S
I, R = ode_model(t=t_data, **res.params)
S = res.params["N"] - I - R

# Get the dates corresponding to our time data
t_data_dates = df_confirmed_total_remaining.date.values[remaining_start_idx:]

In [None]:
fig, ax = plt.subplots(figsize=(15, 13))

# Plot data
ax.plot_date(
    x=t_data_dates, y=S, fmt="-*", xdate=True,
)
ax.plot_date(
    x=t_data_dates, y=I, fmt="-*", xdate=True,
)
ax.plot_date(
    x=t_data_dates, y=R, fmt="-*", xdate=True,
)
ax.plot_date(
    x=t_data_dates, y=i_data, fmt="o", xdate=True,
)
ax.plot_date(
    x=t_data_dates, y=r_data, fmt="o", xdate=True,
)

# Add cosmetics
ax.set_title("World (excluding China) SIR curves")
ax.set_xlabel("date")
ax.set_ylabel("count")
ax.ticklabel_format(axis="y", style="sci", scilimits=(0, 0))

ax.legend(["S (estimated)", "I (estimated)", "R (estimated)", "I (data)", "R (data)"])

ax.grid(alpha=0.3)

We can see here that our estimated parameters are extremely accurate in their agreement with the data.

### Determine the basic reproductive number 

Now that we've found that our parameter estimates our reasonable, we can determine the basic reproductive number for our data.

In [None]:
print("R0 = %.2f" % (res.params["beta"] / res.params["gamma"]))

For the data used when I ran this computation (on 2020-03-30), the basic reproductive number $R_0$ was approximately 8. This means, as a global average, each infected person will infect an additional 8 non-infected people, which is quite worrying!

## Determine the basic reproductive number for a sample of countries

Now we will determine the basic reproductive number $R_0$ for a number of different countries, so we can see how this number varies based on the metrics discussed in the introduction.

Because we need to provide good estimates for

+ what data to supply to the fitting function
+ the parameters $\beta$, $\gamma$, and $N$

we will need to plot the data for each country we consider prior to fitting.

### Choosing which countries to sample

Here we will choose for which countries we will determine the basic reproductive number $R_0$.

Below are a list of all countries we have data for.

In [None]:
pprint(list(np.unique(df_confirmed["Country/Region"])))

I'll choose 16 of these countries to include in our analysis.

In [None]:
included_countries = np.array(
    [
        "Angola",
        "Argentina",
        "Brazil",
        "Canada",
        "Chile",
        "Columbia",
        "Egypt",
        "Germany",
        "Indonesia",
        "Iran",
        "Jordan",
        "Korea, South",
        "Mexico",
        "Spain",
        "Thailand",
        "Ukraine",
    ]
)

Let's plot their confirmed cases so that we can come up with good parameter estimates for our fitting function.

In [None]:
fig, ax = plt.subplots()

# Plot data for each country
for country in included_countries:
    df_massaged = massage_dataframe(
        df_confirmed[df_confirmed["Country/Region"].eq(country)]
    )
    ax.plot_date(
        x=df_massaged.date.values, y=df_massaged.cases.values, fmt="-*", xdate=True,
    )

# Add cosmetics
ax.set_title("Confirmed COVID-19 cases")
ax.set_xlabel("date")
ax.set_ylabel("count")
ax.ticklabel_format(axis="y", style="sci", scilimits=(0, 0))

ax.legend(included_countries)

ax.grid(alpha=0.3)

It's a little hard to see what's going on with some of the countries with lower numbers of confirmed cases, so let's plot those separately.

In [None]:
included_countries_plot_2 = np.array(
    [
        "Angola",
        "Argentina",
        "Brazil",
        "Canada",
        "Chile",
        "Columbia",
        "Egypt",
        "Germany",
        "Indonesia",
        "Iran",
        "Jordan",
        "Korea, South",
        "Mexico",
        "Spain",
        "Thailand",
        "Ukraine",
    ]
)

fig, ax = plt.subplots()

# Plot data for each country
for country in included_countries_plot_2:
    df_massaged = massage_dataframe(
        df_confirmed[df_confirmed["Country/Region"].eq(country)]
    )
    ax.plot_date(
        x=df_massaged.date.values, y=df_massaged.cases.values, fmt="-*", xdate=True,
    )

# Add cosmetics
ax.set_title("Confirmed COVID-19 cases")
ax.set_xlabel("date")
ax.set_ylabel("count")
ax.ticklabel_format(axis="y", style="sci", scilimits=(0, 0))

ax.legend(included_countries_plot_2)

ax.grid(alpha=0.3)

### Initializing data frame to store data

Before we start computing parameters, let's set up a dataframe to store our results. 

# Discussion

TODO: add discussion and move this BELOW analysis