In [None]:
%matplotlib inline

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Analysis of COVID-19 time series data

**PHYS 395 project 1; **
**Matt Wiens - #301294492**

## Notebook setup 

The first command here sets the default figure size to be a bit larger than normal. The second command sets it so all figure output areas are expanded by default.

In [None]:
# Set default plot size
plt.rcParams["figure.figsize"] = (12, 9)

In [None]:
%%javascript
IPython.OutputArea.auto_scroll_threshold = 9999

# Introduction

In this notebook we will be interested in how the basic reproduction number $R_0$ varies across countries grouped by several different metrics. The basic reproduction number is the average number of people one person infects given that they have COVID-19. The metrics we will consider in this notebook are (i) population density, (ii) GDP, and (iii) the EIU Democracy Index, which is a measure of how "democratic" countries are.

# Methods

The first part of analysis will provide an argument for us to *not* consider China's data, as it is highly suspect and presumably unreliable.

The remainder of our analysis will be geared towards finding the basic reproduction numbers $R_0$ of different countries using the SIR model, and then using this $R_0$ value to compare countries based on the metrics listed in the introduction. The SIR model models the temporal behavior of an infectious outbreak through the equations

\begin{align}
     \frac{dS}{dt} &= - \frac{\beta I S}{N}, \\
     \frac{dI}{dt} &= \frac{\beta I S}{N} - \gamma I, \\
     \frac{dR}{dt} &= \gamma I,
\end{align}

where

+ $\beta$, $\gamma$ are constants
+ $S$ is the size of the susceptible population
+ $I$ is the size of the infected population
+ $R$ is the size of the recovered population
+ $N = S + I + R$ is total size of the population being considered

The basic reproductive number $R_0$ is related to the constants $\beta$ and $\gamma$ through

\begin{equation}
    R_0 = \frac{\beta}{\gamma}
    .
\end{equation}

# Analysis

## Fetching data

Here, we'll fetch the latest COVID-19 data from John Hopkins CSSE.

In [None]:
# URLs to fetch
url_confirmed_csv = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
url_deaths_csv = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv"
url_recovered_csv = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv"

In [None]:
# Get data
df_confirmed = pd.read_csv(url_confirmed_csv, sep=",")
df_deaths = pd.read_csv(url_deaths_csv, sep=",")
df_recovered = pd.read_csv(url_recovered_csv, sep=",")

## Plotting confirmed cases, deaths, and recovered

Now we'll plot the number of confirmed cases, death, and recovered for both China and the rest of the world (excluding China).

### Confirmed cases

In [None]:
# Filter by China/rest of world
df_confirmed_total_china = df_confirmed[df_confirmed["Country/Region"].eq("China")]
df_confirmed_total_remaining = df_confirmed[~df_confirmed["Country/Region"].eq("China")]

# Drop columns we don't need
df_confirmed_total_china = df_confirmed_total_china.drop(
    ["Province/State", "Country/Region", "Lat", "Long"], axis=1
)
df_confirmed_total_remaining = df_confirmed_total_remaining.drop(
    ["Province/State", "Country/Region", "Lat", "Long"], axis=1
)

# Collapse all provinces/regions into single row
df_confirmed_total_china = df_confirmed_total_china.sum(axis=0).to_frame("cases")
df_confirmed_total_remaining = df_confirmed_total_remaining.sum(axis=0).to_frame(
    "cases"
)

# Move dates into a column
df_confirmed_total_china.index.name = "date"
df_confirmed_total_china.reset_index(inplace=True)

df_confirmed_total_remaining.index.name = "date"
df_confirmed_total_remaining.reset_index(inplace=True)

# Parse dates to datetime
df_confirmed_total_china.date = pd.to_datetime(df_confirmed_total_china.date)
df_confirmed_total_remaining.date = pd.to_datetime(df_confirmed_total_remaining.date)

In [None]:
fig, ax = plt.subplots()

# Plot data
ax.plot_date(
    x=df_confirmed_total_china.date.values,
    y=df_confirmed_total_china.cases.values,
    fmt="-*",
    xdate=True,
)
ax.plot_date(
    x=df_confirmed_total_remaining.date.values,
    y=df_confirmed_total_remaining.cases.values,
    fmt="-*",
    xdate=True,
)

# Add cosmetics
ax.set_title("Confirmed COVID-19 cases")
ax.set_xlabel("date")
ax.set_ylabel("count")
ax.ticklabel_format(axis="y", style="sci", scilimits=(0, 0))

ax.legend(["China", "rest of the world"])

ax.grid(alpha=0.3)

### Deaths

In [None]:
# Filter by China/rest of world
df_deaths_total_china = df_deaths[df_deaths["Country/Region"].eq("China")]
df_deaths_total_remaining = df_deaths[~df_deaths["Country/Region"].eq("China")]

# Drop columns we don't need
df_deaths_total_china = df_deaths_total_china.drop(
    ["Province/State", "Country/Region", "Lat", "Long"], axis=1
)
df_deaths_total_remaining = df_deaths_total_remaining.drop(
    ["Province/State", "Country/Region", "Lat", "Long"], axis=1
)

# Collapse all provinces/regions into single row
df_deaths_total_china = df_deaths_total_china.sum(axis=0).to_frame("cases")
df_deaths_total_remaining = df_deaths_total_remaining.sum(axis=0).to_frame(
    "cases"
)

# Move dates into a column
df_deaths_total_china.index.name = "date"
df_deaths_total_china.reset_index(inplace=True)

df_deaths_total_remaining.index.name = "date"
df_deaths_total_remaining.reset_index(inplace=True)

# Parse dates to datetime
df_deaths_total_china.date = pd.to_datetime(df_deaths_total_china.date)
df_deaths_total_remaining.date = pd.to_datetime(df_deaths_total_remaining.date)

In [None]:
fig, ax = plt.subplots()

# Plot data
ax.plot_date(
    x=df_deaths_total_china.date.values,
    y=df_deaths_total_china.cases.values,
    fmt="-*",
    xdate=True,
)
ax.plot_date(
    x=df_deaths_total_remaining.date.values,
    y=df_deaths_total_remaining.cases.values,
    fmt="-*",
    xdate=True,
)

# Add cosmetics
ax.set_title("COVID-19 Deaths")
ax.set_xlabel("date")
ax.set_ylabel("count")
ax.ticklabel_format(axis="y", style="sci", scilimits=(0, 0))

ax.legend(["China", "rest of the world"])

ax.grid(alpha=0.3)

### Recovered cases

In [None]:
# Filter by China/rest of world
df_recovered_total_china = df_recovered[df_recovered["Country/Region"].eq("China")]
df_recovered_total_remaining = df_recovered[~df_recovered["Country/Region"].eq("China")]

# Drop columns we don't need
df_recovered_total_china = df_recovered_total_china.drop(
    ["Province/State", "Country/Region", "Lat", "Long"], axis=1
)
df_recovered_total_remaining = df_recovered_total_remaining.drop(
    ["Province/State", "Country/Region", "Lat", "Long"], axis=1
)

# Collapse all provinces/regions into single row
df_recovered_total_china = df_recovered_total_china.sum(axis=0).to_frame("cases")
df_recovered_total_remaining = df_recovered_total_remaining.sum(axis=0).to_frame(
    "cases"
)

# Move dates into a column
df_recovered_total_china.index.name = "date"
df_recovered_total_china.reset_index(inplace=True)

df_recovered_total_remaining.index.name = "date"
df_recovered_total_remaining.reset_index(inplace=True)

# Parse dates to datetime
df_recovered_total_china.date = pd.to_datetime(df_recovered_total_china.date)
df_recovered_total_remaining.date = pd.to_datetime(df_recovered_total_remaining.date)

In [None]:
fig, ax = plt.subplots()

# Plot data
ax.plot_date(
    x=df_recovered_total_china.date.values,
    y=df_recovered_total_china.cases.values,
    fmt="-*",
    xdate=True,
)
ax.plot_date(
    x=df_recovered_total_remaining.date.values,
    y=df_recovered_total_remaining.cases.values,
    fmt="-*",
    xdate=True,
)

# Add cosmetics
ax.set_title("Recovered COVID-19 cases")
ax.set_xlabel("date")
ax.set_ylabel("count")
ax.ticklabel_format(axis="y", style="sci", scilimits=(0, 0))

ax.legend(["China", "rest of the world"])

ax.grid(alpha=0.3)

### Brief discussion of plots (on why China's data makes no sense)

The above plots are pretty good justification to not even take China's data into consideration. The data shows that there was no exponential growth phase (including in Hubei province which contains Wuhan), which makes no sense given that they didn't immediately start quarantining.

## Fitting the world's data to an SIR model

Now let's try to fit the world's data (excluding China) to the SIR model. For the purposes of the SIR model we will take

+ susceptible = unknown
+ infected = confirmed data - recovered data - deaths data
+ recovered = recovered data + deaths data

# Discussion

TODO: add discussion and move this BELOW analysis