# CoVid-19 Pandemic Statistics

Analysis of the CoVid-19 outbreak in Europe and comparison with _official_ data of the Hubei province, China

Total population of European countries taken from https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Europe

Credits: a good part of this analysis is taken from
https://towardsdatascience.com/covid-19-infection-in-italy-mathematical-models-and-predictions-7784b4d7dd8d


In [None]:
from datetime import datetime,timedelta
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
from scipy.optimize import curve_fit
from scipy.optimize import fsolve
import matplotlib.pyplot as plt
%matplotlib notebook

In [None]:
# Total population of some European countries of interest (Millions)
pop = {}
pop['Italy'] = 60
pop['France'] = 67
pop['Switzerland'] = 8.5
pop['Germany'] = 83
pop['Spain'] = 48
pop['Hubei'] = 58.5

In [None]:
# This is a Github repo with the official data from WHO
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/'

# these are the main data frames
df = pd.read_csv(url + 'time_series_19-covid-Confirmed.csv')
#df = pd.read_csv(url + 'time_series_19-covid-Deaths.csv')
#df = pd.read_csv(url + 'time_series_19-covid-Recovered.csv')
df.head()

In [None]:
df.describe()

In [None]:
### Some data manipulation

# extract the countries of interest and prepare data
countrydf = {}
pop_no_Hubei = (set(pop.keys()) - {'Hubei'})
for c in pop_no_Hubei:
  # transpose + drop the first 4 fields (Prov, Country, Lat, Long)
  # + rename x axis label -> 'date' and data label -> country name
  countrydf[c] = df[df['Country/Region'] == c] \
                 .transpose()[4:] \
                 .reset_index() \
                 .rename(columns={'index': 'date', \
                                  df[df['Country/Region'] == c].index[0]: c})
# The Hubei case is special...
c = 'Hubei'
countrydf[c] = df[df['Province/State'] == c] \
               .transpose()[4:] \
               .reset_index() \
               .rename(columns={'index': 'date', \
                                df[df['Province/State'] == c].index[0]: c})

for c in pop:
  # convert date to days since Feb 29th
  countrydf[c]['date'] = countrydf[c]['date'].map(lambda d : \
        (datetime.strptime(d, '%m/%d/%y') - datetime.strptime('2020-02-29','%Y-%m-%d')).days)
  # convert #cases to cases per 1M inhabitants
  countrydf[c][c] /= pop[c]

countrydf['Italy']

In [None]:
plt.rcParams['figure.figsize'] = [18, 10]
plt.rc('font', size=11)

for c in pop_no_Hubei:
  # drop first days' data as it's more noisy: keep points with value > 0.2 cases / 1M people
  countrydf[c] = countrydf[c][countrydf[c].gt(0.2).any(axis=1)]

  # get all data points
  t = list(countrydf[c].iloc[:, 0]) 
  y = list(countrydf[c].iloc[:, 1])
  plt.scatter(t, y, label=c)   # draw the dots
  plt.plot(t, y)   # draw the lines

plt.legend()
plt.title("Confirmed cases per 1M inhabitants")
plt.xlabel("Days since Mar 2020")
plt.ylabel("Count")
plt.yscale('log')
plt.grid(which='both')
plt.show()

# Models and curves fitting

Now let's try to fit the data. How do we model such a growth?

## Geometric or Exponential growth

Many natural phenomena follow geometric or exponential evolutions. Examples:
* Radioactive decay
* Population growth (or virus spreading)

WHY?

Each time the delta effect (for example, the daily growth) is proportional to the entire set, the resulting evolution is a geometric or exponential sequence!

Population growth:
  $$\Delta N(d) = N(d+1) - N(d) = k\cdot N(d)$$
  
Radioactive decay (the negative sign accounts for the fact that the decayed atoms disappear from the total):
  $$N_{decaying}(t) = \Delta N(t) = N_{total}(t) - N_{total}(t-1) = - k\cdot N_{total}(t)$$

Solving the first:
  $$N(d+1) = (k + 1)\cdot N(d) = (k + 1)^{2}\cdot N(d-1) = \cdots = (k + 1)^{d}\cdot N(0)$$

It is a common convention in Math to express any exponential function in terms of the _e_ constant, by redefining the other factors:

$$(k+1)^{d} = e^{d/\tau}, \quad \tau \equiv \frac{1}{ln(k+1)}$$

Therefore, a generic exponential function is:

  $$f_{a,b,\tau}(t) = a\cdot e^{(t-b)/\tau} \quad a, b, \tau \; free \; parameters$$

Let's have a look at it with some plots.

In [None]:
def exp_model(t, a, b, c):
  return a*np.exp((t-b)/c)

In [None]:
t_range = list(range(-50, +50))
plt.plot(t_range, [exp_model(i, 500, 0, 1.3) for i in t_range])
plt.plot(t_range, [exp_model(i, 500, 0, -1.3) for i in t_range])

plt.title("Exponential growth")
plt.xlabel("Days")
plt.ylabel("Count")
#plt.yscale('log')
plt.grid(which='both')
plt.show()

## Back to our data

Now let's try and fit our data with those functions, that is find the parameters for our functions that "best" match the given data points.

WARNING: data fitting and model identification is a large and complex topic in Data Science and Statistics. Here the tools to perform some fits are shown to illustrate how easy it is to experiment with the data, but the scientist has to always challenge the model before reaching any conclusions!

Paraphrasing Socrates (_Gnoti seauthon_, "know yourself"), know your data!

In [None]:
exp_fit = {}
for c in pop:
  t = list(countrydf[c].iloc[:,0])
  y = list(countrydf[c].iloc[:,1])
  try:
    exp_fit[c] = curve_fit(exp_model, t, y, p0=[0.01, 0.5, 3])
  except RuntimeError:
    exp_fit[c] = None

exp_fit     # parameters of the fit: note how many useless decimal digits. The Covariance Matrix is given as result (but not for Hubei?! hmmm....)

In [None]:
t_fit = list(range(-40, +30))

for c in ('Italy', 'Switzerland', 'France', 'Hubei'):  # 'Spain', 'Germany'
  # Real data
  t = list(countrydf[c].iloc[:, 0])
  y = list(countrydf[c].iloc[:, 1])
  plt.scatter(t, y, label=c)

  # Predicted exponential curve
  if exp_fit[c]:
    plt.plot(t_fit, [exp_model(i, exp_fit[c][0][0],exp_fit[c][0][1],exp_fit[c][0][2]) for i in t_fit],
             label=("Exponential %s" % c))

plt.legend()
plt.title("Confirmed cases per 1M inhabitants")
plt.xlabel("Days since Mar 2020")
plt.ylabel("Count")
plt.ylim((0.1, 1e4))
plt.yscale('log')
plt.grid(which='both')
plt.show()

### [Advanced]

If we want to take into account that the population size is not infinite, a more realistic model is given by the "logistic" (or sigmoid) function:

  $$f_{a,b,c}(\tau) = \frac{a}{1+e^{-(t-b)/\tau}} \quad a \equiv P_{final}\$$

The derivation of this formula goes beyond the scope of this tutorial. See e.g. https://en.wikipedia.org/wiki/Logistic_regression for further details.

Let's have a look at this one too.

In [None]:
def logistic_model(t, a, b, c):
  return a/(1+np.exp(-(t-b)/c))

In [None]:
t_range = list(range(-15, +15))
plt.plot(t_range, [logistic_model(i, 10000, 0, 1.3) for i in t_range])
plt.plot(t_range, [exp_model(i, 4, -10.2, 1.3) for i in t_range])

plt.title("Logistic growth vs. exponential growth")
plt.xlabel("Days")
plt.ylabel("Count")
plt.ylim((0.1, 1.2e4))
#plt.yscale('log')
plt.grid(which='both')
plt.show()

In [None]:
logit_fit = {}
for c in pop:
  t = list(countrydf[c].iloc[:,0])
  y = list(countrydf[c].iloc[:,1])
  try:
    logit_fit[c] = curve_fit(logistic_model, t, y, p0=[1000, 40, 3])
  except RuntimeError:
    logit_fit[c] = None

logit_fit     # parameters of the fit and Covariance Matrix - this time also for Hubei.

In [None]:
t_fit = list(range(-40, +30))

show_exp = False

for c in ('Italy', 'Switzerland', 'France', 'Hubei'):  # 'Spain', 'Germany'
  # Real data
  t = list(countrydf[c].iloc[:, 0])
  y = list(countrydf[c].iloc[:, 1])
  plt.scatter(t, y, label=c)

  if show_exp:
    # Predicted exponential curve
    if exp_fit[c]:
      plt.plot(t_fit, [exp_model(i, exp_fit[c][0][0],exp_fit[c][0][1],exp_fit[c][0][2]) for i in t_fit],
               label=("Exponential %s" % c))
  else:
    # Predicted logistic curve
    if logit_fit[c]:
      plt.plot(t_fit, [logistic_model(i, logit_fit[c][0][0],logit_fit[c][0][1],logit_fit[c][0][2]) for i in t_fit],
               label=("Logistic %s" % c))

plt.legend()
plt.title("Confirmed cases per 1M inhabitants")
plt.xlabel("Days since Mar 2020")
plt.ylabel("Count")
plt.ylim((0.1, 1e4))
plt.yscale('log')
plt.grid(which='both')
plt.show()