## Covid case/mortality analysis for 173 countries

We use data from this site:

https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide

Download the csv data file and save it in the working directory with the file name "ecdpc.csv".

In [1]:
import numpy as np
import statsmodels.api as sm
import pandas as pd
from datetime import datetime

Load the data

In [2]:
df = pd.read_csv("ecdpc.csv")

Simplify some variable names

In [3]:
df["country"] = df.countryterritoryCode
df = df.loc[pd.notnull(df.country), :]

Clean up the dates

In [4]:
def f(x):
    u = x.split("/")
    return "%s-%s-%s" % tuple(u[::-1])
df["date_rep"] = df.dateRep.apply(f)
df["date"] = pd.to_datetime(df.date_rep)

Remove meaningless rows with future dates.

In [5]:
today = datetime.today().strftime('%Y-%m-%d')
today = pd.to_datetime(today)
df = df.loc[df.date <= today]

Only keep countries having at least one month of data

In [6]:
n = df.groupby("country").size()
n = n[n > 30]
df = df.loc[df.country.isin(n.index), :]
print(df.country.unique().size)

202


Sort first by country, then within country by date.

In [7]:
df = df.sort_values(["country", "date"])

Days is the number of calendar days in 2020 for each record.

In [8]:
df["days"] = (df.date - pd.to_datetime("2020-01-01")).dt.days

Create variables containing the number of new cases
within a week, for each of the four weeks preceding
the current day.

In [9]:
# Sum x from d2 days back in time to d1 days back in time, inclusive of
# both endpoints.  d2 must be greater than d1.
def wsum(x, d1, d2):
    w = np.ones(d2 + 1)
    if d1 > 0:
        w[-d1:] = 0
    y = np.zeros_like(x)
    y[d2:] = np.convolve(x.values, w[::-1], mode='valid')
    return y

for j in range(4):
    xx = df.groupby("country").cases.transform(lambda x: wsum(x, 7*j, 7*j+6))
    df["cumpos%d" % j] = df.groupby("country").cases.transform(lambda x: wsum(x, 7*j, 7*j+6))
    df["logcumpos%d" % j] = np.log(df["cumpos%d" % j] + 1)

  result = getattr(ufunc, method)(*inputs, **kwargs)


Calculate the date of the first death in each country,
then remove data prior to 10 days after this date.
rdays is the number of days in each country since
its first reported Covid death.

In [10]:
def firstdeath(x):
    if (x.deaths == 0).all():
        return np.inf
    ii = np.flatnonzero(x.deaths > 0)[0]
    return x.date.iloc[ii]

xx = df.groupby("country").apply(firstdeath)
xx.name = "firstdeath"
df = pd.merge(df, xx, left_on="country", right_index=True)

df["rdays"] = (df.date - df.firstdeath).dt.days
df = df.loc[df.rdays >= 10, :]

Here is a basic regression analysis looking at all the countries, with data filtered
as described above.

In [11]:
fml = "deaths ~ 0 + C(country) + "
fml += "logcumpos0 + logcumpos1 + logcumpos2 + logcumpos3"
m1 = sm.GEE.from_formula(fml, groups="country", data=df, family=sm.families.Poisson())
r1 = m1.fit(scale="X2")
print(r1.summary())

                               GEE Regression Results                              
Dep. Variable:                      deaths   No. Observations:                 8586
Model:                                 GEE   No. clusters:                      173
Method:                        Generalized   Min. cluster size:                  12
                      Estimating Equations   Max. cluster size:                 123
Family:                            Poisson   Mean cluster size:                49.6
Dependence structure:         Independence   Num. iterations:                     2
Date:                     Mon, 01 Jun 2020   Scale:                          33.046
Covariance type:                    robust   Time:                         13:29:26
                      coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------
C(country)[ABW]    -3.9114      0.063    -61.801      0.000      -4.035     

Including either relative time or calendar time in the model could capture
risk-modifying effects of interest.  For example, if there is a role for
calendar time, this could reflect a changing case/fatality ratio as the
epidemic progresses.

In [12]:
fml = "deaths ~ 0 + C(country) + bs(rdays, 5) + "
fml += "logcumpos0 + logcumpos1 + logcumpos2 + logcumpos3"
m2 = sm.GEE.from_formula(fml, groups="country", data=df, family=sm.families.Poisson())
r2 = m2.fit(scale="X2")
print(r2.summary())
print(m2.compare_score_test(r1))

                               GEE Regression Results                              
Dep. Variable:                      deaths   No. Observations:                 8586
Model:                                 GEE   No. clusters:                      173
Method:                        Generalized   Min. cluster size:                  12
                      Estimating Equations   Max. cluster size:                 123
Family:                            Poisson   Mean cluster size:                49.6
Dependence structure:         Independence   Num. iterations:                     2
Date:                     Mon, 01 Jun 2020   Scale:                          32.362
Covariance type:                    robust   Time:                         13:29:28
                      coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------
C(country)[ABW]    -4.3474      0.112    -38.814      0.000      -4.567     

{'statistic': 8.038538341636286, 'df': 5, 'p-value': 0.15412445444462763}


There variance is around 30 times the mean.

In [13]:
print(r1.scale)
print(r2.scale)

33.04612879864114
32.361938777767186
