Methodology used in WPROACM

December 30, 2020

Notes on estimating excess deaths during the COVID-19 pandemic using all-cause of mortality data

by

Mark S. Handcock and Bart Blackburn

These notes provide technical details on the statistical methods used to estimate excess deaths in WPR Member States during the COVID-19 pandemic using all-cause of mortality data.

A PDF version of these notes is available here.

We consider the case where we have multiple time-series of all-cause mortality counts from each member state for each week from January 1, 2015 to a recent date. For some states we will have only monthly data, for which much of the description below maintains with natural changes. We consider the case where we have separate reported counts for each sex and age group (typically, five-year age groups).

The primary objective is to estimate the expected all-cause mortality counts for each week starting at January 1, 2020 onward in the counter-factual situation where there had not been a pandemic. The excess mortality is defined to be the difference between the reported counts and expected counts for that week.

Current Model

To fix ideas, consider the case of females, aged 65-74 years in Australia. Let $y_{t}$ be the count for week $t = 1,\ldots,T$ with $t = 1,\ldots,\ 260$ being the period January 1, 2015 to December 31st, 2020. We model $y_{t}$ as a random variable following a negative-binomial distribution with mean parameter $\lambda_{t}$ . We make this choice rather than a Poisson distribution to account for overdispersion in the counts. The overdispersion parameter is itself estimated from the data and the mean parameters $\lambda_{t}$ are modelled as

$\log\ \lambda_{t}\ = c\left( t \right) +\text{trend}\left( t \right) \+ X_{t}\beta$

where $c\left( t \right)$ represents the annual cycle in all-cause mortality and $trend\left( t \right)$ is the curvilinear trend of all-cause mortality over time. The annual cycle $c\left( t \right)$ is modeled as a cyclic cubic spline function (Wegman and Wright 1983) of time with a period of 52 weeks (that is, $c\left( t \right) = \ c\left( t\ \ + \ \ 52 \right)$ ). A spline is a piecewise polynomial. Conceptually, one can imagine a high-degree polynomial capable of crossing through every data point. Such a polynomial would likely overfit the observed data, meaning it may not predict well using new data. Splines allow many low-degree (degree three, in this case) polynomials to fit the data in pieces. This achieves a good fit to the data without the risk of overfitting.

Specifically, $c_{t}$ is modeled as a piecewise cubic polynomial that has a continuous second derivative, is continuous, has continuous $1^{\text{st}}$ and $2^{\text{nd}}$ derivatives at 52 week cycles and best fits the recorded all-cause mortality while being smooth. The specific criterion for the last feature is to choose $c_{t}$ to minimize the penalized square error (PSE):

$PSE_{\tau}\left( c \right) = log-restricted-likelihood\ \left( y,X,\ t = 1,\ldots,\ T \right)\ \ \ +\ \ \ \ \tau\int_{0}^{52}{c^{''}\left\lbrack s \right\rbrack^{2}}ds\ \ \ \ \ \ \ \ \ \ \ \tau\ > \ 0$

where $c^{''}\left\lbrack s \right\rbrack$ is the $2^{\text{nd}}$ derivative of $c\left\lbrack s \right\rbrack$ and $\tau$ is a smoothing parameter, chosen to balance the closeness of fit to the recorded counts (the first term) with the smoothness of $c\left\lbrack s \right\rbrack$ (the second term). Hence, choosing the function $c\left\lbrack s \right\rbrack$ that minimizes $\text{PSE}_{\tau}\left( c \right)$ provides a balanced representation of the annual cycle. It prioritizes smoothness of $c\left\lbrack s \right\rbrack$ over the closeness of fit of $c\left\lbrack s \right\rbrack$ to the recorded all-cause mortality. Note that the traditional estimator, $c\left\lbrack s \right\rbrack$ , is the minimizer with $\tau = 0$ , that is, with no penalty for lack of smoothness. The choice of $\tau$ is subjective. In this work we choose to maximize the ability to predict unrecorded all-cause mortality counts. Specifically, we use Generalized Cross Validation (GCV) (Craven and Wahba 1979) to choose, and the R package mgcv by Simon Wood for analysis (Wood 2004, Wood 2017). The annual cycle so obtained is the optimal smoothest annual cycle chosen to maximize the likelihood of the observed all-causes mortality.

A similar approach is taken to the curvilinear trend $\text{trend}\left( t \right).$ It is modeled as a (non-cyclic) cubic spline function, specifically, as a piecewise cubic polynomial that has a continuous second derivative, is continuous, and best fits the recorded all-cause mortality while being smooth. The specific criterion for the last feature is to choose $\text{trend}\left( t \right)\$ to minimize the penalized square error (PSE):

$PSE_{\gamma}\left( c \right) = log-restricted-likelihood\ \left( y,X,\ t = 1,\ldots,\ T \right) + \gamma\int_{0}^{260}{\text{trend}^{''}\left\lbrack t \right\rbrack^{2}}\text{dt}\text{\ \ \ \ \ \ \ \ \ \ \ }\gamma\ > \ 0$

where $\text{trend}^{''}\left\lbrack t \right\rbrack$ is the $2^{\text{nd}}$ derivative of $\text{trend}\left( t \right)\$ and $\gamma$ is a smoothing parameter, chosen to balance the closeness of fit to the recorded counts (the first term) with the smoothness of $\text{trend}\left( t \right)$ (the second term). Hence, choosing the function $\text{trend}\left( t \right)$ that minimizes $\text{PSE}_{\gamma}\left( \text{trend} \right)$ provides a balanced representation of the trend. It prioritizes smoothness of $trend\left( t \right)$ over the closeness of fit of $trend\left( t \right)$ to the recorded all-cause mortality. Note that the traditional estimator, $trend\left( t \right)$ , is the minimizer with $\gamma = 0$ , that is, with no penalty for lack of smoothness. Like $\tau,$ the choice of $\gamma$ is subjective. As for the annual cycle, we choose to maximize the ability to predict unrecorded all-cause mortality counts by using the Generalized Cross Validation criterion. The model allows for arbitrary time-varying covariates, $X_{t}.$ Including both the date and period allows for the model to detect trends across years and within years.

Negative-binomial regression is a natural choice in that we are seeking to estimate the death count during any time frame. Negative-binomial is preferred to Poisson regression because it allows for overdispersion, and it can also account for instances of low or zero counts without issue.

This particular negative-binomial regression model is a generalized additive model (GAM) in that it uses smoothing functions for the predictor variables. Since the date and period are input as discrete values, they are smoothed using cubic splines, a common smoothing technique. The parameters $\beta$ and the splines themselves are found through restricted maximum likelihood estimation (REML). GAMs are a type of generalized linear model, which are generalizations of ordinary linear regression that allow for the response variable to have error distributions other than the normal distribution (in this case, the negative-binomial distribution).

At the moment, this model is very simple in that it uses no other information outside of sex, age-group, and time/date. Once more data becomes readily available, such as flu counts, the model can easily be extended to incorporate it. There are also other ways to enhance the model, such as considering negative-binomial regression for the case of overdispersion or using hierarchical models for sharing information across groupings. As such, this preliminary approach should serve as a strong starting point.

The expected is then forecast stochastically to represent the uncertainty in the estimate of the expected. Thus, the statistical significance of the observed can be determined (i.e., if it is a substantial increase or decrease from the baseline). One detail of the forecast is that it is an average over the sampling distribution of the parameter estimates. This is a simple way to account for uncertainty in our model for the expected deaths in addition to the sampling variation of the counts for given model parameters. We prefer this to a formal Bayesian model due to its simplicity.

For the moment, models are fit separately to each sex and each age-group and each state. It is possible to improve the estimation by using information from both sexes and multiple age groups simultaneously. However, this is a bias-variance trade-off that can be explored.

For countries with missing (pandemic) weeks we can stochastically interpolate using simple time-series models. If there are significant missing weeks we will use a negative-binomial model like the above to stochastically interpolate.

An issue that may be important is to adjust for reporting delay (mainly an issue for recent weeks). To do this information is needed on the reporting delay. In the US, the NCHS reports deaths as they are received from the states and processed; counts of deaths from recent weeks are highly incomplete, reflecting delays in reporting. These "provisional" counts are updated regularly for past weeks, and the counts are not finalized until more than a year after the deaths occur. The estimate of completeness is based on the number of weeks that passed between the week in which the data set was obtained and the week in which the death occurred. We can model this relationship and use it to adjust the estimates, if necessary.

A PDF version of these notes, including a validation study is available here.

References

Splines in Statistics (article)
Author

Edward J. Wegman and Ian W. Wright

Journal

Journal of the American Statistical Association

Year

1983

Volume

78

Number

382

Pages

351--365

Issn

01621459

Publisher

[American Statistical Association, Taylor & Francis, Ltd.]

Url

http://www.jstor.org/stable/2288640

Stable and efficient multiple smoothing parameter estimation for generalized additive models (article)
Author

S. N. Wood

Journal

Journal of the American Statistical Association

Year

2004

Volume

99

Number

467

Pages

673-686

Generalized Additive Models: An Introduction with R (book)
Publisher

Chapman and Hall/CRC

Year

2017

Author

S.N Wood

Edition

2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Methodology used in WPROACM

Notes on estimating excess deaths during the COVID-19 pandemic using all-cause of mortality data

Mark S. Handcock and Bart Blackburn

Clone this wiki locally