# Deaths Occurred vs Deaths Reported

In a recent episode of *More or Less*, Tim Harford and David Spiegelhalter discussed the difference between the number of deaths that occurred on each date versus the number of deaths that were reported on each date.
When they were discussing the difference between these variables, it was common to hear statements about the latter in the news. It was less common to hear statements about the former.

In this notebook, we're going to investigate when using the number of deaths that were reported, rather than the number of deaths that occurred, to make "highest in" statements might have under-stated Coronavirus in the UK.
It has three sections:

1. Data Wrangling, where we *wrangle* or prepare the data
2. Chart Wrangling, where we prepare the charts
3. Analysis, where we discuss what we think the charts show us

We're interested in two variables from the [Coronavirus in the UK][1] API:

* `newDeaths28DaysByDeathDate`
    This variable records the number of deaths that **occurred** on each date.

* `newDeaths28DaysByPublishDate`
    This variable records the number of deaths that were **reported** on each date.

For more information about these variables, see [*Daily and cumulative deaths within 28 days of a positive test*][2].

[1]: https://coronavirus.data.gov.uk/
[2]: https://coronavirus.data.gov.uk/details/about-data#daily-and-cumulative-deaths-within-28-days-of-a-positive-test

In [1]:
import altair
import pandas
import requests
from src import print_url

## Data Wrangling

In [2]:
json = requests.get(print_url.make_url()).json()

In [3]:
timeseries = pandas.DataFrame([x for x in json["data"]])

In [4]:
def convert_initial_values_to_none(s):
    """Converts initial values (values before the first non-zero value) to None."""
    s = s.copy()
    earliest_date = s.index[s > 0][0]
    s[s.index < earliest_date] = None
    return s

In [5]:
# Test that the function returns a copy of the series
s = pandas.Series([0, 0, 1])
assert id(s) != id(convert_initial_values_to_none(s))
del s  # Remove the series from global scope

In [6]:
# Initial values are zero
assert convert_initial_values_to_none(pandas.Series([0, 0, 1])).equals(pandas.Series([None, None, 1]))
# Initial values are None
assert convert_initial_values_to_none(pandas.Series([None, None, 1])).equals(pandas.Series([None, None, 1]))
# Don't convert values after the first non-zero value
assert convert_initial_values_to_none(pandas.Series([0, 1, 0])).equals(pandas.Series([None, 1, 0]))

In [7]:
timeseries = (timeseries
    .assign(date=pandas.to_datetime(timeseries.date))
    .set_index("date")
    .sort_index()
    .loc[:, ["newDeaths28DaysByDeathDate", "newDeaths28DaysByPublishDate"]]
    .apply(convert_initial_values_to_none))

We will start our analysis when either the number of deaths that occurred or the number of deaths that were reported is greater than zero, whichever date is earlier.

In [8]:
start_date = min(
    timeseries.index[timeseries.newDeaths28DaysByDeathDate > 0][0],
    timeseries.index[timeseries.newDeaths28DaysByPublishDate > 0][0],
)

In [9]:
timeseries = timeseries[start_date:]

In [10]:
timeseries.head()

Unnamed: 0_level_0,newDeaths28DaysByDeathDate,newDeaths28DaysByPublishDate
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2020-03-02,1.0,
2020-03-03,2.0,
2020-03-04,0.0,
2020-03-05,3.0,
2020-03-06,0.0,1.0


As expected, the data that show when deaths occurred (the left chart) are smoother than the data that show when deaths were reported (the right chart).

In [11]:
(altair.Chart(timeseries.reset_index().melt(id_vars=["date"]))
    .mark_line(interpolate="step", strokeWidth=1)
    .encode(x="date", y="value", column="variable"))

For both the number of deaths that occurred and the number of deaths that were reported, let's compute the rank for each day with respect to the preceding 29 days (i.e. let's use a 30 day window).
A rank of 1 means the day experienced the highest number of deaths in the preceding 29 days.

In [12]:
timeseries_of_ranks = (timeseries
    .rolling(window="30D")
    # Ranks values from 1 (highest) to n (lowest), ignoring NaN values.
    # Uses the "min" method, so ties have the same rank.
    .apply(lambda x: x.rank(method="min", ascending=False).iloc[-1]))

timeseries_of_ranks.columns = [f"{x}Rank" for x in timeseries_of_ranks.columns]

In [13]:
timeseries_of_ranks.head()

Unnamed: 0_level_0,newDeaths28DaysByDeathDateRank,newDeaths28DaysByPublishDateRank
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2020-03-02,1.0,
2020-03-03,1.0,
2020-03-04,3.0,
2020-03-05,1.0,
2020-03-06,4.0,1.0


In [14]:
timeseries_of_highests = (timeseries_of_ranks
    .apply(lambda x: x == 1)
    .assign(true=lambda x: x.iloc[:, 0] & x.iloc[:, 1])
    .astype(int))

timeseries_of_highests.columns = [f"{x}Highest" for x in timeseries_of_highests.columns]

In [15]:
timeseries_of_highests.head()

Unnamed: 0_level_0,newDeaths28DaysByDeathDateRankHighest,newDeaths28DaysByPublishDateRankHighest,trueHighest
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2020-03-02,1,0,0
2020-03-03,1,0,0
2020-03-04,0,0,0
2020-03-05,1,0,0
2020-03-06,0,1,0


## Chart Wrangling

In [16]:
chart_kwargs = {"width": 800}

In [17]:
reported_chart = (altair.Chart(timeseries.reset_index())
    .properties(**chart_kwargs)
    .mark_line(interpolate="step", strokeWidth=1)
    .encode(x="date", y="newDeaths28DaysByPublishDate"))

In [18]:
occurred_highest_chart = (altair.Chart(timeseries_of_highests.reset_index())
    .properties(**chart_kwargs)
    .mark_area(interpolate="step", fillOpacity=.25)
    .encode(x="date", y=altair.Y("newDeaths28DaysByDeathDateRankHighest", axis=None)))

In [19]:
reported_highest_chart = (altair.Chart(timeseries_of_highests.reset_index())
    .properties(**chart_kwargs)
    .mark_area(interpolate="step", fillOpacity=.25)
    .encode(x="date", y=altair.Y("newDeaths28DaysByPublishDateRankHighest", axis=None)))

In [20]:
true_highest_chart = (altair.Chart(timeseries_of_highests.reset_index())
    .properties(**chart_kwargs)
    .mark_area(interpolate="step", fillOpacity=.25)
    .encode(x="date", y=altair.Y("trueHighest", axis=None)))

## Analysis

Each of the following charts shows the number of deaths that were reported.

First, let's highlight the dates when the number of deaths that were **reported** were at their highest.
In other words, let's highlight a bar if it is taller than the preceding 29 bars.
Statements such as...

> "The number of deaths that were reported on \[date\] was the highest in the preceding 29 days."

...would have been accurate.

In [21]:
altair.layer(reported_chart, reported_highest_chart).resolve_scale(y="independent")

Second, let's highlight the dates when the number of deaths that **occurred** were at their highest. Remember, this chart still shows the number of deaths that were reported. The highlight -- but not the height of the bar -- shows the dates when the number of deaths that occurred were at their highest.

If we compare the first chart to the second chart, then we might conclude that in late December 2020 and early January 2021, by using the number of deaths that were reported, rather than the number of deaths that occurred, to make "highest in" statements, we were under-stating Coronavirus in the UK. This was roughly when Tim Harford and David Spiegelhalter were discussing the difference between these variables.

In [22]:
altair.layer(reported_chart, occurred_highest_chart).resolve_scale(y="independent")

Third, and finally, let's highlight the *true highest* dates or those where **both** the number of deaths that were reported and the number of deaths that occurred were at their highest.
This chart shows us the dates when using the number of deaths that were reported to make "highest in" statements was accurate with respect to the number of deaths that occurred.
Clearly, there weren't many of these dates in late December 2020 and early January 2021.

In [23]:
altair.layer(reported_chart, true_highest_chart).resolve_scale(y="independent")