Skip to content

jasondorjeshort/CovidColorado

Repository files navigation

Simple Explanation

To track true epdidemic data, we'd like to look at data by day of infection. Most sources only give numbers by day of public release, but this can be delayed by days, weeks, or even up to a year. This data appears to be "up to date", but each value is just a snapshot created as a linear combination of values from a range of dates in the past: it's being obfuscated! Colorado is unusual in that the state tracks most numbers by day of symptom onset, and by subtracting 5 days we can estimate day of infection. This may be both imprecise and incomplete, but it is a better estimate of epidemic data than is available in any other way.

The issue with these numbers is that they are always incomplete. Cases with an infection date of yesterday, hospitalizations infected a week ago, or deaths infected two weeks ago will almost certainly not be reported yet. One common approach, used by the CDPHE dashboard, is to pick an arbitrary cutoff in the past in which to mark numbers as "inaccurate". But this doesn't work very well: numbers lag by a greater or lesser amount depending on test returns, and even more importantly there is really no upper bound on which they can change. Colorado routinely adds new numbers from months ago, with a few outliers from as much as a year in age. In generating these graphs, I have taken a simple but effective approach to combatting this problem: looking at how much these numbers increase in "recent" historical data to extrapolate a confidence interval for their final values.

The graphs shown here represent a few things. Tests, cases, hospitalizations, and deaths are provided by infection date. The "rates" graphs are merely divisions of two of these values, expressed as a percentage. Most charts are expressed as an average over time.

Read On

Although analytics are used here, a key goal of the project is that it is not a model predicting the future: it is only trying to give an accurate representation of what has happened in the past. It has been demonstrated that the best possible model predicting future spread is choosing two points on the log graph and drawing a line; this model can be used to do that by picking any two points you choose.

I use logarithmic graphs because contagious diseases spread exponentially, so on a log graph they should almost always end up as a sequence of line segments. Geometric rolling averages are used in some of the algorithms because spread is geometric. Geometric rolling averages are used for display on some of the log graphs because they preserve the area under the curve, which makes the straight lines come out straighter. Arithmetic averages are used for graphs that have a Cartesian scale, and for log graphs where the numbers are too small for geometric averages to work out smoothly (daily numbers of 0 can be a problem!).

A central use of the epidemic curves is to estimate what actions we have done that affect the epidemic. When we view the true epidemic curve by day of infection, the results of an action should be visible (eventually) on the very day it happened. Several of the "events" shown on the graphs were chosen as things that obviously affected spread: the day the state re-closed bars, and the day it mandated masks, were chosen because they obviously should change the curve - and they did. Several other events on the graph (the state of emergency, the stay-at-home order, the beginning of the Black Lives Matter protests, the state's first snowfall, Thanksgiving, Christmas, and the day around half of the state moved to "red" status) are included because they seem significant, but the effects may be uncertain. Thirdly I have pinned two days as the start and end of the "November wave", which correspond identically to the opening and closing of Denver Public Schools - but this is chosen retrospectively, and could just be conincidence. Finally, V+5 is the 5th day after vaccinations started, which (when measured by day of infection) should represent the start of a roughly linear decline in hospitalization and mortality rates (but this data is still incomplete).

The Math

CDPHE gives us data archives for each day, with each day's data including cases, hospitalizations, and deaths by date of onset. Subtracting 5 gives us a good estimation of date of infection. We end up with a three-dimensional set of data. Dimension 1 is the "day of data", and is the released numbers for each day. Dimension 2 is the "day of infection" (or more generally, the "day of type"): for each day of data, we have a scalar number for each day of infection. Dimension 3 is the numbers (e.g., cases) themselves. The graphs here show only the most recent snapshot of the most recent day of data (though historical data is used to create the confidence intervals); representing all three dimensions can be done in the software with animated graphs.

If we make a simple line graph of the data we want (cases, CFR, or R value for instance), it will be incomplete for recent days since it's by day of infection and recent infections have a lower representation in the data. But over time these numbers will stabilize (usually). What I do is look at how they've stabilized in the past to make a confidence interval for where they will end up at in 30 days. For instance, an infection day 10 days in the past will have some cases recorded, but the number is extremely incomplete. But if we look at days of data of 40+ days ago, we have 30 days worth of increase for the corresponding 10-day-delay. We can use those values (always multiplicatively) to get a sample of possible endpoints (in 30 days) for today's numbers (from 10 days ago). This is done after all smoothing, making the confidence interval precise with respect to the graph curve. If you look at the same graph in 30 days, it should fall within the confidence interval the given percentage of the time - assuming circumstances remain roughly the same. Circumstances are often changing, and of course the data from one day to the next is not independent. It is probably possible to predict whether the final numbers will be toward the top or the bottom of expectations based on whether current curcumstances (particularly, testing turnaround times) are better or worse than the average over the previous 100+ days.

Estimating test numbers by infection day is a completely unrelated calculation. A negative test certainly has no onset/infection day, but to find positivity by onset/infection day we have to estimate one. To put it most simply: for each new day of data, the tests are distributed to an onset/infection day using the same distribution as the cases for that day-of-data have. More technically, on each day of data, new cases are added for each day of infection, and new tests are added as just a single total for the day. A positivity value for that day of data can easily be calculated, and dividing the number of new cases for each day of infection by that positivity can give us an estimate of the tests for that day of infection. This test value by itself isn't particularly noteworthy, though it is important that the total number of tests is not changed by this algorithm - we're simply moving from them around from the day of data to the day of infection.

By comparison, R values are trivial to calculate, being just the case growth over the time period of the serial interval. The serial interval is currently assumed at 3.96 days as per the arithmetic mean found here. To use rational smoothing, the week before and after each day's total cases are used for a ratio, which is then exponentiated to SERIAL_INTERVAL/7. By using weekly totals, day-of-week issues are minimized (these appear to still exist even in symptom onset data due to placeholders). It is worth noting that the geometric serial interval is a fairly complex thing to calculate; it should not equal the arithmetic average of data points but should be solved for in an equivalent way to solving for the golden ratio from the Fibonacci sequence. But the counter-point is that the serial interval doesn't really matter most of the time; we could just as easily use weekly growth rates instead of attempting to calculate an actual R value in most calculations. R is useful in determining the effect of population immunity, however (R should decrease in a time period proportional to the percentage of previously susceptible people who are either recovered or finish vaccination in that time period, which could possibly be the target of a future graph).

As of early 2021 onset date numbers stopped having variability, and appear to just be a few days before reported date. Subtracting 5 days off to get infection date no longer seems to justify the added complexity. Thus I'm moving everything simply to use onset date.

The Programming

The code is included as an Eclipse project - as the data is included as a sequence of CSV's - in sister github projects. There is a lot more that is visualized with the charts created by the program - animated GIFs show the numbers over time, numbers can be seen by date of infection, onset, reporting, or death (all of which are incomplete), county final numbers are available (only state-wide numbers are available by onset day, so the final numbers are no different than what can be found in any other data source), the average age from infection/onset/reporting/death to CDPHE release can be found, and...so on.

About

COVID numbers for Colorado

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages