Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smoothing might be done wrong #30

Open
lemeb opened this issue Apr 20, 2020 · 12 comments
Open

Smoothing might be done wrong #30

lemeb opened this issue Apr 20, 2020 · 12 comments

Comments

@lemeb
Copy link

lemeb commented Apr 20, 2020

First off: this is an absolutely kick-ass and necessary project. Thank you so much for this, and for the transparency that you're putting behind this.

I have two big concerns with this initiative. The first — about color coding on the website — is addressed in a tweet here, and the obvious fix would be to adopt a color scheme on the website similar to the one used in the Matplotlib plot_rt() function, which is not nearly as binary.

The second concern is about smoothing (or rolling average, or time series gaussian filter, which are used interchangeably here) below.

The problem with centering the rolling average

Try executing this code in your notebook, first with line 17 (center=True) present, and then commented out.

Slightly modified notebook cell (toggled for space)

state_name = 'NY'
# We will compute the smoothed lines, and the resulting R_ts, 6 times
# First with the data up to the latest day, then up to one day before 
# that day, and so on...
day_lags = [0,1,2,3,4,5]

# Using a different name to avoid interfere with prepare_cases()
def prepare_cases_new(cases, cutoff=25):
    # Unrelated note: new cases data are available in the COVID
    # Tracking Project in the column "positiveIncrease", so this line
    # could go, in theory.
    new_cases = cases.diff()

    smoothed = new_cases.rolling(7,
        win_type="gaussian",
        # The offending line is below.
        center=True,
        min_periods=1).mean(std=2).round()
    
    idx_start = np.searchsorted(smoothed, cutoff)
    smoothed = smoothed.iloc[idx_start:]
    original = new_cases.loc[smoothed.index]
    return original, smoothed


def prepare_cases_wrap(day_lag):
    lagidx = -day_lag if day_lag != 0 else None
    # We skip cases that are after our cutoff (or lag)
    cases = states.xs(state_name).rename(f"{state_name} cases")[:lagidx]
    # The new function is inserted here
    original, smoothed = prepare_cases_new(cases)
    return {"original": original, "smoothed": smoothed}

data = list(map(prepare_cases_wrap, day_lags))

original = data[0]["original"]
original[-21:].plot(title=f"{state_name} New Cases per Day",
               c='k',
               linestyle=':',
               alpha=.5,
               label='Actual',
               legend=True,
             figsize=(500/72, 300/72))

for day_lag in day_lags:
    smoothed = data[day_lag]["smoothed"]
    smoothed[-21+day_lag:].plot(
                   label=f"Smoothed with {day_lag} day lag",
                   legend=True)

fig, ax = plt.subplots(figsize=(600/72,400/72))

# Display the Rt history with the different time cut-offs
for day_lag in day_lags:
    smoothed = data[day_lag]["smoothed"]
    posteriors, log_likelihood = get_posteriors(smoothed, sigma=.25)
    hdis = highest_density_interval(posteriors, p=.9)
    most_likely = posteriors.idxmax().rename('ML')
    result = pd.concat([most_likely, hdis], axis=1)
    plot_rt(result, ax, state_name)

ax.set_title(f'Real-time $R_t$ for {state_name}')
ax.set_xlim(original.index.get_level_values('date')[-1]+pd.Timedelta(days=-21),
            original.index.get_level_values('date')[-1]+pd.Timedelta(days=1))
ax.xaxis.set_major_locator(mdates.WeekdayLocator())
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b %d'))

Note: the graphs below only show the 3 most recent weeks, for clarity's sake. There is no change to the underlying computation otherwise.

That code will first produce the following graph:

cases_with_center

Each colored line represents the smoothed cases made with data up to [5, 4, ..., 0] days before the latest data. The issue here is that they lines quite wildly.

For instance, here is what the model says about the amount of new cases on April 16:

  • On April 16 (3 day lag), the model says that there are 8930 that day, an increase of 292 from April 15.
  • On April 19 (0 day lag), the model says that, in fact, there were more like 8187 cases that day, a decrease of -175 from April 15.
Code to find the precise figures (click to toggle)

for date in [16, 15, 14]:
    print(f"For April {date}:")
    for x in range(6):
        try:
            new_cases = data[x]["smoothed"][f"2020-4-{date}"]
            diff = new_cases - data[x]["smoothed"][f"2020-4-{date-1}"]
        except KeyError:
            continue
        else:
            print(f"with {x} day lag: {int(new_cases)} new cases. change: {int(diff)}")
    print(f"")

In other words, the model is re-writing history every day. That's not great for something that is meant to be tracking Rt incrementally. Especially since, in turn, the estimates for Rt vary wildly as well:

realtime_rt_with_center

Note: which line corresponds to which time cutoff / lag should be pretty self-explanatory.

So, someone checking rt.live:

  • On April 14 (5-day lag): Everything is under control in NY state:
    • Most likely Rt is 0.55.
    • The 80% CI is 0.40 - 0.66, so anything above 1 is very far-fetched.
  • With the data for two additional days (3-day lag):
    • Most likely Rt for April 14 is now short of 1, at 0.97, something that the model deemed outside of the 80% CI just three days prior,
    • Now, the 80% CI (0.82 - 1.07) puts the previous value, 0.55, outside of it. Hell, it's even outside of the 95% CI, which I calculated: is 0.79 - 1.12

Same story for Rt for April 16:

  • the day of (3 -day lag):
    • Most likely Rt at 1.23
    • 80% CI: 1.09 - 1.33 (for the record: 95% CI is 1.06-1.38)
  • On April 19:
    • Most likely Rt for April 16 is 0.86
    • 80% CI: 0.72 - 0.97 (for the record: 95% CI is 0.68-1.02)
  • Similarly, both numbers, 3 days of each other, diverge way in a way that makes their CIs look meaningless.
Code for the confidence intervals (click to toggle)

for day_lag in day_lags:
    smoothed = data[day_lag]["smoothed"]
    posteriors, log_likelihood = get_posteriors(smoothed, sigma=.25)
    # Use .975 for 95% CI, .9 for 80%, .995 for 99%, etc.
    hdis = highest_density_interval(posteriors, p=.975)
    most_likely = posteriors.idxmax().rename('ML')
    result = pd.concat([most_likely, hdis], axis=1)
    try:
        print(day_lag, result.loc["2020-4-14"]) # or "2020-4-16"
    except KeyError:
        pass

I don't really think you can run a model that updates every day like that. Think about the models from 538: when you look at them every day, you would expect the latest numbers to change because new polling has come in. What you would not expect is the previous numbers to have changed as well, unless the underlying model had been revised overnight.

I think that's the impression that many readers will have looking at the website day to day, and see data from previous days change all of a sudden. They will think that the model is massively tweaked all the time, when in fact it is working as intended.

The perks of not centering the rolling average

The solution is simple: remove the center=True line from the previous code. These are the results:

cases_no_center

What changes in the COVID Tracking Project data every day at 4PM is not the data from previous days, merely the data from that specific day. The smoothing line should behave accordingly, and does so here.

realtime_rt_no_center

As far as the Rt calculations are concerned, they are also much more stable. (I haven't done the math, but I obviously suspect that confidence intervals might have been incorrect before, and are now.)

Note: As the code listed at the top of the comment makes clear, I did not modify the sigma across runs. #23 references a change in sigma as something that might modify the curves, but it seems at first look that the impact is more minor than centering the rolling average.

Alternative

I'm not an epidemiologist, or a statistician by trade. Maybe centering the rolling average is the right thing to do, but the inevitable result is that Rts will change in a way that will be incomprehensible to readers that expect this to be an incremental dashboard.

This all the more that you're using a Bayesian approach, so it's not really great to revisit priors' priors. But maybe I'm wrong! I didn't see any formal justification of the centering though. If you do decide to go with the centering, however, then please freeze the history.

In other words, make sure that the Rt generated for a state on April 16 will be the same whether you're checking out rt.live on April 16, 17, 18, 19, and so on. That would make the charts even more unstable, and CIs would have to be revised, but this is better than seeing the history of Rts change drastically everyday.

Appendix: more results

Other results with no centering

rt_no_center rt_ranking_no_center rt_high90_no_center

For reference, same images generated with `center=True`

rt_with_center rt_ranking_with_center rt_high90_with_center

@k-sys
Copy link
Owner

k-sys commented Apr 21, 2020

This is a fantastic bit of feedback thanks for writing it up. I am unhappy with the smoothing as it stands not because it's centered per-se but because of the min_periods=1 arg. Effectively, every point is an average of 7 days (weighted) except the last few days which become concentrated. The problem I have with just uncentering is that you effectively have the same values just delayed a few days, which would in turn delay Rt's value by a few days. I think this is right, but posting my thoughts so there can be a discussion. I'm also wondering if there might be another kind of filter that would be better than a simple windowed technique (eg. kalman, etc)

Notice the delay effect below on CT's new cases:

cases

The red suffers from over weighting on the last value whereas the black does not, but the black is just the red line delayed by half the window size.

Thoughts?

@k-sys
Copy link
Owner

k-sys commented Apr 21, 2020

Here's a very simple g-h filter on the data which actually looks quite promising and suffers from none of the windowing effects. Notice I zoomed so you could see the effect.

Notice how the GH tracks closer to the true value without the uptick at the end?

all

@femto113
Copy link

the model is re-writing history every day. That's not great for something that is meant to be tracking Rt incrementally

The reported cases are re-writing history every day: positive tests reported on day N could easily be for a person who became infected on day N-7. Here's a couple highlighted screenshots I happen to have at hand (for WA and TN)

Screen Shot 2020-04-20 at 8 43 31 PM

Screen Shot 2020-04-20 at 8 45 33 PM

@k-sys
Copy link
Owner

k-sys commented Apr 21, 2020

This is a different issue - this is about taking current counts and moving them from reporting time to onset time (which would shift them backwards). I believe that's recorded as a different issue (I agree with the issue btw)

@kmedved
Copy link

kmedved commented Apr 21, 2020

I think @k-sys's suggestion to use a Kalman filter for this is a good one. Depending on whether or not you are concerned with centering, you can use the filter or the smoother from the Pykalman package for this, and handle this in a couple of lines of code. I would personally recommend the smoother, as I am not concerned with older estimations changing for these purposes.

A g-h filter is fine too, as is pd.ewm; they all get you to the same place for this problem. Pykalman is nice because it has a built-in optimization method for the parameters via the .em() method, and works as a smoother as well.

@k-sys
Copy link
Owner

k-sys commented Apr 21, 2020 via email

@femto113
Copy link

This is a different issue - this is about taking current counts and moving them from reporting time to onset time (which would shift them backwards)

My point here was that @lemeb's argument for a model that doesn't change previously publicized values isn't really feasible, since each day's reports should affect the calculation for multiple prior days.

@lemeb
Copy link
Author

lemeb commented Apr 21, 2020

A g-h filter is a great choice. I didn't know about it until you brought it up, but it is indeed much better than a trailing average and it preserves older estimations. So that's great.

After playing around with a simple g-h filter for a couple of hours, I'm a little baffled about which g and h to choose, as it seems that some seemingly reasonable g and h make the highest_density_interval function crash because of a lack of good highs and lows... But that's solvable anyway.

farr pushed a commit to farr/covid-19 that referenced this issue Apr 25, 2020
@srikumarks
Copy link

srikumarks commented May 7, 2020

Came here since I was unhappy with the ending artifacts of the gaussian filter used in this approach.

I'm thinking that a filter that will hold on to the near-end Rt estimates is valuable (*) .. especially given the growth factor is exp((Rt-1)/7) applied to day-on-day case arrival values. To meet this, I've chosen to do a centered gaussian filter on day-to-day log ratios. This filter is somewhat capricious compared to filtering the raw case rate if you look at actual case values, but since the growth model is scale independent, that is (I think) acceptable for the purpose of estimating Rt - i.e. for an Rt of 2.5, say, it wouldn't make much of a difference to Rt itself whether k is 100 or 1000, though it will likely affect the spread around the estimate.

edit: (*) This is because gaussian filtering (or perhaps even most filtering techniques that operate on daily case arrivals) typically holds the end value beyond the end and therefore biases the Rt estimates near the end to move towards 1.0 .. which is terribly misleading given we're trying to track it "real time".

(pseudo julia code)

logratio(x) = [ log(x[i]/x[i+1]) for i in 1:(length(x)-1) ]
logfilter(x) = vcat([x[end] * exp(a) for a in reverse(cumsum(reverse(gaussian(logratio(x)))))],[x[end]])

Given I'm willing to accept history rewrites, is there something that would introduce another kind of systematic bias with this approach that I'm missing?

@srikumarks
Copy link

Update to previous comment - The log ratio filter is a bit too sensitive to near-end fluctuations - so if we get a spike on the last day, that becomes significant as it gets extended. To counter that, I'm now using an exponential filter followed by the log ratio filter. Fairly content with it for the moment.

@ryancherrmann
Copy link

Would you be able to show the code with the exp filter and log ratio filter implemented? I've also been having issues near the ends (like log(0) issues, etc.) and would like to see how you implemented this. Thanks!

@srikumarks
Copy link

srikumarks commented May 9, 2020

Would you be able to show the code with the exp filter and log ratio filter implemented?

You can find the code here - https://github.com/Imaginea/covid19/blob/master/rt-turing.ipynb . It's in Julia since I used Turing.jl to implement the model ... much shorter code.

Please see the "true filter" section (no filter is "true" - see note below - but the name is to make the intention clear :).
Here's an explanation -

The filters are implemented by hand as repeatedly applied simple FIR (finite impulse response) filters. The filter(x,n) function does a [0.25,0.5,0.25] centered FIR filter on the array x, and does it n times. This approximates a centered gaussian filter if done a few times. Similarly, the fir2filter(x) does a [0.25,0.75] FIR convolution, which gets you causal finite geometric decay if repeated a few times before itself starting to look like a gaussian. The truefilter(x,n) applies the fir2filter(x) on the raw daily counts n times , followed by the gaussian approximation on the log ratio, also n times. In my case, I chose n=5.

There is no "true" filter

I think the bottomline is that given the weekly cycles embedded into the case reporting, we have to live with about a week's latency for a reliable Rt estimate. Any attempts to reduce the latency for a reliable estimate can only mean we have a filter that's already capable of predicting Rt a week into the future .. which unless no variation in action is seen over the previous couple of weeks or so and aren't planned for the coming week, is close to clairvoyance or at the least requires rich data from diverse sources. This means, we're essentially estimating Rt for a week ago .. which I think is still valuable. I only want to avoid giving the false impression of "Rt has been going down over the past few days" just because I used an inappropriate assumption for the filter.

edit: I just added this note to the notebook itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants