Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculate change in frequency for nodes in tree #595

Merged
merged 15 commits into from Apr 13, 2021
Merged

Conversation

huddlej
Copy link
Contributor

@huddlej huddlej commented Mar 30, 2021

Description of proposed changes

Adds a script, Snakemake rule, and configs to calculate the change in
frequency (i.e., "delta frequency") for nodes in a given tree over a
fixed period of time. The default time period is 4 weeks which
corresponds to 4 timepoints or "pivots" in the tip frequencies data. The
workflow exports this new delta_frequency annotation in auspice JSONs
for visualization as a color-by or scatterplot attribute.

Related issue(s)

Fixes #594

Testing

Tested by CI and with the Nextstrain "global" build. Examples include scatterplot view:

image

And the tree view:

image

Based on the tree view, we may want to limit the delta frequency calculations to internal nodes instead of calculating them for all tips.

@huddlej
Copy link
Contributor Author

huddlej commented Mar 31, 2021

858d2d0 updates the logic of these annotations to exclude tips by default and also omit annotations for zero-valued change in frequencies. The tree view now looks like this:

image

Even though the tips are gray, they still obscure the pattern in the internal nodes a bit. The scatterplot view is much more helpful for identifying recently growing/declining clades:

image

This most recent commit also adds an annotation for the current frequency of each internal node with a non-zero change in frequency. This annotation allows us to make plots like this one where we can identify rapidly growing clades that are still at lower frequencies, for example.

image

@huddlej
Copy link
Contributor Author

huddlej commented Mar 31, 2021

The build from the example screenshots above is available in the staging bucket and through the Nextstrain development server.

@huddlej huddlej marked this pull request as ready for review March 31, 2021 18:04
@trvrb
Copy link
Member

trvrb commented Mar 31, 2021

Awesome! This PR looks like it does a straight-forward additive change in frequency? Ie going from 5% to 10% in 4 weeks would get labeled as dfreq = 0.65?

Per https://github.com/nextstrain/ncov/pull/595/files#diff-3f5f25a4554430e151e3be1cdc169c1fe731bd67a92232e350a8a5faeac36539R79:

node_delta_frequency = (node.frequencies[-1] - node.frequencies[-(args.delta_pivots + 1)]) / delta_time

This is not the best way to capture logistic growth. I would recommend a Fisher-Pry transform. More information can be found here:

But all you need to do is run:

Screen Shot 2021-03-31 at 11 28 55 AM

on the input frequencies and then a simple regression will give you logistic growth rate. Most people are reporting these in terms of daily logistic growth rate, where for example B.1.1.7 is ~0.04 per day in the US and ~0.08 per day in Switzerland.

growth-rates-across-countries

The above comes from working with lineage frequencies for a little while now and trying to assess their growth rates. Linear regression on Fisher-Pry transformed data has been the most robust method for me.

I would:

  • Fisher-Pry transform the last 4 pivots
  • Run a regression on these 4 data points
  • Report slope of this regression as dfreq
    (if resulting estimates are too noisy, we could move to ~6 pivots)

Or we could give it a different name to avoid confusion with usage of dfreq for influenza. Perhaps a text label of "logistic growth rate" and a short label of logistic_growth?

Curious for @rneher's thoughts here as well. I do see these estimates breaking down when calculating the growth from 99% to 99.9%, etc... Perhaps we'd want to remove estimates for nodes with current frequency >99%, but I'd wait do to this until we can look at behavior.

@trvrb
Copy link
Member

trvrb commented Mar 31, 2021

And separate from the issue of exactly how to calculate growth rate from vector of node (clade) frequencies, there is an issue of how to treat tips. I see that you've made the decision to not give attach a growth rate to tips. This is well justified in a lot of ways, however, I think it would be better to cast growth rates to tips. Take a look at this zoomed portion of B.1.1.7 for example:

Screen Shot 2021-03-31 at 11 45 33 AM

Every internal branch gets an estimate of growth rate, but the final branch leading to each tip does not. These 2 tip internal branches will be about as wonky as single tip branches. This causes it to look like the internal bits of B.1.1.7 aren't growing, but they really are.

Here, I'd set a tip count threshold (perhaps something like 10) in which only internal nodes with 10 or more subtending tips would have the logistic growth directly calculated. Other branches (with 9 or fewer subtending nodes) would inherit the closest calculated "upstream" growth rate estimate. This will emphasize single-tip branches coming from basal expansions, but I think this is okay.

I'd also calculate current frequency for tips in addition to 2-plus-subtending-tip internal nodes. This will highlight the contribution of more recent tips.

Both these will result in (I believe) more useful scatterplots where you can actually mouseover tips.

@huddlej
Copy link
Contributor Author

huddlej commented Apr 1, 2021

This PR looks like it does a straight-forward additive change in frequency? Ie going from 5% to 10% in 4 weeks would get labeled as dfreq = 0.65?

Yes, this is just the original dfreq formulation from flu, so 5 to 10% in 4 weeks (0.0833 years) would be (0.1 - 0.05) / 0.0833 or ~0.6 delta frequency.

I would recommend a Fisher-Pry transform. [...snip...]
I'd set a tip count threshold (perhaps something like 10) in which only internal nodes with 10 or more subtending tips would have the logistic growth directly calculated. Other branches (with 9 or fewer subtending nodes) would inherit the closest calculated "upstream" growth rate estimate. [...snip...]
I'd also calculate current frequency for tips in addition to 2-plus-subtending-tip internal nodes.

Ok, I’ve added this logistic transform and regression to the delta frequency script as an option along with a user-configurable option to limit these calculations to nodes with at least 10 tips. The script propagates the values of parents to nodes that are too small to have values calculated. I reannotated the original tree that I’ve shown in the PR with a logistic_growth field and also calculated the current_frequency field for all nodes and tips. I’ll show screenshots below of this tree that is also available on the staging server. The tree still includes the original delta_frequency or "Change in frequency" annotation.

Here is the tree view colored by logistic growth with a minimum of 10 tips required for the growth calculation.

image

One issue here is that the color scale is linear, so growth >3 looks the similar across the whole tree and it is difficult to identify clades at the higher end of the distribution. The coloring of tips also obscures the signal of internal clades even if those tips have the same color as their parents. Would it be difficult to implement a log color scale in Auspice or should we define our own scale in a different way through a config file?

Next, we can look at the scatterplot of logistic growth by sample date.

image

This view is complicated by older tips that have inherited the logistic growth values of their parent clades. We also see there is a wide range of growth values with the most extreme values representing small clades (those with <10 tips). This pattern seems consistent with the logit transformation’s greater emphasis on small changes at the tails of the input distribution. Just for comparison, I repeated these growth annotations but without propagating values from parents to children and got the following view that more clearly highlights rapidly growing smaller clades like the one at the top. I understand the need to include tip annotations in the tree view, though. For this kind of analysis, I wish that the scatterplot view had a way to toggle tips on an off the way it can toggle branches.

image

Back to the original tree with tip annotations, if we plot growth by current frequency, as shown below, we still see the wide range of growth values on the left, but we can also more clearly identify medium-sized clades (by current frequency) that have larger growth values (like 501Y.V1).

image

We can use this view to zoom into a recent 501Y.V1 subclade with higher growth. From the tree and map views, we can see that this growing clade has a declining subclade with a ORF1b:P218L mutation in West Africa and a growing subclade with a N:R204P mutation in Australia and central Europe.

image

One issue that you hinted at before is that propagating growth values to smaller clades can give the appearance that these small clades are doing as well as their parents when we really don’t know. This pattern is easier to see with the linear dfreq measurement and linear color scale for the following clade where its very small subclades appear to have the highest change in frequency, but they’ve really inherited their parent’s value.

image

Finally, here is the correlation between the logistic growth values and linear change in frequency values. For early detection of small clades that are rapidly growing, the logistic growth values provide a much better signal.

image

If this looks like a better direction, I'll update all the Auspice config files to reference these new logistic growth and current frequency fields.

@rneher
Copy link
Member

rneher commented Apr 2, 2021

Thanks a lot for pushing this, @huddlej !

The large negative values after log-transform could be driven by the KDE smoothing kernel and will depend rather sensitively on the value of pc. Maybe we should clamp the range of valid values.

The behavior of frequency trajectories near one are likely never a perfect logistic. Geographic and demographic penetration will slow a variant down before reaching 100%. But logit should still be a better transform than the simple log that we have used in the past. log transform has the advantage though that variants will receive smaller growth values once they become common (growth from 90 to 95% is arguably not terribly important).

Overall, frequencies are normalized such that they sum to one at each pivot, right? I have been thinking of using the explicit derivative of your smoothing kernel, but this is complicated by unequal weighing pivot by pivot.

@huddlej
Copy link
Contributor Author

huddlej commented Apr 2, 2021

The large negative values after log-transform could be driven by the KDE smoothing kernel and will depend rather sensitively on the value of pc. Maybe we should clamp the range of valid values.

That makes sense. For the figures above, I used the default value from the diffusion frequencies code (pc=1e-4). With pc=1e-3, we get the following range:

image

With pc=5e-3, we get the following range:

image

Overall, frequencies are normalized such that they sum to one at each pivot, right? I have been thinking of using the explicit derivative of your smoothing kernel, but this is complicated by unequal weighing pivot by pivot.

That's right. We could always switch to a more flexible implementation though.

@trvrb
Copy link
Member

trvrb commented Apr 4, 2021

Thanks for all the detailed work here @huddlej. Starting with a replication check using different methodology to check that final numbers make sense. Here, I looked at the base of B.1.1.7 which has a logistic growth rate of 4.0 in your analysis (https://dev.nextstrain.org/staging/ncov/global-with-dfreq?c=logistic_growth&f_clade_membership=20I/501Y.V1&l=scatter). It looks like this is measured in num_date units and so represents a logistic growth rate of 4.0 per year.

If I take the same 3984 strains present in your dataset and run my logistic rate estimate I get:

frequencies-B 1 1 7

global-projection

This rate of 0.04 per day corresponds to a rate of 13.0 per year, so about 3X what you have.

(This is a little funny of comparison in that the data goes until Feb 15 in this example, but the final pivot is March 22, so the final 5 weeks in the PR example is almost all projection from the KDE.)

I followed up here by directly looking at the final 4 pivots of clade 501Y.V1 via Auspice which are 27%, 29%, 30%, 32%. Logit transform followed by linear regression gives a slope and logistic growth rate of 1.1 per year.

I think I need to run things myself to dig in deeper here to figure out what's going on with the difference between values of 1.1 and 4.0, but do you have any thoughts on what may be causing this John?

@trvrb
Copy link
Member

trvrb commented Apr 4, 2021

Also, I think we're going to want 6 pivots (weeks) back. If you look at the latest live dataset, the final 2 weeks have ~30 genomes and the 2 weeks before this have ~330. I'm comfortable making estimates using 4 weeks of data, but the most recent two weeks shouldn't count for much here. We see consistently that the nowcast of B.1.1.7 frequency based on KDEs is less than "real" frequency once the data comes in. For example: https://nextstrain.org/ncov/global/2021-03-16 has pivot from 2021-02-28 at 36%, while https://nextstrain.org/ncov/global/2021-03-01 has this same pivot at 25%.

Either in addition to or instead of this, we could also pull back the latest pivot we calculate in general. This may be more principled actually. As we don't trust the current estimate. Maybe lop off the final pivot in current tip frequencies and move to 5 pivots back for growth estimates?

@huddlej
Copy link
Contributor Author

huddlej commented Apr 5, 2021

(This is a little funny of comparison in that the data goes until Feb 15 in this example, but the final pivot is March 22, so the final 5 weeks in the PR example is almost all projection from the KDE.)

I think this is a bug with how I ran the build on March 22 but did not download the latest GISAID data. Hopefully, this isn't going to be an issue in production, but I'll doublecheck.

I followed up here by directly looking at the final 4 pivots of clade 501Y.V1 via Auspice which are 27%, 29%, 30%, 32%. Logit transform followed by linear regression gives a slope and logistic growth rate of 1.1 per year.

I think I need to run things myself to dig in deeper here to figure out what's going on with the difference between values of 1.1 and 4.0, but do you have any thoughts on what may be causing this John?

It seems like the issue might be differences in how we calculate our date units. One small point is that the delta frequency script asks for "4 pivots back from the last pivot" to get the frequency 4 weeks back, so we end up with 5 data points to fit.

I attached a Jupyter notebook showing the frequencies and pivots for 501Y.V1 and how the model fitting works for those data:

In your example of 32% latest frequency and earliest of 27% frequency, the delta time between those pivots is 0.064. The logit transform of those values is logit(0.32) = -0.75 and logit(0.27) = -0.99. So, the slope between those two points is (-0.75 - (-0.99)) / 0.064 = 3.75. I've rounded all of the inputs here, but in the more precise calculations in the notebook the value is closer to 3.93.

Either in addition to or instead of this, we could also pull back the latest pivot we calculate in general. This may be more principled actually. As we don't trust the current estimate. Maybe lop off the final pivot in current tip frequencies and move to 5 pivots back for growth estimates?

That is a great idea. With this change, then the --delta-pivots argument of the current script would change to be "pivots back from the last pivot" and we'd add an argument like --drop-last-pivots that defaults to 1?

@trvrb
Copy link
Member

trvrb commented Apr 5, 2021

Thanks for the well thought out notebook @huddlej. This really makes things clear. Seeing how close to perfect fit the linear regression is on the logit scale is really comforting. There's perhaps a bit of evidence for the final pivot to be a slightly below expectation, but it's really not as big of an effect as I was expecting:

Screen Shot 2021-04-05 at 2 49 41 PM

With this change, then the --delta-pivots argument of the current script would change to be "pivots back from the last pivot" and we'd add an argument like --drop-last-pivots that defaults to 1?

I was proposing an "upstream" change so that when tip-frequencies.json is calculated we'd move from --max-date of "today" to 1 week back from today here: https://github.com/nextstrain/ncov/blob/master/workflow/snakemake_rules/common.smk#L150. Ie we'd want to correct Auspice view so that the current most recent pivot isn't shown as we don't trust this pivot.

That said, the current B.1.1.7 example looks pretty spot-on without the final pivot being obviously off. I'm not sure I'd worry about this at the moment.

Given notebook, I'm quite convinced about the logistic growth calculations and think the current parameters (4 pivots back, so 5 pivots total) are probably good.

In your example of 32% latest frequency and earliest of 27% frequency, the delta time between those pivots is 0.064.

Ah! Yes. I messed up with the delta time and was off by 1 week vs 4 weeks to give the factor of 4 error. Thanks for catching this.

@trvrb
Copy link
Member

trvrb commented Apr 5, 2021

Also, thanks for all the work implementing the tip count cutoff for calculation of growth rate. I was a bit surprised by this behavior here:

Screen Shot 2021-04-05 at 3 07 04 PM

where the highest growth rate clade does not have anything special about it. It's just a clade of 10 tips with no nucleotide mutations and nothing particularly special about it.

Screen Shot 2021-04-05 at 3 08 11 PM

It looks like it just happens to have 10 recent tips associated with it that are sampled and so there is apparent clade growth:

Screen Shot 2021-04-05 at 3 12 35 PM

Maybe the only recourse here is to have a larger cutoff value? Like 15 instead of 10? Because even if we didn't propagate internal node growth rates to the tips, this would still register as a internal node with 10 tips with high growth rate.

Or perhaps dropping the final pivot would help in this scenario? I can imagine these small clades being influenced by a particular very recent sample pulling up the final pivot.

@huddlej
Copy link
Contributor Author

huddlej commented Apr 5, 2021

I was proposing an "upstream" change so that when tip-frequencies.json is calculated we'd move from --max-date of "today" to 1 week back from today

Ah, that seems much nicer!

Maybe the only recourse here is to have a larger cutoff value? Like 15 instead of 10? Because even if we didn't propagate internal node growth rates to the tips, this would still register as a internal node with 10 tips with high growth rate.

Or perhaps dropping the final pivot would help in this scenario? I can imagine these small clades being influenced by a particular very recent sample pulling up the final pivot.

I bet dropping the final pivot would help. Initially, I tried setting a minimum current frequency of 1% instead of a fixed number of tips (after talking with @kistlerk about her similar work). But it turned out that viewing these data with current frequency on the x-axis helped focus on clades that are large enough to consider their high logistic growth as more relevant without applying any filters during the analysis.

I will re-run the analysis with the latest available data so the last pivot and available samples are more closely matched. We can see from those results whether we need to modify any filter thresholds or alter the pivots.

@trvrb
Copy link
Member

trvrb commented Apr 5, 2021

Thanks John. I had run this trail build https://github.com/nextstrain/ncov/actions/runs/717668977 from commit 53053e2. You should be able to work from this:

nextstrain build --aws-batch --attach fb03df82-6055-4f0c-90ff-45329d713f5c .

I haven't looked at the results here yet.

@huddlej
Copy link
Contributor Author

huddlej commented Apr 6, 2021

@trvrb I ran a fresh build with the latest available GISAID data, to see if logistic growth annotations were less likely to pick up spurious signals from KDE distributions. As a sanity check, I inspected the frequency distribution of strains by their submission date. Although there were strains in the tree from early April (a couple of days ago), 48% of frequencies at the last timepoint came from strains submitted >7 days ago and 21% submitted more than 1 month ago.

image

This distribution suggests that calculating frequencies from one week back as the maximum date would be a better implementation. Next, I looked at the logistic growth values by sampled date and current frequency. The same issue that we noticed in the earlier analysis appeared here where the clade with the highest logistic growth was a small one (N=10 strains) with two more recent samples whose KDE distributions most likely drove the apparent recent growth.

image

To address this issue, I added a configuration parameter for frequencies called recent_days_to_censor that allows users to dynamically set the maximum date for frequency estimation such that the most recent days are excluded.

Without defining this parameter, its value defaults to 0 and the max date is 2021.2616438356165. When I set this parameter value to 7 days, the max date is 2021.2424657534248. I recalculated frequencies and logistic growth and still got the same results with the highest growth from the small clade above. I could move the max date back even farther, but this would seem to throw away information about larger clades that we care about.

(As an aside, while I was inspecting the frequencies command, I noticed that the KDE bandwidth is 0.05. At 0.00274 years per day, this bandwidth corresponds to 18 days. That seems kind of wide now that we have such densely sampled trees.)

Other options to tune parameters and avoid these spurious signals would be to:

  • increase the numerical precision threshold from 1e-3 to 1e-2 to limit extreme growth values
  • increase the minimum number of tips required for growth calculations
  • set a minimum current clade frequency for growth calculation
  • limit growth calculations to branches with amino acid mutations

I reverted the frequencies censoring back to original behavior and then I tried increasing the numerical precision threshold to 1e-2, as a way to avoid censoring data while limiting extreme logistic growth values. This change set an upper threshold of ~12 for the upper end and ~-7 for the lower end. The resulting tree view more clearly highlights recently growing clades.

image

The highest growth clades with current frequencies >1% are actually reasonable clades with many amino acid mutations.

image

Some larger clades with high growth rates did not have amino acid mutations.

image

Their growth appeared to be driven by smaller subclades with a couple of recent samples.

image

This result suggests that censoring the last couple of days in frequency calculations could still be helpful. Adding censoring of the last 7 days back to the frequency calculations, the same larger clade still has a relatively high growth rate, but the subclade that appeared to be driving that growth is no longer solely responsible now that its two latest strains do not contribute to frequencies.

image

For what it is worth, this approach of censoring the last 7 days for frequency calculations and increasing the numerical precision threshold seems to produce the clearest signal from the tree view, too.

image

I’ve pushed this latest tree to the staging server, if you want to explore it more.

@trvrb
Copy link
Member

trvrb commented Apr 7, 2021

Nice progress here John. However, I do think that the shrinkage from using pc=1e-2 is too strong. See for example clade 20H/501Y.V2:

Screen Shot 2021-04-06 at 9 45 57 PM

The base of the clade is estimated to be growing as expected from just looking at frequency plot. However, subclades within it quickly dive to 0. My desired behavior would be a tree that's much more blocky. Ie we expect subclades within 20H/501Y.V2 to be growing at the same rate as the parent clade unless we have good evidence otherwise.

I'd suggest to go in the direction of something like pc=1e-4 but then having a tip count cutoff of something like ~25.

@huddlej
Copy link
Contributor Author

huddlej commented Apr 7, 2021

Ok, I see what you mean, @trvrb. Here is the same view for 501Y.V2 using pc=1e-4 and min_tips=25.

image

I added the frequency threshold (pc) as a command line argument to the delta frequency script and as an option in the Snakemake config.

The "logistic growth" and "current frequency" fields now have proper human-readable names now, too.

@trvrb
Copy link
Member

trvrb commented Apr 10, 2021

Thanks John! This is behaving rather nicely I think. I'm quite happy with the "blockiness" of the tree where moderately sized clades are being estimated to have similar logistic growth rates. A couple observations:

The clade with the highest logistic growth rate is a subclade of 20I/501Y.V1. And generally as expected subclades of 20I/501Y.V1 rank above much of other circulating viruses. You can see the growth of 501Y viruses here: https://nextstrain.org/staging/ncov/global-with-dfreq?c=gt-S_501&d=tree,frequencies&l=scatter&p=full&scatterY=logistic_growth, which is really quite striking.

501Y

I'm mostly okay keeping "Change in frequency" in, but it does have some odd behavior and I prefer how logistic growth is working. For example take a look at this parent clade of 20H/501Y.V2.

delta-freq

You can see that this parent clade (part of 20C in green) is estimated at higher "Change in frequency" than the 20H subclade (in orange). However, I think this just because this 20C parent clade is at higher frequency and getting carried along by a rising 20H subclade. If we look at logistic growth it nests in as expected and is mostly lower than 20H.

logistic

This parent clade to 20H is actually the thing with the highest "Change in frequency" and rather stands out in the scatterplot above all the 501Y viruses. (This is the main thing that gives me pause about including this, people will see clades like this one and think something is going on when it isn't)

My only remaining complaint is color ramp, where for logistic growth the vast majority of nodes have values greater than -10, but because the minimum value is around -30 the color ramp is compressed to basically be oranges and reds. I think we should supply a min and max in the Auspice config JSON to ramp from -10 to 10.

The other way to tackle this would be to increase logistic_growth.frequency_threshold a bit so that numbers don't get quite this negative. But I'm worried about washing out all small clades to 0 as happened before if this increases too much.

This deserves a bit of testing, which I'll try to do.

@trvrb
Copy link
Member

trvrb commented Apr 11, 2021

As specified in the above commit I thought that linear delta frequency was just too confusing. For example, parent clades of B.1.1.7 were consistently getting assigned higher values of delta frequency. Generally, I think is too biased to assigning high values to high frequency clades, which is exactly what we don't want to target.

I spent a fair bit playing around with different parameters for logistic growth calculation. There were a few broad considerations:

  1. There is a trade off between catching growth of biological distinct newly arisen clades and geographic sampling artifacts that cause a clade to look like it's suddenly taking off. My best strategy here was to increase the minimum clade count to 50 subtending tips.
  2. The other sampling effect was to have a clade that's generally declining, but then to get a bump due to sampling of some recent sequences. The best strategy I came up with here was to increase delta pivots from 4 to 6. This increase did not hide important new clades, but instead dealt with older clades "curving" in their trajectories.
  3. A major failure mode was basal clades that take off (or have subclades that carry them like 20B) and then having subclades that branch directly off of the basal clade that don't hit the minimum thresholds. Due to propagating down parent estimates to daughter nodes, this caused a bunch of dead 20B clades to have some of the highest logistic growth estimates. I dealt with this by having a low 50% threshold for "max frequency" above which a clade receives a logistic growth estimate of nan. This causes old clades in the tree to be grayed out in Auspice and I think is generally an okay strategy, but it would be nice if there was a way to fix these non-propagating clades more directly.
  4. Some old clades have completely died out near the present. With pseudocounts applied to the logit transform, these dead basal clades were assigned logistic growth rates of 0, which placed them in the middle of the pack, when they really should be ranked poorly. To deal with this, I added a criteria that sets logistic growth rate to nan for clades that have current frequency of less than 0.0001. This basically removes the need for finessing of pseudocounts (pc) passed to the logit transform function.

I'm pretty happy with the results. By my eye the trees generally have the appropriate behavior with known clades and also with clicking around to look directly at frequency behavior.

Results are available at:

@trvrb
Copy link
Member

trvrb commented Apr 11, 2021

I'm sure this could use some further finessing, but I'm happy enough with things as they stand that I'd now be okay merging. Thanks for all the detailed work on this PR John.

@rneher
Copy link
Member

rneher commented Apr 11, 2021

I think this generally works very nicely. But it does generate some odd dead branches:
image

I am wondering whether a more robust approach would be to consider only clades with tips in the last 3 months or so.

@trvrb
Copy link
Member

trvrb commented Apr 12, 2021

Thanks Richard. Your example here is the edge case I was having the most trouble dealing with. It's been difficult to get the small basal clades that don't hit the 50 tip cutoff to register appropriately. If we want the basal 20I/501Y.V1 clades to be appropriately red, it's really difficult not also have the basal clades to the parents of 20I also be inappropriately red. The actual parent clade to 20I/501Y.V1 should have high logistic growth, but small offshoots to this parent lineage should not.

That said, a cutoff for only calculating logistic growth rates for clades with tips sampled in the last ~3 months seems like it could be a generally helpful thing. Worth experimenting with.

Also, I see that the merged PR is failing the CI tests. Should look into this as well.

Adds a script, Snakemake rule, and configs to calculate the change in
frequency (i.e., "delta frequency") for nodes in a given tree over a
fixed period of time. The default time period is 4 weeks which
corresponds to 4 timepoints or "pivots" in the tip frequencies data. The
workflow exports this new `delta_frequency` annotation in auspice JSONs
for visualization as a color-by or scatterplot attribute.

Fixes #594
Do not annotate change in frequency for tips by default (but allow users
to override this with a flag) and do not emit entries for nodes with
zero-valued delta frequencies. The resulting new annotations only color
internal nodes and branches with non-zero delta frequency values which
allows us to more rapidly identify clades of interest.
huddlej and others added 12 commits April 12, 2021 11:06
Adds a `--method` argument to the delta frequency calculation script and
logic to calculate a regression from the logistic transform of
frequencies. The resulting annotation estimates logitic growth for
clades. This new annotation is included as "logistic_growth" in the
auspice JSON along with the original linear annotation, for comparison.
Limits delta frequency annotations to nodes in the tree that are
currently at >=1% frequency and propagates the delta frequency (or
logistic growth) values of these internal nodes to their lower-frequency
children. Also, separates the logic for annotating the current frequency
of each node such that tips have their own frequency annotations. These
changes should help us identify clades of interest from both tree and
scatterplot views.
…calculations

Adds a configuration parameter, `recent_days_to_censor`, to the top-level
`frequencies` configuration group. When this value is defined, the max date used
for frequencies is today's date minus the number of days requested. The value
defaults to 0, for backward compatibility.
Increase the threshold for values in the logit calculations used for logistic
growth. This change has the effect of restricting the range of logistic growth
value such that small clades with a few recent samples do not appear to dominate
the signal of rapidly growing clades.

     args = parser.parse_args()
@@ -116,7 +115,7 @@ if __name__ == "__main__":
                 # transform (as when frequencies equal 0 or 1).
                 y_frequencies = logit_transform(
                     node.frequencies[first_pivot_index:],
-                    pc=1e-4
+                    pc=1e-2
                 )

                 # Fit linear regression to pivots and frequencies and use the
Adds config and Snakemake param for the frequency threshold parameter of
logistic growth calculations. Also, adds a separate logistic growth section to
the config, so these calculations can be parameterized differently from linear
delta frequency calculations. Sets minimum tips to 25, to make sure annotations
only occur for clades that are large enough to warrant inspection.
Parent clades of B.1.1.7 were consistently getting assigned higher values of delta frequency. Generally, I think is too biased to assigning high values to high frequency clades, which is exactly what we don't want to target.
1. Increase min tips requirement
2. Increase delta pivots
3. Include max frequency cutoff
4. Include min frequency cutoff
…te logistic growth

The latest CI builds have fewer samples than previous CI builds, so the
new requirement that nodes have >=50 tips to calculate logistic growth
is never true for the CI data. This leads to a bug in the logistic
growth script where the root node tries to inherit its logistic growth
value from a parent it doesn't have. This commit adds a specific check
for the existence of a parent node before trying to pull from that node
and assigns a default missing value in the edge case where the parent
does not exist.
Instead of hardcoding the minimum frequency for clades to get logistic
growth calculated, expose this value as a parameter on the command line
and in the workflow configuration. This change should allow us to tune
parameters more easily.
Reduces the stringency of default logistic growth settings such that
values can be calculated for the smaller tree in the CI build and we
make sure the growth script is tested properly.
@huddlej
Copy link
Contributor Author

huddlej commented Apr 12, 2021

Also, I see that the merged PR is failing the CI tests. Should look into this as well.

@trvrb, I rebased this branch on to master so we're up-to-date with the new CI dataset on master. 2f875d6 fixes the CI bug and 41e60ea enables logistic growth for the CI build by making the internal node requirements less strict.

I'm sure this could use some further finessing, but I'm happy enough with things as they stand that I'd now be okay merging.
...
That said, a cutoff for only calculating logistic growth rates for clades with tips sampled in the last ~3 months seems like it could be a generally helpful thing. Worth experimenting with.

Do you want to continue experimenting in this branch or merge and then continue experimenting separately?

It seems like a higher minimum current frequency threshold (instead of a high minimum number of tips) would provide a way to get at the "clades with tips in the last 3 months" approach. 2130500 adds the minimum frequency as config parameter that we could tune for this purpose.

Only calculate logistic growth for clades with frequency greater than a threshold (set to e-06). This effectively only calculates logistic growth for clades sampled in the previous 3 months.
@trvrb
Copy link
Member

trvrb commented Apr 13, 2021

Thanks John! I just slightly updated the logic that applies the minimum frequency cutoff so that it effectively only calculates logistic growth for clades with sampled tips in the previous three months.

I've updated the outputs here:

I think this is a definite improvement and there's no longer confusion about what it means to have positive logistic growth of a clearly dead clade. I'm quite confident we should merge this now.

@trvrb
Copy link
Member

trvrb commented Apr 13, 2021

This did reveal one idiosyncrasy to how augur frequencies currently works. I'll describe here and it can become a separate issue if you agree that behavior should be updated. Running the above results in situations like:

recent

where very recent tips are gray. These are generally viruses sampled in the past week. I looked into things and these viruses (like "Australia/NSW4424/2021" in global above) have a collection date after the --max-date handed to augur frequencies. This results in a frequencies vector in tip-frequencies.json that looks like:

 "frequencies": [
      0.0,
      0.0,
      0.0,
      ...
      0.0,
      0.0,
      0.0
]

with all zeros. This is not what I would have expected. I'd expect --max-date to define the final pivot in the frequencies vector but I'd expect --metadata passed in with dates after this date to still be counted. Same logic for --min-date.

I don't think this makes a big difference however. But was surprising to me and slightly shifts clade frequencies.

@trvrb trvrb merged commit 58df6f4 into master Apr 13, 2021
@trvrb trvrb deleted the add-delta-frequency branch April 13, 2021 03:59
@trvrb
Copy link
Member

trvrb commented Apr 13, 2021

(I just realized I was supposed to update the change log as part of PR with this as a feature. I'll edit master instead. Sorry about that.)

@batson
Copy link

batson commented Apr 28, 2021

Love this new feature, @trvrb and @huddlej .

For use cases with time periods where the absolute number of samples is low and/or the variant is at very high or very low frequency, a weighted-least squares regression in logit-space may be more appropriate. (The strategies you have taken in terms of thresholding may also address these issues, but it may be worth trying for your offline analyses @trvrb.)

The loss function would be

Sum (f(x_i) - y_i)^2/(var(y_i|x_i)^2

where x = week, y = logit(frequency of variant in week), and var(y|x) is due to finite sample effects, rescaled by the squared derivative of logit. This is implemented in statsmodels and sklearn

This can help with time windows lacking good data and also with low/high probabilities, while still being as fast as linear regression.

Back-of-the-envelope:

Finite sample correction: if you observe k out of n, then the variance of k/n is

k(n-k)/(n*n*(n+1))

(If n is very low, you may want to use a beta(1/2,1/2) prior, which amounts to adding 0.5 to k and 1 to n.)

Jacobian of logit the logit will rescale a frequency x to log(x) - log(1-x), with derivative 1/x + 1/(1-x). Plugging in the observation k/n to this formula gives n/k + n/(n-k).

Put this all together to get a variance in the logit observation of approximately

var(y|x) = k(n-k)/n * (1/k + 1/(n-k))^2

When k is small and n is large, this is about 1/k; when k = n/2, this is about 4/n.

The appropriate weights for the regression would be the reciprocals of those. So an error at small k would be weighted by about k, where an error for large k would be weighted by about n/4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Annotate change in frequency for clades in Auspice JSONs
4 participants