Use weighted sampling #1141

victorlin · 2024-08-14T18:00:49Z

Note

#1106 came first. This is a higher level summary written after some design discussions happened in that PR.

Context

There is much sampling bias in SARS-CoV-2 data. One of the goals of this workflow is to produce datasets that are representative of real-world incidence, stripping away as much sampling bias as possible.

Currently, this is approximated by sampling with various group_bys - a combination of geographic (division/country) and temporal (month/week) attributes - to define groups that are then uniformly sampled based on a target max_sequences.

The need for uniform sampling at the group level is an inherent limitation of augur filter. It has prompted workarounds in this workflow such as #1074.

Proposal

There is a proposal to remove the limitation of augur filter: nextstrain/augur#1318. The option to specify sampling weights could be directly used in this workflow. Population-based weighted sampling would bring this workflow one step closer to representing real-world incidence, though there will still be some inherent sampling bias¹.

Case count data was also considered as a potential source of weights, however it was determined that population data would be a better source. See discussion: #1106 (comment)

¹ weighted target sizes are calculated without taking into account the actual number of sequences available per group. This means under-sampled countries would still be under-sampled, resulting in fewer total sequences than requested by max_sequences. This is already the case with current uniform sampling, but it may be more noticeable under population-based weighted sampling for large countries that are under-sampled.

Progress

Allow weighted subsampling augur#1318
Use weighted sampling for Asia builds #1106
- I chose these builds to directly address the workaround applied in Update subsampling #1074
Fix Asia weighted sampling #1150
Use weighted sampling for other builds #1151
Figure out what to do with global 1m/2m/6m builds
- Use weighted sampling for recent samples in global builds #1161
- Revert "Use weighted sampling for recent samples in global builds" #1168

The text was updated successfully, but these errors were encountered:

corneliusroemer · 2024-12-20T13:05:10Z

Reopening as the implementation didn't seem to work for at least global builds, see for context:

As implemented (prior to revert), the recent samples become obsolete as they become essentially subsets of the longer window builds. The intent of recent builds (1m/2m) was to have builds that include as many recent sequences as possible as those are the most interesting for spotting new developments.

I don't know how population weighting is implemented. Based on the result observed, it seems plausible to me that "max sequences" assumes that all countries contribute their full quota which is very much not true at a global scale and especially so for recent submissions.

This means that the effective sample will be much lower than the "max sequences", in contradistinction to the original augur filter meaning of "max sequences", where you would pretty much get max sequences no matter the grouping.

A temporary workaround, if one wanted to keep using the new population weighting feature, is to scale up the parameter of max sequences to something like 15k to get an effective sample of 4k. This would be theoretically risky as if countries where to scale up sequencing, we could end up with more sequences than we really want (~5k) [this is unlikely to happen in practice so is more of a theoretical issue]. Another issue is that over time as sequencing activity will likely further decrease, we would have to further increase the knob of max sequences, not ideal.

After reviewing more of the prior PRs, I realize that the revert of #1161 doesn't fully restore the approach prior to new population weighted sampling for global builds. That would require reinstating the splitting of countries in Asia.

victorlin · 2025-01-02T20:05:14Z

I don't know how population weighting is implemented. Based on the result observed, it seems plausible to me that "max sequences" assumes that all countries contribute their full quota which is very much not true at a global scale and especially so for recent submissions.

Population weighting divides the max sequences among countries per capita instead of the default equal weighting. The concept of "max sequences" for both weighted and uniform sampling is a limitation of augur filter – it doesn't take into consideration what's actually available in the input data (i.e. the problem is under-sampling which becomes more apparent with population weights when large countries do not contribute many samples).

After reviewing more of the prior PRs, I realize that the revert of #1161 doesn't fully restore the approach prior to new population weighted sampling for global builds. That would require reinstating the splitting of countries in Asia.

Continuing this in #1161 (comment), but I think it may be worth reconsidering whether to use weighted sampling at all for the 1m/2m focal samples or simply take whatever is available at the time. Weighted sampling only makes sense when there are enough samples and minimal under-sampling.

victorlin added the enhancement label Aug 14, 2024

victorlin self-assigned this Aug 14, 2024

This was referenced Aug 14, 2024

Use weighted sampling for Asia builds #1106

Merged

Improved subsampling support nextstrain/augur#1481

Open

Allow weighted subsampling nextstrain/augur#1318

Closed

victorlin mentioned this issue Aug 26, 2024

Use weighted sampling for other builds #1151

Merged

3 tasks

victorlin closed this as completed in #1151 Sep 30, 2024

victorlin reopened this Jan 2, 2025

victorlin mentioned this issue Jan 2, 2025

Revert "Use weighted sampling for recent samples in global builds" #1168

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use weighted sampling #1141

Use weighted sampling #1141

victorlin commented Aug 14, 2024 •

edited

Loading

corneliusroemer commented Dec 20, 2024

victorlin commented Jan 2, 2025

Use weighted sampling #1141

Use weighted sampling #1141

Comments

victorlin commented Aug 14, 2024 • edited Loading

Context

Proposal

Progress

corneliusroemer commented Dec 20, 2024

victorlin commented Jan 2, 2025

victorlin commented Aug 14, 2024 •

edited

Loading