Allow weighted subsampling #1318

victorlin · 2023-09-19T22:12:06Z

Context

Currently, --subsample-max-sequences effectively calculates a value for --sequences-per-group which applies to all groups specified by --group-by.

This behavior does not work for all scenarios. Example: There are 5 countries with vastly different population sizes. 60 sequences are requested for subsampling. The command would be:

augur filter \
  --group-by country \
  --subsample-max-sequences 60

This means 12 sequences will be sampled from each country, regardless of population size. One may want the sample to be representative of country population sizes and not uniform. This limitation is what prompts higher-level subsampling workarounds such as nextstrain/ncov#1074.

Possible solution

Implement an option --subsample-weights, which reads a file that specifies weights per --group-by column. A simple example:

augur filter \
  --group-by country \
  --subsample-max-sequences 60 \
  --subsample-weights weights.yaml

weights.yaml:

# Weight countries by population size.
country:
    A: 1000
    B: 1000
    C: 300
    D: 100
    E: 600

With this information, a different amount of sequences can be calculated per group.

A would have 60*1000/3000 = 20 sequences.
C would have 60*300/3000 = 6 sequences.

The absence of a column from the weights file can imply equal weighting. In other words, the example can be updated to use --group-by country month while keeping weights.yaml as-is to have weighted country sampling for each time bin.

Or, a more complex example where time is also weighted:

augur filter \
  --group-by country month \
  --subsample-max-sequences 60 \
  --subsample-weights weights.yaml

weights.yaml:

# Weight countries by population size.
country:
    A: 1000
    B: 1000
    C: 300
    D: 100
    E: 600

# Get twice the amount of sequences from 2021 compared to 2020.
month:
    2020-01: 1
    2020-02: 1
    2020-03: 1
    # … all months in 2020 are weighted with 1
    2020-01: 2
    2020-02: 2
    2020-03: 2
    # … all months in 2021 are weighted with 2

Notes:

The file format is up for debate. At the least, it can be JSON or YAML, but not anything tabular (not enough dimensions to cover multiple group by columns).
This would necessitate all possible values of a column to have a weight so that the denominator can be calculated by summing up all the values.
Weights should be relative within each column.
(I think) as long as the weights are non-negative, the values can be multiplied across columns to get effective weighting for all combinations.

The text was updated successfully, but these errors were encountered:

victorlin · 2024-03-11T23:48:23Z

There's been lots of internal discussions on this feature. Contrary to the proposal in the issue description, it does seem reasonable to encode multi-dimensional weights in a CSV/TSV format, though it's likely that this type of file must be generated via a script.

country     month       weight
A           2020-01     N
A           2020-02     N
A           2020-03     N
…
B           2020-01     N
B           2020-02     N
B           2020-03     N
…

Some more notes:

The weights file should be mutually exclusive with --group-by (determined by weights file columns) and --sequences-per-group (calculated dynamically using weights).
The list of weights should partition the data completely. In the example above, all combinations of country and month present in the data must be provided a weight. Raise user error if it doesn't.

In the initial implementation, all cells of the weights file must have a value. In the future, this can be extended to allow partitioning of the data at different resolutions. Here's an example with geographically even sampling on two different resolutions:

country     division    weight
A                       <1/n_countries>
B                       <1/n_countries>
C                       <1/n_countries>
…
USA         WA          <1/n_countries * 1/n_divisions>
USA         WA          <1/n_countries * 1/n_divisions>
USA         WA          <1/n_countries * 1/n_divisions>
…
USA         OR          <1/n_countries * 1/n_divisions>
USA         OR          <1/n_countries * 1/n_divisions>
…

trvrb · 2024-03-21T20:19:48Z

Thanks for spelling things out in such detail @victorlin. A couple thoughts:

I really like the behavior in the original YAML version of being able to specify independent weights for column 1 (eg country) vs column 2 (eg month). The situations where we have an interaction effect between weights seem quite limited (I can't think of an immediate example in existing subsampling routines).

I could easily write this YAML file for ncov, while for the fully specified TSV example, I'd need a script that generates a large number of combinations (that I don't actually care about).

Note that you could still encode interactions terms in a YAML file, eg:

# Weight countries by population size.
country month:
    A 2020-01: 10
    B 2020-01: 10
    C 2020-01: 3
    D 2020-01: 1
    E 2020-02: 6
    A 2020-02: 10
    B 2020-02: 10
    C 2020-02: 3
    D 2020-02: 1
    E 2020-02: 6

Again, I believe that independent columns will cover >90% of use cases and then won't force people to write intermediate scripts if they have multiple columns they care about.

This would necessitate all possible values of a column to have a weight so that the denominator can be calculated by summing up all the values.

The list of weights should partition the data completely. In the example above, all combinations of country and month present in the data must be provided a weight. Raise user error if it doesn't.

To avoid enforced verbosity I could also imagine assuming a weight of 1 for any missing entries. But raise a warning saying that missing values have been assumed to be 1.

victorlin · 2024-03-28T02:50:08Z

I could easily write this YAML file for ncov

My speculative hesitation with YAML is that it'll be hard to translate from a source file e.g. case counts which are typically in TSV format (but I haven't actually tried). YAML would definitely be easier to manually define simple weighting logic such as "2x sequences from region A compared to B".

I'd need a script that generates a large number of combinations (that I don't actually care about).

Good point. The combinations need to be programmatically generated somewhere along the lines. If providing weights as YAML, the subsampling tool would internally generate weights per group analogous to the TSV.

I think it'd be manageable to first implement the underlying logic and allow configuration via both YAML and TSV to get a feel for what works better under different scenarios.
To avoid enforced verbosity I could also imagine assuming a weight of 1 for any missing entries.

My (again speculative) concern is that there may be few cases in which 1 is a useful default, especially if weights are based on case counts or population size.

This seems like a small behavioral detail in which we'll only know what to do once we have an implementation to test against real world usage. We could start with errors to notice if enforced verbosity is overkill.

victorlin · 2024-03-28T02:57:31Z

After working on nextstrain/ncov@0fd6861 I've realized that in order to reduce the number of samples (i.e. calls to augur filter) in the workflow, augur filter will need the extended implementation that allows partitioning of the data at different resolutions. I don't see how the initial implementation will simplify the ncov workflow.

victorlin · 2024-03-28T18:01:01Z

Here's an idea: implement weighted subsampling as a part of augur subsample and configure it in the new YAML.

Using the currently proposed YAML as-is would look something like:

samples:
  north_america_6m:
    size: 4000

    weights:
      # Region weighting: 4:1 for North America to rest of world
      region:
        North America: 4
        # Africa: 1
        # Asia: 1
        # Europe: 1
        # …

      # Time weighting: 4:1 for recent sequences to early sequences (2M threshold)
      month:
        # 2020-01: 1
        # 2020-02: 1
        # …
        2024-02: 4
        2024-03: 4

Issues:

For region weighting, assuming a weight of 1 for missing entries, the weighting will change from 4:1 North America to rest of the world to 4:1 North America to every other region (i.e. 4:6 North America to rest of the world). Time weighting is similarly affected.
Time weighting is verbose and lacks ability to use relative dates.
For both region and time weighting, the column to group by for uniform sampling within each group is no longer encoded. In the current ncov workflow, this is encoded as different --group-by columns for individual samples.

Here's an alternative which addresses those issues:

samples:
  north_america_6m:
    size: 4000

    partitions:
      # Region weighting: 4:1 for North America to rest of world
      region:
      - query: region == 'North America'
        weight: 4
        uniform_sampling: division
      - query: region != 'North America'
        weight: 1
        uniform_sampling: country

      # Time weighting: 4:1 for recent sequences to early sequences (2M threshold)
      month:
      - query: date >= 2M
        weight: 4
        uniform_sampling: week
      - query: date < 2M
        weight: 1
        uniform_sampling: month

victorlin · 2024-03-29T00:02:13Z

After thinking more along the lines of implementing this in augur subsample, I've realized there's two types of weighted sampling:

Weighted sampling between intermediate samples (e.g. 4:1 between North America vs. rest of the world)
Weighted sampling within an intermediate sample (e.g. dynamic sequences per group based on geo-temporal case counts).

I think these can be implemented separately, where (1) can be YAML-based (2) can be TSV-based. I've added more detail and examples in the subsampling doc.

trvrb · 2024-04-12T19:41:46Z

Thanks for the thoughts @victorlin. I'll try to pull together a more cohesive thread for how I'd see this working for the ncov example. But broadly, I like the general idea of encoding weights independently between categories (country vs month for example) and assuming no interaction between categories. Ie if you have weight of 4 in North America and weight of 1 for global context and if you have weight of 4 for recent samples and weight of 1 for older samples, then I'd assume sampling weight of 4x4 = 16 for recent North America, 1x4 = 4 for recent global, 1x4 = 4 for older North America and 1x1 = 1 for older global.

victorlin added the enhancement New feature or request label Sep 19, 2023

trvrb mentioned this issue Mar 21, 2024

Generate subsampling config with a script nextstrain/ncov#1102

Draft

5 tasks

This was referenced Apr 12, 2024

filter: Split filtering and subsampling #1432

Draft

augur subsample command #635

Open

victorlin changed the title ~~filter: Allow weighted subsampling~~ Allow weighted subsampling Apr 18, 2024

victorlin mentioned this issue Apr 25, 2024

Implement weighted sampling #1454

Draft

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow weighted subsampling #1318

Allow weighted subsampling #1318

victorlin commented Sep 19, 2023

victorlin commented Mar 11, 2024

trvrb commented Mar 21, 2024

victorlin commented Mar 28, 2024

victorlin commented Mar 28, 2024

victorlin commented Mar 28, 2024 •

edited

victorlin commented Mar 29, 2024

trvrb commented Apr 12, 2024

Allow weighted subsampling #1318

Allow weighted subsampling #1318

Comments

victorlin commented Sep 19, 2023

Context

Possible solution

victorlin commented Mar 11, 2024

trvrb commented Mar 21, 2024

victorlin commented Mar 28, 2024

victorlin commented Mar 28, 2024

victorlin commented Mar 28, 2024 • edited

victorlin commented Mar 29, 2024

trvrb commented Apr 12, 2024

victorlin commented Mar 28, 2024 •

edited