augur subsample command #762

jameshadfield · 2021-08-17T07:29:47Z

This implements a new augur command augur subsample. The code quality is not ready for merge - there are a number of todos, incomplete testing & comments, and some functionality has not yet been implemented. I’m putting it up here to allow discussion about the direction.

A brief description of the subsampling scheme as implemented in the Snakemake workflow in the ncov repo:
1. A build specifies a yaml-formatted subsampling scheme, with template strings which are filled in with build-specific info
2. A scheme consists of a number of sample definitions
3. Each sample definition specifies parameters which are translated into arguments to augur filter
4. Each sample definition may define a “priorities” sample, such that the current sample is focused on sequences included in the focal sample

The current implementation of augur subsample essentially follows this approach:
1. A subsampling scheme is provided, parsed, validated, and turned into a simple graph to indicate which samples rely on other samples having been computed (i.e. which are needed for priorities)
2. Each sample is computed by calling the run function of augur filter
3. If priorities need to be calculated for a sample to be computed, this is achieved by calling functions from (a new) augur/priorities.py module before step 2. The code here has been taken directly from the nextstrain/ncov repo.
4. The set of sequences to include in each sample is combined, and outputs written.

The YAML subsampling scheme provided to augur subsample can take two forms. We can use syntax which mostly maps one-to-one onto arguments used with augur filter, such as

focal:
  group-by: [year, month]
  sequences-per-group: 5

related:
  group-by: [country, year]
  sequences-per-group: 20
  exclude-where: ["division=Washington"]
  priorities:
    type: "proximity"
    focus: "focal"

Or we can use syntax more familiar to our ncov workflows:

focal:
  group_by: year month
  seq_per_group: 5

related:
  group_by: country year
  seq_per_group: 20
  exclude: "--exclude-where 'division= Washington'"
  priorities:
    type: "proximity"
    focus: "focal"

There’s an extensive todo list here, including the following points, however I wanted to discuss with others first.
* Currently a limited selection of subsampling parameters are allowed in schema definitions
* Instead of using augur filter’s run() function, we could refactor that function slightly to have a function which returns a strain list for inclusion, as well as logging data etc. Currently the run() function writes data to disk which we immediately read in.
* Similarly, the functions in priority.py are directly taken from scripts in the ncov repo. These functions can be refactored to return data rather than writing to disk.
* Decide whether include / exclude files are defined by arguments to augur subsample or are set within the scheme definition, which would allow per-sample differences.
* Write tests!
* Allow a log file to be written explaining why strains were not included (or why they were!)
* Expand priorities schema definition so that sample definitions define the ignore_seqs etc
* VCF support
* For samples which can be computed in parallel (e.g. those which don’t need priorities), in principle we could refactor the code in augur filter to allow independent sets of (sets of) filters to be applied each time we iterate over a chunk of metadata. This should speed things up quite a bit, but would come at increased code complexity. I don’t think this is immediately necessary.
* See code for more!
* (ncov workflow) allow default / “all” subsampling schemes
* instead of allowing augur subsample to interpret ncov-specific subsampling syntax, we could (should) expand the rule extract_subsampling_scheme to transform the scheme into more canonical, augur-filter style syntax (see above).

Testing
The subsample branch of ncov is working for two profiles (and probably broken for all others!):

snakemake --cores 4 --profile my_profiles/subsample/ -pf auspice/ncov_test.json

snakemake --cores 4 --profile nextstrain_profiles/nextstrain-ci -pf all

This commit adds the contents of two nCoV scripts, `get_distance_to_focal_set.py` and `priorities.py` from nextstrain/ncov at commit 093aa650a6e0b68704e68b8b73b979e768962a4f. This code will not work as-is, rather this commit serves to acknowledge the original authors of the code. The intention is to refactor this into functions which can be called from an upcoming `augur subsample` command. Co-authored-by: Richard Neher <richard.neher@unibas.ch> Co-authored-by: John Huddleston <huddlej@gmail.com> Co-authored-by: Emma Hodcroft <emmahodcroft@gmail.com>

This is in preparation for an `augur subsample` command to call these python functions directly, as opposed to scripts called from the command line. The work done by the functions is unchanged, thus they still don't return any information rather they write information to disk. Future work will change this so that information can flow back to the calling program without needing to be written to disk.

codecov · 2021-08-17T07:31:27Z

Codecov Report

Merging #762 (e40561c) into master (5c89570) will decrease coverage by 1.15%.
The diff coverage is 13.27%.

@@            Coverage Diff             @@
##           master     #762      +/-   ##
==========================================
- Coverage   33.66%   32.50%   -1.16%     
==========================================
  Files          41       43       +2     
  Lines        5882     6236     +354     
  Branches     1423     1479      +56     
==========================================
+ Hits         1980     2027      +47     
- Misses       3817     4124     +307     
  Partials       85       85

Impacted Files	Coverage Δ
augur/__init__.py	`31.25% <ø> (ø)`
augur/priorities.py	`12.58% <12.58%> (ø)`
augur/subsample.py	`13.74% <13.74%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5c89570...e40561c. Read the comment docs.

jameshadfield · 2021-08-23T03:50:53Z

After discussion with @huddlej I am going to move this functionality into the ncov repo for further development and testing, before it makes its way back to augur!

jameshadfield and others added 3 commits August 17, 2021 13:16

WIP augur subsample command

e40561c

jameshadfield mentioned this pull request Aug 17, 2021

augur subsample command #635

Open

jameshadfield closed this Aug 23, 2021

jameshadfield mentioned this pull request Aug 23, 2021

Implement subsampling via a script nextstrain/ncov#711

Open

1 task

huddlej mentioned this pull request Dec 21, 2021

Move subsampling by proximity into augur nextstrain/ncov#816

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

augur subsample command #762

augur subsample command #762

jameshadfield commented Aug 17, 2021 •

edited

Loading

codecov bot commented Aug 17, 2021 •

edited

Loading

jameshadfield commented Aug 23, 2021

augur subsample command #762

augur subsample command #762

Conversation

jameshadfield commented Aug 17, 2021 • edited Loading

codecov bot commented Aug 17, 2021 • edited Loading

Codecov Report

jameshadfield commented Aug 23, 2021

jameshadfield commented Aug 17, 2021 •

edited

Loading

codecov bot commented Aug 17, 2021 •

edited

Loading