Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This implements a new augur command
augur subsample
. The code quality is not ready for merge - there are a number of todos, incomplete testing & comments, and some functionality has not yet been implemented. I’m putting it up here to allow discussion about the direction.A brief description of the subsampling scheme as implemented in the Snakemake workflow in the ncov repo:
1. A build specifies a yaml-formatted subsampling scheme, with template strings which are filled in with build-specific info
2. A scheme consists of a number of sample definitions
3. Each sample definition specifies parameters which are translated into arguments to
augur filter
4. Each sample definition may define a “priorities” sample, such that the current sample is focused on sequences included in the focal sample
The current implementation of
augur subsample
essentially follows this approach:1. A subsampling scheme is provided, parsed, validated, and turned into a simple graph to indicate which samples rely on other samples having been computed (i.e. which are needed for priorities)
2. Each sample is computed by calling the
run
function ofaugur filter
3. If priorities need to be calculated for a sample to be computed, this is achieved by calling functions from (a new)
augur/priorities.py
module before step 2. The code here has been taken directly from the nextstrain/ncov repo.4. The set of sequences to include in each sample is combined, and outputs written.
The YAML subsampling scheme provided to
augur subsample
can take two forms. We can use syntax which mostly maps one-to-one onto arguments used withaugur filter
, such asOr we can use syntax more familiar to our ncov workflows:
There’s an extensive todo list here, including the following points, however I wanted to discuss with others first.
* Currently a limited selection of subsampling parameters are allowed in schema definitions
* Instead of using
augur filter
’srun()
function, we could refactor that function slightly to have a function which returns a strain list for inclusion, as well as logging data etc. Currently therun()
function writes data to disk which we immediately read in.* Similarly, the functions in
priority.py
are directly taken from scripts in the ncov repo. These functions can be refactored to return data rather than writing to disk.* Decide whether include / exclude files are defined by arguments to
augur subsample
or are set within the scheme definition, which would allow per-sample differences.* Write tests!
* Allow a log file to be written explaining why strains were not included (or why they were!)
* Expand priorities schema definition so that sample definitions define the
ignore_seqs
etc* VCF support
* For samples which can be computed in parallel (e.g. those which don’t need priorities), in principle we could refactor the code in
augur filter
to allow independent sets of (sets of) filters to be applied each time we iterate over a chunk of metadata. This should speed things up quite a bit, but would come at increased code complexity. I don’t think this is immediately necessary.* See code for more!
* (ncov workflow) allow default / “all” subsampling schemes
* instead of allowing
augur subsample
to interpret ncov-specific subsampling syntax, we could (should) expand therule extract_subsampling_scheme
to transform the scheme into more canonical, augur-filter style syntax (see above).Testing
The
subsample
branch of ncov is working for two profiles (and probably broken for all others!):