Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

augur subsample command #762

Closed
wants to merge 3 commits into from
Closed

augur subsample command #762

wants to merge 3 commits into from

Conversation

jameshadfield
Copy link
Member

@jameshadfield jameshadfield commented Aug 17, 2021

This implements a new augur command augur subsample. The code quality is not ready for merge - there are a number of todos, incomplete testing & comments, and some functionality has not yet been implemented. I’m putting it up here to allow discussion about the direction.

A brief description of the subsampling scheme as implemented in the Snakemake workflow in the ncov repo:
1. A build specifies a yaml-formatted subsampling scheme, with template strings which are filled in with build-specific info
2. A scheme consists of a number of sample definitions
3. Each sample definition specifies parameters which are translated into arguments to augur filter
4. Each sample definition may define a “priorities” sample, such that the current sample is focused on sequences included in the focal sample

The current implementation of augur subsample essentially follows this approach:
1. A subsampling scheme is provided, parsed, validated, and turned into a simple graph to indicate which samples rely on other samples having been computed (i.e. which are needed for priorities)
2. Each sample is computed by calling the run function of augur filter
3. If priorities need to be calculated for a sample to be computed, this is achieved by calling functions from (a new) augur/priorities.py module before step 2. The code here has been taken directly from the nextstrain/ncov repo.
4. The set of sequences to include in each sample is combined, and outputs written.

The YAML subsampling scheme provided to augur subsample can take two forms. We can use syntax which mostly maps one-to-one onto arguments used with augur filter, such as

focal:
  group-by: [year, month]
  sequences-per-group: 5

related:
  group-by: [country, year]
  sequences-per-group: 20
  exclude-where: ["division=Washington"]
  priorities:
    type: "proximity"
    focus: "focal"

Or we can use syntax more familiar to our ncov workflows:

focal:
  group_by: year month
  seq_per_group: 5

related:
  group_by: country year
  seq_per_group: 20
  exclude: "--exclude-where 'division= Washington'"
  priorities:
    type: "proximity"
    focus: "focal"

There’s an extensive todo list here, including the following points, however I wanted to discuss with others first.
* Currently a limited selection of subsampling parameters are allowed in schema definitions
* Instead of using augur filter’s run() function, we could refactor that function slightly to have a function which returns a strain list for inclusion, as well as logging data etc. Currently the run() function writes data to disk which we immediately read in.
* Similarly, the functions in priority.py are directly taken from scripts in the ncov repo. These functions can be refactored to return data rather than writing to disk.
* Decide whether include / exclude files are defined by arguments to augur subsample or are set within the scheme definition, which would allow per-sample differences.
* Write tests!
* Allow a log file to be written explaining why strains were not included (or why they were!)
* Expand priorities schema definition so that sample definitions define the ignore_seqs etc
* VCF support
* For samples which can be computed in parallel (e.g. those which don’t need priorities), in principle we could refactor the code in augur filter to allow independent sets of (sets of) filters to be applied each time we iterate over a chunk of metadata. This should speed things up quite a bit, but would come at increased code complexity. I don’t think this is immediately necessary.
* See code for more!
* (ncov workflow) allow default / “all” subsampling schemes
* instead of allowing augur subsample to interpret ncov-specific subsampling syntax, we could (should) expand the rule extract_subsampling_scheme to transform the scheme into more canonical, augur-filter style syntax (see above).

Testing
The subsample branch of ncov is working for two profiles (and probably broken for all others!):

snakemake --cores 4 --profile my_profiles/subsample/ -pf auspice/ncov_test.json
snakemake --cores 4 --profile nextstrain_profiles/nextstrain-ci -pf all

jameshadfield and others added 3 commits August 17, 2021 13:16
This commit adds the contents of two nCoV scripts,
`get_distance_to_focal_set.py` and `priorities.py` from
nextstrain/ncov at commit 093aa650a6e0b68704e68b8b73b979e768962a4f.
This code will not work as-is, rather this commit serves to acknowledge
the original authors of the code. The intention is to refactor this
into functions which can be called from an upcoming `augur subsample`
command.

Co-authored-by: Richard Neher <richard.neher@unibas.ch>
Co-authored-by: John Huddleston <huddlej@gmail.com>
Co-authored-by: Emma Hodcroft <emmahodcroft@gmail.com>
This is in preparation for an `augur subsample` command to call these
python functions directly, as opposed to scripts called from the
command line.

The work done by the functions is unchanged, thus they still don't
return any information rather they write information to disk. Future
work will change this so that information can flow back to the calling
program without needing to be written to disk.
@codecov
Copy link

codecov bot commented Aug 17, 2021

Codecov Report

Merging #762 (e40561c) into master (5c89570) will decrease coverage by 1.15%.
The diff coverage is 13.27%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #762      +/-   ##
==========================================
- Coverage   33.66%   32.50%   -1.16%     
==========================================
  Files          41       43       +2     
  Lines        5882     6236     +354     
  Branches     1423     1479      +56     
==========================================
+ Hits         1980     2027      +47     
- Misses       3817     4124     +307     
  Partials       85       85              
Impacted Files Coverage Δ
augur/__init__.py 31.25% <ø> (ø)
augur/priorities.py 12.58% <12.58%> (ø)
augur/subsample.py 13.74% <13.74%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5c89570...e40561c. Read the comment docs.

@jameshadfield
Copy link
Member Author

After discussion with @huddlej I am going to move this functionality into the ncov repo for further development and testing, before it makes its way back to augur!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant