Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] augur mask only supports VCFs with a single "chromosome" entry #507

Open
huddlej opened this issue Mar 31, 2020 · 8 comments
Open
Labels
bug low priority Can be resolved after other higher priority issues moderate problem Requires an average amount of work

Comments

@huddlej
Copy link
Contributor

huddlej commented Mar 31, 2020

Current Behavior
augur mask was originally developed to support masking sites from VCFs whose variants were relative to a single chromosome reference of a pathogen (e.g., tuberculosis). As it is currently implemented, the mask command only reads in the first "chromosome" entry of the VCF and considers all masked sites as relative to that chromosome.

Although the mask command accepts a BED file to specify which sites should be masked, the chromosome entry of the BED file is ignored and only the start and end positions are used. This will silently cause unexpected results for any inputs that have more than one chromosome (e.g., a bacterial genome plus its plasmids or a segmented genome that has its variants stored in VCF).

Expected behavior
augur mask should fail loudly if the user provides a VCF with more than one chromosome with an error message explaining that only one chromosome is currently supported.

@huddlej huddlej added bug low priority Can be resolved after other higher priority issues moderate problem Requires an average amount of work labels Mar 31, 2020
@danielsoneg
Copy link

This is going to be a bit interesting - I'll tackle this after centralizing the "read_bed_file" logic from tree & mask, but two things pop to mind -

  1. right now in Tree we're reading masking sites from both files the same way - effectively, "give me back a list" - we'll have to change that behavior to differentiate between bed files and just straight lists

  2. More importantly, how does this work with FASTA files? Do the chromosome names map the same? Do we want to just make bed files exclusive to VCF files?

@emmahodcroft
Copy link
Member

emmahodcroft commented Apr 1, 2020

For Fasta files there is only ever one chromosome - the chromosomes are a VCF thing. For Fasta, supplying masking sites as BED files is not really standard, but useful in theory as it's a nice format. We can assume that if using Fasta files all the site apply to the sequence that's been passed in. (There's no way to differentiate different sections of a Fasta file).

Unfortunately I don't understand point one - sorry it's very hard to follow all the threads right now. Which two files are read the same way?

@danielsoneg
Copy link

@emmahodcroft per a conversation separately, we have two very similar functions for loading masking sites from a file, one in augur/mask.py and one in augur/tree.py, so we're going to merge those and centralize to augur/utils.py.

Currently, reading masking sites from files, whether .bed format or just a file with one site per line, returns the same format, a basic list of integers. To handle the chromosomes, we'll have to treat the BED format and the "site-per-line" format differently. It's doable, but it's a change from how we're currently doing it.

I'll propose a changeset for this once I've consolidated the current mask file loading code.

@emmahodcroft
Copy link
Member

Ah - they should still be able to return the same format though, I think!

There may be multiple chromosomes/whatever in the BED file, but we only support one 'chromosome' (or 'thing') in Nextstrain. They can only have one 'chromosome-thing' in the VCF file itself (how strictly we enforce this in the code, I don't remember - but it definitely won't work).

So, we just need to check that one of the ones in the BED file matches the CHROM column of the VCF file, and then just take that one - can convert to the same line list after that.

I hope I understood the gist of this correctly - and hope that maybe makes this less complicated!

@emmahodcroft
Copy link
Member

We may support more 'things' in future but this would be a big, big change in augur/auspice - and the phylogenetics as a whole - so this can come if we ever move in that direction (no plans at the moment).

@danielsoneg
Copy link

Ok, cool - if you and @huddlej are on the same page on this, I'm good with less work 😁

@huddlej
Copy link
Contributor Author

huddlej commented Apr 2, 2020

Yeah, if there were a "very low priority" tag, this issue would have it, but I wanted to document the edge case somewhere so it didn't get lost forever.

That said, I've updated the "expected behavior" to reflect that we should at least alert unsuspecting users if they pass in a VCF with multiple chromosomes instead of just processing it anyway.

Users coming from a genomics background will expect to be able to apply a BED file with multiple chromosomes to a VCF with multiple chromosomes. Similarly masking of FASTA files with BED files is a common pattern in genomics. Even though this isn’t applicable to Nextstrain right now, we should make our limitations about supporting one chromosome clearer.

@jameshadfield
Copy link
Member

Expected behavior
augur mask should fail loudly if the user provides a VCF with more than one chromosome with an error message explaining that only one chromosome is currently supported.

One way we can achieve this is to replace the get_chrom_name(vcf_file) function with our canonical vcf reading function from TreeTime. It'll be a little slower, but should help ensure the VCF file is valid for Augur's purposes. If we did this it wouldn't be hard to just drop sites from that data and round-trip it to TreeTime's write_vcf function and get rid of vcftools altogether.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug low priority Can be resolved after other higher priority issues moderate problem Requires an average amount of work
Projects
No open projects
Development

No branches or pull requests

4 participants