New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] augur mask only supports VCFs with a single "chromosome" entry #507
Comments
This is going to be a bit interesting - I'll tackle this after centralizing the "read_bed_file" logic from tree & mask, but two things pop to mind -
|
For Fasta files there is only ever one chromosome - the chromosomes are a VCF thing. For Fasta, supplying masking sites as BED files is not really standard, but useful in theory as it's a nice format. We can assume that if using Fasta files all the site apply to the sequence that's been passed in. (There's no way to differentiate different sections of a Fasta file). Unfortunately I don't understand point one - sorry it's very hard to follow all the threads right now. Which two files are read the same way? |
@emmahodcroft per a conversation separately, we have two very similar functions for loading masking sites from a file, one in Currently, reading masking sites from files, whether .bed format or just a file with one site per line, returns the same format, a basic list of integers. To handle the chromosomes, we'll have to treat the BED format and the "site-per-line" format differently. It's doable, but it's a change from how we're currently doing it. I'll propose a changeset for this once I've consolidated the current mask file loading code. |
Ah - they should still be able to return the same format though, I think! There may be multiple chromosomes/whatever in the BED file, but we only support one 'chromosome' (or 'thing') in Nextstrain. They can only have one 'chromosome-thing' in the VCF file itself (how strictly we enforce this in the code, I don't remember - but it definitely won't work). So, we just need to check that one of the ones in the BED file matches the CHROM column of the VCF file, and then just take that one - can convert to the same line list after that. I hope I understood the gist of this correctly - and hope that maybe makes this less complicated! |
We may support more 'things' in future but this would be a big, big change in |
Ok, cool - if you and @huddlej are on the same page on this, I'm good with less work 😁 |
Yeah, if there were a "very low priority" tag, this issue would have it, but I wanted to document the edge case somewhere so it didn't get lost forever. That said, I've updated the "expected behavior" to reflect that we should at least alert unsuspecting users if they pass in a VCF with multiple chromosomes instead of just processing it anyway. Users coming from a genomics background will expect to be able to apply a BED file with multiple chromosomes to a VCF with multiple chromosomes. Similarly masking of FASTA files with BED files is a common pattern in genomics. Even though this isn’t applicable to Nextstrain right now, we should make our limitations about supporting one chromosome clearer. |
One way we can achieve this is to replace the |
Current Behavior
augur mask was originally developed to support masking sites from VCFs whose variants were relative to a single chromosome reference of a pathogen (e.g., tuberculosis). As it is currently implemented, the mask command only reads in the first "chromosome" entry of the VCF and considers all masked sites as relative to that chromosome.
Although the mask command accepts a BED file to specify which sites should be masked, the chromosome entry of the BED file is ignored and only the start and end positions are used. This will silently cause unexpected results for any inputs that have more than one chromosome (e.g., a bacterial genome plus its plasmids or a segmented genome that has its variants stored in VCF).
Expected behavior
augur mask should fail loudly if the user provides a VCF with more than one chromosome with an error message explaining that only one chromosome is currently supported.
The text was updated successfully, but these errors were encountered: