Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trim variant #22

Merged
merged 4 commits into from
Jun 26, 2012
Merged

Trim variant #22

merged 4 commits into from
Jun 26, 2012

Conversation

martijnvermaat
Copy link
Collaborator

I don't mean this to be pulled in as-is, but rather to start discussion.

Standard convention with VCF is to place an indel at the left-most position, but some tools add additional context to the right of the sequences (e.g. samtools). These common suffixes are undesirable when comparing variants, for example in variant databases.

The trim_common_suffix function removes these common suffixes, forcing the INDEL to the left.

For example, to get trimmed reference and alternate alleles for a VCF record:

>>> print record
Record(CHROM=chr1, POS=152195728, REF=ATTTTTTTTTTT, ALT=['ATTTTTTTTTT', 'ATTTTTTTTT'])
>>> trim_common_suffix(record.REF, *record.ALT)
['ATT', 'AT', 'A']

Or if you only want to work with the first alternate allele:

>>> trim_common_suffix(record.REF, record.ALT[0])
['AT', 'A']

If such functionality should ever be included in PyVCF, a few ideas:

  • Per allele is difficult, since alleles in a record all share the same reference. It could be a .trimmed() method on Call objects, returning a pair of trimmed reference and trimmed variant.
  • Per record could be done with a flag in the Reader constructor, which would cause all records to be trimmed. Or provide separate fields next to the existing REF and ALT fields, containing the trimmed versions.

But I think trimming per allele is in practice most useful.

Related to this would be to really left-align the INDEL as far as possible, making use of a reference file, and thereby possibly modifying the start position of the variant. The GATK LeftAlignVariants module and the Freebayes --left-align-indels do this.

Other opinions?

@brentp
Copy link

brentp commented Feb 9, 2012

in the above, where you have the example of "only work[ing] with the first alternate allele. Is that

>>> trim_common_suffix(record.REF, record.ALT[0])

 instead of `*record.ALT`?

looks useful to me.

@martijnvermaat
Copy link
Collaborator Author

Fixed, thanks for catching that.

@jamescasbon
Copy link
Owner

Anyone have an opinion on this? Is it ready for merge?

Looks like an upstream problem, but accept it is useful.

jamescasbon pushed a commit that referenced this pull request Jun 26, 2012
@jamescasbon jamescasbon merged commit 39f85fa into jamescasbon:master Jun 26, 2012
gotgenes pushed a commit to gotgenes/PyVCF that referenced this pull request May 13, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants