New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle large structural variants in VCF better #847

Merged
merged 13 commits into from Feb 27, 2018

Conversation

Projects
None yet
5 participants
@cmdcolin
Contributor

cmdcolin commented Jan 27, 2017

As background, VCF only has one coordinate given, and then the length of the variant is normally calculated in jbrowse as the length of the sequence of the genotype.

For larger structural variants in a VCF, the sequence is not put into the genotype, and instead the meta information field contains the data encoded as "END=" (similar to a gff column 9), so this PR enhances the ability to parse out that field

This link shows some exampes of larger structural variants like deletions, copy number variants, etc.
http://www.internationalgenome.org/wiki/Analysis/Variant%20Call%20Format/VCF%20(Variant%20Call%20Format)%20version%204.0/encoding-structural-variants/

@keiranmraine

This comment has been minimized.

Contributor

keiranmraine commented Jan 27, 2017

Be aware that that specification only covers very simple rearrangements. The TCGA VCF specification has more detail which has been used for larger projects like PanCancer.

Describing events that lead to cancer/somatic diseases is impossible in the 1000g version.

@cmdcolin

This comment has been minimized.

Contributor

cmdcolin commented Jan 27, 2017

Interesting! thanks for the link. true, this PR only covers some simple things, like structural variants that cover some region like deletion, copy number variant, transposon.

would be interesting to have more support of tcga vcf too (looks much more complex though!)

@keiranmraine

This comment has been minimized.

Contributor

keiranmraine commented Jan 27, 2017

We were discussing it here previously for the more complex types: elsiklab/sashimiplot#5

@rdhayes

This comment has been minimized.

Contributor

rdhayes commented Jan 27, 2017

@cmdcolin

This comment has been minimized.

Contributor

cmdcolin commented Mar 22, 2017

For the case that I was interested in addressing, this is actually probably a finished PR.

Maybe additional enhancements can be added later?

@cmdcolin cmdcolin removed the in progress label Mar 22, 2017

@cmdcolin

This comment has been minimized.

Contributor

cmdcolin commented Mar 27, 2017

Note that this PR also fixes another long standing bug that you can see in the volvox sample data

If you click on one of the variants in this region E.g. the "DEL:ME" feature then you will see that the genotype table is not displayed, and there is an error in the javascript console about failing to parse the alt alleles
http://jbrowse.org/code/JBrowse-1.12.1/?data=sample_data%2Fjson%2Fvolvox&loc=ctgA%3A2886..4686&tracks=DNA%2Cvolvox-sorted-vcf%2Cvolvox_vcf_test&highlight=

This PR fixes that too :)

Screenshot with bug, no genotype table on a structural variant feature

screenshot-jbrowse org-2017-03-27-10-42-46

Screenshot without bug, normal variant with genotypes

screenshot-localhost-2017-05-11-11-05-08

@rbuels

This comment has been minimized.

Collaborator

rbuels commented Jan 23, 2018

This looks pretty good, how did this get derailed?

@rbuels rbuels self-assigned this Jan 24, 2018

@rbuels rbuels added this to the 1.12.4 milestone Jan 24, 2018

@cmdcolin

This comment has been minimized.

Contributor

cmdcolin commented Jan 24, 2018

I'm not sure, I think code should be fine :) there were the a couple "rider" changes (note chopping newline, fixing alt allele handling when it is empty) but I think it should be ok. Could also add an automated test if interested

@rbuels rbuels modified the milestones: 1.12.4, 1.13.0 Feb 2, 2018

@nathandunn nathandunn changed the base branch from master to dev Feb 7, 2018

@wafflebot wafflebot bot added the in progress label Feb 15, 2018

@cmdcolin cmdcolin referenced this pull request Feb 15, 2018

Merged

Add more gVCF support #991

@cmdcolin

This comment has been minimized.

Contributor

cmdcolin commented Feb 27, 2018

@rbuels I added some tests for parsing the END field better (in previous iteration of this PR, the code just matched END=\d+ but that would match BAMEND=\d+ for example, a possible a rare failure of the parser) so now that is fixed by adding two regex matching cases (^END and ;END) with a unit test

It also now has a test for the removal of the newline at the end of the line. This had previously caused the name of the last genotype given by feature.get('genotypes') to contain a newline

Let me know if there are any changes requested :)

cmdcolin added some commits Feb 27, 2018

@rbuels

Looks pretty good, just need to remove that xit

it('reads big dbsnp', function() {
xit('reads big dbsnp', function() {

This comment has been minimized.

@rbuels

rbuels Feb 27, 2018

Collaborator

don't forget to remove this x

This comment has been minimized.

@cmdcolin

cmdcolin Feb 27, 2018

Contributor

I don't think this file exists anywhere, so it is unable to pass by itself? if it is large maybe it should not be added to repo after all

This comment has been minimized.

@rbuels

rbuels Feb 27, 2018

Collaborator

oh, yeah it doesn't exist. how were these tests passing running in Travis then??

This comment has been minimized.

@cmdcolin

cmdcolin Feb 27, 2018

Contributor

I don't think they were running currently index.html has vcf.spec.js commented this PR uncommented it but xit that test

This comment has been minimized.

@rbuels

rbuels Feb 27, 2018

Collaborator

oh god thanks for catching that

});
});

This comment has been minimized.

@rbuels

rbuels Feb 27, 2018

Collaborator

oh i love these tests 😍😍

@@ -95,7 +95,7 @@ var Feature = Util.fastDeclare(
}
// add a toString to it that serializes it to JSON without its metadata
bySample.toString = this._stringifySample;
//bySample.toString = this._stringifySample;

This comment has been minimized.

@rbuels

rbuels Feb 27, 2018

Collaborator

why remove the stringification code?

This comment has been minimized.

@cmdcolin

cmdcolin Feb 27, 2018

Contributor

The feature.get('genotypes') would also contain a toString element that you have to work around. The toString would appear in Object.keys(feature.get('genotypes')) and such and seems unnecessary?

@rbuels

This comment has been minimized.

Collaborator

rbuels commented Feb 27, 2018

@cmdcolin could you add a release notes blurb?

@rbuels rbuels modified the milestones: 1.13.0, 1.12.5 Feb 27, 2018

@cmdcolin

This comment has been minimized.

Contributor

cmdcolin commented Feb 27, 2018

The last commit updating the release notes and having the skip CI tag seemed to make the tests not run on the pr now but it should be ok

@@ -72,7 +72,7 @@
<script type="text/javascript" src="spec/TwoBit.spec.js"></script>
<script type="text/javascript" src="spec/SequenceTrack.spec.js"></script>
<!--<script type="text/javascript" src="spec/VCF.spec.js"></script>-->
<script type="text/javascript" src="spec/VCF.spec.js"></script>

This comment has been minimized.

@rbuels

rbuels Feb 27, 2018

Collaborator

😳

@rbuels rbuels merged commit f8883fc into dev Feb 27, 2018

@wafflebot wafflebot bot removed the in progress label Feb 27, 2018

@rbuels rbuels deleted the parse_cnv_end_vcf branch Feb 27, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment