New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use @gmod/vcf #1227

Merged
merged 37 commits into from Nov 6, 2018

Conversation

Projects
None yet
4 participants
@garrettjstevens
Copy link
Contributor

garrettjstevens commented Oct 6, 2018

Whew, finally ready I think. Here's some test data I've been using: data.tar.gz. That's got the same VCFs indexed with both tabix and tribble so you can compare them. There are a few changes and bugfixes. One change is with the way some feature descriptions are displayed:

Old

image

New

image

There were also some tribble VCFs that wouldn't load, and then feature callbacks (like the red-blue heterozygous/homozygous one in the Volvox data) didn't work with tribble. Both issues are fixed now.

I've tested on a bigger dbSNP VCF and didn't see any performance issues, but feel free to hammer it and see how it performs. There are probably a couple places I could optimize if I need to.

Resolves #1140, resolves #1199, resolves #1212.

garrettjstevens added some commits Sep 13, 2018

@cmdcolin

This comment has been minimized.

Copy link
Contributor

cmdcolin commented Oct 8, 2018

I think this is looking really good! I found a potential bug with a 1000 genomes dataset for SV though

I tested out with http://s3.amazonaws.com/1000genomes/phase3/integrated_sv_map/ALL.wgs.integrated_sv_map_v2.20130502.svs.genotypes.vcf.gz and found it gives an error displaying a feature e.g. in this region 1:67477101..67604500 (Error is cannot read property 'undefined' of undefined)

Also on some datasets the layout is a little weird and not optimal, especially noticeable if you zoom out. Here is the chrY data for human

screenshot-localhost-2018 10 08-17-20-51 2

@garrettjstevens

This comment has been minimized.

Copy link
Contributor

garrettjstevens commented Oct 11, 2018

@cmdcolin I've fixed the 1000 genomes display error in vcf-js and pulled in the new version.

As for the the display issues, I think I'm going to try to figure out #1232 first and then come back, since fixing that might affect this.

@garrettjstevens

This comment has been minimized.

Copy link
Contributor

garrettjstevens commented Oct 24, 2018

The layout problems are because I use the file offset from tabix-js as a uniqe ID for the feature, and it isn't consistently giving the same file offsets for each line of the VCF. I'll try to create a failing test case for tabix-js.

@garrettjstevens

This comment has been minimized.

Copy link
Contributor

garrettjstevens commented Oct 29, 2018

With the SV 1KG data, each VCF line is about 10,250 characters long. Using the CRC32 of the line as the unique ID, it took one page about 43 seconds to load, and here is the call tree with ~90% of the time spent on the CRC32:
image

The same page using an ID that's just a string concatenation of a couple of the VCF fields only takes ~4 seconds to load. It looks like the longest operation in the CRC32 code was the string to bytes conversion.

@cmdcolin

This comment has been minimized.

Copy link
Contributor

cmdcolin commented Oct 29, 2018

Gotcha. I guess with BAM it might be a little more optimized because you can know the exact byte boundaries of a feature and can simply pass a byte buffer straight to crc32 but tabix/vcf seems to just need to still parse text chunks of data so it doesn't necessarily receive those byte boundaries that can be used

@garrettjstevens

This comment has been minimized.

Copy link
Contributor

garrettjstevens commented Oct 29, 2018

@cmdcolin I'm not getting an error with that small VCF you posted above:
image

I just pushed some commits that include pulling in the latest tabix-js release, so that might fix it on your end.

@cmdcolin

This comment has been minimized.

Copy link
Contributor

cmdcolin commented Oct 29, 2018

I am not getting that error either now, not sure if I had yarn link @gmod/vcf or @gmod/tabix or something like this but it seems good. Sorry :)

@cmdcolin

This comment has been minimized.

Copy link
Contributor

cmdcolin commented Nov 5, 2018

I tested the chrY from 1000 genomes and was getting "cannot read 'map' of null", I think due to the value of '.' on genotypes, made this test case here to check GMOD/vcf-js#1

garrettjstevens and others added some commits Nov 5, 2018

@cmdcolin

This comment has been minimized.

Copy link
Contributor

cmdcolin commented Nov 6, 2018

Pushed a small change which fixes opening local files via Open track dialog

@cmdcolin

This comment has been minimized.

Copy link
Contributor

cmdcolin commented Nov 6, 2018

There is one more small concern which is that the encoded characters e.g. %2C are not urldecoded, I think this only applies to the info field.

For what it's worth though the old VCF parser did not do this either, but the VCF spec does mention these should be decoded

@garrettjstevens

This comment has been minimized.

Copy link
Contributor

garrettjstevens commented Nov 6, 2018

@cmdcolin thanks for the BlobFilehandleWrapper fix. You're right about the decoding, that probably needs to happen in vcf-js for INFO and FORMAT fields. I'll add it in there.

@garrettjstevens

This comment has been minimized.

Copy link
Contributor

garrettjstevens commented Nov 6, 2018

New vcf-js version addresses the decoding, and it's now incorporated here.

@cmdcolin

This comment has been minimized.

Copy link
Contributor

cmdcolin commented Nov 6, 2018

I think it looks good to me :) I think we can get it merged

@rbuels

This comment has been minimized.

Copy link
Collaborator

rbuels commented Nov 6, 2018

Looks great to me! Merging.

@rbuels rbuels merged commit c920f74 into dev Nov 6, 2018

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

@wafflebot wafflebot bot removed the in progress label Nov 6, 2018

@garrettjstevens garrettjstevens deleted the 1199_npm_vcf branch Nov 6, 2018

@colindaven

This comment has been minimized.

Copy link
Contributor

colindaven commented Nov 7, 2018

Btw, if you're looking for additional VCF4.3 datasets the only tool I know which will generate them is octopus

conda install -c conda-forge -c defaults -c r -c bioconda octopus

eg

octopus --reference /lager2/rcug/seqres/HS/Homo_sapiens.GRCh38.dna.chromosome.MT.fa -o octopus2.vcf --bamout octopus2 --threads 32 --reads *.bam

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment