Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

factor twobit store out into npm module #1146

Merged
merged 11 commits into from Aug 1, 2018

Conversation

Projects
None yet
3 participants
@rbuels
Copy link
Collaborator

rbuels commented Jul 28, 2018

out2.2bit now loads in about 9 seconds on my machine. Still not awesome, but better than it was. And the code is quite a bit cleaner now.

@rbuels rbuels added this to the 1.15.1 milestone Jul 28, 2018

@rbuels rbuels requested a review from cmdcolin Jul 28, 2018

@wafflebot wafflebot bot added the in progress label Jul 28, 2018

@rbuels

This comment has been minimized.

Copy link
Collaborator Author

rbuels commented Jul 28, 2018

This should fix #1116

@cmdcolin

This comment has been minimized.

Copy link
Contributor

cmdcolin commented Jul 28, 2018

I'm trying out on a wheat 2bit file from Nathan Haigh and it is giving some problems with alternating blocks being loaded? I also tried it on volvox.2bit and it seemed to work fine so I'd have to do a little more debugging to find out why it is weird on this specific data. If you have that wheat genome maybe try converting it to 2bit to test

screenshot-localhost-2018 07 28-14-11-51

@rbuels

This comment has been minimized.

Copy link
Collaborator Author

rbuels commented Jul 28, 2018

you talking about @nathanhaigh's 161010_Chinese_Spring_v1.0_pseudomolecules.fasta?

@cmdcolin

This comment has been minimized.

Copy link
Contributor

cmdcolin commented Jul 28, 2018

yep :)

@nathanhaigh

This comment has been minimized.

Copy link
Contributor

nathanhaigh commented Jul 28, 2018

@cmdcolin @rbuels According to the format description:

"A .2bit file stores multiple DNA sequences (up to 4 Gb total) in a compact randomly-accessible format."

So I guess it won't support references that size?

@nathanhaigh

This comment has been minimized.

Copy link
Contributor

nathanhaigh commented Jul 28, 2018

Though I also see that faToTwoBit has a -long option to overcome this limitation.

@rbuels

This comment has been minimized.

Copy link
Collaborator Author

rbuels commented Jul 28, 2018

ah, a weird bug. fixed it in @gmod/twobit 1.0.2. GMOD/twobit-js@69ecf95 was the fix

@rbuels

This comment has been minimized.

Copy link
Collaborator Author

rbuels commented Jul 28, 2018

@nathanhaigh your wheat sequence fit OK in a .2bit file, it only ended up being 3.5G. starting to max out the format a bit though. they need to publish an upgraded version of the format with bigger fields for the file offsets, at minimum.

@rbuels

This comment has been minimized.

Copy link
Collaborator Author

rbuels commented Jul 28, 2018

@nathanhaigh

This comment has been minimized.

Copy link
Contributor

nathanhaigh commented Jul 29, 2018

Oh...I thought the limit would likely be Gbp (number of nucleotides) not GB as in file size.

@rbuels

This comment has been minimized.

Copy link
Collaborator Author

rbuels commented Jul 29, 2018

Alright I think it should be good now, could you try it again and confirm?

@@ -33,60 +27,15 @@ return declare([ SeqFeatureStore, DeferredFeaturesMixin], {
*/
constructor: function( args ) {

var blob = args.blob || new XHRBlob( this.resolveUrl( args.urlTemplate || 'data.2bit' ), { expectRanges: true } );

This comment has been minimized.

@cmdcolin

cmdcolin Jul 29, 2018

Contributor

any reason why expectRanges removed?

This comment has been minimized.

@rbuels

rbuels Jul 29, 2018

Author Collaborator

Nope, fixed.

@cmdcolin

This comment has been minimized.

Copy link
Contributor

cmdcolin commented Jul 29, 2018

I'm currently getting "Uncaught (in promise) start cannot be negative!" from RemoteBinaryFile.js when loading the wheat 2bit. I didnt use the -long argument but as discussed above I don't think it is necessary. Here is my console log

errors

@rbuels

This comment has been minimized.

Copy link
Collaborator Author

rbuels commented Jul 29, 2018

I haven't been able to reproduce the "start cannot be negative" problem. Do you have a recipe for getting that error?

@cmdcolin

This comment has been minimized.

Copy link
Contributor

cmdcolin commented Jul 29, 2018

Sorry for not giving all details, the negative offset happens when the "refSeqs" is also set to the 2bit file e.g.

{
  "formatVersion": 1,
  "tracks": [
    {
      "category": "Reference sequence",
      "key": "Reference sequence",
      "label": "DNA",
      "seqType": "dna",
      "storeClass": "JBrowse/Store/Sequence/TwoBit",
      "type": "SequenceTrack",
      "urlTemplate": "seq/wheat.2bit",
      "useAsRefSeqStore": 1
    }
  ],
  "refSeqs": "seq/wheat.2bit"
}

I thought it might happen also when "Open sequence" since this initializes the store in a similar way but that actually did work

The volvox.2bit again also doesn't have the problem when refSeqs is set this way but my wheat.2bit did (which was just generated with faToTwoBit chinese_spring.fa wheat.2bit

@cmdcolin

This comment has been minimized.

Copy link
Contributor

cmdcolin commented Jul 30, 2018

It looks like a int overflow somewhere from signed int 2^31...

Wondering if maybe there is a unsigned int in the twobit-js code

e.g.

diff --git a/src/twoBitFile.js b/src/twoBitFile.js
index 7f72555..9aace85 100644
--- a/src/twoBitFile.js
+++ b/src/twoBitFile.js
@@ -120,7 +120,7 @@ class TwoBitFile {
         }),
       record1: new Parser()
         .endianess(endianess)
-        .int32('dnaSize')
+        .uint32('dnaSize')
         .int32('nBlockCount'),
       record2: new Parser()
         .endianess(endianess)
@cmdcolin

This comment has been minimized.

Copy link
Contributor

cmdcolin commented Jul 30, 2018

Here's one random tidbit possible related to the above (uint vs int)

If I do twoBitInfo wheat.2bit output.tab I get

chr1A   594102056
chr1B   689851870
chr1D   495453186
chr2A   780798557
chr2B   801256715
chr2D   651852609
chr3A   750843639
chr3B   830829764
chr3D   615552423
chr4A   744588157
chr4B   673617499
chr4D   509857067
chr5A   709773743
chr5B   713149757
chr5D   566080677
chr6A   618079260
chr6B   720988478
chr6D   473592718
chr7A   736706236
chr7B   750620385
chr7D   638686055
chrUn   480980714

If I make a small nodejs script

require('regenerator-runtime/runtime');

const { TwoBitFile } = require('@gmod/twobit')
const t = new TwoBitFile({
      path: require.resolve('./seq/out.2bit'),
})
t.getSequenceSizes().then((response) => {
    Object.keys(response).forEach(key => {
        console.log(key+"\t"+response[key]);
    });
})

Then I get

chr1A   594102056
chr1B   689851870
chr1D   495453186
chr2A   780798557
chr2B   801256715
chr2D   651852609
chr3A   750843639
chr3B   830829764
chr3D   615552423
chr4A   744588157
chr4B   673617499
chr4D   509857067
chr5A   709773743
chr5B   440477507
chr5D   22
chr6A   1919443717
chr6B   1661272064
chr6D   148846134
chr7A   627459121
chr7B   1093825128
chr7D   1661281852
chrUn   1919443717

@rbuels

This comment has been minimized.

Copy link
Collaborator Author

rbuels commented Aug 1, 2018

@cmdcolin alright upgrade deps and try again, the chrom sizes issues should be fixed. apparently all the ints they talk about in the spec are unsigned ints.

@cmdcolin

This comment has been minimized.

Copy link
Contributor

cmdcolin commented Aug 1, 2018

Looking good!

@cmdcolin cmdcolin merged commit bbfadf1 into dev Aug 1, 2018

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

@wafflebot wafflebot bot removed the in progress label Aug 1, 2018

@cmdcolin cmdcolin deleted the 1116_twobit branch Aug 14, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.