Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2bit file processing is slow if it contains many refseqs #1116

Closed
cmdcolin opened this Issue Jul 13, 2018 · 6 comments

Comments

Projects
None yet
2 participants
@cmdcolin
Copy link
Contributor

cmdcolin commented Jul 13, 2018

The 2bit file parsing currently launches a Promise-per-refseq on initial parse which becomes very slow when there are many refseqs

@rbuels

This comment has been minimized.

Copy link
Collaborator

rbuels commented Jul 13, 2018

my 2c, i'd say this is more of a bug than a tech debt thing, cause it's a scalability issue

@rbuels rbuels added this to the 1.15.1 milestone Jul 13, 2018

@cmdcolin

This comment has been minimized.

Copy link
Contributor Author

cmdcolin commented Jul 13, 2018

Agree :)

@rbuels

This comment has been minimized.

Copy link
Collaborator

rbuels commented Jul 18, 2018

implementation sketch:

  • break out as a node module and rewrite/refactor

@rbuels rbuels added the big task label Jul 18, 2018

@rbuels rbuels self-assigned this Jul 18, 2018

@rbuels

This comment has been minimized.

Copy link
Collaborator

rbuels commented Jul 24, 2018

@cmdcolin do you have a problematic 2bit file I can test with?

@rbuels rbuels added the in progress label Jul 24, 2018

@cmdcolin

This comment has been minimized.

Copy link
Contributor Author

cmdcolin commented Jul 24, 2018

I forgot to mention that this happens when the 2bit file is used as a reference instead of refSeqs.json

If you use prepare-refseqs.pl currently it writes out a refSeqs.json generated from the twobit file which makes twobit loading fast, but if you use the "Open sequence file" or create a trackList.json like this

{
   "formatVersion" : 1,
   "tracks" : [
      {
         "category" : "Reference sequence",
         "key" : "Reference sequence",
         "label" : "DNA",
         "seqType" : "dna",
         "storeClass" : "JBrowse/Store/Sequence/TwoBit",
         "type" : "SequenceTrack",
         "urlTemplate" : "seq/out.2bit",
         "useAsRefSeqStore" : 1
      }
   ],
   "refSeqs": "seq/out.2bit"
}

Here's a simple tar.gz of a data dir with this

data.tar.gz

It has 10000 refseqs but takes 30 seconds to load on my computer due to tons and tons of readSequenceHeader calls

@cmdcolin

This comment has been minimized.

Copy link
Contributor Author

cmdcolin commented Aug 1, 2018

Fixed via #1146! Nice work

@cmdcolin cmdcolin closed this Aug 1, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.