Consistent chunkSizeLimit issue #540

Closed
keiranmraine opened this Issue Nov 27, 2014 · 12 comments

Comments

Projects
None yet
5 participants
@keiranmraine
Contributor

keiranmraine commented Nov 27, 2014

Hi,

I've found that I consistently have the inability to view the alignments2 track for human chromosome 21. Although I'm aware that I can increase the chunkSizeLimit (and I have to 15MB) it is even now regularly exceeded specifically on this chromosome.

I suspect that there is little that can be done, however this seems to occur when there is virtually no data in the region. I've seen these hitting 47MB

screen shot 2014-11-27 at 16 29 57

d: Too many BAM features. BAM chunk size 27,477,591 bytes exceeds chunkSizeLimit of 15,000,000. {message: "Too many BAM features. BAM chunk size 27,477,591 bytes exceeds chunkSizeLimit of 15,000,000.", _defaultMessage: "Too much data to show.", constructor: function, getInherited: function, isInstanceOf: function…}message: "Too many BAM features. BAM chunk size 27,477,591 bytes exceeds chunkSizeLimit of 15,000,000."__proto__: d

Tested in 1.11.3.

I can share examples privately for testing purposes.

Regards,
Keiran

@cmdcolin

This comment has been minimized.

Show comment
Hide comment
@cmdcolin

cmdcolin Nov 29, 2014

Contributor

I would definitely like to check this out. I feel like I have also seen that happen on fairly sparse bam files too.

Contributor

cmdcolin commented Nov 29, 2014

I would definitely like to check this out. I feel like I have also seen that happen on fairly sparse bam files too.

@keiranmraine

This comment has been minimized.

Show comment
Hide comment
@keiranmraine

keiranmraine Dec 2, 2014

Contributor

Have you a dropbox I could drop a dataset into?

Contributor

keiranmraine commented Dec 2, 2014

Have you a dropbox I could drop a dataset into?

@cmdcolin

This comment has been minimized.

Show comment
Hide comment
@cmdcolin

cmdcolin Dec 2, 2014

Contributor

I sent you a link by email

Contributor

cmdcolin commented Dec 2, 2014

I sent you a link by email

@vivekkrish

This comment has been minimized.

Show comment
Hide comment
@vivekkrish

vivekkrish Dec 2, 2014

Contributor

Just an FYI, the iPlantCollaborative provides a free to use data store (http://www.iplantcollaborative.org/ci/data-store) with quite a lot of capacity (100GB to start with).

One of the nice things about this data store is that BAM (and corresponding BAI) files stored here can be easily streamed to any genome browser of your choice (JBrowse, IGB, IGV, etc.). And the service they provide is CORS compliant and therefore will work on any genome browser running on any server. (documentation here: https://pods.iplantcollaborative.org/wiki/display/DEmanual/Sending+Genome+Files+to+the+Genome+Browser)

We routinely use this for hosting and sharing files with collaborators.

Contributor

vivekkrish commented Dec 2, 2014

Just an FYI, the iPlantCollaborative provides a free to use data store (http://www.iplantcollaborative.org/ci/data-store) with quite a lot of capacity (100GB to start with).

One of the nice things about this data store is that BAM (and corresponding BAI) files stored here can be easily streamed to any genome browser of your choice (JBrowse, IGB, IGV, etc.). And the service they provide is CORS compliant and therefore will work on any genome browser running on any server. (documentation here: https://pods.iplantcollaborative.org/wiki/display/DEmanual/Sending+Genome+Files+to+the+Genome+Browser)

We routinely use this for hosting and sharing files with collaborators.

@keiranmraine

This comment has been minimized.

Show comment
Hide comment
@keiranmraine

keiranmraine Dec 2, 2014

Contributor

@vivekkrish, thanks for this info but unless it's possible to use authenticated access I can't use that as the data I have exhibiting this issue is human and ethics etc. prevents me from putting it on publicly accessible resources.

Contributor

keiranmraine commented Dec 2, 2014

@vivekkrish, thanks for this info but unless it's possible to use authenticated access I can't use that as the data I have exhibiting this issue is human and ethics etc. prevents me from putting it on publicly accessible resources.

@keiranmraine

This comment has been minimized.

Show comment
Hide comment
@keiranmraine

keiranmraine Dec 2, 2014

Contributor

@cmdcolin I've dropped the files in, you only need chromosome 21 from the reference but I put it all (compressed) in so that it would match the bam header.

Contributor

keiranmraine commented Dec 2, 2014

@cmdcolin I've dropped the files in, you only need chromosome 21 from the reference but I put it all (compressed) in so that it would match the bam header.

@vivekkrish

This comment has been minimized.

Show comment
Hide comment
@vivekkrish

vivekkrish Dec 3, 2014

Contributor

@keiranmraine, the data is private to you, associated to your personal iPlant account. They have ACLs in place that let you send sharing invites for specific files/folders to other registered iPlant user accounts.

However, they do provide this nice capability to generate unique shareable URLs for individual files (BAM, GFF3, etc.) for streaming to Genome Browsers, which can be accessed only by the users with whom you have explicitly shared the link (similar to how it works in Google Drive and/or Dropbox).

Contributor

vivekkrish commented Dec 3, 2014

@keiranmraine, the data is private to you, associated to your personal iPlant account. They have ACLs in place that let you send sharing invites for specific files/folders to other registered iPlant user accounts.

However, they do provide this nice capability to generate unique shareable URLs for individual files (BAM, GFF3, etc.) for streaming to Genome Browsers, which can be accessed only by the users with whom you have explicitly shared the link (similar to how it works in Google Drive and/or Dropbox).

@cmdcolin

This comment has been minimized.

Show comment
Hide comment
@cmdcolin

cmdcolin Dec 3, 2014

Contributor

Hi Keiran, what I am seeing is that the "stats estimation" step results in too much data being downloaded on this file. The procedure for stats estimation in JBrowse seems to be (1) select a place somewhere in the middle of the chromosome (2) download 100 base pair range of data (3) if there are not enough features, double the interval and retry the procedure. In this case, it seems that there is not enough data sampled until it is getting a window the size of 3,276,800 base pairs, and at that point something is exceeding chunkSizeLimit. So, perhaps there should be a limit on the stats estimation

Contributor

cmdcolin commented Dec 3, 2014

Hi Keiran, what I am seeing is that the "stats estimation" step results in too much data being downloaded on this file. The procedure for stats estimation in JBrowse seems to be (1) select a place somewhere in the middle of the chromosome (2) download 100 base pair range of data (3) if there are not enough features, double the interval and retry the procedure. In this case, it seems that there is not enough data sampled until it is getting a window the size of 3,276,800 base pairs, and at that point something is exceeding chunkSizeLimit. So, perhaps there should be a limit on the stats estimation

@cmdcolin

This comment has been minimized.

Show comment
Hide comment
@cmdcolin

cmdcolin Dec 3, 2014

Contributor

Proposed patch to add some randomness and a retry limit...

diff --git a/src/JBrowse/Store/SeqFeature/GlobalStatsEstimationMixin.js b/src/JBrowse/Store/SeqFeature/GlobalStatsEstimationMixin.js
index 137e3ad..b797033 100644
--- a/src/JBrowse/Store/SeqFeature/GlobalStatsEstimationMixin.js
+++ b/src/JBrowse/Store/SeqFeature/GlobalStatsEstimationMixin.js
@@ -22,10 +22,21 @@ return declare( null, {
         var deferred = new Deferred();

         refseq = refseq || this.refSeq;
+        var retries = 0;
+        var sampleCenter = refseq.start*0.75 + refseq.end*0.25;

         var statsFromInterval = function( length, callback ) {
             var thisB = this;
-            var sampleCenter = refseq.start*0.75 + refseq.end*0.25;
+            var reset = false;
+            if( length>10000 ) {
+                length = 100;
+                retries++;
+                sampleCenter = Math.round(Math.random()*(refseq.end));
+            }
+            if( retries>10 ) {
+                callback.call( thisB, length,  null, "Failed to estimate stats" );
+                return;
+            }
Contributor

cmdcolin commented Dec 3, 2014

Proposed patch to add some randomness and a retry limit...

diff --git a/src/JBrowse/Store/SeqFeature/GlobalStatsEstimationMixin.js b/src/JBrowse/Store/SeqFeature/GlobalStatsEstimationMixin.js
index 137e3ad..b797033 100644
--- a/src/JBrowse/Store/SeqFeature/GlobalStatsEstimationMixin.js
+++ b/src/JBrowse/Store/SeqFeature/GlobalStatsEstimationMixin.js
@@ -22,10 +22,21 @@ return declare( null, {
         var deferred = new Deferred();

         refseq = refseq || this.refSeq;
+        var retries = 0;
+        var sampleCenter = refseq.start*0.75 + refseq.end*0.25;

         var statsFromInterval = function( length, callback ) {
             var thisB = this;
-            var sampleCenter = refseq.start*0.75 + refseq.end*0.25;
+            var reset = false;
+            if( length>10000 ) {
+                length = 100;
+                retries++;
+                sampleCenter = Math.round(Math.random()*(refseq.end));
+            }
+            if( retries>10 ) {
+                callback.call( thisB, length,  null, "Failed to estimate stats" );
+                return;
+            }

@cmdcolin cmdcolin added the bug label Dec 3, 2014

@cmdcolin

This comment has been minimized.

Show comment
Hide comment
@cmdcolin

cmdcolin Dec 3, 2014

Contributor

My patch may want to allow length>10000 since this is also called for GFFs which might be less feature dense (the stats calculation waits until 300 features are found). Apparently the limit that is in JBrowse by default is stopping when the whole chromosome is read but for BAM, but this seems prohibitively large and then causes these chunkSizeLimit errors

I guess a lesson here might be to make this configurable as well (many of these values are hard coded)

Contributor

cmdcolin commented Dec 3, 2014

My patch may want to allow length>10000 since this is also called for GFFs which might be less feature dense (the stats calculation waits until 300 features are found). Apparently the limit that is in JBrowse by default is stopping when the whole chromosome is read but for BAM, but this seems prohibitively large and then causes these chunkSizeLimit errors

I guess a lesson here might be to make this configurable as well (many of these values are hard coded)

@keiranmraine

This comment has been minimized.

Show comment
Hide comment
@keiranmraine

keiranmraine Dec 4, 2014

Contributor

@cmdcolin, thanks for looking into this. Is there a time frame for the 1.11.6 point release? I'm especially in need of the XS read colouring fix for our production system.

@vivekkrish, thanks for the info that's very useful. I'll have to confirm with our PI that he is happy with using this but I expect it will be a very useful tool.

Contributor

keiranmraine commented Dec 4, 2014

@cmdcolin, thanks for looking into this. Is there a time frame for the 1.11.6 point release? I'm especially in need of the XS read colouring fix for our production system.

@vivekkrish, thanks for the info that's very useful. I'll have to confirm with our PI that he is happy with using this but I expect it will be a very useful tool.

@cmdcolin cmdcolin added this to the 1.11.6 milestone Jan 23, 2015

@cmdcolin

This comment has been minimized.

Show comment
Hide comment
@cmdcolin

cmdcolin Feb 6, 2015

Contributor

I'm not really sure my patch is the right approach, I think something has to be done to improve this more fundamentally

Contributor

cmdcolin commented Feb 6, 2015

I'm not really sure my patch is the right approach, I think something has to be done to improve this more fundamentally

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment