Avoid unnecessary re-reading of streams #4405

nnethercote · 2014-03-07T02:55:35Z

The first two patches in this pull request reduce the load time of http://www.minotnd.org/pdf/engineering/sanitary%20sewer/24x36_ntrunk.pdf (when loaded through a local web server rather than directly from file), from ~28 seconds to ~6 seconds, and reduce peak memory consumption from ~405 MiB to ~355 MiB.

The third patch doesn't have any effect on the load time/memory consumption of the files I tested, but it's certainly conceivable that there are files for which it would help. Getting all of a stream's bytes up-front is an anti-pattern that's worth avoiding in general.

yurydelendik · 2014-03-07T03:14:11Z

src/core/stream.js


    DecodeStream.call(this);
  }

  JpxStream.prototype = Object.create(DecodeStream.prototype);

+  JpxStream.prototype.__defineGetter__('bytes', function() {


Use:

Object.defineProperty(JpxStream.prototype, 'bytes', { get: function JpxStream_bytes() { var bytes = this.stream.getBytes(this.length); return shadow(this, 'bytes', bytes); }, configurable: true });

So if I understand correctly, that creates a getter called 'bytes', and the first time it's called the getter will be replaced with a read-only property? Is there an advantage doing it that way compared to the way I've done it?

nnethercote · 2014-03-07T05:24:57Z

I changed how the getters are defined.

Snuffleupagus · 2014-03-07T09:02:30Z

/botio test

pdfjsbot · 2014-03-07T09:02:32Z

From: Bot.io (Linux)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://107.21.233.14:8877/1aaf341b354758c/output.txt

pdfjsbot · 2014-03-07T09:02:32Z

From: Bot.io (Windows)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://107.22.172.223:8877/39942240785f43e/output.txt

pdfjsbot · 2014-03-07T09:28:51Z

From: Bot.io (Linux)

Success

Full output at http://107.21.233.14:8877/1aaf341b354758c/output.txt

Total script time: 26.31 mins

Font tests: Passed
Unit tests: Passed
Regression tests: Passed

pdfjsbot · 2014-03-07T09:39:12Z

From: Bot.io (Windows)

Failed

Full output at http://107.22.172.223:8877/39942240785f43e/output.txt

Total script time: 36.65 mins

Font tests: Passed
Unit tests: Passed
Regression tests: FAILED

Image differences available at: http://107.22.172.223:8877/39942240785f43e/reftest-analyzer.html#web=eq.log

yurydelendik · 2014-03-07T13:19:44Z

src/core/stream.js


    var b;
    while (codeSize < bits) {
-      if (typeof (b = bytes[bytesPos++]) == 'undefined')
+      if (typeof (b = str.getByte()) === 'undefined') {


getByte() returns -1 if byte after eof is requested

Hmm, yes. I'll fix, but I probably won't get to it until early next week.

I guess this means there aren't any tests with invalid FlateDecode streams in them :(

No, that's just mean we have tests only for valid/complete DEFLATE streams -- these have valid EOF marked.

Right :) the same you just said, sorry

timvandermeij · 2014-03-07T21:58:25Z

/botio-linux preview

pdfjsbot · 2014-03-07T21:58:26Z

From: Bot.io (Linux)

Received

Command cmd_preview from @timvandermeij received. Current queue size: 0

Live output at: http://107.21.233.14:8877/b4d80886a9a7f03/output.txt

pdfjsbot · 2014-03-07T21:58:47Z

From: Bot.io (Linux)

Success

Full output at http://107.21.233.14:8877/b4d80886a9a7f03/output.txt

Total script time: 0.35 mins

Published

Viewer: http://107.21.233.14:8877/b4d80886a9a7f03/web/viewer.html
B2G Viewer: http://107.21.233.14:8877/b4d80886a9a7f03/extensions/b2g/content/web/viewer.html
Extension: http://107.21.233.14:8877/b4d80886a9a7f03/extensions/firefox/pdf.js.xpi
Extension (AMO): http://107.21.233.14:8877/b4d80886a9a7f03/extensions/firefox/pdf.js.amo.xpi

timvandermeij · 2014-03-07T21:59:58Z

Awesome! Major difference indeed.

Snuffleupagus · 2014-03-08T12:39:21Z

src/core/stream.js


    DecodeStream.call(this);
  }

  JpegStream.prototype = Object.create(DecodeStream.prototype);

+  Object.defineProperty(JpegStream.prototype, 'bytes', {
+    get: function JpegStream_bytes() {
+      return shadow(this, 'bytes', this.stream.getBytes(this.length));


Add shadow to the list of globals to fix the lint errors.

nnethercote · 2014-03-08T13:00:45Z

I fixed the eof/-1 confusion, and tested the new code worked by deliberately corrupting a FlateStream and checking that the error message was the same.

I also fixed the jshint warning.

Snuffleupagus · 2014-03-08T16:06:43Z

Beside the very nice performance improvement and memory reduction that this patch provides, it might also address #4084 at the same time.

/botio test

pdfjsbot · 2014-03-08T16:06:44Z

From: Bot.io (Linux)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://107.21.233.14:8877/28348dd1476303a/output.txt

pdfjsbot · 2014-03-08T16:06:44Z

From: Bot.io (Windows)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://107.22.172.223:8877/ebc37f1ad12b97a/output.txt

pdfjsbot · 2014-03-08T16:33:05Z

From: Bot.io (Linux)

Success

Full output at http://107.21.233.14:8877/28348dd1476303a/output.txt

Total script time: 26.35 mins

Font tests: Passed
Unit tests: Passed
Regression tests: Passed

pdfjsbot · 2014-03-08T16:43:21Z

From: Bot.io (Windows)

Failed

Full output at http://107.22.172.223:8877/ebc37f1ad12b97a/output.txt

Total script time: 36.60 mins

Font tests: Passed
Unit tests: Passed
Regression tests: FAILED

Image differences available at: http://107.22.172.223:8877/ebc37f1ad12b97a/reftest-analyzer.html#web=eq.log

brendandahl · 2014-03-10T20:41:45Z

src/core/chunked_stream.js

@@ -201,6 +201,7 @@ var ChunkedStream = (function ChunkedStreamClosure() {
        }
        return missingChunks;
      };
+      this.ensureRange(start, start + length);


I'd prefer a solution that does not use ensureRange. EnsureRange throws an exception when any pieces are missing so any function that calls makeSubstream must now handle this exception. Would it be enough to just fire of a request for all the missing chunks?

I think removing the ensureRange() call is a bad idea. For one, the subsequent patch isn't valid -- it cause the sewer PDF to hang.

The whole point of the patch is to avoid doing any work that might subsequently be thrown away. Your solution would be a partial fix -- depending on the load speed, those requests might be fulfilled by the time we're parsing. But they might not, so you could still end up doing wasted work.

There aren't that many calls to makeSubStream() anyway.

By checking if the data is all present before making a substream, we avoid cases where we parse part of a stream and then throw a MissingDataException part-way through, which forces us to later re-read the stream -- possibly multiple times. This is a sizeable performance win for some cases when file loading is slow (e.g. over the web).

This avoids lots of unnecessary work when such streams are referred to via fetch(), and so their bytes aren't subsequently read. This is a large performance win on some files.

nnethercote · 2014-03-11T23:45:06Z

I moved the ensureRange() call earlier in the function, per bdahl's suggestion on IRC.

Snuffleupagus · 2014-03-12T09:55:21Z

/botio test

pdfjsbot · 2014-03-12T09:55:22Z

From: Bot.io (Linux)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://107.21.233.14:8877/f5a835cafb19526/output.txt

pdfjsbot · 2014-03-12T09:55:22Z

From: Bot.io (Windows)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://107.22.172.223:8877/ec2e68d99362d5d/output.txt

pdfjsbot · 2014-03-12T10:21:42Z

From: Bot.io (Linux)

Success

Full output at http://107.21.233.14:8877/f5a835cafb19526/output.txt

Total script time: 26.34 mins

Font tests: Passed
Unit tests: Passed
Regression tests: Passed

pdfjsbot · 2014-03-12T10:31:55Z

From: Bot.io (Windows)

Failed

Full output at http://107.22.172.223:8877/ec2e68d99362d5d/output.txt

Total script time: 36.54 mins

Font tests: Passed
Unit tests: Passed
Regression tests: FAILED

Image differences available at: http://107.22.172.223:8877/ec2e68d99362d5d/reftest-analyzer.html#web=eq.log

Avoid unnecessary re-reading of streams

yurydelendik reviewed Mar 7, 2014
View reviewed changes

Snuffleupagus reviewed Mar 8, 2014
View reviewed changes

brendandahl reviewed Mar 10, 2014
View reviewed changes

nnethercote added 3 commits March 11, 2014 16:03

Don't get bytes eagerly when creating {Jpeg,Jpx,Jbig2}Stream objects.

d0253c8

This avoids lots of unnecessary work when such streams are referred to via fetch(), and so their bytes aren't subsequently read. This is a large performance win on some files.

Don't get bytes eagerly when creating FlateStream objects.

ea17749

brendandahl added a commit that referenced this pull request Mar 12, 2014

Merge pull request #4405 from nnethercote/avoid-re-reading-streams

c3ed71c

Avoid unnecessary re-reading of streams

brendandahl merged commit c3ed71c into mozilla:master Mar 12, 2014

Snuffleupagus mentioned this pull request Mar 13, 2014

Fix loading of fonts with empty font file (bug 866395) #4184

Closed

nnethercote deleted the avoid-re-reading-streams branch March 14, 2014 04:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid unnecessary re-reading of streams #4405

Avoid unnecessary re-reading of streams #4405

nnethercote commented Mar 7, 2014

yurydelendik Mar 7, 2014

nnethercote Mar 7, 2014

nnethercote commented Mar 7, 2014

Snuffleupagus commented Mar 7, 2014

pdfjsbot commented Mar 7, 2014

pdfjsbot commented Mar 7, 2014

pdfjsbot commented Mar 7, 2014

pdfjsbot commented Mar 7, 2014

yurydelendik Mar 7, 2014

nnethercote Mar 7, 2014

yurydelendik Mar 7, 2014

yurydelendik Mar 7, 2014

timvandermeij commented Mar 7, 2014

pdfjsbot commented Mar 7, 2014

pdfjsbot commented Mar 7, 2014

timvandermeij commented Mar 7, 2014

Snuffleupagus Mar 8, 2014

nnethercote commented Mar 8, 2014

Snuffleupagus commented Mar 8, 2014

pdfjsbot commented Mar 8, 2014

pdfjsbot commented Mar 8, 2014

pdfjsbot commented Mar 8, 2014

pdfjsbot commented Mar 8, 2014

brendandahl Mar 10, 2014

nnethercote Mar 10, 2014

nnethercote commented Mar 11, 2014

Snuffleupagus commented Mar 12, 2014

pdfjsbot commented Mar 12, 2014

pdfjsbot commented Mar 12, 2014

pdfjsbot commented Mar 12, 2014

pdfjsbot commented Mar 12, 2014

Avoid unnecessary re-reading of streams #4405

Avoid unnecessary re-reading of streams #4405

Conversation

nnethercote commented Mar 7, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nnethercote commented Mar 7, 2014

Snuffleupagus commented Mar 7, 2014

pdfjsbot commented Mar 7, 2014

From: Bot.io (Linux)

Received

pdfjsbot commented Mar 7, 2014

From: Bot.io (Windows)

Received

pdfjsbot commented Mar 7, 2014

From: Bot.io (Linux)

Success

pdfjsbot commented Mar 7, 2014

From: Bot.io (Windows)

Failed

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timvandermeij commented Mar 7, 2014

pdfjsbot commented Mar 7, 2014

From: Bot.io (Linux)

Received

pdfjsbot commented Mar 7, 2014

From: Bot.io (Linux)

Success

Published

timvandermeij commented Mar 7, 2014

Choose a reason for hiding this comment

nnethercote commented Mar 8, 2014

Snuffleupagus commented Mar 8, 2014

pdfjsbot commented Mar 8, 2014

From: Bot.io (Linux)

Received

pdfjsbot commented Mar 8, 2014

From: Bot.io (Windows)

Received

pdfjsbot commented Mar 8, 2014

From: Bot.io (Linux)

Success

pdfjsbot commented Mar 8, 2014

From: Bot.io (Windows)

Failed

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nnethercote commented Mar 11, 2014

Snuffleupagus commented Mar 12, 2014

pdfjsbot commented Mar 12, 2014

From: Bot.io (Linux)

Received

pdfjsbot commented Mar 12, 2014

From: Bot.io (Windows)

Received

pdfjsbot commented Mar 12, 2014

From: Bot.io (Linux)

Success

pdfjsbot commented Mar 12, 2014

From: Bot.io (Windows)

Failed