Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid unnecessary re-reading of streams #4405

Merged
merged 3 commits into from
Mar 12, 2014

Conversation

nnethercote
Copy link
Contributor

The first two patches in this pull request reduce the load time of http://www.minotnd.org/pdf/engineering/sanitary%20sewer/24x36_ntrunk.pdf (when loaded through a local web server rather than directly from file), from ~28 seconds to ~6 seconds, and reduce peak memory consumption from ~405 MiB to ~355 MiB.

The third patch doesn't have any effect on the load time/memory consumption of the files I tested, but it's certainly conceivable that there are files for which it would help. Getting all of a stream's bytes up-front is an anti-pattern that's worth avoiding in general.


DecodeStream.call(this);
}

JpxStream.prototype = Object.create(DecodeStream.prototype);

JpxStream.prototype.__defineGetter__('bytes', function() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use:

Object.defineProperty(JpxStream.prototype, 'bytes', {
  get: function JpxStream_bytes() {
     var bytes = this.stream.getBytes(this.length);
     return shadow(this, 'bytes', bytes);
  },
  configurable: true
});

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I understand correctly, that creates a getter called 'bytes', and the first time it's called the getter will be replaced with a read-only property? Is there an advantage doing it that way compared to the way I've done it?

@nnethercote
Copy link
Contributor Author

I changed how the getters are defined.

@Snuffleupagus
Copy link
Collaborator

/botio test

@pdfjsbot
Copy link

pdfjsbot commented Mar 7, 2014

From: Bot.io (Linux)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://107.21.233.14:8877/1aaf341b354758c/output.txt

@pdfjsbot
Copy link

pdfjsbot commented Mar 7, 2014

From: Bot.io (Windows)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://107.22.172.223:8877/39942240785f43e/output.txt

@pdfjsbot
Copy link

pdfjsbot commented Mar 7, 2014

From: Bot.io (Linux)


Success

Full output at http://107.21.233.14:8877/1aaf341b354758c/output.txt

Total script time: 26.31 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: Passed

@pdfjsbot
Copy link

pdfjsbot commented Mar 7, 2014

From: Bot.io (Windows)


Failed

Full output at http://107.22.172.223:8877/39942240785f43e/output.txt

Total script time: 36.65 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: FAILED

Image differences available at: http://107.22.172.223:8877/39942240785f43e/reftest-analyzer.html#web=eq.log


var b;
while (codeSize < bits) {
if (typeof (b = bytes[bytesPos++]) == 'undefined')
if (typeof (b = str.getByte()) === 'undefined') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getByte() returns -1 if byte after eof is requested

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yes. I'll fix, but I probably won't get to it until early next week.

I guess this means there aren't any tests with invalid FlateDecode streams in them :(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that's just mean we have tests only for valid/complete DEFLATE streams -- these have valid EOF marked.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right :) the same you just said, sorry

@timvandermeij
Copy link
Contributor

/botio-linux preview

@pdfjsbot
Copy link

pdfjsbot commented Mar 7, 2014

From: Bot.io (Linux)


Received

Command cmd_preview from @timvandermeij received. Current queue size: 0

Live output at: http://107.21.233.14:8877/b4d80886a9a7f03/output.txt

@timvandermeij
Copy link
Contributor

Awesome! Major difference indeed.


DecodeStream.call(this);
}

JpegStream.prototype = Object.create(DecodeStream.prototype);

Object.defineProperty(JpegStream.prototype, 'bytes', {
get: function JpegStream_bytes() {
return shadow(this, 'bytes', this.stream.getBytes(this.length));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add shadow to the list of globals to fix the lint errors.

@nnethercote
Copy link
Contributor Author

I fixed the eof/-1 confusion, and tested the new code worked by deliberately corrupting a FlateStream and checking that the error message was the same.

I also fixed the jshint warning.

@Snuffleupagus
Copy link
Collaborator

Beside the very nice performance improvement and memory reduction that this patch provides, it might also address #4084 at the same time.

/botio test

@pdfjsbot
Copy link

pdfjsbot commented Mar 8, 2014

From: Bot.io (Linux)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://107.21.233.14:8877/28348dd1476303a/output.txt

@pdfjsbot
Copy link

pdfjsbot commented Mar 8, 2014

From: Bot.io (Windows)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://107.22.172.223:8877/ebc37f1ad12b97a/output.txt

@pdfjsbot
Copy link

pdfjsbot commented Mar 8, 2014

From: Bot.io (Linux)


Success

Full output at http://107.21.233.14:8877/28348dd1476303a/output.txt

Total script time: 26.35 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: Passed

@pdfjsbot
Copy link

pdfjsbot commented Mar 8, 2014

From: Bot.io (Windows)


Failed

Full output at http://107.22.172.223:8877/ebc37f1ad12b97a/output.txt

Total script time: 36.60 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: FAILED

Image differences available at: http://107.22.172.223:8877/ebc37f1ad12b97a/reftest-analyzer.html#web=eq.log

@@ -201,6 +201,7 @@ var ChunkedStream = (function ChunkedStreamClosure() {
}
return missingChunks;
};
this.ensureRange(start, start + length);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer a solution that does not use ensureRange. EnsureRange throws an exception when any pieces are missing so any function that calls makeSubstream must now handle this exception. Would it be enough to just fire of a request for all the missing chunks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think removing the ensureRange() call is a bad idea. For one, the subsequent patch isn't valid -- it cause the sewer PDF to hang.

The whole point of the patch is to avoid doing any work that might subsequently be thrown away. Your solution would be a partial fix -- depending on the load speed, those requests might be fulfilled by the time we're parsing. But they might not, so you could still end up doing wasted work.

There aren't that many calls to makeSubStream() anyway.

By checking if the data is all present before making a substream, we avoid
cases where we parse part of a stream and then throw a MissingDataException
part-way through, which forces us to later re-read the stream -- possibly
multiple times. This is a sizeable performance win for some cases when file
loading is slow (e.g. over the web).
This avoids lots of unnecessary work when such streams are referred to via
fetch(), and so their bytes aren't subsequently read. This is a large
performance win on some files.
@nnethercote
Copy link
Contributor Author

I moved the ensureRange() call earlier in the function, per bdahl's suggestion on IRC.

@Snuffleupagus
Copy link
Collaborator

/botio test

@pdfjsbot
Copy link

From: Bot.io (Linux)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://107.21.233.14:8877/f5a835cafb19526/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://107.22.172.223:8877/ec2e68d99362d5d/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux)


Success

Full output at http://107.21.233.14:8877/f5a835cafb19526/output.txt

Total script time: 26.34 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: Passed

@pdfjsbot
Copy link

From: Bot.io (Windows)


Failed

Full output at http://107.22.172.223:8877/ec2e68d99362d5d/output.txt

Total script time: 36.54 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: FAILED

Image differences available at: http://107.22.172.223:8877/ec2e68d99362d5d/reftest-analyzer.html#web=eq.log

brendandahl added a commit that referenced this pull request Mar 12, 2014
Avoid unnecessary re-reading of streams
@brendandahl brendandahl merged commit c3ed71c into mozilla:master Mar 12, 2014
@nnethercote nnethercote deleted the avoid-re-reading-streams branch March 14, 2014 04:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants