Check that the first page can be successfully loaded, to try and ascertain the validity of the XRef table (issue 7496, issue 10326) #10392

Snuffleupagus · 2018-12-29T11:45:19Z

For PDF documents with sufficiently broken XRef tables, it's usually quite obvious when you need to fallback to indexing the entire file. However, for certain kinds of corrupted PDF documents the XRef table will, for all intents and purposes, appear to be valid. It's not until you actually try to fetch various objects that things will start to break, which is the case in the referenced issues[1].

Since there's generally a real effort being in made PDF.js to load even corrupt PDF documents, this patch contains a suggested approach to attempt to do a bit more validation of the XRef table during the initial document loading phase.

Here the choice is made to attempt to load the first page, as a basic sanity check of the validity of the XRef table. Please note that attempting to load a more-or-less arbitrarily chosen object without any context of what it's supposed to be isn't a very useful, which is why this particular choice was made.
Obviously, just because the first page can be loaded successfully that doesn't guarantee that the entire XRef table is valid, however if even the first page fails to load you can be reasonably sure that the document is not valid[2].

Even though this patch won't cause any significant increase in the amount of parsing required during initial loading of the document[3], it will require loading of more data upfront which thus delays the initial getDocument call.
Whether or not this is a problem depends very much on what you actually measure, please consider the following examples:

console.time('first');
getDocument(...).promise.then((pdfDocument) => {
  console.timeEnd('first');
});

console.time('second');
getDocument(...).promise.then((pdfDocument) => {
  pdfDocument.getPage(1).then((pdfPage) => { // Note: the API uses `pageNumber >= 1`, the Worker uses `pageIndex >= 0`.
    console.timeEnd('second');
  });
});

The first case is pretty much guaranteed to show a small regression, however the second case won't be affected at all since the Worker caches the result of getPage calls. Again, please remember that the second case is what matters for the standard PDF.js use-case which is why I'm hoping that this patch is deemed acceptable.

Fixes #7496.
Fixes #10326.

[1] In issue 7496, the problem is that the document is edited without the XRef table being correctly updated.
In issue 10326, the generator was sorting the XRef table according to the offsets rather than the objects.

[2] The idea of checking the first page in particular came from the "standard" use-case for the PDF.js library, i.e. the default viewer, where a failure to load the first page basically means that nothing will work; note how {BaseViewer, PDFThumbnailViewer}.setDocument depends completely on being able to fetch the first page.

[3] The only extra parsing is caused by, potentially, having to traverse part of the Pages tree to find the first page.

…rtain the validity of the XRef table (issue 7496, issue 10326) For PDF documents with sufficiently broken XRef tables, it's usually quite obvious when you need to fallback to indexing the entire file. However, for certain kinds of corrupted PDF documents the XRef table will, for all intents and purposes, appear to be valid. It's not until you actually try to fetch various objects that things will start to break, which is the case in the referenced issues[1]. Since there's generally a real effort being in made PDF.js to load even corrupt PDF documents, this patch contains a suggested approach to attempt to do a bit more validation of the XRef table during the initial document loading phase. Here the choice is made to attempt to load the *first* page, as a basic sanity check of the validity of the XRef table. Please note that attempting to load a more-or-less arbitrarily chosen object without any context of what it's supposed to be isn't a very useful, which is why this particular choice was made. Obviously, just because the first page can be loaded successfully that doesn't guarantee that the *entire* XRef table is valid, however if even the first page fails to load you can be reasonably sure that the document is *not* valid[2]. Even though this patch won't cause any significant increase in the amount of parsing required during initial loading of the document[3], it will require loading of more data upfront which thus delays the initial `getDocument` call. Whether or not this is a problem depends very much on what you actually measure, please consider the following examples: ```javascript console.time('first'); getDocument(...).promise.then((pdfDocument) => { console.timeEnd('first'); }); console.time('second'); getDocument(...).promise.then((pdfDocument) => { pdfDocument.getPage(1).then((pdfPage) => { // Note: the API uses `pageNumber >= 1`, the Worker uses `pageIndex >= 0`. console.timeEnd('second'); }); }); ``` The first case is pretty much guaranteed to show a small regression, however the second case won't be affected at all since the Worker caches the result of `getPage` calls. Again, please remember that the second case is what matters for the standard PDF.js use-case which is why I'm hoping that this patch is deemed acceptable. --- [1] In issue 7496, the problem is that the document is edited without the XRef table being correctly updated. In issue 10326, the generator was sorting the XRef table according to the offsets rather than the objects. [2] The idea of checking the first page in particular came from the "standard" use-case for the PDF.js library, i.e. the default viewer, where a failure to load the first page basically means that nothing will work; note how `{BaseViewer, PDFThumbnailViewer}.setDocument` depends completely on being able to fetch the *first* page. [3] The only extra parsing is caused by, potentially, having to traverse *part* of the `Pages` tree to find the first page.

Snuffleupagus · 2018-12-29T11:50:12Z

/botio test

pdfjsbot · 2018-12-29T11:50:13Z

From: Bot.io (Linux m4)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/d4f3bfcf86218f7/output.txt

pdfjsbot · 2018-12-29T11:50:13Z

From: Bot.io (Windows)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.215.176.217:8877/88677a390a21a8f/output.txt

pdfjsbot · 2018-12-29T12:07:58Z

From: Bot.io (Linux m4)

Success

Full output at http://54.67.70.0:8877/d4f3bfcf86218f7/output.txt

Total script time: 17.74 mins

Font tests: Passed
Unit tests: Passed
Regression tests: Passed

pdfjsbot · 2018-12-29T12:13:36Z

From: Bot.io (Windows)

Success

Full output at http://54.215.176.217:8877/88677a390a21a8f/output.txt

Total script time: 23.37 mins

Font tests: Passed
Unit tests: Passed
Regression tests: Passed

timvandermeij · 2018-12-29T13:18:49Z

/botio-linux preview

pdfjsbot · 2018-12-29T13:18:49Z

From: Bot.io (Linux m4)

Received

Command cmd_preview from @timvandermeij received. Current queue size: 0

Live output at: http://54.67.70.0:8877/f11ca4aa30e621c/output.txt

pdfjsbot · 2018-12-29T13:20:29Z

From: Bot.io (Linux m4)

Success

Full output at http://54.67.70.0:8877/f11ca4aa30e621c/output.txt

Total script time: 1.65 mins

Published

Viewer: http://54.67.70.0:8877/f11ca4aa30e621c/web/viewer.html

timvandermeij · 2018-12-29T14:13:52Z

Thank you! I also really like the readability improvements in loadDocument.

Snuffleupagus force-pushed the checkFirstPage branch from 10f7e1d to 60bcce1 Compare December 29, 2018 11:47

timvandermeij added the core label Dec 29, 2018

timvandermeij merged commit e53877f into mozilla:master Dec 29, 2018

timvandermeij mentioned this pull request Dec 29, 2018

Error in opening PDF file #9934

Closed

Snuffleupagus deleted the checkFirstPage branch December 29, 2018 14:31

Snuffleupagus mentioned this pull request Nov 25, 2021

[api-minor] Validate the /Pages-tree /Count entry during document initialization (issue 14303) #14311

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check that the first page can be successfully loaded, to try and ascertain the validity of the XRef table (issue 7496, issue 10326) #10392

Check that the first page can be successfully loaded, to try and ascertain the validity of the XRef table (issue 7496, issue 10326) #10392

Snuffleupagus commented Dec 29, 2018

Snuffleupagus commented Dec 29, 2018

pdfjsbot commented Dec 29, 2018

pdfjsbot commented Dec 29, 2018

pdfjsbot commented Dec 29, 2018

pdfjsbot commented Dec 29, 2018

timvandermeij commented Dec 29, 2018

pdfjsbot commented Dec 29, 2018

pdfjsbot commented Dec 29, 2018

timvandermeij commented Dec 29, 2018

Check that the first page can be successfully loaded, to try and ascertain the validity of the XRef table (issue 7496, issue 10326) #10392

Check that the first page can be successfully loaded, to try and ascertain the validity of the XRef table (issue 7496, issue 10326) #10392

Conversation

Snuffleupagus commented Dec 29, 2018

Snuffleupagus commented Dec 29, 2018

pdfjsbot commented Dec 29, 2018

From: Bot.io (Linux m4)

Received

pdfjsbot commented Dec 29, 2018

From: Bot.io (Windows)

Received

pdfjsbot commented Dec 29, 2018

From: Bot.io (Linux m4)

Success

pdfjsbot commented Dec 29, 2018

From: Bot.io (Windows)

Success

timvandermeij commented Dec 29, 2018

pdfjsbot commented Dec 29, 2018

From: Bot.io (Linux m4)

Received

pdfjsbot commented Dec 29, 2018

From: Bot.io (Linux m4)

Success

Published

timvandermeij commented Dec 29, 2018