Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not transform jpeg RGB components #11966

Closed
wants to merge 519 commits into from
Closed

Do not transform jpeg RGB components #11966

wants to merge 519 commits into from

Conversation

havocbcn
Copy link
Contributor

@havocbcn havocbcn commented Jun 4, 2020

Hello,

Fix issue #11931

The image is encoded using RGB, not YCbCr

here : https://stackoverflow.com/questions/50798014/determining-color-space-for-jpeg/50861048

On ISO/IEC 10918-6:2013 (E), section 6.1: (http://www.itu.int/rec/T-REC-T.872-201206-I/en)

Images encoded with three components are assumed to be RGB data encoded as YCbCr unless the image contains an APP14 marker segment as specified in 6.5.3, in which case the colour encoding is considered either RGB or YCbCr according to the application data of the APP14 marker segment

The head.jpg doesn't have a app14, so image seems to be using YCbCr not RGB

But, the second best answer points to look up the implementation of a Java jpeg library, seems that if components are called R, G and B, the space color is RGB, and this is the case, the components are called in this way.

This pull request maintain the index in the components struct to allow _isColorConversionNeeded to do not convert if the components are called R, G and B.

The Pdf is now showed correctly

Snuffleupagus and others added 30 commits June 4, 2020 08:31
This is not only useful to have one format for consistency, but also to
be able to quickly search for colors for e.g., finding duplicates or
when tweaking the CSS for custom deployments.
…the PDF header is missing (bug 1606566)

As suggested in the bug we'll now always fallback to "other", which should be fine considering that that value is listed in the histogram definition; see https://searchfox.org/mozilla-central/rev/331f0c3b25089c9a16be65f4dc8c601aeaac8cc4/toolkit/components/telemetry/Histograms.json#9097-9106

Fixes https://bugzilla.mozilla.org/show_bug.cgi?id=1606566
…ding telemetry data from the viewer

When working on PR 11463 I couldn't help thinking that the `Array.prototype.some` callback function, used when determining the generator, was somewhat difficult to read with its partly unused and strangely named parameters.
…` (PR 11418 follow-up)

Currently *all* users of `pdfjs-dist` are forced to install the `webpack` and `worker-loader` packages, despite the fact that they are *only* relevant if the `webpack.js` file is being used (with a custom Webpack build).
This really doesn't seem great, especially since those packages are the only remaining dependencies in the `pdfjs-dist` library, and it thus seem more reasonable overall that Webpack users handle those dependencies themselves.

To prevent unnecessarily cryptic runtime failures, when people update to newer `pdfjs-dist` versions, the `webpack.js` file was updated to explicitly check for the existence of the `worker-loader` package and error otherwise.
Furthermore, note that `webpack` was only listed as a dependency because of the `worker-loader` package itself (see issue 9248).

Obviously these changes may not be seen as great by Webpack users who rely on `pdfjs-dist`, since it forces them to handle the dependencies themselves, however it should improve things considerably for "general" users of `pdfjs-dist` by not burdening them with unnecessary dependencies.
These sort of changes are also in line with other recent changes, see PR 11418, which removed built-in fake worker loader code for specific JS builders/bundlers/frameworks. This work was prompted not only by a desire to simplify/clean-up old code, but also to lessen future support burden since the PDF.js contributors cannot be assumed to be experts in various JS bundlers.
…-base, have a `.js` file extension

In order to eventually get rid of SystemJS and start using native `import`s instead, we'll need to provide "complete" file identifiers since otherwise there'll be MIME type errors when attempting to use `import`.
This rule is already enabled in mozilla-central, and helps avoid some confusing formatting[1], see https://searchfox.org/mozilla-central/rev/a92ed79b0bc746159fc31af1586adbfa9e45e264/tools/lint/eslint/eslint-plugin-mozilla/lib/configs/recommended.js#209-210

Since some cases may not be aren't entirely straightforward to convert, and some we probably don't want to change either[2], this patch is limited to the `web/` directory as a proof-of-concept.

Please find additional details about the ESLint rule at https://eslint.org/docs/rules/no-nested-ternary

---
[1] One example with *three* ternary statements: https://github.com/mozilla/pdf.js/blob/93aa613db7a874e6ed964f394241605d5586c142/src/core/fonts.js#L2623-L2630

[2] One example that should be whitelisted: https://github.com/mozilla/pdf.js/blob/93aa613db7a874e6ed964f394241605d5586c142/src/core/jbig2.js#L82-L92
… ensure correct text rendering (issue 11457)

This is currently the only possible way of addressing the issue, until https://developer.mozilla.org/en-US/docs/Web/API/CanvasRenderingContext2D/direction becomes generally available in browsers.

*Note:* This will also require manually updating https://mozilla.github.io/pdf.js/examples/#interactive-examples
…e DOM elements (issue 11499)

As described in the issue, having a DOM element with `id=page2` (or any other number) will automatically cause that element to become linkable through the URL hash. That's currently leading to some confusing and outright wrong behaviour, since it obviously only works for pages that have been loaded and rendered.

For PDF documents the only officially supported way to reference a particular page through the URL hash is using the `#page=2` format, which also works for all pages regardless if they're loaded or not.

As far as I can tell there's nothing in the PDF.js default viewer that actually depends on the page/thumbnail `id` at this point in time, hence why I believe that this removal ought to be safe.
Just as a pre-caution this patch adds an `aria-label` to the page canvas, similar to the thumbnail canvas/image, to at least keep this information in the DOM.
This particular JSDoc comment is fairly old and it also contains some now unrelated/confusing information.
The only way to *guarantee* that the PDF.js library works as expected is to correctly set the global `workerSrc`[1], hence giving the impression that the option isn't strictly necessary is thus incorrect.

---
[1] Since advertising the fallbackWorkerSrc functionality definitely seems like the *wrong* thing to do.
… to classes with static methods

This seems nicer overall, since it's now a bit clearer that the various build targets *extend* the default implementation.
… 844349 follow-up)

With https://bugzilla.mozilla.org/show_bug.cgi?id=844349 now being fixed in Firefox, the textLayer will now actually stay hidden as intended regardless of the browser settings.
Hence it should no longer be necessary to display the fallback bar, nor print a warning in the console, for documents which contains a textLayer.

Besides removing the `supportsDocumentColors` methods in the default viewer, we can also remove a now unused l10n string.
…n development mode and GENERIC builds

While only the `MOZCENTRAL` builds will actually do anything meaningful with the telemetry data, none of the code in question actually runs *at all* in e.g. development mode.[1]
This seems bad since it essentially means that this code is completely untested, despite being quite important for the built-in Firefox PDF viewer, and this thus ought to be fixed.

In this case, the explanation for the current state of the code should be "for historical reasons". Before the viewer was split into the current components and before the pre-processor was improved, back when all code resided in the `web/viewer.js` file, the telemetry reporting was done with *direct* `FirefoxCom` calls. However, with the dummy `DefaultExternalServices.reportTelemetry` method there's nothing actually preventing attempting to report telemetry in any type of build.

NOTE: By running this code in GENERIC builds as well, in addition to just locally, the *viewer* part of telemetry reporting becomes tested e.g. in preview builds too which should help with reviewing.

---
[1] When fixing bug 1606566, I had to edit the relevant `PDFJSDev` checks to be able to actually test the changes locally.
… bots

Note that this will still allow the FBF tests to run locally, and also on the bots when invoked with `test`/`browsertest` (to not lose all the FBF test-coverage), but will no longer prevent `makeref` from running successfully on the bots.
I just happened to notice that we've apparently never displayed the "CSS" tab [in the interactive examples](https://mozilla.github.io/pdf.js/examples/index.html#interactive-examples), which probably limits the number of users who'll notice the `direction: ltr;` CSS rule on the canvas :-)
…446 follow-up)

These two cases should have been whitelisted prior to re-formatting respectively had the comments fixed afterwards, however I unfortunately missed them because of the massive size of the diff.
…fontWeight` in the same way as with canvas (PR 6091 and 7839 follow-up)
This rule is already enabled in mozilla-central, and helps avoid some confusing formatting, see https://searchfox.org/mozilla-central/rev/9e45d74b956be046e5021a746b0c8912f1c27318/tools/lint/eslint/eslint-plugin-mozilla/lib/configs/recommended.js#209-210

With the recent introduction of Prettier some of the existing nested ternary statements became even more difficult to read, since any possibly helpful indentation was removed.
This particular ESLint rule wasn't entirely straightforward to enable, and I do recognize that there's a certain amount of subjectivity in the changes being made. Generally, the changes in this patch fall into three categories:
 - Cases where a value is only clamped to a certain range (the easiest ones to update).
 - Cases where the values involved are "simple", such as Numbers and Strings, which are re-factored to initialize the variable with the *default* value and only update it when necessary by using `if`/`else if` statements.
 - Cases with more complex and/or larger values, such as TypedArrays, which are re-factored to let the variable be (implicitly) undefined and where all values are then set through `if`/`else if`/`else` statements.

Please find additional details about the ESLint rule at https://eslint.org/docs/rules/no-nested-ternary
Snuffleupagus and others added 28 commits June 4, 2020 08:31
Originally the `default_preferences.json` file was checked into the repository, and we thus needed to load it in non-PRODUCTION mode (which was originally done asynchronously using `XMLHttpRequest`). Over the years a lot has changed and the `default_preferences.json` file is now built, by the `gulp default_preferences` task, from the `web/app_options.js` file. Hence it's no longer necessary, in non-PRODUCTION mode, to use SystemJS here since we can simply use a standard `import` statement instead.

Note how e.g. `web/app.js` already imports from `web/app_options.js` in the same exact way that `web/preferences.js` now does, hence this patch will *not* result in any significant changes in the built/bundled viewer file.

This is another (small) part in trying to reduce usage of SystemJS, with the goal of hopefully getting rid of it completely. (I've started working on this, and doing so has identified a number of problem areas; this patch addresses one of them.)
… the browser (bug 1632644)

Apparently the old link format used in MOZCENTRAL-builds, with the blob URL separated from the filename with a `?` character violates the specification; see https://bugzilla.mozilla.org/show_bug.cgi?id=1632644#c5

Obviously just removing the `?`-part of the URL would have worked, but that would also have meant that we'd no longer be able to provide the correct filename when the user attempts to download the opened PDF attachment.
To fix this we'll instead append the filename in the hash-part of the URL, which however required using a *custom* hash-parameter to avoid triggering the fallback "named destination" code-paths in the viewer.

Note that only changing the `web/pdf_attachment_viewer.js` file wasn't sufficient to fix the bug, and we also need to tweak the `webViewerInitialized` function in `web/app.js` since MOZCENTRAL-builds used to ignore *everything* in the URL hash.
This particular code is very old, but changing it *should* be completely safe given that the `PDFViewerApplication.setTitleUsingUrl` method since some time now stores both the original URL (in `this.url`) as well as one without the hash (in `this.baseUrl`). The latter one is already used everywhere where it matters, so this change seem fine to me.

This patch thus restores the original behaviour for PDF attachments in the MOZCENTRAL-build, by once again allowing them to be opened *directly* in the browser without downloading. (The fallback added in PR 11845 is obviously kept, since it seems generally useful to have.)
…URL` call in `web/pdf_document_properties.js` (PR 10114 follow-up)

Given that the `getPDFFileNameFromURL` helper function has a specific code-path for handling non-string inputs, this empty string fallback really isn't necessary at the call-site in `web/pdf_document_properties.js`.
…o accept an asynchronous function

As part of trying to reduce the usage of SystemJS in the development viewer, this patch is a necessary step that will allow removal of some `require` statements.

Currently this uses `SystemJS.import` in non-PRODUCTION mode, but it should be possible to replace those with standard *dynamic* `import` calls in the future.
…d `import`/`export` statements

As part of reducing our reliance on SystemJS in the development viewer, this patch replaces usage of `require` statements with modern standards `import`/`export` statements instead.

If we want to try and move forward with reducing usage of SystemJS, we don't have much choice but to make these kind changes (despite what prior test-results showed, however I'm no longer able to reproduce the issues locally).
With these changes SystemJS is now only used, during development, on the worker-thread and in the unit/font-tests, since Firefox is currently missing support for worker modules; please see https://bugzilla.mozilla.org/show_bug.cgi?id=1247687

Hence all the JavaScript files in the `web/` and `src/display/` folders are now loaded *natively* by the browser (during development) using standard `import` statements/calls, thanks to a nice `import-maps` polyfill.

*Please note:* As soon as https://bugzilla.mozilla.org/show_bug.cgi?id=1247687 is fixed in Firefox, we should be able to remove all traces of SystemJS and thus finally be able to use every possible modern JavaScript feature.
…e, level (issue 11878)

Currently image resources, as opposed to e.g. font resources, are handled exclusively on a page-specific basis. Generally speaking this makes sense, since pages are separate from each other, however there's PDF documents where many (or even all) pages actually references exactly the same image resources (through the XRef table). Hence, in some cases, we're decoding the *same* images over and over for every page which is obviously slow and wasting both CPU and memory resources better used elsewhere.[1]

Obviously we cannot simply treat all image resources as-if they're used throughout the entire PDF document, since that would end up increasing memory usage too much.[2]
However, by introducing a `GlobalImageCache` in the worker we can track image resources that appear on more than one page. Hence we can switch image resources from being page-specific to being document-specific, once the image resource has been seen on more than a certain number of pages.

In many cases, such as e.g. the referenced issue, this patch will thus lead to reduced memory usage for image resources. Scrolling through all pages of the document, there's now only a few main-thread copies of the same image data, as opposed to one for each rendered page (i.e. there could theoretically be *twenty* copies of the image data).
While this obviously benefit both CPU and memory usage in this case, for *very* large image data this patch *may* possibly increase persistent main-thread memory usage a tiny bit. Thus to avoid negatively affecting memory usage too much in general, particularly on the main-thread, the `GlobalImageCache` will *only* cache a certain number of image resources at the document level and simply fallback to the default behaviour.

Unfortunately the asynchronous nature of the code, with ranged/streamed loading of data, actually makes all of this much more complicated than if all data could be assumed to be immediately available.[3]

*Please note:* The patch will lead to *small* movement in some existing test-cases, since we're now using the built-in PDF.js JPEG decoder more. This was done in order to simplify the overall implementation, especially on the main-thread, by limiting it to only the `OPS.paintImageXObject` operator.

---
[1] There's e.g. PDF documents that use the same image as background on all pages.

[2] Given that data stored in the `commonObjs`, on the main-thread, are only cleared manually through `PDFDocumentProxy.cleanup`. This as opposed to data stored in the `objs` of each page, which is automatically removed when the page is cleaned-up e.g. by being evicted from the cache in the default viewer.

[3] If the latter case were true, we could simply check for repeat images *before* parsing started and thus avoid handling *any* duplicate image resources.
… data (PR 11912 follow-up)

When "Cleanup" is triggered, you obviously need to remove all globally cached data on *both* the main- and worker-threads.
However, the current the implementation of the `GlobalImageCache.clear` method also means that we lose *all* information about which images were cached and not just their data. This thus has the somewhat unfortunate side-effect of requiring images, which were previously known to be "global", to *again* having to reach `NUM_PAGES_THRESHOLD` before being cached again.

To avoid doing unnecessary parsing after "Cleanup", we can thus let `GlobalImageCache.clear` keep track of which images were cached while still removing their actual data. This should not have any significant impact on memory usage, since the only extra thing being kept is a `RefSetCache` (essentially an Object) with a couple of `Set`s containing only integers.
…n `src/core/jpg.js`

Currently some JPEG images are decoded by the built-in PDF.js decoder in `src/core/jpg.js`, while others attempt to use the browser JPEG decoder. This inconsistency seem unfortunate for a number of reasons:

 - It adds, compared to the other image formats supported in the PDF specification, a fair amount of code/complexity to the image handling in the PDF.js library.

 - The PDF specification support JPEG images with features, e.g. certain ColorSpaces, that browsers are unable to decode natively. Hence, determining if a JPEG image is possible to decode natively in the browser require a non-trivial amount of parsing. In particular, we're parsing (part of) the raw JPEG data to extract certain marker data and we also need to parse the ColorSpace for the JPEG image.

 - While some JPEG images may, for all intents and purposes, appear to be natively supported there's still cases where the browser may fail to decode some JPEG images. In order to support those cases, we've had to implement a fallback to the PDF.js JPEG decoder if there's any issues during the native decoding. This also means that it's no longer possible to simply send the JPEG image to the main-thread and continue parsing, but you now need to actually wait for the main-thread to indicate success/failure first.
   In practice this means that there's a code-path where the worker-thread is forced to wait for the main-thread, while the reverse should *always* be the case.

 - The native decoding, for anything except the *simplest* of JPEG images, result in increased peak memory usage because there's a handful of short-lived copies of the JPEG data (see PR 11707).
Furthermore this also leads to data being *parsed* on the main-thread, rather than the worker-thread, which you usually want to avoid for e.g. performance and UI-reponsiveness reasons.

 - Not all environments, e.g. Node.js, fully support native JPEG decoding. This has, historically, lead to some issues and support requests.

 - Different browsers may use different JPEG decoders, possibly leading to images being rendered slightly differently depending on the platform/browser where the PDF.js library is used.

Originally the implementation in `src/core/jpg.js` were unable to handle all of the JPEG images in the test-suite, but over the last couple of years I've fixed (hopefully) all of those issues.
At this point in time, there's two kinds of failure with this patch:

 - Changes which are basically imperceivable to the naked eye, where some pixels in the images are essentially off-by-one (in all components), which could probably be attributed to things such as different rounding behaviour in the browser/PDF.js JPEG decoder.
   This type of "failure" accounts for the *vast* majority of the total number of changes in the reference tests.

 - Changes where the JPEG images now looks *ever so slightly* blurrier than with the native browser decoder. For quite some time I've just assumed that this pointed to a general deficiency in the `src/core/jpg.js` implementation, however I've discovered when comparing two viewers side-by-side that the differences vanish at higher zoom levels (usually around 200% is enough).
   Basically if you disable [this downscaling in canvas.js](https://github.com/mozilla/pdf.js/blob/8fb82e939cf0c8618a4e775ff17fc96f726872b5/src/display/canvas.js#L2356-L2395), which is what happens when zooming in, the differences simply vanish!
   Hence I'm pretty satisfied that there's no significant problems with the `src/core/jpg.js` implementation, and the problems are rather tied to the general quality of the downscaling algorithm used. It could even be seen as a positive that *all* images now share the same downscaling behaviour, since this actually fixes one old bug; see issue 7041.
With the changes in the previous patch, this is now dead code which should thus be removed.
…ocument` parameters, since it's now unused in the API

With the changes in previous patches, the `disableCreateObjectURL` option/functionality is no longer used for anything in the API and/or in the Worker code.

Note however that there's some functionality, mainly related to file loading/downloading, in the GENERIC version of the default viewer which still depends on this option.
Hence the `disableCreateObjectURL` option (and related compatibility code) is moved into the viewer, see e.g. `web/app_options.js`, such that it's still available in the default viewer.
…task

With the changes made in the previous patch, the `web/app_options.js` file no longer depends on anything *except* files residing in the `web/` folder. Hence the `gulp default_preferences` task can now be further simplified and thus becomes even faster than before; see also PR 11724.
In the PDF file from the issue below, the fill alpha (`ca`) is set
before drawing the circles using the `setGState` operator. Doing so
causes the global alpha to be set on the canvas' context for the canvas
back-end, but this was not handled in the SVG back-end. This patch fixes
that by taking the fill opacity into account when drawing shading
patterns in the same way as done elsewhere so it is only included if the
value is non-default.

Fixes #11812.
By updating to the new major version of Acorn, we'll get support for newer ECMAScript features as they become available (although some features are currently also blocked by ESLint support and/or SystemJS usage).

Please see https://github.com/acornjs/acorn/releases/tag/7.2.0 for details.
…follow-up)

When converting this file to use standard `import`/`export` statements, I sorted the exports in the same order as the imports to simplify things.

However, looking at the list of `export`ed properties it probably doesn't hurt to add a couple of comments to clarify from where specifically the `export`s originated.
Currently the local `imageCache`, as used in `PartialEvaluator.getOperatorList`, will miss certain cases of repeated images because the caching is *only* done by name (usually using a format such as e.g. "Im0", "Im1", ...).
However, in some PDF documents the `/XObject` dictionaries many contain hundreds (or even thousands) of distinctly named images, despite them referring to only a handful of actual image objects (via the XRef table).

With these changes we'll now cache *local* images using both name and (where applicable) reference, thus improving re-usage of images resources even further.

This patch was tested using the PDF file from [bug 857031](https://bugzilla.mozilla.org/show_bug.cgi?id=857031), i.e. https://bug857031.bmoattachments.org/attachment.cgi?id=732270, with the following manifest file:
```
[
    {  "id": "bug857031",
       "file": "../web/pdfs/bug857031.pdf",
       "md5": "",
       "rounds": 250,
       "lastPage": 1,
       "type": "eq"
    }
]
```

which gave the following results when comparing this patch against the `master` branch:
```
-- Grouped By browser, page, stat --
browser | page | stat         | Count | Baseline(ms) | Current(ms) | +/- |    %  | Result(P<.05)
------- | ---- | ------------ | ----- | ------------ | ----------- | --- | ----- | -------------
firefox | 0    | Overall      |   250 |         2749 |        2656 | -93 | -3.38 |        faster
firefox | 0    | Page Request |   250 |            3 |           4 |   1 | 50.14 |        slower
firefox | 0    | Rendering    |   250 |         2746 |        2652 | -94 | -3.44 |        faster
```

While this is certainly an improvement, since we now avoid re-parsing ~1000 images on the first page, all of the image resources are small enough that the total rendering time doesn't improve that much in this particular case.

In pathological cases, such as e.g. the PDF document in issue 4958, the improvements with this patch can be very significant. Looking for example at page 2, from issue 4958, the rendering time drops from ~60 seconds with `master` to ~30 seconds with this patch (obviously still slow, but it really showcases the potential of this patch nicely).

Finally, note that there's also potential for additional improvements by re-using `LocalImageCache` instances for e.g. /XObject data of the `Form`-type. However, given that recent changes in this area I purposely didn't want to complicate *this* patch more than necessary.
…Content` method

It turns out that `getTextContent` suffers from *similar* problems with repeated images as `getOperatorList`; please see the previous patch.

While only `/XObject` resources of the `Form`-type will actually be *parsed* in `PartialEvaluator.getTextContent`, since those are the only ones that may contain text, we're still forced to fetch repeated image resources where the name differs (but not the reference).
Obviously it's less bad in this case, since we're not actually parsing `/XObject`s of e.g. the `Image`-type. However, you still want to avoid even fetching the data whenever possible, since `Stream`s are not cached on the `XRef` instance (given their potential size) and the lookup can thus be somewhat expensive in general.

To address these issues, we can simply replace the exiting name-only caching in `PartialEvaluator.getTextContent` with a new cache backed by `LocalImageCache` instead.
…builder/builder.js` file

This functionality has been completely unused ever since PR 9629 (two years ago).
…der/builder.js` file

This functionality has been completely unused ever since PR 9566 (two years ago).
…obalImageCache` (PR 11912 follow-up)

Since *inline* images, i.e. those defined inside of `/Contents` streams, are by their very definition page-specific it thus seem like a good idea to actually enforce that they won't accidentally end up in the `GlobalImageCache`.
…ination` method (PR 11644 follow-up)

This method has been printing a `deprecated` warning in two releases, hence it should hopefully be safe to remove now.
@havocbcn
Copy link
Contributor Author

havocbcn commented Jun 4, 2020

Sorry, trying to squash, I will do in another PR better

@havocbcn havocbcn closed this Jun 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet