Export the "raw" toUnicode-data from `PartialEvaluator.preEvaluateFont` #13354

Snuffleupagus · 2021-05-08T07:50:02Z

Compared to other data-structures, such as e.g. Dicts, we're purposely not caching Streams on the XRef-instance.[1]
The, somewhat unfortunate, effect of Streams not being cached is that repeatedly getting the same Stream-data requires re-parsing/re-initializing of a bunch of data; see XRef.fetch and related methods.

For the font-parsing in particular we're currently fetching the toUnicode-data, which is very often a Stream, in PartialEvaluator.preEvaluateFont and then again in PartialEvaluator.extractDataStructures soon afterwards.
By instead letting PartialEvaluator.preEvaluateFont export the "raw" toUnicode-data, we can avoid some unnecessary re-parsing/re-initializing when handling fonts.
Please note: In this particular case, given that PartialEvaluator.preEvaluateFont only accesses the "raw" toUnicode data, exporting a Stream should be safe.

[1] The reasons for this include:

Streams, especially DecodeStream-instances, can become very large once read. Hence caching them really isn't a good idea simply because of the (potential) memory impact of doing so.
Attempting to read from the same Stream-instance more than once won't work, unless it's reset in between, since using any method such as e.g. getBytes always starts at the current data position.
Given that parsing, even in the worker-thread, is now fairly asynchronous it's generally impossible to assert that any one Stream-instance isn't being accessed "concurrently" by e.g. different getOperatorList calls. Hence reset-ing a cached Stream-instance isn't going to work in the general case.

pdfjsbot · 2021-05-08T07:53:09Z

From: Bot.io (Linux m4)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/b130e0f9786275a/output.txt

pdfjsbot · 2021-05-08T07:53:09Z

From: Bot.io (Windows)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://3.101.106.178:8877/92dbee648649bb4/output.txt

pdfjsbot · 2021-05-08T08:19:02Z

From: Bot.io (Linux m4)

Failed

Full output at http://54.67.70.0:8877/b130e0f9786275a/output.txt

Total script time: 25.87 mins

Font tests: Passed
Unit tests: Passed
Integration Tests: Passed
Regression tests: FAILED

Image differences available at: http://54.67.70.0:8877/b130e0f9786275a/reftest-analyzer.html#web=eq.log

pdfjsbot · 2021-05-08T08:22:07Z

From: Bot.io (Windows)

Failed

Full output at http://3.101.106.178:8877/92dbee648649bb4/output.txt

Total script time: 28.97 mins

Font tests: Passed
Unit tests: Passed
Integration Tests: FAILED
Regression tests: FAILED

Image differences available at: http://3.101.106.178:8877/92dbee648649bb4/reftest-analyzer.html#web=eq.log

…uateFont` Rather than re-fetching/re-parsing these properties immediately in `PartialEvaluator.translateFont`, we can simply export them instead. (Obviously the effect will be really tiny, but there is less parsing overall this way.)

…ont` Compared to other data-structures, such as e.g. `Dict`s, we're purposely *not* caching Streams on the `XRef`-instance.[1] The, somewhat unfortunate, effect of Streams not being cached is that repeatedly getting the *same* Stream-data requires re-parsing/re-initializing of a bunch of data; see `XRef.fetch` and related methods. For the font-parsing in particular we're currently fetching the `toUnicode`-data, which is very often a Stream, in `PartialEvaluator.preEvaluateFont` and then *again* in `PartialEvaluator.extractDataStructures` soon afterwards. By instead letting `PartialEvaluator.preEvaluateFont` export the "raw" `toUnicode`-data, we can avoid *some* unnecessary re-parsing/re-initializing when handling fonts. *Please note:* In this particular case, given that `PartialEvaluator.preEvaluateFont` only accesses the "raw" `toUnicode` data, exporting a Stream should be safe. --- [1] The reasons for this include: - Streams, especially `DecodeStream`-instances, can become *very* large once read. Hence caching them really isn't a good idea simply because of the (potential) memory impact of doing so. - Attempting to read from the *same* Stream-instance more than once won't work, unless it's `reset` in between, since using any method such as e.g. `getBytes` always starts at the current data position. - Given that parsing, even in the worker-thread, is now fairly asynchronous it's generally impossible to assert that any one Stream-instance isn't being accessed "concurrently" by e.g. different `getOperatorList` calls. Hence `reset`-ing a cached Stream-instance isn't going to work in the general case.

pdfjsbot · 2021-05-08T11:27:27Z

From: Bot.io (Linux m4)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/24d2852924f5720/output.txt

pdfjsbot · 2021-05-08T11:27:28Z

From: Bot.io (Windows)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://3.101.106.178:8877/04ec3280681c6fc/output.txt

pdfjsbot · 2021-05-08T11:53:13Z

From: Bot.io (Linux m4)

Failed

Full output at http://54.67.70.0:8877/24d2852924f5720/output.txt

Total script time: 25.75 mins

Font tests: Passed
Unit tests: Passed
Integration Tests: Passed
Regression tests: FAILED

Image differences available at: http://54.67.70.0:8877/24d2852924f5720/reftest-analyzer.html#web=eq.log

pdfjsbot · 2021-05-08T11:56:17Z

From: Bot.io (Windows)

Failed

Full output at http://3.101.106.178:8877/04ec3280681c6fc/output.txt

Total script time: 28.82 mins

Font tests: Passed
Unit tests: Passed
Integration Tests: FAILED
Regression tests: FAILED

Image differences available at: http://3.101.106.178:8877/04ec3280681c6fc/reftest-analyzer.html#web=eq.log

timvandermeij · 2021-05-08T12:26:29Z

Looks good to me; thanks!

Snuffleupagus force-pushed the preEvaluateFont-toUnicode-export branch 2 times, most recently from 94df095 to 7435cb9 Compare May 8, 2021 09:12

timvandermeij added the core label May 8, 2021

timvandermeij assigned timvandermeij and unassigned timvandermeij May 8, 2021

timvandermeij added font-conversion and removed core labels May 8, 2021

Snuffleupagus added 2 commits May 8, 2021 12:02

Snuffleupagus force-pushed the preEvaluateFont-toUnicode-export branch from 7435cb9 to 6eef69d Compare May 8, 2021 10:04

Snuffleupagus marked this pull request as ready for review May 8, 2021 11:56

timvandermeij approved these changes May 8, 2021

View reviewed changes

timvandermeij merged commit 0ec945c into mozilla:master May 8, 2021

Snuffleupagus deleted the preEvaluateFont-toUnicode-export branch May 8, 2021 13:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Export the "raw" toUnicode-data from `PartialEvaluator.preEvaluateFont` #13354

Export the "raw" toUnicode-data from `PartialEvaluator.preEvaluateFont` #13354

Snuffleupagus commented May 8, 2021

pdfjsbot commented May 8, 2021

pdfjsbot commented May 8, 2021

pdfjsbot commented May 8, 2021

pdfjsbot commented May 8, 2021

pdfjsbot commented May 8, 2021

pdfjsbot commented May 8, 2021

pdfjsbot commented May 8, 2021

pdfjsbot commented May 8, 2021

timvandermeij commented May 8, 2021

Export the "raw" toUnicode-data from PartialEvaluator.preEvaluateFont #13354

Export the "raw" toUnicode-data from PartialEvaluator.preEvaluateFont #13354

Conversation

Snuffleupagus commented May 8, 2021

pdfjsbot commented May 8, 2021

From: Bot.io (Linux m4)

Received

pdfjsbot commented May 8, 2021

From: Bot.io (Windows)

Received

pdfjsbot commented May 8, 2021

From: Bot.io (Linux m4)

Failed

pdfjsbot commented May 8, 2021

From: Bot.io (Windows)

Failed

pdfjsbot commented May 8, 2021

From: Bot.io (Linux m4)

Received

pdfjsbot commented May 8, 2021

From: Bot.io (Windows)

Received

pdfjsbot commented May 8, 2021

From: Bot.io (Linux m4)

Failed

pdfjsbot commented May 8, 2021

From: Bot.io (Windows)

Failed

timvandermeij commented May 8, 2021

Export the "raw" toUnicode-data from `PartialEvaluator.preEvaluateFont` #13354

Export the "raw" toUnicode-data from `PartialEvaluator.preEvaluateFont` #13354