Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export the "raw" toUnicode-data from PartialEvaluator.preEvaluateFont #13354

Merged

Conversation

Snuffleupagus
Copy link
Collaborator

Compared to other data-structures, such as e.g. Dicts, we're purposely not caching Streams on the XRef-instance.[1]
The, somewhat unfortunate, effect of Streams not being cached is that repeatedly getting the same Stream-data requires re-parsing/re-initializing of a bunch of data; see XRef.fetch and related methods.

For the font-parsing in particular we're currently fetching the toUnicode-data, which is very often a Stream, in PartialEvaluator.preEvaluateFont and then again in PartialEvaluator.extractDataStructures soon afterwards.
By instead letting PartialEvaluator.preEvaluateFont export the "raw" toUnicode-data, we can avoid some unnecessary re-parsing/re-initializing when handling fonts.
Please note: In this particular case, given that PartialEvaluator.preEvaluateFont only accesses the "raw" toUnicode data, exporting a Stream should be safe.


[1] The reasons for this include:

  • Streams, especially DecodeStream-instances, can become very large once read. Hence caching them really isn't a good idea simply because of the (potential) memory impact of doing so.

  • Attempting to read from the same Stream-instance more than once won't work, unless it's reset in between, since using any method such as e.g. getBytes always starts at the current data position.

  • Given that parsing, even in the worker-thread, is now fairly asynchronous it's generally impossible to assert that any one Stream-instance isn't being accessed "concurrently" by e.g. different getOperatorList calls. Hence reset-ing a cached Stream-instance isn't going to work in the general case.

@pdfjsbot
Copy link

pdfjsbot commented May 8, 2021

From: Bot.io (Linux m4)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/b130e0f9786275a/output.txt

@pdfjsbot
Copy link

pdfjsbot commented May 8, 2021

From: Bot.io (Windows)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://3.101.106.178:8877/92dbee648649bb4/output.txt

@pdfjsbot
Copy link

pdfjsbot commented May 8, 2021

From: Bot.io (Linux m4)


Failed

Full output at http://54.67.70.0:8877/b130e0f9786275a/output.txt

Total script time: 25.87 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Integration Tests: Passed
  • Regression tests: FAILED

Image differences available at: http://54.67.70.0:8877/b130e0f9786275a/reftest-analyzer.html#web=eq.log

@pdfjsbot
Copy link

pdfjsbot commented May 8, 2021

From: Bot.io (Windows)


Failed

Full output at http://3.101.106.178:8877/92dbee648649bb4/output.txt

Total script time: 28.97 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Integration Tests: FAILED
  • Regression tests: FAILED

Image differences available at: http://3.101.106.178:8877/92dbee648649bb4/reftest-analyzer.html#web=eq.log

@Snuffleupagus Snuffleupagus force-pushed the preEvaluateFont-toUnicode-export branch 2 times, most recently from 94df095 to 7435cb9 Compare May 8, 2021 09:12
…uateFont`

Rather than re-fetching/re-parsing these properties immediately in `PartialEvaluator.translateFont`, we can simply export them instead. (Obviously the effect will be really tiny, but there is less parsing overall this way.)
…ont`

Compared to other data-structures, such as e.g. `Dict`s, we're purposely *not* caching Streams on the `XRef`-instance.[1]
The, somewhat unfortunate, effect of Streams not being cached is that repeatedly getting the *same* Stream-data requires re-parsing/re-initializing of a bunch of data; see `XRef.fetch` and related methods.

For the font-parsing in particular we're currently fetching the `toUnicode`-data, which is very often a Stream, in `PartialEvaluator.preEvaluateFont` and then *again* in `PartialEvaluator.extractDataStructures` soon afterwards.
By instead letting `PartialEvaluator.preEvaluateFont` export the "raw" `toUnicode`-data, we can avoid *some* unnecessary re-parsing/re-initializing when handling fonts.
*Please note:* In this particular case, given that `PartialEvaluator.preEvaluateFont` only accesses the "raw" `toUnicode` data, exporting a Stream should be safe.

---
[1] The reasons for this include:
 - Streams, especially `DecodeStream`-instances, can become *very* large once read. Hence caching them really isn't a good idea simply because of the (potential) memory impact of doing so.

 - Attempting to read from the *same* Stream-instance more than once won't work, unless it's `reset` in between, since using any method such as e.g. `getBytes` always starts at the current data position.

 - Given that parsing, even in the worker-thread, is now fairly asynchronous it's generally impossible to assert that any one Stream-instance isn't being accessed "concurrently" by e.g. different `getOperatorList` calls. Hence `reset`-ing a cached Stream-instance isn't going to work in the general case.
@Snuffleupagus Snuffleupagus force-pushed the preEvaluateFont-toUnicode-export branch from 7435cb9 to 6eef69d Compare May 8, 2021 10:04
@pdfjsbot
Copy link

pdfjsbot commented May 8, 2021

From: Bot.io (Linux m4)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/24d2852924f5720/output.txt

@pdfjsbot
Copy link

pdfjsbot commented May 8, 2021

From: Bot.io (Windows)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://3.101.106.178:8877/04ec3280681c6fc/output.txt

@pdfjsbot
Copy link

pdfjsbot commented May 8, 2021

From: Bot.io (Linux m4)


Failed

Full output at http://54.67.70.0:8877/24d2852924f5720/output.txt

Total script time: 25.75 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Integration Tests: Passed
  • Regression tests: FAILED

Image differences available at: http://54.67.70.0:8877/24d2852924f5720/reftest-analyzer.html#web=eq.log

@pdfjsbot
Copy link

pdfjsbot commented May 8, 2021

From: Bot.io (Windows)


Failed

Full output at http://3.101.106.178:8877/04ec3280681c6fc/output.txt

Total script time: 28.82 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Integration Tests: FAILED
  • Regression tests: FAILED

Image differences available at: http://3.101.106.178:8877/04ec3280681c6fc/reftest-analyzer.html#web=eq.log

@Snuffleupagus Snuffleupagus marked this pull request as ready for review May 8, 2021 11:56
@timvandermeij timvandermeij merged commit 0ec945c into mozilla:master May 8, 2021
@timvandermeij
Copy link
Contributor

Looks good to me; thanks!

@Snuffleupagus Snuffleupagus deleted the preEvaluateFont-toUnicode-export branch May 8, 2021 13:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants