Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[text selection] Add the whitespaces present in the pdf in the text chunk #14703

Merged
merged 1 commit into from
Mar 27, 2022

Conversation

calixteman
Copy link
Contributor

@calixteman calixteman commented Mar 21, 2022

  • it aims to fix issue Text search is not able to identify text, but working with other pdf viewers #14627;
  • the basic idea of the recent text refactoring was to only consider the rendered visible whitespaces.
    But sometimes, the heuristics aren't correct and although some whitespaces are in the text stream
    they weren't in the text chunks because they were too small. Hence we added some exceptions, for example,
    we always add a whitespace when it is between two non-whitespace chars but only when in the same Tj.
    So basically, this patch removes the constraint to have the chars in the same Tj
    (in using a circular buffer to save the two last chars) but don't add a space when the visible space is really
    too small (hence NOT_A_SPACE_FACTOR).

@calixteman
Copy link
Contributor Author

/botio test

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_test from @calixteman received. Current queue size: 0

Live output at: http://54.241.84.105:8877/09da1adb41bcc23/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_test from @calixteman received. Current queue size: 0

Live output at: http://54.193.163.58:8877/fcc0c624e3999e4/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Failed

Full output at http://54.241.84.105:8877/09da1adb41bcc23/output.txt

Total script time: 23.68 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Integration Tests: Passed
  • Regression tests: FAILED
  different ref/snapshot: 20

Image differences available at: http://54.241.84.105:8877/09da1adb41bcc23/reftest-analyzer.html#web=eq.log

@pdfjsbot
Copy link

From: Bot.io (Windows)


Failed

Full output at http://54.193.163.58:8877/fcc0c624e3999e4/output.txt

Total script time: 26.57 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Integration Tests: FAILED
  • Regression tests: FAILED
  different ref/snapshot: 6

Image differences available at: http://54.193.163.58:8877/fcc0c624e3999e4/reftest-analyzer.html#web=eq.log

@Snuffleupagus
Copy link
Collaborator

There seem to be a regression in issue6605-page1, for all platforms/browsers, since most of the spaces appear to be missing now.

@calixteman
Copy link
Contributor Author

/botio test

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_test from @calixteman received. Current queue size: 0

Live output at: http://54.193.163.58:8877/68545675500c364/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_test from @calixteman received. Current queue size: 0

Live output at: http://54.241.84.105:8877/b10f5755e6f1605/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Failed

Full output at http://54.193.163.58:8877/68545675500c364/output.txt

Total script time: 0.72 mins

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Failed

Full output at http://54.241.84.105:8877/b10f5755e6f1605/output.txt

Total script time: 23.67 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Integration Tests: Passed
  • Regression tests: FAILED
  different ref/snapshot: 20

Image differences available at: http://54.241.84.105:8877/b10f5755e6f1605/reftest-analyzer.html#web=eq.log

@calixteman
Copy link
Contributor Author

/botio test

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_test from @calixteman received. Current queue size: 0

Live output at: http://54.241.84.105:8877/d9e6d4c86dda2b0/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_test from @calixteman received. Current queue size: 0

Live output at: http://54.193.163.58:8877/1b1326bcb9ad933/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Failed

Full output at http://54.241.84.105:8877/d9e6d4c86dda2b0/output.txt

Total script time: 23.58 mins

  • Font tests: Passed
  • Unit tests: FAILED
  • Integration Tests: Passed
  • Regression tests: FAILED
  different ref/snapshot: 17
  different first/second rendering: 1

Image differences available at: http://54.241.84.105:8877/d9e6d4c86dda2b0/reftest-analyzer.html#web=eq.log

@pdfjsbot
Copy link

From: Bot.io (Windows)


Failed

Full output at http://54.193.163.58:8877/1b1326bcb9ad933/output.txt

Total script time: 26.27 mins

  • Font tests: Passed
  • Unit tests: FAILED
  • Integration Tests: Passed
  • Regression tests: FAILED
  different ref/snapshot: 6

Image differences available at: http://54.193.163.58:8877/1b1326bcb9ad933/reftest-analyzer.html#web=eq.log

@calixteman
Copy link
Contributor Author

/botio test

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_test from @calixteman received. Current queue size: 0

Live output at: http://54.193.163.58:8877/199de5196ccc849/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_test from @calixteman received. Current queue size: 0

Live output at: http://54.241.84.105:8877/eb325a7abec3863/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Failed

Full output at http://54.241.84.105:8877/eb325a7abec3863/output.txt

Total script time: 23.50 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Integration Tests: Passed
  • Regression tests: FAILED
  different ref/snapshot: 19

Image differences available at: http://54.241.84.105:8877/eb325a7abec3863/reftest-analyzer.html#web=eq.log

@pdfjsbot
Copy link

From: Bot.io (Windows)


Failed

Full output at http://54.193.163.58:8877/199de5196ccc849/output.txt

Total script time: 27.02 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Integration Tests: Passed
  • Regression tests: FAILED
  different ref/snapshot: 5

Image differences available at: http://54.193.163.58:8877/199de5196ccc849/reftest-analyzer.html#web=eq.log

Copy link
Collaborator

@Snuffleupagus Snuffleupagus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r=me, with the final comment addressed.

src/core/evaluator.js Outdated Show resolved Hide resolved
…hunk

- it aims to fix issue mozilla#14627;
- the basic idea of the recent text refactoring was to only consider the rendered visible whitespaces.
  But sometimes, the heuristics aren't correct and although some whitespaces are in the text stream
  they weren't in the text chunks because they were too small. Hence we added some exceptions, for example,
  we always add a whitespace when it is between two non-whitespace chars but only when in the same Tj.
  So basically, this patch removes the constraint to have the chars in the same Tj
  (in using a circular buffer to save the two last chars) but don't add a space when the visible space is really
  too small (hence `NOT_A_SPACE_FACTOR`).
@Snuffleupagus
Copy link
Collaborator

/botio test

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/4930c9191c54fdb/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/430054015e0bfca/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Failed

Full output at http://54.241.84.105:8877/4930c9191c54fdb/output.txt

Total script time: 23.47 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Integration Tests: Passed
  • Regression tests: FAILED
  different ref/snapshot: 19

Image differences available at: http://54.241.84.105:8877/4930c9191c54fdb/reftest-analyzer.html#web=eq.log

@pdfjsbot
Copy link

From: Bot.io (Windows)


Failed

Full output at http://54.193.163.58:8877/430054015e0bfca/output.txt

Total script time: 26.74 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Integration Tests: FAILED
  • Regression tests: FAILED
  different ref/snapshot: 5

Image differences available at: http://54.193.163.58:8877/430054015e0bfca/reftest-analyzer.html#web=eq.log

@Snuffleupagus Snuffleupagus merged commit 0dd6bc9 into mozilla:master Mar 27, 2022
@Snuffleupagus
Copy link
Collaborator

/botio makeref

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_makeref from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/46f6159d0affc6f/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_makeref from @Snuffleupagus received. Current queue size: 1

Live output at: http://54.193.163.58:8877/ae0e782307050b0/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Success

Full output at http://54.241.84.105:8877/46f6159d0affc6f/output.txt

Total script time: 20.62 mins

  • Lint: Passed
  • Make references: Passed
  • Check references: Passed

@pdfjsbot
Copy link

From: Bot.io (Windows)


Success

Full output at http://54.193.163.58:8877/ae0e782307050b0/output.txt

Total script time: 20.98 mins

  • Lint: Passed
  • Make references: Passed
  • Check references: Passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Text search is not able to identify text, but working with other pdf viewers
3 participants