Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[api-minor] Support search with or without diacritics (bug 1508345, bug 916883, bug 1651113) #13261

Merged
merged 1 commit into from
Feb 3, 2022

Conversation

calixteman
Copy link
Contributor

@calixteman calixteman commented Apr 18, 2021

  • get original index in using a dichotomic seach instead of a linear one;
  • normalize the text in using NFD;
  • convert the query string into a RegExp;
  • replace whitespaces in the query with \s+;
  • handle hyphens at eol use to break a word;
  • add some \s* around punctuation signs

@calixteman calixteman marked this pull request as draft April 18, 2021 21:45
@calixteman
Copy link
Contributor Author

The PR is almost ready but highlights are wrong with RTL languages.

test/unit/pdf_find_controller_spec.js Show resolved Hide resolved
web/viewer.html Outdated Show resolved Hide resolved
web/viewer.html Outdated Show resolved Hide resolved
web/viewer.js Outdated Show resolved Hide resolved
web/pdf_find_controller.js Outdated Show resolved Hide resolved
web/pdf_find_controller.js Outdated Show resolved Hide resolved
web/pdf_find_controller.js Outdated Show resolved Hide resolved
web/pdf_find_controller.js Outdated Show resolved Hide resolved
web/pdf_find_controller.js Outdated Show resolved Hide resolved
web/pdf_find_controller.js Show resolved Hide resolved
@calixteman calixteman marked this pull request as ready for review May 1, 2021 17:13
@calixteman
Copy link
Contributor Author

For now it doesn't work with RTL languages and I'll do that in an other patch.

@calixteman calixteman force-pushed the diacritics1 branch 2 times, most recently from 7f7cfdf to daa4e64 Compare May 1, 2021 17:32
web/text_layer_builder.css Outdated Show resolved Hide resolved
web/pdf_find_controller.js Outdated Show resolved Hide resolved
web/pdf_find_controller.js Outdated Show resolved Hide resolved
web/pdf_find_controller.js Outdated Show resolved Hide resolved
web/pdf_find_controller.js Outdated Show resolved Hide resolved
web/pdf_find_controller.js Outdated Show resolved Hide resolved
web/pdf_find_controller.js Outdated Show resolved Hide resolved
@calixteman calixteman force-pushed the diacritics1 branch 2 times, most recently from 2142145 to c22368f Compare May 13, 2021 16:08
@Snuffleupagus
Copy link
Collaborator

  • remove pdf_find_utils.js.

You probably want to remove this from the commit message (and PR description) now :-)


Given that this implementation uses a lot more, and more complex, regular expressions during both initial text-parsing and subsequent searching: What sort of performance impact, if any, does this patch have for larger and/or more complex documents?

For example, what about e.g. the pdf.pdf document (in the test-suite) or perhaps kjv.pdf (which is even longer)?

@calixteman
Copy link
Contributor Author

Good questions.
I used some perfomance.now calls at the beginning and at the end of functions normalize and _calculateRegExpMatch and I used file kjv.pdf (and set privacy.reduceTimerPrecision to false):
Normalization (ms): 69, 63, 78, 60,
Search for "a" (ms): 316, 296, 306, 253

Normalization (ms): 71, 63, 77, 77
Search for "and thou wast not spoiled;" (ms): 32, 32, 36, 33

And the same in master with normalize and _calculatePhraseMatch:
Normalization (ms): 25, 25, 22, 20,
Search for "a" (ms): 41, 41, 40, 40

Normalization (ms): 29, 29, 27, 20
Search for "and thou wast not spoiled;" (ms): 7, 7, 7, 7

From a user pov, I don't see any differences with both searches and so I think that this perf regression is acceptable.

I added some code to remove the diacritics stuff in the query regexp when there are no diacritics on the page (which is the case in kjv.pdf) and there are no significant difference in time for both searches, so in the search part we pay for the use of regexp instead of indexOf and not so much because of diacritics.
One big advantage of regexp over indexOf is that it's very simple to search for foo\s+bar (because the pdf can contain multiple spaces between foo and bar) and we don't have to re-normalize pageContent when matchDiacritics is turned on/off or to change pageContent case when caseSensitive is turned on/off.

And as usual, I'm open to any good idea to improve this.

@calixteman
Copy link
Contributor Author

@timvandermeij, @Snuffleupagus do you have any objections for landing that stuff ? or any idea to improve perf or whatever ?

@Snuffleupagus
Copy link
Collaborator

Snuffleupagus commented May 22, 2021

do you have any objections for landing that stuff ?

I've been a little bit short on time to really review this properly, and have only looked briefly at the implementation, so it'd probably be a very good idea to actually do a "full" review before landing it since this is a significant change to the find implementation.

@timvandermeij
Copy link
Contributor

Same here. I think it would be good to, aside from the full review, wait until #13418 is done before merging since it's quite a change to the implementation and it would be good to get the release out first to avoid any risks. (We usually do this for significant changes, such as the text layer and struct tree PRs.)

@lixaotec

This comment has been minimized.

@timvandermeij timvandermeij removed their request for review June 13, 2021 13:14
@calixteman calixteman force-pushed the diacritics1 branch 2 times, most recently from a20b623 to d92e0b2 Compare October 3, 2021 15:09
@pdfjsbot
Copy link

pdfjsbot commented Feb 3, 2022

From: Bot.io (Linux m4)


Success

Full output at http://54.241.84.105:8877/f79b098ad1ade97/output.txt

Total script time: 4.67 mins

Published

@Snuffleupagus
Copy link
Collaborator

/botio unittest

@pdfjsbot
Copy link

pdfjsbot commented Feb 3, 2022

From: Bot.io (Linux m4)


Received

Command cmd_unittest from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/39d0ab5729ad9a2/output.txt

@pdfjsbot
Copy link

pdfjsbot commented Feb 3, 2022

From: Bot.io (Windows)


Received

Command cmd_unittest from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/174cda3f5d3abe1/output.txt

@Snuffleupagus Snuffleupagus changed the title Support search with or without diacritics (bug 1508345, bug 916883, bug 1651113) [api-minor] Support search with or without diacritics (bug 1508345, bug 916883, bug 1651113) Feb 3, 2022
@pdfjsbot
Copy link

pdfjsbot commented Feb 3, 2022

From: Bot.io (Linux m4)


Success

Full output at http://54.241.84.105:8877/39d0ab5729ad9a2/output.txt

Total script time: 3.15 mins

  • Unit Tests: Passed

@pdfjsbot
Copy link

pdfjsbot commented Feb 3, 2022

From: Bot.io (Windows)


Success

Full output at http://54.193.163.58:8877/174cda3f5d3abe1/output.txt

Total script time: 6.12 mins

  • Unit Tests: Passed

@Snuffleupagus
Copy link
Collaborator

/botio integrationtest

@pdfjsbot
Copy link

pdfjsbot commented Feb 3, 2022

From: Bot.io (Windows)


Received

Command cmd_integrationtest from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/aa860f7ef9c0a8e/output.txt

@pdfjsbot
Copy link

pdfjsbot commented Feb 3, 2022

From: Bot.io (Linux m4)


Received

Command cmd_integrationtest from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/1aa7bf3800cc78f/output.txt

@pdfjsbot
Copy link

pdfjsbot commented Feb 3, 2022

From: Bot.io (Linux m4)


Success

Full output at http://54.241.84.105:8877/1aa7bf3800cc78f/output.txt

Total script time: 4.01 mins

  • Integration Tests: Passed

@pdfjsbot
Copy link

pdfjsbot commented Feb 3, 2022

From: Bot.io (Windows)


Success

Full output at http://54.193.163.58:8877/aa860f7ef9c0a8e/output.txt

Total script time: 6.58 mins

  • Integration Tests: Passed

Copy link
Collaborator

@Snuffleupagus Snuffleupagus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For your information, we'd like to have this feature in the next nightly cycle (the soft freeze is next week) and we'll ask to QA to check that everything is fine.

OK, let's try landing this and see how it goes :-)
I've tried to do some manual testing, but that's obviously quite limited and given the scope/size of these changes we obviously need more widespread testing.

If the quality is not good enough we'll backout and it I'll work on it to improve what is needed.

Unless there's really big problems, it might be easier to fix in place (and uplift patches) as necessary since trying to back-out a PR this size could very quickly become difficult.

web/pdf_find_controller.js Outdated Show resolved Hide resolved
…ug 1651113)

  - get original index in using a dichotomic seach instead of a linear one;
  - normalize the text in using NFD;
  - convert the query string into a RegExp;
  - replace whitespaces in the query with \s+;
  - handle hyphens at eol use to break a word;
  - add some \s* around punctuation signs
@calixteman calixteman merged commit 8281e64 into mozilla:master Feb 3, 2022
Snuffleupagus added a commit to Snuffleupagus/pdf.js that referenced this pull request Feb 5, 2022
Note that the *browser* findbar in Firefox uses "Title Case" for the labels, and it thus seem like a good idea to ensure that `PDFFindBar` in consistent with that.
Furthermore, the new label added in PR mozilla#13261 uses the "Title Case" format which means that currently the default viewer findbar looks inconsistent.

*Please note:* Based on the official Firefox localization docs, see https://firefox-source-docs.mozilla.org/l10n/overview.html#string-updates, changing only the casing should *not* require updating the key:
> 1) If the change is minor, like fixing a spelling error or case, the developer should update the en-US translation without changing the l10n-id.
quaoaris pushed a commit to quaoaris/pdf.js that referenced this pull request Feb 8, 2022
Note that the *browser* findbar in Firefox uses "Title Case" for the labels, and it thus seem like a good idea to ensure that `PDFFindBar` in consistent with that.
Furthermore, the new label added in PR mozilla#13261 uses the "Title Case" format which means that currently the default viewer findbar looks inconsistent.

*Please note:* Based on the official Firefox localization docs, see https://firefox-source-docs.mozilla.org/l10n/overview.html#string-updates, changing only the casing should *not* require updating the key:
> 1) If the change is minor, like fixing a spelling error or case, the developer should update the en-US translation without changing the l10n-id.
@marco-c marco-c added this to Closed in PDF.js quality Feb 24, 2022
bh213 pushed a commit to bh213/pdf.js that referenced this pull request Jun 3, 2022
Note that the *browser* findbar in Firefox uses "Title Case" for the labels, and it thus seem like a good idea to ensure that `PDFFindBar` in consistent with that.
Furthermore, the new label added in PR mozilla#13261 uses the "Title Case" format which means that currently the default viewer findbar looks inconsistent.

*Please note:* Based on the official Firefox localization docs, see https://firefox-source-docs.mozilla.org/l10n/overview.html#string-updates, changing only the casing should *not* require updating the key:
> 1) If the change is minor, like fixing a spelling error or case, the developer should update the en-US translation without changing the l10n-id.
rousek pushed a commit to signosoft/pdf.js that referenced this pull request Aug 10, 2022
Note that the *browser* findbar in Firefox uses "Title Case" for the labels, and it thus seem like a good idea to ensure that `PDFFindBar` in consistent with that.
Furthermore, the new label added in PR mozilla#13261 uses the "Title Case" format which means that currently the default viewer findbar looks inconsistent.

*Please note:* Based on the official Firefox localization docs, see https://firefox-source-docs.mozilla.org/l10n/overview.html#string-updates, changing only the casing should *not* require updating the key:
> 1) If the change is minor, like fixing a spelling error or case, the developer should update the en-US translation without changing the l10n-id.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Closed
6 participants