[api-minor] Support search with or without diacritics (bug 1508345, bug 916883, bug 1651113) #13261

calixteman · 2021-04-18T21:42:59Z

get original index in using a dichotomic seach instead of a linear one;
normalize the text in using NFD;
convert the query string into a RegExp;
replace whitespaces in the query with \s+;
handle hyphens at eol use to break a word;
add some \s* around punctuation signs

calixteman · 2021-04-18T21:46:24Z

The PR is almost ready but highlights are wrong with RTL languages.

test/unit/pdf_find_controller_spec.js

web/viewer.html

web/viewer.js

web/pdf_find_controller.js

calixteman · 2021-05-01T17:14:36Z

For now it doesn't work with RTL languages and I'll do that in an other patch.

web/text_layer_builder.css

web/pdf_find_controller.js

Snuffleupagus · 2021-05-13T16:37:48Z

remove pdf_find_utils.js.

You probably want to remove this from the commit message (and PR description) now :-)

Given that this implementation uses a lot more, and more complex, regular expressions during both initial text-parsing and subsequent searching: What sort of performance impact, if any, does this patch have for larger and/or more complex documents?

For example, what about e.g. the pdf.pdf document (in the test-suite) or perhaps kjv.pdf (which is even longer)?

calixteman · 2021-05-13T18:09:51Z

Good questions.
I used some perfomance.now calls at the beginning and at the end of functions normalize and _calculateRegExpMatch and I used file kjv.pdf (and set privacy.reduceTimerPrecision to false):
Normalization (ms): 69, 63, 78, 60,
Search for "a" (ms): 316, 296, 306, 253

Normalization (ms): 71, 63, 77, 77
Search for "and thou wast not spoiled;" (ms): 32, 32, 36, 33

And the same in master with normalize and _calculatePhraseMatch:
Normalization (ms): 25, 25, 22, 20,
Search for "a" (ms): 41, 41, 40, 40

Normalization (ms): 29, 29, 27, 20
Search for "and thou wast not spoiled;" (ms): 7, 7, 7, 7

From a user pov, I don't see any differences with both searches and so I think that this perf regression is acceptable.

I added some code to remove the diacritics stuff in the query regexp when there are no diacritics on the page (which is the case in kjv.pdf) and there are no significant difference in time for both searches, so in the search part we pay for the use of regexp instead of indexOf and not so much because of diacritics.
One big advantage of regexp over indexOf is that it's very simple to search for foo\s+bar (because the pdf can contain multiple spaces between foo and bar) and we don't have to re-normalize pageContent when matchDiacritics is turned on/off or to change pageContent case when caseSensitive is turned on/off.

And as usual, I'm open to any good idea to improve this.

calixteman · 2021-05-22T13:22:52Z

@timvandermeij, @Snuffleupagus do you have any objections for landing that stuff ? or any idea to improve perf or whatever ?

Snuffleupagus · 2021-05-22T13:26:39Z

do you have any objections for landing that stuff ?

I've been a little bit short on time to really review this properly, and have only looked briefly at the implementation, so it'd probably be a very good idea to actually do a "full" review before landing it since this is a significant change to the find implementation.

timvandermeij · 2021-05-22T16:56:38Z

Same here. I think it would be good to, aside from the full review, wait until #13418 is done before merging since it's quite a change to the implementation and it would be good to get the release out first to avoid any risks. (We usually do this for significant changes, such as the text layer and struct tree PRs.)

pdfjsbot · 2022-02-03T13:55:00Z

From: Bot.io (Linux m4)

Success

Full output at http://54.241.84.105:8877/f79b098ad1ade97/output.txt

Total script time: 4.67 mins

Published

Viewer: http://54.241.84.105:8877/f79b098ad1ade97/web/viewer.html
Viewer (legacy): http://54.241.84.105:8877/f79b098ad1ade97/legacy/web/viewer.html

Snuffleupagus · 2022-02-03T14:06:39Z

/botio unittest

pdfjsbot · 2022-02-03T14:06:40Z

From: Bot.io (Linux m4)

Received

Command cmd_unittest from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/39d0ab5729ad9a2/output.txt

pdfjsbot · 2022-02-03T14:06:40Z

From: Bot.io (Windows)

Received

Command cmd_unittest from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/174cda3f5d3abe1/output.txt

pdfjsbot · 2022-02-03T14:09:50Z

From: Bot.io (Linux m4)

Success

Full output at http://54.241.84.105:8877/39d0ab5729ad9a2/output.txt

Total script time: 3.15 mins

Unit Tests: Passed

pdfjsbot · 2022-02-03T14:12:48Z

From: Bot.io (Windows)

Success

Full output at http://54.193.163.58:8877/174cda3f5d3abe1/output.txt

Total script time: 6.12 mins

Unit Tests: Passed

Snuffleupagus · 2022-02-03T14:13:33Z

/botio integrationtest

pdfjsbot · 2022-02-03T14:13:34Z

From: Bot.io (Windows)

Received

Command cmd_integrationtest from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/aa860f7ef9c0a8e/output.txt

pdfjsbot · 2022-02-03T14:13:34Z

From: Bot.io (Linux m4)

Received

Command cmd_integrationtest from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/1aa7bf3800cc78f/output.txt

pdfjsbot · 2022-02-03T14:17:36Z

From: Bot.io (Linux m4)

Success

Full output at http://54.241.84.105:8877/1aa7bf3800cc78f/output.txt

Total script time: 4.01 mins

Integration Tests: Passed

pdfjsbot · 2022-02-03T14:20:10Z

From: Bot.io (Windows)

Success

Full output at http://54.193.163.58:8877/aa860f7ef9c0a8e/output.txt

Total script time: 6.58 mins

Integration Tests: Passed

Snuffleupagus

For your information, we'd like to have this feature in the next nightly cycle (the soft freeze is next week) and we'll ask to QA to check that everything is fine.

OK, let's try landing this and see how it goes :-)
I've tried to do some manual testing, but that's obviously quite limited and given the scope/size of these changes we obviously need more widespread testing.

If the quality is not good enough we'll backout and it I'll work on it to improve what is needed.

Unless there's really big problems, it might be easier to fix in place (and uplift patches) as necessary since trying to back-out a PR this size could very quickly become difficult.

web/pdf_find_controller.js

…ug 1651113) - get original index in using a dichotomic seach instead of a linear one; - normalize the text in using NFD; - convert the query string into a RegExp; - replace whitespaces in the query with \s+; - handle hyphens at eol use to break a word; - add some \s* around punctuation signs

Note that the *browser* findbar in Firefox uses "Title Case" for the labels, and it thus seem like a good idea to ensure that `PDFFindBar` in consistent with that. Furthermore, the new label added in PR mozilla#13261 uses the "Title Case" format which means that currently the default viewer findbar looks inconsistent. *Please note:* Based on the official Firefox localization docs, see https://firefox-source-docs.mozilla.org/l10n/overview.html#string-updates, changing only the casing should *not* require updating the key: > 1) If the change is minor, like fixing a spelling error or case, the developer should update the en-US translation without changing the l10n-id.

calixteman marked this pull request as draft April 18, 2021 21:45

calixteman force-pushed the diacritics1 branch from efb187d to decea62 Compare April 18, 2021 21:50

Snuffleupagus requested changes Apr 19, 2021

View reviewed changes

timvandermeij added the text-selection label Apr 20, 2021

calixteman force-pushed the diacritics1 branch from decea62 to 617e919 Compare May 1, 2021 17:11

calixteman marked this pull request as ready for review May 1, 2021 17:13

calixteman requested a review from timvandermeij May 1, 2021 17:14

calixteman force-pushed the diacritics1 branch 2 times, most recently from 7f7cfdf to daa4e64 Compare May 1, 2021 17:32

Snuffleupagus requested changes May 1, 2021

View reviewed changes

calixteman force-pushed the diacritics1 branch from daa4e64 to f20a6f0 Compare May 1, 2021 17:39

calixteman mentioned this pull request May 2, 2021

Search matches characters across new lines #2806

Closed

calixteman force-pushed the diacritics1 branch from f20a6f0 to 73ad20e Compare May 2, 2021 10:39

calixteman mentioned this pull request May 2, 2021

String.indexOf() cannot match phrases with variable whitespace #7355

Closed

Snuffleupagus requested changes May 3, 2021

View reviewed changes

web/pdf_find_controller.js Outdated Show resolved Hide resolved

calixteman force-pushed the diacritics1 branch 2 times, most recently from 2142145 to c22368f Compare May 13, 2021 16:08

calixteman force-pushed the diacritics1 branch from c22368f to 37c5404 Compare May 13, 2021 18:13

calixteman mentioned this pull request May 14, 2021

Can't search across lines in some PDFs (like the demo PDF) #4742

Closed

This comment has been minimized.

Sign in to view

timvandermeij removed their request for review June 13, 2021 13:14

calixteman force-pushed the diacritics1 branch 2 times, most recently from a20b623 to d92e0b2 Compare October 3, 2021 15:09

Snuffleupagus changed the title ~~Support search with or without diacritics (bug 1508345, bug 916883, bug 1651113)~~ [api-minor] Support search with or without diacritics (bug 1508345, bug 916883, bug 1651113) Feb 3, 2022

Snuffleupagus approved these changes Feb 3, 2022

View reviewed changes

web/pdf_find_controller.js Outdated Show resolved Hide resolved

calixteman force-pushed the diacritics1 branch from 774b053 to 1f41028 Compare February 3, 2022 14:43

calixteman merged commit 8281e64 into mozilla:master Feb 3, 2022

Snuffleupagus added the viewer label Feb 3, 2022

Snuffleupagus mentioned this pull request Feb 3, 2022

Avoid the findResultsCount span taking up (vertical) space when hidden (PR 13261 follow-up) #14530

Merged

Snuffleupagus mentioned this pull request Feb 5, 2022

[GENERIC viewer] Use consistent casing, for the labels, in the findbar #14535

Merged

marco-c added this to Closed in PDF.js quality Feb 24, 2022

Snuffleupagus mentioned this pull request Feb 27, 2022

Time for a new release? #14613

Closed

hrynko mentioned this pull request Jul 17, 2022

[Feature] Advanced Highlighting hrynko/vue-pdf-embed#61

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[api-minor] Support search with or without diacritics (bug 1508345, bug 916883, bug 1651113) #13261

[api-minor] Support search with or without diacritics (bug 1508345, bug 916883, bug 1651113) #13261

calixteman commented Apr 18, 2021 •

edited

Loading

calixteman commented Apr 18, 2021

calixteman commented May 1, 2021

Snuffleupagus commented May 13, 2021

calixteman commented May 13, 2021

calixteman commented May 22, 2021

Snuffleupagus commented May 22, 2021 •

edited

Loading

timvandermeij commented May 22, 2021

This comment has been minimized.

pdfjsbot commented Feb 3, 2022

Snuffleupagus commented Feb 3, 2022

pdfjsbot commented Feb 3, 2022

pdfjsbot commented Feb 3, 2022

pdfjsbot commented Feb 3, 2022

pdfjsbot commented Feb 3, 2022

Snuffleupagus commented Feb 3, 2022

pdfjsbot commented Feb 3, 2022

pdfjsbot commented Feb 3, 2022

pdfjsbot commented Feb 3, 2022

pdfjsbot commented Feb 3, 2022

Snuffleupagus left a comment

[api-minor] Support search with or without diacritics (bug 1508345, bug 916883, bug 1651113) #13261

[api-minor] Support search with or without diacritics (bug 1508345, bug 916883, bug 1651113) #13261

Conversation

calixteman commented Apr 18, 2021 • edited Loading

calixteman commented Apr 18, 2021

calixteman commented May 1, 2021

Snuffleupagus commented May 13, 2021

calixteman commented May 13, 2021

calixteman commented May 22, 2021

Snuffleupagus commented May 22, 2021 • edited Loading

timvandermeij commented May 22, 2021

This comment has been minimized.

pdfjsbot commented Feb 3, 2022

From: Bot.io (Linux m4)

Success

Published

Snuffleupagus commented Feb 3, 2022

pdfjsbot commented Feb 3, 2022

From: Bot.io (Linux m4)

Received

pdfjsbot commented Feb 3, 2022

From: Bot.io (Windows)

Received

pdfjsbot commented Feb 3, 2022

From: Bot.io (Linux m4)

Success

pdfjsbot commented Feb 3, 2022

From: Bot.io (Windows)

Success

Snuffleupagus commented Feb 3, 2022

pdfjsbot commented Feb 3, 2022

From: Bot.io (Windows)

Received

pdfjsbot commented Feb 3, 2022

From: Bot.io (Linux m4)

Received

pdfjsbot commented Feb 3, 2022

From: Bot.io (Linux m4)

Success

pdfjsbot commented Feb 3, 2022

From: Bot.io (Windows)

Success

Snuffleupagus left a comment

Choose a reason for hiding this comment

calixteman commented Apr 18, 2021 •

edited

Loading

Snuffleupagus commented May 22, 2021 •

edited

Loading