Improve fuzzy quote matching #2779

robertknight · 2020-12-01T13:07:20Z

~~Depends on #2814~~

This PR replaces the quote selector anchoring algorithm with a new one added in #2087. See notes from the original draft below for the rationale.

Changes in detail:

Change TextQuoteAnchor to use matchQuote rather than dom-anchor-text-quote to find matches in the document text
Change TextQuoteAnchor.toSelector to directly generate a selector rather than delegating to dom-anchor-text-quote
Improve the tests for TextQuoteAnchor. In particular, make sure to check the properties of objects returned by the various methods instead of just their type.
Replace unnecessary dom-anchor-text-quote usage in tests for anchoring/pdf.js.

One choice I want to note regarding the tests is that they mock matchQuote but not TextRange (used for text position <-> Range conversion). I think this is a good balance of making it easy to test the behaviors of TextQuoteAnchor without coupling the tests too much to implementation details.

Notes from the original draft PR:

On pages where there are many annotations and significant edits since those annotations were created, anchoring quote selectors can be really slow and lock up the browser for several seconds. See http://www.americanyawp.com/text/01-the-new-world/ for example and Anchoring is very slow on certain documents/URLs #189 (this issue also contains analysis of the problem). The new implementation has similar performance when all the quotes match exactly, but is much faster in the worst case.
The previous implementation has quite a low tolerance for mismatches between the current and originally anchored text before it gives up and considers an annotation to be "orphaned". The new implementation can tolerate more mismatches, allowing it to anchor quotes that previously orphaned.
It is difficult to describe how exactly the current "fuzzy" matching works and how fuzzy/approximate it allows matches to be. This means we can't easily explain to users/internally about why an incorrect match happened or fuzzy anchoring failed when we should have found a match. Finding the best match for an annotation in an edited document in all cases would require general intelligence. In lieu of that, I think the best approach is for us to try and have a straightforward algorithm that works well in the majority of cases and where we can easily describe how it works, what it's limitations are and tweak it as necessary.

As an illustration of the performance improvements, here is a comparison of the amount of time spend anchoring annotations in the Public group on http://www.americanyawp.com/text/01-the-new-world/.

With the previous implementation:

With the new implementation:

On this page, the old algorithm produced 329 matches and 137 orphans. The new implementation finds 403 matches and 63 orphans. Here is an example of an annotation that anchors with the new implementation but did not under the old one:

Note there are significant changes in the text, but the current text is close enough to the text that was originally annotated to consider it a match.

Note: Not all of the 403 matches are good. In particular both the old and new implementations are prone to finding incorrect matches for short terms (eg. for "mastodons" and "megafauna" in this document). The new implementation should make it easier to rectify that later by changing this logic though.

The technical changes in this PR are:

Implement a new fuzzy/approximate quote selecting matching algorithm in the matchQuote function in src/annotator/anchoring/match-quote.js
Change the TextQuoteAnchor class to use matchQuote to anchor text quotes, and TextRange to generate quote selectors

For details on the fuzzy string matching algorithm itself, see the approx-string-match README. The heart of the previous implementation came from the diff-match-patch library, which was used indirectly via dom-anchor-text-quote. The TL;DR is that this library uses a more recent algorithm which can tolerate a larger number of errors with reasonable efficiency. It is also easier to program with because it accepts arbitrarily long patterns, whereas diff-match-patch only allows patterns up to 32 characters and so dom-anchor-text-quote had to work around that.

lyzadanger · 2020-12-01T13:24:29Z

It is difficult to describe how exactly the current "fuzzy" matching works and how fuzzy/approximate it allows matches to be. This means we can't easily explain to users/internally about why an incorrect match happened or fuzzy anchoring failed when we should have found a match. Finding the best match for an annotation in an edited document in all cases would require general intelligence.

I find this the most compelling criterion—thanks for expressing it this way!

judell · 2020-12-01T16:34:48Z

This is a huge deal, bravo!

both the old and new implementations are prone to finding incorrect matches for short terms

Was going to ask about that. I wonder if the matcher could be tuned to be less tolerant when considering shorter terms?

I also wonder, as I have before, about telemetry that would report anchoring outcomes, and would enable us to study and tune the matching.

codecov · 2020-12-07T14:22:04Z

Codecov Report

Merging #2779 (be34723) into master (d2e9f19) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #2779   +/-   ##
=======================================
  Coverage   97.77%   97.77%           
=======================================
  Files         204      204           
  Lines        7764     7766    +2     
  Branches     1718     1717    -1     
=======================================
+ Hits         7591     7593    +2     
  Misses        173      173

Impacted Files	Coverage Δ
src/annotator/anchoring/types.js	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d2e9f19...be34723. Read the comment docs.

Use the new matching algorithm for anchoring text quote selectors. This is faster than the existing one when many quote selectors fail to exactly match and gives us more insight into and control over the fuzzy matching process. - Use the `matchQuote` function to do find the best match for the quote in the text, replacing the `dom-anchor-text-quote` library. This resolves a problem where the browser could become unresponsive for a significant period of time when anchoring large numbers of annotations (hundreds) on pages where there have been significant changes in the content. In the "Public" group on http://www.americanyawp.com/text/01-the-new-world/ for example the client spends a total of ~2.4 seconds running JS in between starting the client and anchoring completing compared to ~11 seconds with the previous implementation. The new implementation also provides more control over the degree of mismatch between quote selector and document text that is allowed. The current settings provide higher recall (larger proportion of "correct" approximate matches found) than the previous implementation. On http://www.americanyawp.com/text/01-the-new-world/ for example the number of orphans dropped from 137 to 63. Finally the new library is also smaller. The minified `annotator.bundle.js` size is reduced by 15% (25KB). - Change `TextQuoteAnchor.fromSelector(...)` to generate the selector directly rather than delegating to `dom-anchor-text-quote`. This gives us more control over how quote selectors are generated and more easily change factors such as the amount of context included.

Check the properties of the `TextQuoteAnchor` instance have expected values.

For consistency, and because it is useful/straightforward to do, all of the `TextQuoteAnchor` tests now mock `matchQuote` but not `TextRange`, except for one integration test that is labeled as such.

robertknight · 2020-12-11T15:22:27Z

src/annotator/anchoring/types.js

+    // This is a natural unit of meaning which enables displaying quotes in
+    // context even when the document is not available. We could use `Intl.Segmenter`
+    // for this when available.
+    const contextLen = 32;


The number 32 was chosen to match the behavior of dom-anchor-text-quote.

LMS007

Looks good!

robertknight · 2020-12-11T19:33:21Z

Thank-you!

robertknight force-pushed the faster-quote-anchoring branch from c00ae5f to f2f29f3 Compare December 7, 2020 14:18

robertknight force-pushed the faster-quote-anchoring branch from f2f29f3 to 07f58db Compare December 8, 2020 08:24

This was referenced Dec 9, 2020

Remove unnecessary requestAnimationFrame callbacks in Guest #2806

Merged

Anchoring issues at Hybrid Pedagogy hypothesis/support-legacy#164

Open

robertknight force-pushed the faster-quote-anchoring branch 2 times, most recently from aaecb3b to 10c016d Compare December 10, 2020 15:56

robertknight mentioned this pull request Dec 10, 2020

Add new fuzzy quote matching implementation #2814

Merged

robertknight changed the base branch from master to match-quote December 10, 2020 16:00

robertknight force-pushed the match-quote branch from 0920964 to 500e9bc Compare December 11, 2020 08:43

robertknight force-pushed the faster-quote-anchoring branch from 10c016d to df63575 Compare December 11, 2020 08:45

Base automatically changed from match-quote to master December 11, 2020 08:52

robertknight added 4 commits December 11, 2020 08:52

Remove dom-anchor-text-quote dependency in pdf-test.js

a20df40

Remove unused dom-anchor-text-quote dependency

bea1ba9

Improve TextQuoteAnchor.fromRange test

8dd3caf

Check the properties of the `TextQuoteAnchor` instance have expected values.

robertknight force-pushed the faster-quote-anchoring branch from df63575 to 8dd3caf Compare December 11, 2020 08:53

robertknight added 2 commits December 11, 2020 12:25

Improve TextQuoteAnchor JSDoc and remove obsolete lint suppression

2f36067

Improve TextQuoteAnchor tests

be34723

For consistency, and because it is useful/straightforward to do, all of the `TextQuoteAnchor` tests now mock `matchQuote` but not `TextRange`, except for one integration test that is labeled as such.

robertknight commented Dec 11, 2020

View reviewed changes

robertknight marked this pull request as ready for review December 11, 2020 15:22

robertknight requested a review from LMS007 December 11, 2020 15:22

LMS007 approved these changes Dec 11, 2020

View reviewed changes

robertknight merged commit f7e9ced into master Dec 11, 2020

robertknight deleted the faster-quote-anchoring branch December 11, 2020 19:33

Treora mentioned this pull request Dec 24, 2020

Fuzzy text quote matching apache/incubator-annotator#83

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve fuzzy quote matching #2779

Improve fuzzy quote matching #2779

robertknight commented Dec 1, 2020 •

edited

Loading

lyzadanger commented Dec 1, 2020

judell commented Dec 1, 2020

codecov bot commented Dec 7, 2020 •

edited

Loading

robertknight Dec 11, 2020

LMS007 left a comment

robertknight commented Dec 11, 2020

Improve fuzzy quote matching #2779

Improve fuzzy quote matching #2779

Conversation

robertknight commented Dec 1, 2020 • edited Loading

lyzadanger commented Dec 1, 2020

judell commented Dec 1, 2020

codecov bot commented Dec 7, 2020 • edited Loading

Codecov Report

robertknight Dec 11, 2020

Choose a reason for hiding this comment

LMS007 left a comment

Choose a reason for hiding this comment

robertknight commented Dec 11, 2020

robertknight commented Dec 1, 2020 •

edited

Loading

codecov bot commented Dec 7, 2020 •

edited

Loading