Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve fuzzy quote matching #2779

Merged
merged 6 commits into from
Dec 11, 2020
Merged

Improve fuzzy quote matching #2779

merged 6 commits into from
Dec 11, 2020

Conversation

robertknight
Copy link
Member

@robertknight robertknight commented Dec 1, 2020

Depends on #2814

This PR replaces the quote selector anchoring algorithm with a new one added in #2087. See notes from the original draft below for the rationale.

Changes in detail:

  • Change TextQuoteAnchor to use matchQuote rather than dom-anchor-text-quote to find matches in the document text
  • Change TextQuoteAnchor.toSelector to directly generate a selector rather than delegating to dom-anchor-text-quote
  • Improve the tests for TextQuoteAnchor. In particular, make sure to check the properties of objects returned by the various methods instead of just their type.
  • Replace unnecessary dom-anchor-text-quote usage in tests for anchoring/pdf.js.

One choice I want to note regarding the tests is that they mock matchQuote but not TextRange (used for text position <-> Range conversion). I think this is a good balance of making it easy to test the behaviors of TextQuoteAnchor without coupling the tests too much to implementation details.


Notes from the original draft PR:

  1. On pages where there are many annotations and significant edits since those annotations were created, anchoring quote selectors can be really slow and lock up the browser for several seconds. See http://www.americanyawp.com/text/01-the-new-world/ for example and Anchoring is very slow on certain documents/URLs #189 (this issue also contains analysis of the problem). The new implementation has similar performance when all the quotes match exactly, but is much faster in the worst case.
  2. The previous implementation has quite a low tolerance for mismatches between the current and originally anchored text before it gives up and considers an annotation to be "orphaned". The new implementation can tolerate more mismatches, allowing it to anchor quotes that previously orphaned.
  3. It is difficult to describe how exactly the current "fuzzy" matching works and how fuzzy/approximate it allows matches to be. This means we can't easily explain to users/internally about why an incorrect match happened or fuzzy anchoring failed when we should have found a match. Finding the best match for an annotation in an edited document in all cases would require general intelligence. In lieu of that, I think the best approach is for us to try and have a straightforward algorithm that works well in the majority of cases and where we can easily describe how it works, what it's limitations are and tweak it as necessary.

As an illustration of the performance improvements, here is a comparison of the amount of time spend anchoring annotations in the Public group on http://www.americanyawp.com/text/01-the-new-world/.

With the previous implementation:

DATQ quote matcher

With the new implementation:

Myers quote matcher

On this page, the old algorithm produced 329 matches and 137 orphans. The new implementation finds 403 matches and 63 orphans. Here is an example of an annotation that anchors with the new implementation but did not under the old one:

fuzzy-match

Note there are significant changes in the text, but the current text is close enough to the text that was originally annotated to consider it a match.

Note: Not all of the 403 matches are good. In particular both the old and new implementations are prone to finding incorrect matches for short terms (eg. for "mastodons" and "megafauna" in this document). The new implementation should make it easier to rectify that later by changing this logic though.


The technical changes in this PR are:

  • Implement a new fuzzy/approximate quote selecting matching algorithm in the matchQuote function in src/annotator/anchoring/match-quote.js
  • Change the TextQuoteAnchor class to use matchQuote to anchor text quotes, and TextRange to generate quote selectors

For details on the fuzzy string matching algorithm itself, see the approx-string-match README. The heart of the previous implementation came from the diff-match-patch library, which was used indirectly via dom-anchor-text-quote. The TL;DR is that this library uses a more recent algorithm which can tolerate a larger number of errors with reasonable efficiency. It is also easier to program with because it accepts arbitrarily long patterns, whereas diff-match-patch only allows patterns up to 32 characters and so dom-anchor-text-quote had to work around that.

@lyzadanger
Copy link
Contributor

It is difficult to describe how exactly the current "fuzzy" matching works and how fuzzy/approximate it allows matches to be. This means we can't easily explain to users/internally about why an incorrect match happened or fuzzy anchoring failed when we should have found a match. Finding the best match for an annotation in an edited document in all cases would require general intelligence.

I find this the most compelling criterion—thanks for expressing it this way!

@judell
Copy link
Contributor

judell commented Dec 1, 2020

This is a huge deal, bravo!

both the old and new implementations are prone to finding incorrect matches for short terms

Was going to ask about that. I wonder if the matcher could be tuned to be less tolerant when considering shorter terms?

I also wonder, as I have before, about telemetry that would report anchoring outcomes, and would enable us to study and tune the matching.

@codecov
Copy link

codecov bot commented Dec 7, 2020

Codecov Report

Merging #2779 (be34723) into master (d2e9f19) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #2779   +/-   ##
=======================================
  Coverage   97.77%   97.77%           
=======================================
  Files         204      204           
  Lines        7764     7766    +2     
  Branches     1718     1717    -1     
=======================================
+ Hits         7591     7593    +2     
  Misses        173      173           
Impacted Files Coverage Δ
src/annotator/anchoring/types.js 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d2e9f19...be34723. Read the comment docs.

Use the new matching algorithm for anchoring text quote selectors. This
is faster than the existing one when many quote selectors fail to exactly match
and gives us more insight into and control over the fuzzy matching
process.

 - Use the `matchQuote` function to do find the best match for the quote
   in the text, replacing the `dom-anchor-text-quote` library.

   This resolves a problem where the browser could become unresponsive
   for a significant period of time when anchoring large numbers of
   annotations (hundreds) on pages where there have been significant
   changes in the content.

   In the "Public" group on http://www.americanyawp.com/text/01-the-new-world/
   for example the client spends a total of ~2.4 seconds running JS in between starting
   the client and anchoring completing compared to ~11 seconds with the
   previous implementation.

   The new implementation also provides more control over the degree of
   mismatch between quote selector and document text that is allowed.
   The current settings provide higher recall (larger proportion of
   "correct" approximate matches found) than the previous
   implementation. On http://www.americanyawp.com/text/01-the-new-world/
   for example the number of orphans dropped from 137 to 63.

   Finally the new library is also smaller. The minified `annotator.bundle.js`
   size is reduced by 15% (25KB).

 - Change `TextQuoteAnchor.fromSelector(...)` to generate the selector
   directly rather than delegating to `dom-anchor-text-quote`. This
   gives us more control over how quote selectors are generated and more
   easily change factors such as the amount of context included.
Check the properties of the `TextQuoteAnchor` instance have expected
values.
For consistency, and because it is useful/straightforward to do, all of the
`TextQuoteAnchor` tests now mock `matchQuote` but not `TextRange`,
except for one integration test that is labeled as such.
// This is a natural unit of meaning which enables displaying quotes in
// context even when the document is not available. We could use `Intl.Segmenter`
// for this when available.
const contextLen = 32;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number 32 was chosen to match the behavior of dom-anchor-text-quote.

@robertknight robertknight marked this pull request as ready for review December 11, 2020 15:22
Copy link
Contributor

@LMS007 LMS007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@robertknight
Copy link
Member Author

Thank-you!

@robertknight robertknight merged commit f7e9ced into master Dec 11, 2020
@robertknight robertknight deleted the faster-quote-anchoring branch December 11, 2020 19:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants