-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve fuzzy quote matching #2779
Conversation
I find this the most compelling criterion—thanks for expressing it this way! |
This is a huge deal, bravo!
Was going to ask about that. I wonder if the matcher could be tuned to be less tolerant when considering shorter terms? I also wonder, as I have before, about telemetry that would report anchoring outcomes, and would enable us to study and tune the matching. |
c00ae5f
to
f2f29f3
Compare
Codecov Report
@@ Coverage Diff @@
## master #2779 +/- ##
=======================================
Coverage 97.77% 97.77%
=======================================
Files 204 204
Lines 7764 7766 +2
Branches 1718 1717 -1
=======================================
+ Hits 7591 7593 +2
Misses 173 173
Continue to review full report at Codecov.
|
f2f29f3
to
07f58db
Compare
aaecb3b
to
10c016d
Compare
0920964
to
500e9bc
Compare
10c016d
to
df63575
Compare
Use the new matching algorithm for anchoring text quote selectors. This is faster than the existing one when many quote selectors fail to exactly match and gives us more insight into and control over the fuzzy matching process. - Use the `matchQuote` function to do find the best match for the quote in the text, replacing the `dom-anchor-text-quote` library. This resolves a problem where the browser could become unresponsive for a significant period of time when anchoring large numbers of annotations (hundreds) on pages where there have been significant changes in the content. In the "Public" group on http://www.americanyawp.com/text/01-the-new-world/ for example the client spends a total of ~2.4 seconds running JS in between starting the client and anchoring completing compared to ~11 seconds with the previous implementation. The new implementation also provides more control over the degree of mismatch between quote selector and document text that is allowed. The current settings provide higher recall (larger proportion of "correct" approximate matches found) than the previous implementation. On http://www.americanyawp.com/text/01-the-new-world/ for example the number of orphans dropped from 137 to 63. Finally the new library is also smaller. The minified `annotator.bundle.js` size is reduced by 15% (25KB). - Change `TextQuoteAnchor.fromSelector(...)` to generate the selector directly rather than delegating to `dom-anchor-text-quote`. This gives us more control over how quote selectors are generated and more easily change factors such as the amount of context included.
Check the properties of the `TextQuoteAnchor` instance have expected values.
df63575
to
8dd3caf
Compare
For consistency, and because it is useful/straightforward to do, all of the `TextQuoteAnchor` tests now mock `matchQuote` but not `TextRange`, except for one integration test that is labeled as such.
// This is a natural unit of meaning which enables displaying quotes in | ||
// context even when the document is not available. We could use `Intl.Segmenter` | ||
// for this when available. | ||
const contextLen = 32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number 32
was chosen to match the behavior of dom-anchor-text-quote
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Thank-you! |
Depends on #2814This PR replaces the quote selector anchoring algorithm with a new one added in #2087. See notes from the original draft below for the rationale.
Changes in detail:
TextQuoteAnchor
to usematchQuote
rather thandom-anchor-text-quote
to find matches in the document textTextQuoteAnchor.toSelector
to directly generate a selector rather than delegating todom-anchor-text-quote
TextQuoteAnchor
. In particular, make sure to check the properties of objects returned by the various methods instead of just their type.dom-anchor-text-quote
usage in tests foranchoring/pdf.js
.One choice I want to note regarding the tests is that they mock
matchQuote
but notTextRange
(used for text position <-> Range conversion). I think this is a good balance of making it easy to test the behaviors ofTextQuoteAnchor
without coupling the tests too much to implementation details.Notes from the original draft PR:
As an illustration of the performance improvements, here is a comparison of the amount of time spend anchoring annotations in the Public group on http://www.americanyawp.com/text/01-the-new-world/.
With the previous implementation:
With the new implementation:
On this page, the old algorithm produced 329 matches and 137 orphans. The new implementation finds 403 matches and 63 orphans. Here is an example of an annotation that anchors with the new implementation but did not under the old one:
Note there are significant changes in the text, but the current text is close enough to the text that was originally annotated to consider it a match.
Note: Not all of the 403 matches are good. In particular both the old and new implementations are prone to finding incorrect matches for short terms (eg. for "mastodons" and "megafauna" in this document). The new implementation should make it easier to rectify that later by changing this logic though.
The technical changes in this PR are:
matchQuote
function insrc/annotator/anchoring/match-quote.js
TextQuoteAnchor
class to usematchQuote
to anchor text quotes, andTextRange
to generate quote selectorsFor details on the fuzzy string matching algorithm itself, see the approx-string-match README. The heart of the previous implementation came from the diff-match-patch library, which was used indirectly via
dom-anchor-text-quote
. The TL;DR is that this library uses a more recent algorithm which can tolerate a larger number of errors with reasonable efficiency. It is also easier to program with because it accepts arbitrarily long patterns, whereas diff-match-patch only allows patterns up to 32 characters and so dom-anchor-text-quote had to work around that.