Add new fuzzy quote matching implementation #2814

robertknight · 2020-12-10T15:58:50Z

This PR is the first part of #2779. It adds the new fuzzy quote matching algorithm and tests. The next PR will then switch quote matching in the client to use the new implementation.

Implement a matchQuote function which will be used to replace
dom-anchor-text-quote for finding the best match for annotation quotes
in the document text.

The new implementation is based on the approx-string-match library and
provides several improvements over the existing one:

Better performance when there are many differences between the quote
and closest document text
It will be easier for us to tune the degree of mismatch allowed
between the quote and document text and how candidate matches are
ranked

codecov · 2020-12-10T15:59:48Z

Codecov Report

Merging #2814 (500e9bc) into master (dd3fd83) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #2814      +/-   ##
==========================================
+ Coverage   97.75%   97.77%   +0.01%     
==========================================
  Files         203      204       +1     
  Lines        7720     7764      +44     
  Branches     1708     1718      +10     
==========================================
+ Hits         7547     7591      +44     
  Misses        173      173

Impacted Files	Coverage Δ
src/annotator/anchoring/match-quote.js	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dd3fd83...500e9bc. Read the comment docs.

LMS007

I gave this a look over last week and tested it out a bit. Obviously we get much better orphan matching out of the gate and I don't see any red flags. Code seems straightforward (ignoring approx-string-match), can't give any reasons not to merge this now!

LMS007 · 2020-12-10T20:46:49Z

src/annotator/anchoring/match-quote.js

+    return 0.0;
+  }
+  const matches = search(text, str, str.length);
+  return 1 - matches[0].errors / str.length;


readability for me :)

Suggested change

return 1 - matches[0].errors / str.length;

1 - (matches[0].errors / str.length);

I agree, although I did have to suppress Prettier's formatting.

LMS007 · 2020-12-10T20:49:00Z

src/annotator/anchoring/match-quote.js

+  const matches = search(text, quote, maxErrors);
+
+  if (matches.length === 0) {
+    // All matches had more than `maxErrors` errors.


I could perhaps call this a "candidate" or "candidate match"

e.g.
// All candidates had more than maxErrors errors.

Looking at this again the comment is pretty superfluous, so I just removed it.

Implement a `matchQuote` function which will be used to replace `dom-anchor-text-quote` for finding the best match for annotation quotes in the document text. The new implementation is based on the `approx-string-match` library and provides several improvements over the existing one: - Better performance when there are many differences between the quote and closest document text - It will be easier for us to tune the degree of mismatch allowed between the quote and document text and how candidate matches are ranked

robertknight requested a review from LMS007 December 10, 2020 15:58

robertknight mentioned this pull request Dec 10, 2020

Improve fuzzy quote matching #2779

Merged

LMS007 approved these changes Dec 10, 2020

View reviewed changes

robertknight force-pushed the match-quote branch from 0920964 to 500e9bc Compare December 11, 2020 08:43

robertknight merged commit d2e9f19 into master Dec 11, 2020

robertknight deleted the match-quote branch December 11, 2020 08:52

This was referenced Dec 11, 2020

When only TextQuoteSelector is used, anchoring should use suffix to differentiate when prefix and exact are the same hypothesis/product-backlog#1022

Closed

Anchoring is very slow on certain documents/URLs #189

Closed

Treora mentioned this pull request Dec 24, 2020

Fuzzy text quote matching apache/incubator-annotator#83

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new fuzzy quote matching implementation #2814

Add new fuzzy quote matching implementation #2814

robertknight commented Dec 10, 2020

codecov bot commented Dec 10, 2020 •

edited

Loading

LMS007 left a comment

LMS007 Dec 10, 2020

robertknight Dec 11, 2020

LMS007 Dec 10, 2020

robertknight Dec 11, 2020

	return 1 - matches[0].errors / str.length;
	1 - (matches[0].errors / str.length);

Add new fuzzy quote matching implementation #2814

Add new fuzzy quote matching implementation #2814

Conversation

robertknight commented Dec 10, 2020

codecov bot commented Dec 10, 2020 • edited Loading

Codecov Report

LMS007 left a comment

Choose a reason for hiding this comment

LMS007 Dec 10, 2020

Choose a reason for hiding this comment

robertknight Dec 11, 2020

Choose a reason for hiding this comment

LMS007 Dec 10, 2020

Choose a reason for hiding this comment

robertknight Dec 11, 2020

Choose a reason for hiding this comment

codecov bot commented Dec 10, 2020 •

edited

Loading