Anchoring is very slow on certain documents/URLs #189

robertknight · 2017-01-16T23:03:28Z

Steps to reproduce

Go to https://www.nytimes.com (which has 27 public annotations at the time of writing)
Activate Hypothesis

Actual behaviour

The browser hangs for several seconds after annotations are fetched and before they are identified as Orphans. Firefox' profiler shows a large amount of time being spent in diff-match-patch's fuzzy string search.

Additional details

Tested with Hypothesis v0.51 on Firefox 53, Chrome 57.

robertknight · 2017-06-26T15:36:42Z

Another and much worse instance of this was reported today by @jeremydean on https://www.nap.edu/read/9853/chapter/3 where the public channel as ~350 annotations. At a quick glance, it looks like many of the orphans might be from other chapters of the same book (?).

Just in case the annotations change in future, I've posted a dump of them at https://gist.github.com/robertknight/108ba6ae921ad567040ac0ef637f94f5

robertknight · 2017-06-26T15:38:39Z

I've been doing some work on improving how fuzzy string matching works in the dom-anchor-text-quote lib, which could be of use here. This is a profile from the current diff-match-patch based quote anchoring:

Here is the profile from a version of dom-anchor-text-quote using a new approx string matching lib:

Implement a replacement for dom-anchor-text-quote which uses the approx-string-match library for fuzzy string matching rather than diff-match-patch. Compared to the previous implementation, this should offer several advantages: - approx-string-match is much faster, due to better algorithms. More importantly, it doesn't have the same issues with pathologically bad performance when quotes are not found (see [1]) - approx-string-match supports arbitrarily long patterns, so the logic built on top of it is simpler - approx-string-match is smaller (3KB vs 51 KB minified) - The degree of fuzziness allowed can be controlled directly. Based on some heuristic testing, this is currently set as 20% of the quote's length - Bringing this module "in house" gives us freedom to make changes to the matching process more easily. [1] hypothesis#189

This resolves issues of: - Poor performance when a page has many annotations that do not anchor - The suffix not being taken into account when there is a match for the prefix Fixes hypothesis#189 Fixes hypothesis#1022

lyzadanger · 2020-09-10T19:37:16Z

I've re-verified that performance on https://www.nap.edu/read/9853/chapter/3 with the current production extension is unacceptable. Once anchoring is complete, the sidebar is responsive and scrolling is smooth, but the anchoring process is painful.

Introduce a new fuzzy quote anchoring implementation which is faster than the existing one when many quote selectors fail to exactly match and gives us more insight into and control over the fuzzy matching process. - It uses the `approx-string-match` library for approximate string matching ("fuzzy string search"). The previous implementation in `dom-anchor-text-quote` used Google's `diff-match-patch` library. `approx-string-match` is much faster when the match has many errors, including when there is no good match at all. This resolves a problem where the browser could become unresponsive for a significant period of time when anchoring large numbers of annotations (hundreds) on pages where there have been significant changes in the content. In the "Public" group on http://www.americanyawp.com/text/01-the-new-world/ for example the client spends a total of ~2.4 seconds running JS in between starting the client and anchoring completing compared to ~11 seconds with the previous implementation. The new library is also smaller. The minified `annotator.bundle.js` size reduced by 15% (25KB). - As a result of the previous point, we can allow the fuzzy quote matcher to find more distant matches from the originally anchored text. On http://www.americanyawp.com/text/01-the-new-world/ for example the number of orphans dropped from 137 to 63. TODO: ADD NUMBERS HERE ON HOW MANY OF THE NEW MATCHES ARE REAL AND HOW MANY ARE NOT - The new implementation gives us more visibility into and control over the logic used to determine the "best" fuzzy match for a quote selector. This enables us to better answer internal/user questions about how the client handles content changes in a document, as well as to understand why quote matching produces the results in does and make changes if we need to. - We have more control over how quote selectors are generated. This will enable us to change the amount of context that is captured for example. Fixes #189 Improves #73 (annotations will still orphan, but more quickly)

Introduce a new fuzzy quote anchoring implementation which is faster than the existing one when many quote selectors fail to exactly match and gives us more insight into and control over the fuzzy matching process. - It uses the `approx-string-match` library for approximate string matching ("fuzzy string search"). The previous implementation in `dom-anchor-text-quote` used Google's `diff-match-patch` library. `approx-string-match` is much faster when the match has many errors, including when there is no good match at all. This resolves a problem where the browser could become unresponsive for a significant period of time when anchoring large numbers of annotations (hundreds) on pages where there have been significant changes in the content. In the "Public" group on http://www.americanyawp.com/text/01-the-new-world/ for example the client spends a total of ~2.4 seconds running JS in between starting the client and anchoring completing compared to ~11 seconds with the previous implementation. The new library is also smaller. The minified `annotator.bundle.js` size reduced by 15% (25KB). - As a result of the previous point, we can allow the fuzzy quote matcher to find more distant matches from the originally anchored text. On http://www.americanyawp.com/text/01-the-new-world/ for example the number of orphans dropped from 137 to 63. - The new implementation gives us more visibility into and control over the logic used to determine the "best" fuzzy match for a quote selector. This enables us to better answer internal/user questions about how the client handles content changes in a document, as well as to understand why quote matching produces the results in does and make changes if we need to. - We have more control over how quote selectors are generated. This will enable us to change the amount of context that is captured for example. Fixes #189 Improves #73 (annotations will still orphan, but more quickly)

Introduce a new fuzzy quote anchoring implementation which is faster than the existing one when many quote selectors fail to exactly match and gives us more insight into and control over the fuzzy matching process. - It uses the `approx-string-match` library for approximate string matching ("fuzzy string search"). The previous implementation in `dom-anchor-text-quote` used Google's `diff-match-patch` library. `approx-string-match` is much faster when the match has many errors, including when there is no good match at all. This resolves a problem where the browser could become unresponsive for a significant period of time when anchoring large numbers of annotations (hundreds) on pages where there have been significant changes in the content. In the "Public" group on http://www.americanyawp.com/text/01-the-new-world/ for example the client spends a total of ~2.4 seconds running JS in between starting the client and anchoring completing compared to ~11 seconds with the previous implementation. The new library is also smaller. The minified `annotator.bundle.js` size reduced by 15% (25KB). - As a result of the previous point, we can allow the fuzzy quote matcher to find more distant matches from the originally anchored text. On http://www.americanyawp.com/text/01-the-new-world/ for example the number of orphans dropped from 137 to 63. - The new implementation gives us more visibility into and control over the logic used to determine the "best" fuzzy match for a quote selector. This enables us to better answer internal/user questions about how the client handles content changes in a document, as well as to understand why quote matching produces the results in does and make changes if we need to. - We have more control over how quote selectors are generated. This will enable us to change the amount of context that is captured for example. Fixes hypothesis#189 Improves hypothesis#73 (annotations will still orphan, but more quickly)

Introduce a new fuzzy quote anchoring implementation which is faster than the existing one when many quote selectors fail to exactly match and gives us more insight into and control over the fuzzy matching process. - It uses the `approx-string-match` library for approximate string matching ("fuzzy string search"). The previous implementation in `dom-anchor-text-quote` used Google's `diff-match-patch` library. `approx-string-match` is much faster when the match has many errors, including when there is no good match at all. This resolves a problem where the browser could become unresponsive for a significant period of time when anchoring large numbers of annotations (hundreds) on pages where there have been significant changes in the content. In the "Public" group on http://www.americanyawp.com/text/01-the-new-world/ for example the client spends a total of ~2.4 seconds running JS in between starting the client and anchoring completing compared to ~11 seconds with the previous implementation. The new library is also smaller. The minified `annotator.bundle.js` size reduced by 15% (25KB). - As a result of the previous point, we can allow the fuzzy quote matcher to find more distant matches from the originally anchored text. On http://www.americanyawp.com/text/01-the-new-world/ for example the number of orphans dropped from 137 to 63. - The new implementation gives us more visibility into and control over the logic used to determine the "best" fuzzy match for a quote selector. This enables us to better answer internal/user questions about how the client handles content changes in a document, as well as to understand why quote matching produces the results in does and make changes if we need to. - We have more control over how quote selectors are generated. This will enable us to change the amount of context that is captured for example. Fixes #189 Improves #73 (annotations will still orphan, but more quickly)

robertknight · 2020-12-11T20:17:46Z

Fixed by #2814 (client release >= v1.617.0).

dwhly mentioned this issue Oct 31, 2017

Performance issues w 200 annotations on a page #556

Closed

lyzadanger changed the title ~~Anchoring is very slow on the nytimes.com homepage~~ Anchoring is very slow on certain documents/URLs Sep 10, 2020

lyzadanger added the performance label Oct 27, 2020

robertknight mentioned this issue Dec 1, 2020

Improve fuzzy quote matching #2779

Merged

robertknight closed this as completed Dec 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Anchoring is very slow on certain documents/URLs #189

Anchoring is very slow on certain documents/URLs #189

robertknight commented Jan 16, 2017

robertknight commented Jun 26, 2017

robertknight commented Jun 26, 2017

lyzadanger commented Sep 10, 2020

robertknight commented Dec 11, 2020

Anchoring is very slow on certain documents/URLs #189

Anchoring is very slow on certain documents/URLs #189

Comments

robertknight commented Jan 16, 2017

Steps to reproduce

Actual behaviour

Additional details

robertknight commented Jun 26, 2017

robertknight commented Jun 26, 2017

lyzadanger commented Sep 10, 2020

robertknight commented Dec 11, 2020