-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Anchoring is very slow on certain documents/URLs #189
Comments
Another and much worse instance of this was reported today by @jeremydean on https://www.nap.edu/read/9853/chapter/3 where the public channel as ~350 annotations. At a quick glance, it looks like many of the orphans might be from other chapters of the same book (?). Just in case the annotations change in future, I've posted a dump of them at https://gist.github.com/robertknight/108ba6ae921ad567040ac0ef637f94f5 |
I've been doing some work on improving how fuzzy string matching works in the dom-anchor-text-quote lib, which could be of use here. This is a profile from the current diff-match-patch based quote anchoring: Here is the profile from a version of dom-anchor-text-quote using a new approx string matching lib: |
Implement a replacement for dom-anchor-text-quote which uses the approx-string-match library for fuzzy string matching rather than diff-match-patch. Compared to the previous implementation, this should offer several advantages: - approx-string-match is much faster, due to better algorithms. More importantly, it doesn't have the same issues with pathologically bad performance when quotes are not found (see [1]) - approx-string-match supports arbitrarily long patterns, so the logic built on top of it is simpler - approx-string-match is smaller (3KB vs 51 KB minified) - The degree of fuzziness allowed can be controlled directly. Based on some heuristic testing, this is currently set as 20% of the quote's length - Bringing this module "in house" gives us freedom to make changes to the matching process more easily. [1] hypothesis#189
This resolves issues of: - Poor performance when a page has many annotations that do not anchor - The suffix not being taken into account when there is a match for the prefix Fixes hypothesis#189 Fixes hypothesis#1022
I've re-verified that performance on https://www.nap.edu/read/9853/chapter/3 with the current production extension is unacceptable. Once anchoring is complete, the sidebar is responsive and scrolling is smooth, but the anchoring process is painful. |
Introduce a new fuzzy quote anchoring implementation which is faster than the existing one when many quote selectors fail to exactly match and gives us more insight into and control over the fuzzy matching process. - It uses the `approx-string-match` library for approximate string matching ("fuzzy string search"). The previous implementation in `dom-anchor-text-quote` used Google's `diff-match-patch` library. `approx-string-match` is much faster when the match has many errors, including when there is no good match at all. This resolves a problem where the browser could become unresponsive for a significant period of time when anchoring large numbers of annotations (hundreds) on pages where there have been significant changes in the content. In the "Public" group on http://www.americanyawp.com/text/01-the-new-world/ for example the client spends a total of ~2.4 seconds running JS in between starting the client and anchoring completing compared to ~11 seconds with the previous implementation. The new library is also smaller. The minified `annotator.bundle.js` size reduced by 15% (25KB). - As a result of the previous point, we can allow the fuzzy quote matcher to find more distant matches from the originally anchored text. On http://www.americanyawp.com/text/01-the-new-world/ for example the number of orphans dropped from 137 to 63. TODO: ADD NUMBERS HERE ON HOW MANY OF THE NEW MATCHES ARE REAL AND HOW MANY ARE NOT - The new implementation gives us more visibility into and control over the logic used to determine the "best" fuzzy match for a quote selector. This enables us to better answer internal/user questions about how the client handles content changes in a document, as well as to understand why quote matching produces the results in does and make changes if we need to. - We have more control over how quote selectors are generated. This will enable us to change the amount of context that is captured for example. Fixes #189 Improves #73 (annotations will still orphan, but more quickly)
Introduce a new fuzzy quote anchoring implementation which is faster than the existing one when many quote selectors fail to exactly match and gives us more insight into and control over the fuzzy matching process. - It uses the `approx-string-match` library for approximate string matching ("fuzzy string search"). The previous implementation in `dom-anchor-text-quote` used Google's `diff-match-patch` library. `approx-string-match` is much faster when the match has many errors, including when there is no good match at all. This resolves a problem where the browser could become unresponsive for a significant period of time when anchoring large numbers of annotations (hundreds) on pages where there have been significant changes in the content. In the "Public" group on http://www.americanyawp.com/text/01-the-new-world/ for example the client spends a total of ~2.4 seconds running JS in between starting the client and anchoring completing compared to ~11 seconds with the previous implementation. The new library is also smaller. The minified `annotator.bundle.js` size reduced by 15% (25KB). - As a result of the previous point, we can allow the fuzzy quote matcher to find more distant matches from the originally anchored text. On http://www.americanyawp.com/text/01-the-new-world/ for example the number of orphans dropped from 137 to 63. - The new implementation gives us more visibility into and control over the logic used to determine the "best" fuzzy match for a quote selector. This enables us to better answer internal/user questions about how the client handles content changes in a document, as well as to understand why quote matching produces the results in does and make changes if we need to. - We have more control over how quote selectors are generated. This will enable us to change the amount of context that is captured for example. Fixes #189 Improves #73 (annotations will still orphan, but more quickly)
Introduce a new fuzzy quote anchoring implementation which is faster than the existing one when many quote selectors fail to exactly match and gives us more insight into and control over the fuzzy matching process. - It uses the `approx-string-match` library for approximate string matching ("fuzzy string search"). The previous implementation in `dom-anchor-text-quote` used Google's `diff-match-patch` library. `approx-string-match` is much faster when the match has many errors, including when there is no good match at all. This resolves a problem where the browser could become unresponsive for a significant period of time when anchoring large numbers of annotations (hundreds) on pages where there have been significant changes in the content. In the "Public" group on http://www.americanyawp.com/text/01-the-new-world/ for example the client spends a total of ~2.4 seconds running JS in between starting the client and anchoring completing compared to ~11 seconds with the previous implementation. The new library is also smaller. The minified `annotator.bundle.js` size reduced by 15% (25KB). - As a result of the previous point, we can allow the fuzzy quote matcher to find more distant matches from the originally anchored text. On http://www.americanyawp.com/text/01-the-new-world/ for example the number of orphans dropped from 137 to 63. - The new implementation gives us more visibility into and control over the logic used to determine the "best" fuzzy match for a quote selector. This enables us to better answer internal/user questions about how the client handles content changes in a document, as well as to understand why quote matching produces the results in does and make changes if we need to. - We have more control over how quote selectors are generated. This will enable us to change the amount of context that is captured for example. Fixes hypothesis#189 Improves hypothesis#73 (annotations will still orphan, but more quickly)
Introduce a new fuzzy quote anchoring implementation which is faster than the existing one when many quote selectors fail to exactly match and gives us more insight into and control over the fuzzy matching process. - It uses the `approx-string-match` library for approximate string matching ("fuzzy string search"). The previous implementation in `dom-anchor-text-quote` used Google's `diff-match-patch` library. `approx-string-match` is much faster when the match has many errors, including when there is no good match at all. This resolves a problem where the browser could become unresponsive for a significant period of time when anchoring large numbers of annotations (hundreds) on pages where there have been significant changes in the content. In the "Public" group on http://www.americanyawp.com/text/01-the-new-world/ for example the client spends a total of ~2.4 seconds running JS in between starting the client and anchoring completing compared to ~11 seconds with the previous implementation. The new library is also smaller. The minified `annotator.bundle.js` size reduced by 15% (25KB). - As a result of the previous point, we can allow the fuzzy quote matcher to find more distant matches from the originally anchored text. On http://www.americanyawp.com/text/01-the-new-world/ for example the number of orphans dropped from 137 to 63. - The new implementation gives us more visibility into and control over the logic used to determine the "best" fuzzy match for a quote selector. This enables us to better answer internal/user questions about how the client handles content changes in a document, as well as to understand why quote matching produces the results in does and make changes if we need to. - We have more control over how quote selectors are generated. This will enable us to change the amount of context that is captured for example. Fixes #189 Improves #73 (annotations will still orphan, but more quickly)
Introduce a new fuzzy quote anchoring implementation which is faster than the existing one when many quote selectors fail to exactly match and gives us more insight into and control over the fuzzy matching process. - It uses the `approx-string-match` library for approximate string matching ("fuzzy string search"). The previous implementation in `dom-anchor-text-quote` used Google's `diff-match-patch` library. `approx-string-match` is much faster when the match has many errors, including when there is no good match at all. This resolves a problem where the browser could become unresponsive for a significant period of time when anchoring large numbers of annotations (hundreds) on pages where there have been significant changes in the content. In the "Public" group on http://www.americanyawp.com/text/01-the-new-world/ for example the client spends a total of ~2.4 seconds running JS in between starting the client and anchoring completing compared to ~11 seconds with the previous implementation. The new library is also smaller. The minified `annotator.bundle.js` size reduced by 15% (25KB). - As a result of the previous point, we can allow the fuzzy quote matcher to find more distant matches from the originally anchored text. On http://www.americanyawp.com/text/01-the-new-world/ for example the number of orphans dropped from 137 to 63. - The new implementation gives us more visibility into and control over the logic used to determine the "best" fuzzy match for a quote selector. This enables us to better answer internal/user questions about how the client handles content changes in a document, as well as to understand why quote matching produces the results in does and make changes if we need to. - We have more control over how quote selectors are generated. This will enable us to change the amount of context that is captured for example. Fixes #189 Improves #73 (annotations will still orphan, but more quickly)
Introduce a new fuzzy quote anchoring implementation which is faster than the existing one when many quote selectors fail to exactly match and gives us more insight into and control over the fuzzy matching process. - It uses the `approx-string-match` library for approximate string matching ("fuzzy string search"). The previous implementation in `dom-anchor-text-quote` used Google's `diff-match-patch` library. `approx-string-match` is much faster when the match has many errors, including when there is no good match at all. This resolves a problem where the browser could become unresponsive for a significant period of time when anchoring large numbers of annotations (hundreds) on pages where there have been significant changes in the content. In the "Public" group on http://www.americanyawp.com/text/01-the-new-world/ for example the client spends a total of ~2.4 seconds running JS in between starting the client and anchoring completing compared to ~11 seconds with the previous implementation. The new library is also smaller. The minified `annotator.bundle.js` size reduced by 15% (25KB). - As a result of the previous point, we can allow the fuzzy quote matcher to find more distant matches from the originally anchored text. On http://www.americanyawp.com/text/01-the-new-world/ for example the number of orphans dropped from 137 to 63. - The new implementation gives us more visibility into and control over the logic used to determine the "best" fuzzy match for a quote selector. This enables us to better answer internal/user questions about how the client handles content changes in a document, as well as to understand why quote matching produces the results in does and make changes if we need to. - We have more control over how quote selectors are generated. This will enable us to change the amount of context that is captured for example. Fixes #189 Improves #73 (annotations will still orphan, but more quickly)
Fixed by #2814 (client release >= v1.617.0). |
Steps to reproduce
Actual behaviour
The browser hangs for several seconds after annotations are fetched and before they are identified as Orphans. Firefox' profiler shows a large amount of time being spent in diff-match-patch's fuzzy string search.
Additional details
Tested with Hypothesis v0.51 on Firefox 53, Chrome 57.
The text was updated successfully, but these errors were encountered: