Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Anchoring is very slow on certain documents/URLs #189

Closed
robertknight opened this issue Jan 16, 2017 · 4 comments
Closed

Anchoring is very slow on certain documents/URLs #189

robertknight opened this issue Jan 16, 2017 · 4 comments

Comments

@robertknight
Copy link
Member

Steps to reproduce

  1. Go to https://www.nytimes.com (which has 27 public annotations at the time of writing)
  2. Activate Hypothesis

Actual behaviour

The browser hangs for several seconds after annotations are fetched and before they are identified as Orphans. Firefox' profiler shows a large amount of time being spent in diff-match-patch's fuzzy string search.

Additional details

Tested with Hypothesis v0.51 on Firefox 53, Chrome 57.

@robertknight
Copy link
Member Author

Another and much worse instance of this was reported today by @jeremydean on https://www.nap.edu/read/9853/chapter/3 where the public channel as ~350 annotations. At a quick glance, it looks like many of the orphans might be from other chapters of the same book (?).

Just in case the annotations change in future, I've posted a dump of them at https://gist.github.com/robertknight/108ba6ae921ad567040ac0ef637f94f5

@robertknight
Copy link
Member Author

I've been doing some work on improving how fuzzy string matching works in the dom-anchor-text-quote lib, which could be of use here. This is a profile from the current diff-match-patch based quote anchoring:

bitap-string-match-profile

Here is the profile from a version of dom-anchor-text-quote using a new approx string matching lib:

approx-string-match-profile

robertknight added a commit to robertknight/client that referenced this issue Oct 30, 2019
Implement a replacement for dom-anchor-text-quote which uses the
approx-string-match library for fuzzy string matching rather than
diff-match-patch.

Compared to the previous implementation, this should offer several
advantages:

 - approx-string-match is much faster, due to better algorithms. More
   importantly, it doesn't have the same issues with pathologically bad
   performance when quotes are not found (see [1])

 - approx-string-match supports arbitrarily long patterns, so the logic
   built on top of it is simpler

 - approx-string-match is smaller (3KB vs 51 KB minified)

 - The degree of fuzziness allowed can be controlled directly.
   Based on some heuristic testing, this is currently set as 20% of the
   quote's length

 - Bringing this module "in house" gives us freedom to make changes to
   the matching process more easily.

[1] hypothesis#189
robertknight added a commit to robertknight/client that referenced this issue Oct 30, 2019
This resolves issues of:

 - Poor performance when a page has many annotations that do not anchor
 - The suffix not being taken into account when there is a match for the
   prefix

Fixes hypothesis#189
Fixes hypothesis#1022
@lyzadanger
Copy link
Contributor

I've re-verified that performance on https://www.nap.edu/read/9853/chapter/3 with the current production extension is unacceptable. Once anchoring is complete, the sidebar is responsive and scrolling is smooth, but the anchoring process is painful.

@lyzadanger lyzadanger changed the title Anchoring is very slow on the nytimes.com homepage Anchoring is very slow on certain documents/URLs Sep 10, 2020
robertknight added a commit that referenced this issue Dec 1, 2020
Introduce a new fuzzy quote anchoring implementation which is faster
than the existing one when many quote selectors fail to exactly match
and gives us more insight into and control over the fuzzy matching
process.

 - It uses the `approx-string-match` library for approximate string
   matching ("fuzzy string search"). The previous implementation in
   `dom-anchor-text-quote` used Google's `diff-match-patch` library.
   `approx-string-match` is much faster when the match has many errors,
   including when there is no good match at all.

   This resolves a problem where the browser could become unresponsive
   for a significant period of time when anchoring large numbers of
   annotations (hundreds) on pages where there have been significant
   changes in the content.

   In the "Public" group on http://www.americanyawp.com/text/01-the-new-world/
   for example the client spends a total of ~2.4 seconds running JS in between starting
   the client and anchoring completing compared to ~11 seconds with the
   previous implementation.

   The new library is also smaller. The minified `annotator.bundle.js`
   size reduced by 15% (25KB).

 - As a result of the previous point, we can allow the fuzzy quote
   matcher to find more distant matches from the originally anchored
   text.

   On http://www.americanyawp.com/text/01-the-new-world/ for example the
   number of orphans dropped from 137 to 63.

   TODO: ADD NUMBERS HERE ON HOW MANY OF THE NEW MATCHES ARE REAL AND
   HOW MANY ARE NOT

 - The new implementation gives us more visibility into and control over
   the logic used to determine the "best" fuzzy match for a quote
   selector. This enables us to better answer internal/user questions
   about how the client handles content changes in a document, as well
   as to understand why quote matching produces the results in does and
   make changes if we need to.

 - We have more control over how quote selectors are generated. This
   will enable us to change the amount of context that is captured for
   example.

Fixes #189
Improves #73 (annotations will still orphan, but more quickly)
robertknight added a commit that referenced this issue Dec 1, 2020
Introduce a new fuzzy quote anchoring implementation which is faster
than the existing one when many quote selectors fail to exactly match
and gives us more insight into and control over the fuzzy matching
process.

 - It uses the `approx-string-match` library for approximate string
   matching ("fuzzy string search"). The previous implementation in
   `dom-anchor-text-quote` used Google's `diff-match-patch` library.
   `approx-string-match` is much faster when the match has many errors,
   including when there is no good match at all.

   This resolves a problem where the browser could become unresponsive
   for a significant period of time when anchoring large numbers of
   annotations (hundreds) on pages where there have been significant
   changes in the content.

   In the "Public" group on http://www.americanyawp.com/text/01-the-new-world/
   for example the client spends a total of ~2.4 seconds running JS in between starting
   the client and anchoring completing compared to ~11 seconds with the
   previous implementation.

   The new library is also smaller. The minified `annotator.bundle.js`
   size reduced by 15% (25KB).

 - As a result of the previous point, we can allow the fuzzy quote
   matcher to find more distant matches from the originally anchored
   text.

   On http://www.americanyawp.com/text/01-the-new-world/ for example the
   number of orphans dropped from 137 to 63.

 - The new implementation gives us more visibility into and control over
   the logic used to determine the "best" fuzzy match for a quote
   selector. This enables us to better answer internal/user questions
   about how the client handles content changes in a document, as well
   as to understand why quote matching produces the results in does and
   make changes if we need to.

 - We have more control over how quote selectors are generated. This
   will enable us to change the amount of context that is captured for
   example.

Fixes #189
Improves #73 (annotations will still orphan, but more quickly)
robertknight added a commit to robertknight/client that referenced this issue Dec 5, 2020
Introduce a new fuzzy quote anchoring implementation which is faster
than the existing one when many quote selectors fail to exactly match
and gives us more insight into and control over the fuzzy matching
process.

 - It uses the `approx-string-match` library for approximate string
   matching ("fuzzy string search"). The previous implementation in
   `dom-anchor-text-quote` used Google's `diff-match-patch` library.
   `approx-string-match` is much faster when the match has many errors,
   including when there is no good match at all.

   This resolves a problem where the browser could become unresponsive
   for a significant period of time when anchoring large numbers of
   annotations (hundreds) on pages where there have been significant
   changes in the content.

   In the "Public" group on http://www.americanyawp.com/text/01-the-new-world/
   for example the client spends a total of ~2.4 seconds running JS in between starting
   the client and anchoring completing compared to ~11 seconds with the
   previous implementation.

   The new library is also smaller. The minified `annotator.bundle.js`
   size reduced by 15% (25KB).

 - As a result of the previous point, we can allow the fuzzy quote
   matcher to find more distant matches from the originally anchored
   text.

   On http://www.americanyawp.com/text/01-the-new-world/ for example the
   number of orphans dropped from 137 to 63.

 - The new implementation gives us more visibility into and control over
   the logic used to determine the "best" fuzzy match for a quote
   selector. This enables us to better answer internal/user questions
   about how the client handles content changes in a document, as well
   as to understand why quote matching produces the results in does and
   make changes if we need to.

 - We have more control over how quote selectors are generated. This
   will enable us to change the amount of context that is captured for
   example.

Fixes hypothesis#189
Improves hypothesis#73 (annotations will still orphan, but more quickly)
robertknight added a commit that referenced this issue Dec 7, 2020
Introduce a new fuzzy quote anchoring implementation which is faster
than the existing one when many quote selectors fail to exactly match
and gives us more insight into and control over the fuzzy matching
process.

 - It uses the `approx-string-match` library for approximate string
   matching ("fuzzy string search"). The previous implementation in
   `dom-anchor-text-quote` used Google's `diff-match-patch` library.
   `approx-string-match` is much faster when the match has many errors,
   including when there is no good match at all.

   This resolves a problem where the browser could become unresponsive
   for a significant period of time when anchoring large numbers of
   annotations (hundreds) on pages where there have been significant
   changes in the content.

   In the "Public" group on http://www.americanyawp.com/text/01-the-new-world/
   for example the client spends a total of ~2.4 seconds running JS in between starting
   the client and anchoring completing compared to ~11 seconds with the
   previous implementation.

   The new library is also smaller. The minified `annotator.bundle.js`
   size reduced by 15% (25KB).

 - As a result of the previous point, we can allow the fuzzy quote
   matcher to find more distant matches from the originally anchored
   text.

   On http://www.americanyawp.com/text/01-the-new-world/ for example the
   number of orphans dropped from 137 to 63.

 - The new implementation gives us more visibility into and control over
   the logic used to determine the "best" fuzzy match for a quote
   selector. This enables us to better answer internal/user questions
   about how the client handles content changes in a document, as well
   as to understand why quote matching produces the results in does and
   make changes if we need to.

 - We have more control over how quote selectors are generated. This
   will enable us to change the amount of context that is captured for
   example.

Fixes #189
Improves #73 (annotations will still orphan, but more quickly)
robertknight added a commit that referenced this issue Dec 8, 2020
Introduce a new fuzzy quote anchoring implementation which is faster
than the existing one when many quote selectors fail to exactly match
and gives us more insight into and control over the fuzzy matching
process.

 - It uses the `approx-string-match` library for approximate string
   matching ("fuzzy string search"). The previous implementation in
   `dom-anchor-text-quote` used Google's `diff-match-patch` library.
   `approx-string-match` is much faster when the match has many errors,
   including when there is no good match at all.

   This resolves a problem where the browser could become unresponsive
   for a significant period of time when anchoring large numbers of
   annotations (hundreds) on pages where there have been significant
   changes in the content.

   In the "Public" group on http://www.americanyawp.com/text/01-the-new-world/
   for example the client spends a total of ~2.4 seconds running JS in between starting
   the client and anchoring completing compared to ~11 seconds with the
   previous implementation.

   The new library is also smaller. The minified `annotator.bundle.js`
   size reduced by 15% (25KB).

 - As a result of the previous point, we can allow the fuzzy quote
   matcher to find more distant matches from the originally anchored
   text.

   On http://www.americanyawp.com/text/01-the-new-world/ for example the
   number of orphans dropped from 137 to 63.

 - The new implementation gives us more visibility into and control over
   the logic used to determine the "best" fuzzy match for a quote
   selector. This enables us to better answer internal/user questions
   about how the client handles content changes in a document, as well
   as to understand why quote matching produces the results in does and
   make changes if we need to.

 - We have more control over how quote selectors are generated. This
   will enable us to change the amount of context that is captured for
   example.

Fixes #189
Improves #73 (annotations will still orphan, but more quickly)
robertknight added a commit that referenced this issue Dec 10, 2020
Introduce a new fuzzy quote anchoring implementation which is faster
than the existing one when many quote selectors fail to exactly match
and gives us more insight into and control over the fuzzy matching
process.

 - It uses the `approx-string-match` library for approximate string
   matching ("fuzzy string search"). The previous implementation in
   `dom-anchor-text-quote` used Google's `diff-match-patch` library.
   `approx-string-match` is much faster when the match has many errors,
   including when there is no good match at all.

   This resolves a problem where the browser could become unresponsive
   for a significant period of time when anchoring large numbers of
   annotations (hundreds) on pages where there have been significant
   changes in the content.

   In the "Public" group on http://www.americanyawp.com/text/01-the-new-world/
   for example the client spends a total of ~2.4 seconds running JS in between starting
   the client and anchoring completing compared to ~11 seconds with the
   previous implementation.

   The new library is also smaller. The minified `annotator.bundle.js`
   size reduced by 15% (25KB).

 - As a result of the previous point, we can allow the fuzzy quote
   matcher to find more distant matches from the originally anchored
   text.

   On http://www.americanyawp.com/text/01-the-new-world/ for example the
   number of orphans dropped from 137 to 63.

 - The new implementation gives us more visibility into and control over
   the logic used to determine the "best" fuzzy match for a quote
   selector. This enables us to better answer internal/user questions
   about how the client handles content changes in a document, as well
   as to understand why quote matching produces the results in does and
   make changes if we need to.

 - We have more control over how quote selectors are generated. This
   will enable us to change the amount of context that is captured for
   example.

Fixes #189
Improves #73 (annotations will still orphan, but more quickly)
@robertknight
Copy link
Member Author

Fixed by #2814 (client release >= v1.617.0).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants