Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annotations immediately orphan when selecting text in a PDF that includes last character on page #1329

Closed
robertknight opened this issue Aug 27, 2019 · 2 comments · Fixed by #1449
Assignees

Comments

@robertknight
Copy link
Member

This issue was created as part of investigating hypothesis/lms#875.

Steps to reproduce

  1. Go to https://via.hypothes.is/https://arxiv.org/pdf/1908.04683v3.pdf, scroll down to page 5 and select the text beginning at the bottom of the page from "For our experiments" up to and including the page number "5" at the bottom of the page.
  2. Click "Annotate"

Expected result: The selected text is highlighted and a new annotation appears in the "Annotations" tab in the client
Actual result: The selected text is not highlighted and the annotation appears in the "Orphans" tab in the client

After reloading the page, the annotation continues to be an orphan instead of anchoring.

Notes

The issue occurs whenever a selection to annotate matches these criteria:

  1. The text is unique enough within the document that it does not fuzzily/roughly match any other text
  2. The selected text includes the last character in the page's text. More specifically, it needs to extend up to the end of the invisible div.textLayer element that PDF.js creates for each page to enable text selection.
@mkdir-washington-edu
Copy link
Contributor

Also occurs here: https://hypothesis.zendesk.com/agent/tickets/5926

@robertknight
Copy link
Member Author

The immediate problem here is an upstream issue in the dom-anchor-text-position library. In brief:

  1. When anchoring a quote within a PDF, the client fetches the text of each page and attempts to locate the quote within the text
  2. If successful, there are two possibilities: Either the page where the quote was found, in which case there will be a text layer <div> used for text selection, or it is not, in which case there will just be a placeholder for the text layer
  3. If there is a text layer, the quote is mapped to a text position selector within the text layer, and the text position selector is mapped to a DOM Range using dom-anchor-text-position. This is basically the same as if we were doing HTML anchoring.

There is a problem in (3) if the generated text position includes the final character of the text layer <div> as this triggers an exception in dom-anchor-text-quote.

Although the basic problem is not actually specific to PDF annotation, it essentially never happens for HTML annotation because there are always some characters in an HTML document beyond what the user can actually select. With PDF annotation however, it is actually possible to select the last character in the text layer.

I've submitted a PR and test cases upstream. Depending on how soon Randall is able to take a look and whether any further changes are needed, we'll either upgrade the library or temporarily use a fork.

tilgovi added a commit to tilgovi/dom-anchor-text-position that referenced this issue Oct 8, 2019
tilgovi added a commit to tilgovi/dom-anchor-text-position that referenced this issue Oct 12, 2019
robertknight added a commit that referenced this issue Oct 22, 2019
This currently fails to a bug on `dom-anchor-text-position`'s `toRange`
implementation.

See #1329
robertknight added a commit that referenced this issue Oct 22, 2019
Add a new implementation of conversion from text positions (that is,
offsets within an element's `textContent`) to DOM `Range`s along with
test cases.

This addresses an issue with the existing implementation of `toRange` in
the `dom-anchor-text-position` package where conversion fails when the
text position includes the end of the element's text.

Even if/when the issue is addressed upstream, I think it would be useful
to retain these test cases to guard against future regressions.

See #1329
robertknight added a commit that referenced this issue Oct 22, 2019
Change PDF anchoring to use the new text position => Range
implementation from the `src/annotator/anchoring/text-position` module
which fixes an issue when the last text in a PDF page is selected.

Fixes #1329
robertknight added a commit that referenced this issue Oct 22, 2019
Add a new implementation of conversion from text positions (that is,
offsets within an element's `textContent`) to DOM `Range`s along with
test cases.

This addresses an issue with the existing implementation of `toRange` in
the `dom-anchor-text-position` package where conversion fails when the
text position includes the end of the element's text.

Even if/when the issue is addressed upstream, I think it would be useful
to retain these test cases to guard against future regressions.

See #1329
robertknight added a commit that referenced this issue Oct 22, 2019
Change PDF anchoring to use the new text position => Range
implementation from the `src/annotator/anchoring/text-position` module
which fixes an issue when the last text in a PDF page is selected.

Fixes #1329
tilgovi added a commit to tilgovi/dom-anchor-text-position that referenced this issue Oct 28, 2019
tilgovi added a commit to tilgovi/dom-anchor-text-position that referenced this issue Oct 28, 2019
tilgovi added a commit to tilgovi/dom-anchor-text-position that referenced this issue Mar 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants