Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TextLayer contains text that is not visible when the page is rendered #11509

Closed
ldenoue opened this issue Jan 13, 2020 · 3 comments
Closed

TextLayer contains text that is not visible when the page is rendered #11509

ldenoue opened this issue Jan 13, 2020 · 3 comments

Comments

@ldenoue
Copy link

ldenoue commented Jan 13, 2020

Attach (recommended) or Link to PDF file here:
materials-12-00322.pdf

Configuration:

  • Web browser and its version: Chrome Version 80.0.3987.42 (Official Build) beta (64-bit)
  • Operating system and its version: MacBook Air version 10.14.4 (18E226)
  • PDF.js version: latest online
  • Is a browser extension: no

Steps to reproduce the problem:

  1. load page 5
  2. notice you can select text that is invisible to the eye (Preview also has this issue)

What is the expected behavior? (add screenshot)
The TextLayer should not contain text that is not visible on the page

What went wrong? (add screenshot)
How could we detect that some text is not drawn with ctx.fillText?
I checked several things, like the fillcolor (black), the textmatrix (looks correct). No apparent way to determine whether a ctx.fillText actually draws something.
If we could, then we'd be able to only keep the unicode characters in TextLayer that produces something on the screen.

Link to a viewer (if hosted on a site other than mozilla.github.io/pdf.js or as Firefox/Chrome extension):

@Snuffleupagus
Copy link
Collaborator

Duplicate of #9666; as explained in #9666 (comment) there's generally no way of fixing broken PDF files like this without also breaking valid cases.

@ldenoue
Copy link
Author

ldenoue commented Jan 13, 2020

Do you understand why during rendering the text is not visible?
Is it a scaling of a matrix?
Is it the font used that contains empty glyphs?
There must be a way to determine if the text is painted or not, and thus use this for the text layer.

@ldenoue
Copy link
Author

ldenoue commented Jan 14, 2020

Digging a bit more into these examples, I noticed that the text ends with a clip.
So basically, characters are drawn normally on the canvas, but then are clipped out.

I modified pdf.js to render all text in red, and you'll see that the PDF contains extra text when I remove the call to this.pendingClip in the consumePath method.
image

image

image

So basically one would need to keep the locations of previously painted text, and determine whether they get clipped out by subsequent clips.

Probably not impossible, but most likely not on a high priority for the PDF.js project

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants