Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TextLayer not well aligned for Amharic fonts extracted from this PDF #11756

Closed
technicaltitch opened this issue Mar 27, 2020 · 1 comment
Closed

Comments

@technicaltitch
Copy link

technicaltitch commented Mar 27, 2020

File:
Social Studies in Amharic Grade 5 Student Book.pdf

Configuration:

  • Web browser and its version:Chrome 79
  • Operating system and its version:macOS 10.14.6
  • PDF.js version:9a437a1
  • Is a browser extension:No

Steps to reproduce the problem:

  1. Load the attached file
  2. Select non-justified lines by dragging down the left margin of the PDF, and the selection is not aligned with the text displayed on the canvas. (Similar happens when double-clicking to select a word, etc.) This is because the font is loaded from the PDF for the canvas, but not used on the TextLayer, which contains strings such as " ôÇ=^L© ÈV¡^c=Á© ]ùwK=¡ ¾ƒUIƒ T>’>e‚ " (the right character codes in a standard font).

What is the expected behavior? (add screenshot)
The selected area should cover the text displayed. I presume this means loading the fonts from the PDF into the TextLayer, and that this is not done to improve performance (ref), so would a setting to turn off this optimization be possible? (I tried the textLayerMode or enhanceTextSelection settings, and adding an extracted font-family codename to the TextLayer span CSS.)

I have generated a screenshot of expected behaviour by rendering using SVG, and deleting the TextLayer div:
Screenshot 2020-03-28 at 02 37 04

Going by the deeper shading, I suspect my diagnosis is nonsense and there are several spans overlaid. (Unfortunately I have textbooks for the entire school curriculum, and I'm trying to launch for lock down study, I only have evenings, so am stuck with these awful PDFs. )

I would love to extract the font though so I can extract and display text outside pdf.js, if possible. Any pointers very gratefully received - I'm still struggling to navigate the core.

What went wrong? (add screenshot)
As you can see from the first screenshot, the selected areas do not match the text areas:

Screenshot 2020-03-28 at 02 11 52

In the second screenshot you can faintly see the actual characters on the TextLayer spans (I made the span color black):

Screenshot 2020-03-28 at 02 12 33

Link to a viewer (if hosted on a site other than mozilla.github.io/pdf.js or as Firefox/Chrome extension):
I am running the latest code from Git in Node.

@technicaltitch technicaltitch changed the title TextLayer not well aligned for Amharic fonts extracted from the PDF TextLayer not well aligned for Amharic fonts extracted from this PDF Mar 28, 2020
@Snuffleupagus
Copy link
Collaborator

Fixed by PR #13424 (at least the alignment part since copying only produces "garbage", however that can be reproduced in Adobe Reader as well).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants