Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text selection is wrong: missing lines #2344

Open
yurivict opened this issue Nov 4, 2012 · 14 comments

Comments

@yurivict
Copy link

commented Nov 4, 2012

Consider this document: http://lists.dragonflybsd.org/pipermail/users/attachments/20121010/7996ff88/attachment-0002.pdf
Go to page #4.
Select text from the word "Hardware" to the word "Goal".
Large portions of text, ex. "- 2x Xeon X5650", don't get selected at all.

There are few other bugs about selection, like these:
#2097
#1994
#951

But I didn't get immediate feeling these are the same issues.

@gigaherz

This comment has been minimized.

Copy link
Contributor

commented Nov 4, 2012

It seems like the text selection puts the selection pieces in a different order than they are visually. It may be possible to fix the selection in this case by sorting the contents based on the Y coord first, and the X coord later, but that would cause issues with any multi-column text.

For reference, Sumatra PDF behaves exactly the same way as pdf.js in this document. I'd say the issue is invalid because it's a problem with the PDF, not the selection algorithm.

@Snuffleupagus

This comment has been minimized.

Copy link
Contributor

commented Nov 4, 2012

Even Adobe Reader (11.0.0) behaves the same way as pdf.js does.

@yurivict

This comment has been minimized.

Copy link
Author

commented Nov 4, 2012

I confirm that adobe also has this issue.
okular-0.14.3 however selects text properly.

Visual appearance is what matters for pdf, not the order text is written into pdf file itself.
I would say, this is a valid issue, even though the adobe implementation also suffers from it.

@gigaherz

This comment has been minimized.

Copy link
Contributor

commented Nov 4, 2012

If you have any idea on how to deduce the right order, when the right order may include multiple columns, text in tables, figure titles, etc. Then feel free to propose it or even implement it and open a pull request.

@yurivict

This comment has been minimized.

Copy link
Author

commented Nov 4, 2012

Ok, keep it open.

@yurivict

This comment has been minimized.

Copy link
Author

commented Nov 5, 2012

So there are many text segments all over the page. They may be in multiple columns, in tables, figure titles.
Sort all such text fragments by y and then by x. Assume they aren't overlapping. You have an array of arrays, first array is by Y, subarrays in it are by X.
Selection from point (X1,Y1) to point (X2,Y2) should pick all qualifying text fragments and display them as selected.

As for possible text rotations, that is another story. But this also can be accommodated.

@yurydelendik

This comment has been minimized.

Copy link
Contributor

commented Nov 5, 2012

There is a reason why other readers do not fix: reading order. Text layout can be complex. It can be right-to-left. Or top-to-bottom (even English text can be written this way). Text in tables almost impossible to guess how to read: it can be ordered by rows or columns. Usually reading order gives a good text order for copy'n'paste or search, unless it's generated by some awful PDF generator. (PDF/A is attempting to enforce right order)

If somebody going to implement whatever was talked about here, there shall be a mode to choose between original order and "new/artificial" one.

@yurydelendik

This comment has been minimized.

Copy link
Contributor

commented Nov 5, 2012

Generated by Calc / LibreOffice 3.6. That explains a lot.

@yurydelendik

This comment has been minimized.

Copy link
Contributor

commented Nov 5, 2012

I would say, this is a valid issue, even though the adobe implementation also suffers from it.

It's the PDF format issue not just adobe implementation. It's too subjective to judge if it's valid or invalid.

Unless somebody is going to assign this issue to himself/herself, it will be closed. @yurivict do you want to work on it?

@yurivict

This comment has been minimized.

Copy link
Author

commented Nov 5, 2012

I disagree with such attitude: "since the problem has some complex difficult to solve cases, we won't solve the problem at all."
Apparently situation can be improved, and relatively simple cases can be processed gracefully. Okular (through the poppler library) demonstrates it in this case.

I will work on it, when I have time. So keep it open and assign to me.

@retornam

This comment has been minimized.

Copy link
Member

commented Jan 16, 2013

This might be related
STR:

  1. http://www.igvita.com/slides/2012/webperf-crash-course.pdf using Nightly 21.0a1 (2013-01-16)
  2. Select any text
  3. Paste the text in another window.
  4. You notice that the text you pasted is not what you selected

See screencast here http://screencast.com/t/vJrn4kBh

@jjmason

This comment has been minimized.

Copy link

commented Jan 21, 2013

This is a long standing issue for PDFs. I ran into it a while back while working on a project that needed to extract text from PDFs, and if I have time I could look into adding some support for it to this project.

For what it's worth the right search terms for finding literature on this are "pdf text extraction".

@wsloand

This comment has been minimized.

Copy link

commented Mar 10, 2013

If you go to figure 13 in the default pdf shown in http://mozilla.github.com/pdf.js/web/viewer.html, select the 0 on the fourth row under loops, and copy it, you'll get an "s" out (in chrome 25.0.1364.97). The text within paragraphs seems to line up better. Is it possible to fix this?

@gigaherz

This comment has been minimized.

Copy link
Contributor

commented Mar 10, 2013

The text layer does not match the text in that table because it uses "special" spacing values.

To see what I mean, open: http://mozilla.github.com/pdf.js/web/viewer.html#textLayer=visible
It will show the text layer content in a visible way. You select the s because that's what the text layer has under it. I don't know if it can be fixed, I'm just explaining the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.