Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search matches characters across new lines #2806

Closed
xavier114fch opened this issue Feb 26, 2013 · 6 comments · Fixed by #13261
Closed

Search matches characters across new lines #2806

xavier114fch opened this issue Feb 26, 2013 · 6 comments · Fixed by #13261

Comments

@xavier114fch
Copy link
Contributor

Just observing this with the default tracemonkey paper.

  1. Use the Find tool and enter "na" in the search box. Check the "Highlight all" box.
  2. Scroll down to page 2 of the paper, and see the 3rd paragraph. It highlights the "na" in "In a", which is across a new line.
  3. Change the search query to "andt" and you can see it highlights "andt" in "and the".

Seems it needs some sort of checks to ensure that search will not return these results?

@Quicksaver
Copy link

That is not the only case where the search returns wrong matches or fails to find the correct matches.

The PDFFindController.pageContents object, which PDFJS's find method uses to return search matches, often doesn't reflect exactly what a page's text content actually is.

In my add-on, FindBar Tweak, I use the same object in the same way to retrieve find results from pdf documents. Through its Find in All Tabs feature, the limitations of this object are very apparent:

  • In every pdf page there are no newline characters, ever!
  • Often, space characters are omitted in this object, greatly influencing search results. http://www.selab.isti.cnr.it/ws-mate/example.pdf is especially problematic, there are absolutely no spaces in this specific document!

At least the space characters should more closely reflect the actual document; newline characters aren't as critical since they are never in a search query anyway, a single space character in place of newlines would be a close enough solution in my opinion.

I would love to try and help fix this myself, but I looked into the code that builds this object and I don't think I have enough knowledge of PDF.JS's rendering mechanism to be able to help with this... However, I hope my add-on can somehow help with debugging this, since it can show the contents of this object in an easy and direct way without having to manually search all over the pdf pages to know if the object's text contents are accurate.

@piotrex
Copy link
Contributor

piotrex commented Oct 26, 2013

Changing block <div> to inline <span> in text layer as per #2989 will probably improve this. (At least it should cause that copied text across lines will contain spaces)

@jazzy-em
Copy link
Contributor

jazzy-em commented Feb 5, 2015

+1
Problem is actual.
informationis
Search results are incorrect due to this problem.
Would be considered this issue?

@jazzy-em
Copy link
Contributor

A similar problem is solved in #5783.

@murilo-ramos
Copy link

I know this is an old thread, but anyone knows if this is issue was solved?
I'm having the same situation where the search finds 'words' across new lines, and I tested with versions 1.9.426 (stable) and 2.0.550 (press release) and the issue persists.

@calixteman
Copy link
Contributor

It's partially fixed thanks to af4dc55 which helps to detect EOL.
And this PR: #13261 should definitely fix the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants