Search matches characters across new lines #2806

xavier114fch · 2013-02-26T01:57:20Z

Just observing this with the default tracemonkey paper.

Use the Find tool and enter "na" in the search box. Check the "Highlight all" box.
Scroll down to page 2 of the paper, and see the 3rd paragraph. It highlights the "na" in "In a", which is across a new line.
Change the search query to "andt" and you can see it highlights "andt" in "and the".

Seems it needs some sort of checks to ensure that search will not return these results?

Quicksaver · 2013-10-18T11:53:34Z

That is not the only case where the search returns wrong matches or fails to find the correct matches.

The PDFFindController.pageContents object, which PDFJS's find method uses to return search matches, often doesn't reflect exactly what a page's text content actually is.

In my add-on, FindBar Tweak, I use the same object in the same way to retrieve find results from pdf documents. Through its Find in All Tabs feature, the limitations of this object are very apparent:

In every pdf page there are no newline characters, ever!
Often, space characters are omitted in this object, greatly influencing search results. http://www.selab.isti.cnr.it/ws-mate/example.pdf is especially problematic, there are absolutely no spaces in this specific document!

At least the space characters should more closely reflect the actual document; newline characters aren't as critical since they are never in a search query anyway, a single space character in place of newlines would be a close enough solution in my opinion.

I would love to try and help fix this myself, but I looked into the code that builds this object and I don't think I have enough knowledge of PDF.JS's rendering mechanism to be able to help with this... However, I hope my add-on can somehow help with debugging this, since it can show the contents of this object in an easy and direct way without having to manually search all over the pdf pages to know if the object's text contents are accurate.

piotrex · 2013-10-26T22:10:31Z

Changing block <div> to inline <span> in text layer as per #2989 will probably improve this. (At least it should cause that copied text across lines will contain spaces)

jazzy-em · 2015-02-05T11:59:05Z

+1
Problem is actual.

Search results are incorrect due to this problem.
Would be considered this issue?

jazzy-em · 2015-03-11T10:58:50Z

A similar problem is solved in #5783.

murilo-ramos · 2018-08-17T14:53:15Z

I know this is an old thread, but anyone knows if this is issue was solved?
I'm having the same situation where the search finds 'words' across new lines, and I tested with versions 1.9.426 (stable) and 2.0.550 (press release) and the issue persists.

calixteman · 2021-05-02T10:15:54Z

It's partially fixed thanks to af4dc55 which helps to detect EOL.
And this PR: #13261 should definitely fix the problem.

Quicksaver mentioned this issue May 30, 2013

Some improvments Find in All Tabs Quicksaver/FindBar-Tweak#33

Closed

Snuffleupagus linked a pull request Feb 3, 2022 that will close this issue

[api-minor] Support search with or without diacritics (bug 1508345, bug 916883, bug 1651113) #13261

Merged

calixteman closed this as completed in #13261 Feb 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search matches characters across new lines #2806

Search matches characters across new lines #2806

xavier114fch commented Feb 26, 2013

Quicksaver commented Oct 18, 2013

piotrex commented Oct 26, 2013

jazzy-em commented Feb 5, 2015

jazzy-em commented Mar 11, 2015

murilo-ramos commented Aug 17, 2018

calixteman commented May 2, 2021

Search matches characters across new lines #2806

Search matches characters across new lines #2806

Comments

xavier114fch commented Feb 26, 2013

Quicksaver commented Oct 18, 2013

piotrex commented Oct 26, 2013

jazzy-em commented Feb 5, 2015

jazzy-em commented Mar 11, 2015

murilo-ramos commented Aug 17, 2018

calixteman commented May 2, 2021