OCR text is not searchable by multiple-word phrase inside pdf viewer #57

chris001 · 2017-02-07T00:43:07Z

Inside a pdf viewer (acrobat reader, or pdf.js in the browser), you cannot search for a phrase of multiple words. The phrase matches nothing even when it is in the document.

Bug report / Feature request

Expected Behavior

When the document contains, for example, "Breakfast menu", when you click the search icon (magnifying glass) and enter text "breakfast menu", it should match the text and find it.

Current Behavior

It olny matches one word. For example, it matches "breakfast", or it matches "menu". If you try to search for two words, it fails to find a match, even when the two words are clearly together on the same line, in the document!

Possible Solution

Possibly take a look at the parameters or settings for tesseract-ocr and see if it can be made to connect words which are on the same line, into the same continuous text line.

Steps to Reproduce (for bugs)

Upload a scan of a page of text in pdf format.
Run the ocr on it.
Open the _OCR.pdf version of the pdf file which contains the recognized text.
Click the magnifying glass, enter text for two adjacent words on the same line. Search fails to find the two words. It finds only one word at a time.

Context

Searching for only one word at a time is awkward and time consuming.

Your Environment

OCR version used: Latest
Browser Name and version: Latest firefox.
Operating System and version (desktop or mobile): Windows 10, Linux Debian 8.
ownCloud/nextcloud version: (see ownCloud admin page or version.php) Latest NC.
PHP version 7.0
Database version 5.6 mysql mariadb
Are you using encryption: yes/no No.

Log File Content (nextcloud/owncloud.log of the "data"-directory)

The text was updated successfully, but these errors were encountered:

janis91 · 2017-02-07T20:57:09Z

Actually I didn't recognize this before. But the problem is: ocrmypdf is working like this. I can't change this behavior. Maybe you can head to the ormypdf github issues and ask, if there is any other solution for this. But I assume it won't be possible, as long as ocrmypdf is not putting the text elements together in one text-box in the background of the picture in the pdf.

As this isn't a bug and ocrmypdf behaves like this, I will close this issue.

janis91 self-assigned this Feb 7, 2017

janis91 closed this as completed Feb 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCR text is not searchable by multiple-word phrase inside pdf viewer #57

OCR text is not searchable by multiple-word phrase inside pdf viewer #57

chris001 commented Feb 7, 2017

janis91 commented Feb 7, 2017

OCR text is not searchable by multiple-word phrase inside pdf viewer #57

OCR text is not searchable by multiple-word phrase inside pdf viewer #57

Comments

chris001 commented Feb 7, 2017

Bug report / Feature request

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

Log File Content (nextcloud/owncloud.log of the "data"-directory)

janis91 commented Feb 7, 2017