Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCR text is not searchable by multiple-word phrase inside pdf viewer #57

Closed
chris001 opened this issue Feb 7, 2017 · 1 comment
Closed
Assignees

Comments

@chris001
Copy link

chris001 commented Feb 7, 2017

Inside a pdf viewer (acrobat reader, or pdf.js in the browser), you cannot search for a phrase of multiple words. The phrase matches nothing even when it is in the document.

Bug report / Feature request

Expected Behavior

When the document contains, for example, "Breakfast menu", when you click the search icon (magnifying glass) and enter text "breakfast menu", it should match the text and find it.

Current Behavior

It olny matches one word. For example, it matches "breakfast", or it matches "menu". If you try to search for two words, it fails to find a match, even when the two words are clearly together on the same line, in the document!

Possible Solution

Possibly take a look at the parameters or settings for tesseract-ocr and see if it can be made to connect words which are on the same line, into the same continuous text line.

Steps to Reproduce (for bugs)

  1. Upload a scan of a page of text in pdf format.
  2. Run the ocr on it.
  3. Open the _OCR.pdf version of the pdf file which contains the recognized text.
  4. Click the magnifying glass, enter text for two adjacent words on the same line. Search fails to find the two words. It finds only one word at a time.

Context

Searching for only one word at a time is awkward and time consuming.

Your Environment

  • OCR version used: Latest
  • Browser Name and version: Latest firefox.
  • Operating System and version (desktop or mobile): Windows 10, Linux Debian 8.
  • ownCloud/nextcloud version: (see ownCloud admin page or version.php) Latest NC.
  • PHP version 7.0
  • Database version 5.6 mysql mariadb
  • Are you using encryption: yes/no No.

Log File Content (nextcloud/owncloud.log of the "data"-directory)

@janis91 janis91 self-assigned this Feb 7, 2017
@janis91
Copy link
Owner

janis91 commented Feb 7, 2017

Actually I didn't recognize this before. But the problem is: ocrmypdf is working like this. I can't change this behavior. Maybe you can head to the ormypdf github issues and ask, if there is any other solution for this. But I assume it won't be possible, as long as ocrmypdf is not putting the text elements together in one text-box in the background of the picture in the pdf.

As this isn't a bug and ocrmypdf behaves like this, I will close this issue.

@janis91 janis91 closed this as completed Feb 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants