Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto correct image rotation (-180, -90, 0, +90) #4

Closed
OCRmyPDF-issuebot opened this issue Sep 14, 2015 · 5 comments
Closed

Auto correct image rotation (-180, -90, 0, +90) #4

OCRmyPDF-issuebot opened this issue Sep 14, 2015 · 5 comments
Milestone

Comments

@OCRmyPDF-issuebot
Copy link

Issue by fritz-hh
Wed Jan 8 22:05:16 2014
Originally opened as fritz-hh/OCRmyPDF#46


@OCRmyPDF-issuebot
Copy link
Author

Comment by fritz-hh
Sun Jan 12 19:21:54 2014


it seems that orientation detection will be supported in the next version of the tesseract command line interface:
http://code.google.com/p/tesseract-ocr/issues/detail?id=955

@OCRmyPDF-issuebot
Copy link
Author

Comment by eloops
Wed Dec 10 12:53:52 2014


Have been testing with v3.04 (compiled from git source). With -psm 0 it gives the orientation as well as confidence and an integer, but then that means you have to run tesseract-ocr over the page twice (first for orientation and then for OCR).

In -psm 1 mode it adds a 'textangle ###' attribute to the tags in the hocr file, so at the moment I am using the following to detect the rotation and correct it, after hocrTransform.py is called:

# Code removed

$curOCRedPDFRotated translates to a *.ocred.rotated.pdf file so should still be caught by the gs concatenation.

Unfortunately this doesn't work; If I rotate the image after OCR (and orientation detection), but before calling hocrTransform.py, the image is not rotated correctly (retains original dimensions) and the OCR'ed text is overlaid sideways.

If I rotate the image after the PDF is generated, it doesn't rotate correctly and/or the OCR'ed text is correct but not laid out correctly.

So it looks like the only way to do it properly is to call tesseract-ocr twice. Once to determine orientation, rotate the image if necessary and then a second time to perform OCR duties.

Edit:
Removed code. It really doesn't work. I kludged up an extra bit that runs tesseract in -psm 0 mode over the .pnm file and then gets convert (I use graphicsmagick convert, I'll test both and also econvert to see what speed difference there is) to rotate the image before passing it back to tesseract for OCR'ing. I don't think the second pass added much to it, although it would be nice to only have to do one pass.

@OCRmyPDF-issuebot
Copy link
Author

Comment by eloops
Tue Sep 8 15:06:41 2015


I ported this to a node library (here), part of it was implementing auto-rotation. Added a prototype to find the general rotation (by finding the greatest number of textangles in the hocr). Also by climbing up/down the DOM to the ocr_line class <span> elements and grabbing the textangle I could correct it when writing the words to the canvas. Still not sure yet if it's faster to do a separate -psm 0 (OSD only) and then a -psm 6 for the OCR text or just the -psm 1 (get everything).

@jbarlow83
Copy link
Collaborator

Implemented in v4

@jbarlow83
Copy link
Collaborator

@eloops - fyi now implemented

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants