Auto correct image rotation (-180, -90, 0, +90) #4

OCRmyPDF-issuebot · 2015-09-14T01:15:32Z

Issue by fritz-hh
Wed Jan 8 22:05:16 2014
Originally opened as fritz-hh/OCRmyPDF#46

OCRmyPDF-issuebot · 2015-09-14T01:15:32Z

Comment by fritz-hh
Sun Jan 12 19:21:54 2014

it seems that orientation detection will be supported in the next version of the tesseract command line interface:
http://code.google.com/p/tesseract-ocr/issues/detail?id=955

OCRmyPDF-issuebot · 2015-09-14T01:15:33Z

Comment by eloops
Wed Dec 10 12:53:52 2014

Have been testing with v3.04 (compiled from git source). With -psm 0 it gives the orientation as well as confidence and an integer, but then that means you have to run tesseract-ocr over the page twice (first for orientation and then for OCR).

In -psm 1 mode it adds a 'textangle ###' attribute to the tags in the hocr file, so at the moment I am using the following to detect the rotation and correct it, after hocrTransform.py is called:

# Code removed

$curOCRedPDFRotated translates to a *.ocred.rotated.pdf file so should still be caught by the gs concatenation.

Unfortunately this doesn't work; If I rotate the image after OCR (and orientation detection), but before calling hocrTransform.py, the image is not rotated correctly (retains original dimensions) and the OCR'ed text is overlaid sideways.

If I rotate the image after the PDF is generated, it doesn't rotate correctly and/or the OCR'ed text is correct but not laid out correctly.

So it looks like the only way to do it properly is to call tesseract-ocr twice. Once to determine orientation, rotate the image if necessary and then a second time to perform OCR duties.

Edit:
Removed code. It really doesn't work. I kludged up an extra bit that runs tesseract in -psm 0 mode over the .pnm file and then gets convert (I use graphicsmagick convert, I'll test both and also econvert to see what speed difference there is) to rotate the image before passing it back to tesseract for OCR'ing. I don't think the second pass added much to it, although it would be nice to only have to do one pass.

OCRmyPDF-issuebot · 2015-09-14T01:15:34Z

Comment by eloops
Tue Sep 8 15:06:41 2015

I ported this to a node library (here), part of it was implementing auto-rotation. Added a prototype to find the general rotation (by finding the greatest number of textangles in the hocr). Also by climbing up/down the DOM to the ocr_line class <span> elements and grabbing the textangle I could correct it when writing the words to the canvas. Still not sure yet if it's faster to do a separate -psm 0 (OSD only) and then a -psm 6 for the OCR text or just the -psm 1 (get everything).

jbarlow83 · 2016-02-17T09:25:53Z

Implemented in v4

jbarlow83 · 2016-02-17T09:26:16Z

@eloops - fyi now implemented

OCRmyPDF-issuebot added this to the v3.x milestone Sep 14, 2015

OCRmyPDF-issuebot added the enhancement label Sep 14, 2015

jbarlow83 closed this as completed Feb 17, 2016

Jmuccigr mentioned this issue Jul 1, 2016

2 columns only sometimes recognized #77

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto correct image rotation (-180, -90, 0, +90) #4

Auto correct image rotation (-180, -90, 0, +90) #4

OCRmyPDF-issuebot commented Sep 14, 2015

OCRmyPDF-issuebot commented Sep 14, 2015

OCRmyPDF-issuebot commented Sep 14, 2015

OCRmyPDF-issuebot commented Sep 14, 2015

jbarlow83 commented Feb 17, 2016

jbarlow83 commented Feb 17, 2016

Auto correct image rotation (-180, -90, 0, +90) #4

Auto correct image rotation (-180, -90, 0, +90) #4

Comments

OCRmyPDF-issuebot commented Sep 14, 2015

OCRmyPDF-issuebot commented Sep 14, 2015

OCRmyPDF-issuebot commented Sep 14, 2015

OCRmyPDF-issuebot commented Sep 14, 2015

jbarlow83 commented Feb 17, 2016

jbarlow83 commented Feb 17, 2016