Pagesegmode #40

tuxasus · 2016-01-09T17:05:11Z

Hey,

is there a way to define the pagesegmode for the tesseract OCR?
(https://tesseract-ocr.googlecode.com/git/doc/tesseract.1.html)

Thank you very much
tuxasus

jbarlow83 · 2016-01-10T02:22:02Z

No official way, but you can try (ab)using the --tesseract-config argument which forwards one argument at a time to tesseract.

e.g. for a single text line
ocrmypdf [other options] --tesseract-config '--psm' --tesseract-config '7'

I'm not sure if I'd implement this since most PDF images have a text page, not a line or word.

tuxasus · 2016-01-10T14:51:36Z

Hey thanks for your help!
using the command above I get the error ocrmypdf: error: argument --tesseract-config: expected one argument and using the command ocrmypdf [other options] --tesseract-config '--psm 4 he generates a conversion error:

________________________________________
Tasks which will be run:


Task enters queue = 'ocrmypdf.main.repair_pdf' 

    [{'height_pixels': 1887, 'height_inches': Decimal('6.29'), 'width_inches': Decimal('3.15'), 'has_text': False, 'xres': Decimal('299.683'), 'yres': Decimal('3E+2'), 'width_pixels': 944, 'images': [{'enc': 'jpeg', 'dpi': Decimal('299.841'), 'color': 'rgb', 'width': 944, 'comp': 3, 'bpc': 8, 'height': 1887, 'dpi_w': Decimal('299.683'), 'dpi_h': Decimal('3E+2')}], 'pageno': 0}, {'height_pixels': 1887, 'height_inches': Decimal('6.29'), 'width_inches': Decimal('3.15'), 'has_text': False, 'xres': Decimal('299.683'), 'yres': Decimal('3E+2'), 'width_pixels': 944, 'images': [{'enc': 'ccitt', 'dpi': Decimal('299.841'), 'color': 'gray', 'width': 944, 'comp': 1, 'bpc': 1, 'height': 1887, 'dpi_w': Decimal('299.683'), 'dpi_h': Decimal('3E+2')}], 'pageno': 1}]
Completed Task = 'ocrmypdf.main.repair_pdf' 
Task enters queue = 'ocrmypdf.main.split_pages' 
Task enters queue = 'ocrmypdf.main.generate_postscript_stub' 
    os.symlink(/tmp/com.github.ocrmypdf.muvt83ir/000002.page.pdf, /tmp/com.github.ocrmypdf.muvt83ir/000002.ocr.page.pdf)
    os.symlink(/tmp/com.github.ocrmypdf.muvt83ir/000001.page.pdf, /tmp/com.github.ocrmypdf.muvt83ir/000001.ocr.page.pdf)
Completed Task = 'ocrmypdf.main.split_pages' 
Task enters queue = 'ocrmypdf.main.rasterize_with_ghostscript' 
Task enters queue = 'ocrmypdf.main.skip_page' 
Uptodate Task = 'ocrmypdf.main.skip_page'


WARNING:
        In Task 'ocrmypdf.main.skip_page':
        No jobs were run because no file names matched.
        Please make sure that the regular expression is correctly specified. 

    Rendering 000001.ocr.page.pdf with png16m
Completed Task = 'ocrmypdf.main.generate_postscript_stub' 
    Rendering 000002.ocr.page.pdf with pngmono
Completed Task = 'ocrmypdf.main.rasterize_with_ghostscript' 
Task enters queue = 'ocrmypdf.main.preprocess_deskew' 
    os.symlink(/tmp/com.github.ocrmypdf.muvt83ir/000001.page.png, /tmp/com.github.ocrmypdf.muvt83ir/000001.pp-deskew.png)
    os.symlink(/tmp/com.github.ocrmypdf.muvt83ir/000002.page.png, /tmp/com.github.ocrmypdf.muvt83ir/000002.pp-deskew.png)
Completed Task = 'ocrmypdf.main.preprocess_deskew' 
Task enters queue = 'ocrmypdf.main.preprocess_clean' 
    os.symlink(/tmp/com.github.ocrmypdf.muvt83ir/000001.pp-deskew.png, /tmp/com.github.ocrmypdf.muvt83ir/000001.pp-clean.png)
    os.symlink(/tmp/com.github.ocrmypdf.muvt83ir/000002.pp-deskew.png, /tmp/com.github.ocrmypdf.muvt83ir/000002.pp-clean.png)
Completed Task = 'ocrmypdf.main.preprocess_clean' 
Task enters queue = 'ocrmypdf.main.select_image_for_pdf' 
Task enters queue = 'ocrmypdf.main.ocr_tesseract_hocr' 
    os.symlink(/tmp/com.github.ocrmypdf.muvt83ir/000002.page.png, /tmp/com.github.ocrmypdf.muvt83ir/000002.image)
Completed Task = 'ocrmypdf.main.select_image_for_pdf' 




Original exceptions:

    Exception #1
      'builtins.TypeError(Can't convert 'list' object to str implicitly)' raised in ...
       Task = def ocrmypdf.main.ocr_tesseract_hocr(...):
       Job  = [.../com.github.ocrmypdf.muvt83ir/000001.pp-clean.png -> .../com.github.ocrmypdf.muvt83ir/000001.hocr, <ocrmypdf.main.WrappedLogger>, [{'height_pixels': 1887, 'height_inches': Decimal('6.29'), 'width_inches': Decimal('3.15'), 'has_text': False, 'xres': Decimal('299.683'), 'yres': Decimal('3E+2'), 'width_pixels': 944, 'images': [{'enc': 'jpeg', 'dpi': Decimal('299.841'), 'color': 'rgb', 'width': 944, 'comp': 3, 'bpc': 8, 'height': 1887, 'dpi_w': Decimal('299.683'), 'dpi_h': Decimal('3E+2')}], 'pageno': 0}, {'height_pixels': 1887, 'height_inches': Decimal('6.29'), 'width_inches': Decimal('3.15'), 'has_text': False, 'xres': Decimal('299.683'), 'yres': Decimal('3E+2'), 'width_pixels': 944, 'images': [{'enc': 'ccitt', 'dpi': Decimal('299.841'), 'color': 'gray', 'width': 944, 'comp': 1, 'bpc': 1, 'height': 1887, 'dpi_w': Decimal('299.683'), 'dpi_h': Decimal('3E+2')}], 'pageno': 1}], <_thread.lock>]

    Traceback (most recent call last):
      File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions
        register_cleanup, touch_files_only)
      File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 567, in job_wrapper_io_files
        ret_val = user_defined_work_func(*params)
      File "/home/florian/Downloads/OCRmyPDF-3.1/ocrmypdf/main.py", line 560, in ocr_tesseract_hocr
        universal_newlines=True)
      File "/usr/lib/python3.4/subprocess.py", line 848, in __init__
        restore_signals, start_new_session)
      File "/usr/lib/python3.4/subprocess.py", line 1384, in _execute_child
        restore_signals, start_new_session, preexec_fn)
    TypeError: Can't convert 'list' object to str implicitly


    Exception #2
      'builtins.TypeError(Can't convert 'list' object to str implicitly)' raised in ...
       Task = def ocrmypdf.main.ocr_tesseract_hocr(...):
       Job  = [.../com.github.ocrmypdf.muvt83ir/000002.pp-clean.png -> .../com.github.ocrmypdf.muvt83ir/000002.hocr, <ocrmypdf.main.WrappedLogger>, [{'height_pixels': 1887, 'height_inches': Decimal('6.29'), 'width_inches': Decimal('3.15'), 'has_text': False, 'xres': Decimal('299.683'), 'yres': Decimal('3E+2'), 'width_pixels': 944, 'images': [{'enc': 'jpeg', 'dpi': Decimal('299.841'), 'color': 'rgb', 'width': 944, 'comp': 3, 'bpc': 8, 'height': 1887, 'dpi_w': Decimal('299.683'), 'dpi_h': Decimal('3E+2')}], 'pageno': 0}, {'height_pixels': 1887, 'height_inches': Decimal('6.29'), 'width_inches': Decimal('3.15'), 'has_text': False, 'xres': Decimal('299.683'), 'yres': Decimal('3E+2'), 'width_pixels': 944, 'images': [{'enc': 'ccitt', 'dpi': Decimal('299.841'), 'color': 'gray', 'width': 944, 'comp': 1, 'bpc': 1, 'height': 1887, 'dpi_w': Decimal('299.683'), 'dpi_h': Decimal('3E+2')}], 'pageno': 1}], <_thread.lock>]

    Traceback (most recent call last):
      File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions
        register_cleanup, touch_files_only)
      File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 567, in job_wrapper_io_files
        ret_val = user_defined_work_func(*params)
      File "/home/florian/Downloads/OCRmyPDF-3.1/ocrmypdf/main.py", line 560, in ocr_tesseract_hocr
        universal_newlines=True)
      File "/usr/lib/python3.4/subprocess.py", line 848, in __init__
        restore_signals, start_new_session)
      File "/usr/lib/python3.4/subprocess.py", line 1384, in _execute_child
        restore_signals, start_new_session, preexec_fn)
    TypeError: Can't convert 'list' object to str implicitly'

It's not really about a one line pdf or a one word pdf. My Problem is the automatic column detection which ruins my OCR (the page is a mix of 2 and 1 column text)

jbarlow83 · 2016-01-12T01:24:05Z

Implemented in commit 8d323ae.

jbarlow83 · 2016-02-05T08:42:46Z

Officially released in v3.2

jbarlow83 closed this as completed Feb 5, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pagesegmode #40

Pagesegmode #40

tuxasus commented Jan 9, 2016

jbarlow83 commented Jan 10, 2016

tuxasus commented Jan 10, 2016

jbarlow83 commented Jan 12, 2016

jbarlow83 commented Feb 5, 2016

Pagesegmode #40

Pagesegmode #40

Comments

tuxasus commented Jan 9, 2016

jbarlow83 commented Jan 10, 2016

tuxasus commented Jan 10, 2016

jbarlow83 commented Jan 12, 2016

jbarlow83 commented Feb 5, 2016