hocr import / export #453

aalmir · 2019-11-12T12:03:50Z

Describe the issue
If you want to create a perfect OCR, 100% correct text, you need some editing function.
For example "gImageReader" gives some basic editing function (but has some other missing features).

Expected behavior
An option to export and import hocr.
After export i can make changes on the hocr and import it for PDF creation.

adrianbroher · 2019-11-12T12:34:05Z

What a coincidence. I wanted to request the same feature today.

This feature request has some overlap with #177.

jbarlow83 · 2019-11-25T08:53:13Z

I think this would be great but there's a lot to do to make it work, especially to support after the fact editing.

tukusejssirs · 2021-01-12T19:32:42Z

All I would need is an option to merge an hOCR HTML file with a PDF file.

On the Internet, I read about hocr2pdf (from ExactImage), which seams not be not developed anymore (and I can’t build it on Fedora 31 for missing libagg library → AntiGrain Geometry).

I also found hocr-pdf from hocr-tools, which I could install from the Fedora repos, but it does not work (it outputs empty file with some header of what).

Finally, gImageReader does output a PDF with hOCR data, but it does not work with 800+ pages (200+ MiB)—it creates a corrupted/damaged PDF (at least that’s what Evince and Adobe Acrobat Reader says).

aalmir · 2021-01-12T19:39:26Z

Have you reported this to gImageReader Repo?

jbarlow83 · 2021-01-12T19:49:17Z

ocrmypdf already has the ability to merge hOCR HTML into PDF through its public APIs. What it does not have is a convenient way to run its post-processing on a set of edited hOCR files.

tukusejssirs · 2021-01-12T20:10:12Z

ocrmypdf already has the ability to merge hOCR HTML into PDF through its public APIs.

Thanks, @jbarlow83, for reply. I haven’t every used ocrmypdf; could you guide me how to do accomplish this please? 😃

For what I want is to merge hOCR HTML file (generated by gImageReader and tesseract and manual correction via gIR and text editor) into a working PDF file merged by ImageMagick. Preferably in terminal, how exactly it does not matter. 😃

@aalmir, not exactly. I reported it in this comment and marginally in manisandro/gImageReader#480.

jbarlow83 · 2021-01-12T20:19:47Z

@tukusejssirs The relevant code is in hocrtransform.py. See python -m ocrmypdf.hocrtransform --help.

tukusejssirs · 2021-01-12T20:43:37Z

@jbarlow83, thanks! Is there any way to either input multiple images or pre-created PDF file? hocrtransform.py has -ioption, but it looks like it accepts only one image file, however, I have 800+ images (or pages in a PDF) and a single hOCR file containing all the text data (note: not all pages have text, some are blank and therefore there is no mention of them in the hOCR file).

@aalmir, the issue is here → manisandro/gImageReader#486.

jbarlow83 · 2021-01-12T20:56:42Z

No, it doesn't have that ability, but you could split the hOCR and run a loop.

tukusejssirs · 2021-01-15T12:48:15Z

@jbarlow83, I have just tried to merge hOCR data and an image into a PDF, but it failed (see below). Am I missing some kind of module or something?

Here are the test files.

Note that although I want to output the interword spaces, I don’t want them where I have commented out the whitespace between spans. This is due to the fact that I don’t want a space to be output after  (soft hyphen). This is just a hack as I could not find a better way to deal with hyphens.

python -m ocrmypdf.hocrtransform -i 024_mr2004.tif --interword-spaces 024_hocr.html merged.pdf

/usr/lib64/python3.8/runpy.py:127: RuntimeWarning: 'ocrmypdf.hocrtransform' found in sys.modules after import of package 'ocrmypdf', but prior to execution of 'ocrmypdf.hocrtransform'; this may result in unpredictable behaviour
  warn(RuntimeWarning(msg))
Traceback (most recent call last):
  File "/usr/lib64/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib64/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/lib/python3.8/site-packages/ocrmypdf/hocrtransform.py", line 382, in <module>
    hocr = HocrTransform(args.hocrfile, args.resolution)
  File "/usr/lib/python3.8/site-packages/ocrmypdf/hocrtransform.py", line 68, in __init__
    self.hocr = ElementTree.parse(hocrFileName)
  File "/usr/lib64/python3.8/xml/etree/ElementTree.py", line 1202, in parse
    tree.parse(source, parser)
  File "/usr/lib64/python3.8/xml/etree/ElementTree.py", line 595, in parse
    self._root = parser._parse_whole(source)
xml.etree.ElementTree.ParseError: undefined entity: line 45, column 113

jbarlow83 · 2021-01-15T19:34:58Z

It looks like the XML (024_hocr.html) is invalid, specifically at line 45.

tukusejssirs · 2021-01-15T20:21:48Z

@jbarlow83, do I get it right that you say the a soft hyphen () is invalid in hOCR file? As I understand it, the hOCR specification 1.2 says otherwise:

5.6. Hyphenation

Soft hyphens must be represented using the HTML  entity.

jbarlow83 · 2021-01-15T20:38:12Z

ocrmypdf.hocrtransform is only capable of parsing the subset of hOCR generated by Tesseract.

For this specific case, you'll need to add a string like the following to the top of the hOCR file

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
            "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [
            <!ENTITY shy '#x00AD'>
            ]>

Based on:
https://stackoverflow.com/questions/35591478/how-to-parse-html-with-entities-such-as-nbsp-using-builtin-library-elementtree

And U+00AD being the Unicode code point for soft hyphen.

jbarlow83 · 2021-01-15T20:39:28Z

(Note that doctype signature may actually be incorrect for hOCR; whatever the hOCR spec says is correct should be used.)

tukusejssirs · 2021-01-15T20:59:43Z

Thanks, @jbarlow83, that sort of worked.

I changed the doctype in the example document to the following:

<!DOCTYPE html [
	<!ENTITY shy '#x00AD'>
	<!ENTITY thinsp '#x2009'>
]>

Run your script, which succeded with two warnings (see below).

python -m ocrmypdf.hocrtransform -i 024_mr2004.tif --interword-spaces 024_hocr.html merged.pdf

/usr/lib64/python3.8/runpy.py:127: RuntimeWarning: 'ocrmypdf.hocrtransform' found in sys.modules after import of package 'ocrmypdf', but prior to execution of 'ocrmypdf.hocrtransform'; this may result in unpredictable behaviour
  warn(RuntimeWarning(msg))
/usr/lib/python3.8/site-packages/ocrmypdf/hocrtransform.py:109: DeprecationWarning: This method will be removed in future versions.  Use 'list(elem)' or iteration over elem instead.
  for child in element.getchildren():

After opening merged.pdf in a PDF reader, the hOCR text is there, however, the ENTITY definitions make those characters in the text literal, e.g.  is replaced with #x00AD (6 characters) instead of a real soft hyphen.
Base on this comment to that SO question, I’ve even tried to change the definition to <!ENTITY shy '&shy;'>, but it failed the same way. Same with the doctype you used in your example.

jbarlow83 · 2021-01-15T21:11:39Z

Official definition is

<!ENTITY shy    CDATA "&#173;" -- soft hyphen = discretionary hyphen,
                                  U+00AD ISOnum -->

From: https://www.w3.org/TR/html4/sgml/entities.html

tukusejssirs · 2021-01-15T21:29:39Z

Thanks again, @jbarlow83, but CDATA causes an error using ocrmypdf.hocrtransform (same command as above).

Change the doctype to the following (the doctype itself does not matter):

<!DOCTYPE html [
	<!ENTITY shy    CDATA "&#173;" -- soft hyphen = discretionary hyphen, U+00AD ISOnum -->
	<!ENTITY thinsp "&#2009;">
]>

Run the ocrmypdf.hocrtransform command (same as above). This produces an error on the L2C17 character (2nd space before CDATA on shy entity definition.

python -m ocrmypdf.hocrtransform -i 024_mr2004.tif --interword-spaces 024_hocr.html merged.pdf

/usr/lib64/python3.8/runpy.py:127: RuntimeWarning: 'ocrmypdf.hocrtransform' found in sys.modules after import of package 'ocrmypdf', but prior to execution of 'ocrmypdf.hocrtransform'; this may result in unpredictable behaviour
  warn(RuntimeWarning(msg))
Traceback (most recent call last):
  File "/usr/lib64/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib64/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/lib/python3.8/site-packages/ocrmypdf/hocrtransform.py", line 382, in <module>
    hocr = HocrTransform(args.hocrfile, args.resolution)
  File "/usr/lib/python3.8/site-packages/ocrmypdf/hocrtransform.py", line 68, in __init__
    self.hocr = ElementTree.parse(hocrFileName)
  File "/usr/lib64/python3.8/xml/etree/ElementTree.py", line 1202, in parse
    tree.parse(source, parser)
  File "/usr/lib64/python3.8/xml/etree/ElementTree.py", line 595, in parse
    self._root = parser._parse_whole(source)
xml.etree.ElementTree.ParseError: syntax error: line 2, column 17

jbarlow83 · 2021-01-15T21:50:06Z

Remove the -- and the comments after it.

…

On Fri., Jan. 15, 2021, 13:29 Tukusej’s Sirs, ***@***.***> wrote: Thanks again, @jbarlow83 <https://github.com/jbarlow83>, but CDATA causes an error using ocrmypdf.hocrtransform (same command as above). 1. Change the doctype to the following (the doctype itself does not matter): <!DOCTYPE html [ <!ENTITY shy CDATA "" -- soft hyphen = discretionary hyphen, U+00AD ISOnum --> <!ENTITY thinsp "ߙ"> ]> 1. Run the ocrmypdf.hocrtransform command (same as above). This produces an error on the L2C17 character (2nd space before CDATA on shy entity definition. python -m ocrmypdf.hocrtransform -i 024_mr2004.tif --interword-spaces 024_hocr.html merged.pdf /usr/lib64/python3.8/runpy.py:127: RuntimeWarning: 'ocrmypdf.hocrtransform' found in sys.modules after import of package 'ocrmypdf', but prior to execution of 'ocrmypdf.hocrtransform'; this may result in unpredictable behaviour warn(RuntimeWarning(msg)) Traceback (most recent call last): File "/usr/lib64/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib64/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/usr/lib/python3.8/site-packages/ocrmypdf/hocrtransform.py", line 382, in <module> hocr = HocrTransform(args.hocrfile, args.resolution) File "/usr/lib/python3.8/site-packages/ocrmypdf/hocrtransform.py", line 68, in __init__ self.hocr = ElementTree.parse(hocrFileName) File "/usr/lib64/python3.8/xml/etree/ElementTree.py", line 1202, in parse tree.parse(source, parser) File "/usr/lib64/python3.8/xml/etree/ElementTree.py", line 595, in parse self._root = parser._parse_whole(source) xml.etree.ElementTree.ParseError: syntax error: line 2, column 17 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#453 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAN5YM4LULPJNDW7QDDG6L3S2CXVFANCNFSM4JMCQF4A> .

jbarlow83 · 2021-01-15T22:56:01Z

<!DOCTYPE html [
	<!ENTITY shy    CDATA "&#173;">
	<!ENTITY thinsp "&#2009;">
]>

tukusejssirs · 2021-01-15T23:10:58Z

@jbarlow83, I’ve already tried that; same result as I posted in my last comment.

It is quite interesting to me that your parser (or whatever reports tnat error) says the error occurs on a space (the second one before CDATA). Can’t there be an error in the parser?

jbarlow83 · 2021-01-15T23:32:05Z

The parser is just Python's widely used XML parser from the standard library (xml.etree.ElementTree). It's saying your modified input file is not valid XML anymore so it won't parse. It seems to work if you remove the CDATA bit.

import xml.etree.ElementTree as ET
x = ET.fromstring(
    """
    <!DOCTYPE html [
	<!ENTITY shy     "&#173;">
	<!ENTITY thinsp "&#2009;">
    ]>
<test>&shy;</test>"""
)
assert x.text == '\xad'

tukusejssirs · 2021-01-16T13:06:49Z

@jbarlow83, thanks, that helped for soft hyphens, but not for thin spaces. Thin spaces are replaced with ■ (U+25A0) for some reason. However, when I use your Python code (modified a bit; see below), it outputs U+07D9 (ߙ; Nko Letter Ra) for some reason. Note that I don’t code in Python at all.

test_html_entity.py

#!/bin/python

"""
Test HTML entities in ElementTree
"""
import xml.etree.ElementTree as ET
x = ET.fromstring(
	"""
	<!DOCTYPE html [
		<!ENTITY shy     "&#173;">
		<!ENTITY thinsp "&#2009;">
	]>
<test>&thinsp;</test>"""
)

if x.text == '\xe2\x80\x89':
	print('true')
else:
	print('false')


print('input  : [ ]', format(ord(' '), 'x'))
print('output : [' + x.text + ']', format(ord(x.text), 'x'))

Update: I’ve also tried to use   in the entity definition instead of &#2009 (which is incorrect; my bad). Same result.

jbarlow83 · 2021-01-17T21:58:51Z

What you're trying to do likely requires software development beyond this one issue. You'll need to find someone who can do that coding for you.

tukusejssirs · 2021-01-17T22:09:37Z

Okay, thanks anyway for the help. 😃

rmast · 2021-11-07T20:04:54Z

If you want to create a perfect OCR, 100% correct text, you need some editing function. For example "gImageReader" gives some basic editing function (but has some other missing features).

I don't know whether gscan2pdf uses gImageReader for it, but in gscan2pdf I get a possibility to visually alter the OCR'ed text, based on the HOCR, and visually showing the confidence with colors.

jaysonlarose · 2022-07-13T03:25:28Z

Apologies if this is already answered, but I've been digging around for a few days and haven't found a definitive answer for this:

What's the best/easiest/recommended way to

get OCRmyPDF to process something¹ up to the point where it's performed OCR but hasn't yet merged the OCR text into a PDF, and output the optimized artifacts as well as the OCR data
get OCRmyPDF to take the data created from the previous step and merge everything together into a finished PDF

In other words, use OCRmyPDF in a workflow where a human can come in and hand-correct (and quite possibly version control) the OCR data before performing the merge? I've been playing around with running something like img2pdf *.png | ocrmypdf - -k --tesseract-config hocr --deskew --clean ../output.pdf, but it outputs a lot of stuff that isn't really necessary to the workflow and tends to get quite large (6.1G worth of source files generates 55G worth of artifacts). That, and I'm not quite sure how to get OCRmyPDF to pick the process back up afterwards.

I'm fine with digging around in the code if necessary... in fact, that's probably where I'm going to be going after I finish writing this. I just figured I'd ask first in case someone has the answers close at hand.

Thanks,
--Jays

(I read somewhere that OCRmyPDF is happiest with the output of img2pdf <image files> |² but am happy to change that if it's not the case) ↩
My current workflow consists of scanning a document as .png files, collating them so that they sort ASCIIbetically, and then running img2pdf *.png | ocrmypdf - --deskew --clean output.pdf. ↩

jbarlow83 · 2022-07-13T08:58:55Z

There are no great options for hand-correction because I tend to see as a problem a command line program is not well equipped to solve. I suppose I (or you) could insert a plugin hook that allows a custom renderer at src/ocrmypdf/_sync.py:216. Then you could intercept the .hocr file and make edits to it before proceeding with the standard hocrtransform. The pages would come at you out of order, because ocrmypdf processes pages in parallel.

rmast · 2022-07-13T09:53:55Z

You might be interested in the flow of GScan2pdf, which offers a scanned text-edit feature: https://images.pling.com/img/00/00/49/49/34/1230285/43c303619853d3cf57804eb721a0abac19d8.png However I don’t believe it supports PDF/A or MRC compression(plugin).

jaysonlarose · 2022-07-13T10:11:17Z

@rmast Checking it out now. It's going to be chewing on the 214 MiB 324-page PDF I just fed it for a while, but what it's showing while it processes each page is making me optimistic. Well, besides looking a little bit like a ransom letter, but I can understand that considering how each word seems to be in its own bounding box, and that's just kind of how things are.

jaysonlarose · 2022-07-13T11:11:07Z

GScan2pdf almost works. It gets up to saving page 35 of 324, and then it just sits there. The UI is still responsive, it claims to be saving, but it's just... sitting there..

jaysonlarose · 2022-07-15T05:18:04Z

@jbarlow83 I just tossed you $200USD via OpenCollective in hopes that I can shamelessly bribe you to add some method for hocr export/import.

I just spent the last couple of hours learning about postfix operators, and it's a rabbit hole I really don't want to have to go down.

jaysonlarose · 2022-07-15T05:18:53Z

See? I just said "postfix" when I meant "PDF".

jbarlow83 · 2022-07-18T07:59:24Z

@jaysonlarose I do appreciate the generous contribution and I'll try to think if there's something clever/efficient/reasonable for a CLI app.

rmast · 2022-10-11T08:17:46Z

I’m even surprised it reads an existing PDF. Usually I cut PDF’s into loose pictures to process them, for example with pdfimages -tiff. As GScan2PDF will not apply JBIG2 there’s no need to put all of them in one PDF at the end for getting one big dictionary, you’ll need more steps to get it compressed. pdsfam will split or merge PDF’s for you.

lumalav · 2023-09-02T23:06:15Z

@jaysonlarose, I needed something exactly to what you are describing and I forked the repo. In my version, ocrmypdf can be used in two additional workflows:

ocr is ONLY performed with the ability of outputting the hocr file and no merge occurs at the end (--pdf-renderer hocr --ocr-only --hocr-out /some/path), NOTE: the ocr image will be outputted as well with the same name of the hocr file but with the .png extension. Also, the hocr file is modified in place after being copied so the source image reflects the one that was outputted (/html/body/div[@class=\'ocr_page\'])
ocr is performed but the merge is done using an hocr file that is coming as an input and the normal pipeline continues (--pdf-rended hocr --hocr-in /some/path )

However, I don't think I'll be opening a pull request because of the way option 2 works. It needs to perform ocr again in case optimizations where done in the middle (--deskew or --rotation) so the hocr that is being passed could match the optimizations and the ocr image that are being generated in the moment. In theory, you need to call the same set of options (--rotate-pages, --clean, --deskew) if you are planning to perform these flows separately... Additionally, option 2 loses the sidecar because this one is created only if hocr is generated on the regular flow or --ocr-only flow, so in theory you should have the original sidecar with you.

I'll be waiting for my 200 bucks :)

commits
fork

dansbandit · 2023-09-26T00:28:00Z

@jaysonlarose

I had the same problem as you. I wanted to edit the OCR before merging it to a PDF. This is what i did.

ran ocrmypdf with -k and --pdf-renderer hocr
edit hocr-files to your liking
move all .png and .hocr to folder (make sure that they have exact same filename i.e. 00001.png and 00001.hocr)
git clone https://github.com/rescribe/bookpipeline
cd bookpipeline/cmd/pdfbook
go build pdfbook
./pdfbook folder output.pdf

jbarlow83 · 2023-11-12T08:52:35Z

Now implemented as an experimental API

endolith · 2024-02-17T22:50:56Z

@jaysonlarose

I had the same problem as you. I wanted to edit the OCR before merging it to a PDF. This is what i did.

* ran ocrmypdf with `-k` and` --pdf-renderer hocr`

* edit hocr-files to your liking

* move all .png and .hocr to folder (make sure that they have exact same filename i.e. 00001.png and 00001.hocr)

* `git clone https://github.com/rescribe/bookpipeline`

* `cd bookpipeline/cmd/pdfbook`

* `go build pdfbook`

* `./pdfbook folder output.pdf`

Why do we need a separate tool when ocrmypdf already does this step internally?

#1254

jbarlow83 added the enhancement label Nov 25, 2019

jbarlow83 closed this as completed Nov 12, 2023

hocr import / export #453

hocr import / export #453

Comments

aalmir commented Nov 12, 2019

adrianbroher commented Nov 12, 2019

jbarlow83 commented Nov 25, 2019

tukusejssirs commented Jan 12, 2021

aalmir commented Jan 12, 2021

jbarlow83 commented Jan 12, 2021

tukusejssirs commented Jan 12, 2021

jbarlow83 commented Jan 12, 2021

tukusejssirs commented Jan 12, 2021

jbarlow83 commented Jan 12, 2021

tukusejssirs commented Jan 15, 2021

jbarlow83 commented Jan 15, 2021

tukusejssirs commented Jan 15, 2021

5.6. Hyphenation

jbarlow83 commented Jan 15, 2021

jbarlow83 commented Jan 15, 2021

tukusejssirs commented Jan 15, 2021 • edited

jbarlow83 commented Jan 15, 2021

tukusejssirs commented Jan 15, 2021

jbarlow83 commented Jan 15, 2021 via email

jbarlow83 commented Jan 15, 2021

tukusejssirs commented Jan 15, 2021

jbarlow83 commented Jan 15, 2021

tukusejssirs commented Jan 16, 2021 • edited

jbarlow83 commented Jan 17, 2021

tukusejssirs commented Jan 17, 2021

rmast commented Nov 7, 2021

jaysonlarose commented Jul 13, 2022

Footnotes

jbarlow83 commented Jul 13, 2022 • edited

rmast commented Jul 13, 2022 via email

jaysonlarose commented Jul 13, 2022

jaysonlarose commented Jul 13, 2022

jaysonlarose commented Jul 15, 2022

jaysonlarose commented Jul 15, 2022

jbarlow83 commented Jul 18, 2022

rmast commented Oct 11, 2022 via email

lumalav commented Sep 2, 2023 • edited

dansbandit commented Sep 26, 2023

jbarlow83 commented Nov 12, 2023

endolith commented Feb 17, 2024 • edited

tukusejssirs commented Jan 15, 2021 •

edited

tukusejssirs commented Jan 16, 2021 •

edited

jbarlow83 commented Jul 13, 2022 •

edited

lumalav commented Sep 2, 2023 •

edited

endolith commented Feb 17, 2024 •

edited