Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hocr import / export #453

Closed
aalmir opened this issue Nov 12, 2019 · 38 comments
Closed

hocr import / export #453

aalmir opened this issue Nov 12, 2019 · 38 comments

Comments

@aalmir
Copy link

aalmir commented Nov 12, 2019

Describe the issue
If you want to create a perfect OCR, 100% correct text, you need some editing function.
For example "gImageReader" gives some basic editing function (but has some other missing features).

Expected behavior
An option to export and import hocr.
After export i can make changes on the hocr and import it for PDF creation.

@adrianbroher
Copy link

What a coincidence. I wanted to request the same feature today.

This feature request has some overlap with #177.

@jbarlow83
Copy link
Collaborator

I think this would be great but there's a lot to do to make it work, especially to support after the fact editing.

@tukusejssirs
Copy link

All I would need is an option to merge an hOCR HTML file with a PDF file.

On the Internet, I read about hocr2pdf (from ExactImage), which seams not be not developed anymore (and I can’t build it on Fedora 31 for missing libagg library → AntiGrain Geometry).

I also found hocr-pdf from hocr-tools, which I could install from the Fedora repos, but it does not work (it outputs empty file with some header of what).

Finally, gImageReader does output a PDF with hOCR data, but it does not work with 800+ pages (200+ MiB)—it creates a corrupted/damaged PDF (at least that’s what Evince and Adobe Acrobat Reader says).

@aalmir
Copy link
Author

aalmir commented Jan 12, 2021

Have you reported this to gImageReader Repo?

@jbarlow83
Copy link
Collaborator

ocrmypdf already has the ability to merge hOCR HTML into PDF through its public APIs. What it does not have is a convenient way to run its post-processing on a set of edited hOCR files.

@tukusejssirs
Copy link

ocrmypdf already has the ability to merge hOCR HTML into PDF through its public APIs.

Thanks, @jbarlow83, for reply. I haven’t every used ocrmypdf; could you guide me how to do accomplish this please? 😃

For what I want is to merge hOCR HTML file (generated by gImageReader and tesseract and manual correction via gIR and text editor) into a working PDF file merged by ImageMagick. Preferably in terminal, how exactly it does not matter. 😃

@aalmir, not exactly. I reported it in this comment and marginally in manisandro/gImageReader#480.

@jbarlow83
Copy link
Collaborator

@tukusejssirs The relevant code is in hocrtransform.py. See python -m ocrmypdf.hocrtransform --help.

@tukusejssirs
Copy link

@jbarlow83, thanks! Is there any way to either input multiple images or pre-created PDF file? hocrtransform.py has -ioption, but it looks like it accepts only one image file, however, I have 800+ images (or pages in a PDF) and a single hOCR file containing all the text data (note: not all pages have text, some are blank and therefore there is no mention of them in the hOCR file).

@aalmir, the issue is here → manisandro/gImageReader#486.

@jbarlow83
Copy link
Collaborator

No, it doesn't have that ability, but you could split the hOCR and run a loop.

@tukusejssirs
Copy link

@jbarlow83, I have just tried to merge hOCR data and an image into a PDF, but it failed (see below). Am I missing some kind of module or something?

Here are the test files.

Note that although I want to output the interword spaces, I don’t want them where I have commented out the whitespace between spans. This is due to the fact that I don’t want a space to be output after ­ (soft hyphen). This is just a hack as I could not find a better way to deal with hyphens.

python -m ocrmypdf.hocrtransform -i 024_mr2004.tif --interword-spaces 024_hocr.html merged.pdf
/usr/lib64/python3.8/runpy.py:127: RuntimeWarning: 'ocrmypdf.hocrtransform' found in sys.modules after import of package 'ocrmypdf', but prior to execution of 'ocrmypdf.hocrtransform'; this may result in unpredictable behaviour
  warn(RuntimeWarning(msg))
Traceback (most recent call last):
  File "/usr/lib64/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib64/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/lib/python3.8/site-packages/ocrmypdf/hocrtransform.py", line 382, in <module>
    hocr = HocrTransform(args.hocrfile, args.resolution)
  File "/usr/lib/python3.8/site-packages/ocrmypdf/hocrtransform.py", line 68, in __init__
    self.hocr = ElementTree.parse(hocrFileName)
  File "/usr/lib64/python3.8/xml/etree/ElementTree.py", line 1202, in parse
    tree.parse(source, parser)
  File "/usr/lib64/python3.8/xml/etree/ElementTree.py", line 595, in parse
    self._root = parser._parse_whole(source)
xml.etree.ElementTree.ParseError: undefined entity: line 45, column 113

@jbarlow83
Copy link
Collaborator

It looks like the XML (024_hocr.html) is invalid, specifically at line 45.

@tukusejssirs
Copy link

@jbarlow83, do I get it right that you say the a soft hyphen (&shy;) is invalid in hOCR file? As I understand it, the hOCR specification 1.2 says otherwise:

5.6. Hyphenation

Soft hyphens must be represented using the HTML &shy; entity.

@jbarlow83
Copy link
Collaborator

ocrmypdf.hocrtransform is only capable of parsing the subset of hOCR generated by Tesseract.

For this specific case, you'll need to add a string like the following to the top of the hOCR file

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
            "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [
            <!ENTITY shy '#x00AD'>
            ]>

Based on:
https://stackoverflow.com/questions/35591478/how-to-parse-html-with-entities-such-as-nbsp-using-builtin-library-elementtree

And U+00AD being the Unicode code point for soft hyphen.

@jbarlow83
Copy link
Collaborator

(Note that doctype signature may actually be incorrect for hOCR; whatever the hOCR spec says is correct should be used.)

@tukusejssirs
Copy link

tukusejssirs commented Jan 15, 2021

Thanks, @jbarlow83, that sort of worked.

  1. I changed the doctype in the example document to the following:
<!DOCTYPE html [
	<!ENTITY shy '#x00AD'>
	<!ENTITY thinsp '#x2009'>
]>
  1. Run your script, which succeded with two warnings (see below).
python -m ocrmypdf.hocrtransform -i 024_mr2004.tif --interword-spaces 024_hocr.html merged.pdf
/usr/lib64/python3.8/runpy.py:127: RuntimeWarning: 'ocrmypdf.hocrtransform' found in sys.modules after import of package 'ocrmypdf', but prior to execution of 'ocrmypdf.hocrtransform'; this may result in unpredictable behaviour
  warn(RuntimeWarning(msg))
/usr/lib/python3.8/site-packages/ocrmypdf/hocrtransform.py:109: DeprecationWarning: This method will be removed in future versions.  Use 'list(elem)' or iteration over elem instead.
  for child in element.getchildren():
  1. After opening merged.pdf in a PDF reader, the hOCR text is there, however, the ENTITY definitions make those characters in the text literal, e.g. &shy; is replaced with #x00AD (6 characters) instead of a real soft hyphen.

  2. Base on this comment to that SO question, I’ve even tried to change the definition to <!ENTITY shy '&amp;shy;'>, but it failed the same way. Same with the doctype you used in your example.

@jbarlow83
Copy link
Collaborator

Official definition is

<!ENTITY shy    CDATA "&#173;" -- soft hyphen = discretionary hyphen,
                                  U+00AD ISOnum -->

From: https://www.w3.org/TR/html4/sgml/entities.html

@tukusejssirs
Copy link

Thanks again, @jbarlow83, but CDATA causes an error using ocrmypdf.hocrtransform (same command as above).

  1. Change the doctype to the following (the doctype itself does not matter):
<!DOCTYPE html [
	<!ENTITY shy    CDATA "&#173;" -- soft hyphen = discretionary hyphen, U+00AD ISOnum -->
	<!ENTITY thinsp "&#2009;">
]>
  1. Run the ocrmypdf.hocrtransform command (same as above). This produces an error on the L2C17 character (2nd space before CDATA on shy entity definition.
python -m ocrmypdf.hocrtransform -i 024_mr2004.tif --interword-spaces 024_hocr.html merged.pdf
/usr/lib64/python3.8/runpy.py:127: RuntimeWarning: 'ocrmypdf.hocrtransform' found in sys.modules after import of package 'ocrmypdf', but prior to execution of 'ocrmypdf.hocrtransform'; this may result in unpredictable behaviour
  warn(RuntimeWarning(msg))
Traceback (most recent call last):
  File "/usr/lib64/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib64/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/lib/python3.8/site-packages/ocrmypdf/hocrtransform.py", line 382, in <module>
    hocr = HocrTransform(args.hocrfile, args.resolution)
  File "/usr/lib/python3.8/site-packages/ocrmypdf/hocrtransform.py", line 68, in __init__
    self.hocr = ElementTree.parse(hocrFileName)
  File "/usr/lib64/python3.8/xml/etree/ElementTree.py", line 1202, in parse
    tree.parse(source, parser)
  File "/usr/lib64/python3.8/xml/etree/ElementTree.py", line 595, in parse
    self._root = parser._parse_whole(source)
xml.etree.ElementTree.ParseError: syntax error: line 2, column 17

@jbarlow83
Copy link
Collaborator

jbarlow83 commented Jan 15, 2021 via email

@jbarlow83
Copy link
Collaborator

<!DOCTYPE html [
	<!ENTITY shy    CDATA "&#173;">
	<!ENTITY thinsp "&#2009;">
]>

@tukusejssirs
Copy link

@jbarlow83, I’ve already tried that; same result as I posted in my last comment.

It is quite interesting to me that your parser (or whatever reports tnat error) says the error occurs on a space (the second one before CDATA). Can’t there be an error in the parser?

@jbarlow83
Copy link
Collaborator

The parser is just Python's widely used XML parser from the standard library (xml.etree.ElementTree). It's saying your modified input file is not valid XML anymore so it won't parse. It seems to work if you remove the CDATA bit.

import xml.etree.ElementTree as ET
x = ET.fromstring(
    """
    <!DOCTYPE html [
	<!ENTITY shy     "&#173;">
	<!ENTITY thinsp "&#2009;">
    ]>
<test>&shy;</test>"""
)
assert x.text == '\xad'

@tukusejssirs
Copy link

tukusejssirs commented Jan 16, 2021

@jbarlow83, thanks, that helped for soft hyphens, but not for thin spaces. Thin spaces are replaced with (U+25A0) for some reason. However, when I use your Python code (modified a bit; see below), it outputs U+07D9 (ߙ; Nko Letter Ra) for some reason. Note that I don’t code in Python at all.

test_html_entity.py
#!/bin/python

"""
Test HTML entities in ElementTree
"""
import xml.etree.ElementTree as ET
x = ET.fromstring(
	"""
	<!DOCTYPE html [
		<!ENTITY shy     "&#173;">
		<!ENTITY thinsp "&#2009;">
	]>
<test>&thinsp;</test>"""
)

if x.text == '\xe2\x80\x89':
	print('true')
else:
	print('false')


print('input  : [ ]', format(ord(' '), 'x'))
print('output : [' + x.text + ']', format(ord(x.text), 'x'))

Update: I’ve also tried to use &#8201; in the entity definition instead of &#2009 (which is incorrect; my bad). Same result.

@jbarlow83
Copy link
Collaborator

What you're trying to do likely requires software development beyond this one issue. You'll need to find someone who can do that coding for you.

@tukusejssirs
Copy link

Okay, thanks anyway for the help. 😃

@rmast
Copy link

rmast commented Nov 7, 2021

If you want to create a perfect OCR, 100% correct text, you need some editing function. For example "gImageReader" gives some basic editing function (but has some other missing features).

I don't know whether gscan2pdf uses gImageReader for it, but in gscan2pdf I get a possibility to visually alter the OCR'ed text, based on the HOCR, and visually showing the confidence with colors.

@jaysonlarose
Copy link

Apologies if this is already answered, but I've been digging around for a few days and haven't found a definitive answer for this:

What's the best/easiest/recommended way to

  • get OCRmyPDF to process something1 up to the point where it's performed OCR but hasn't yet merged the OCR text into a PDF, and output the optimized artifacts as well as the OCR data
  • get OCRmyPDF to take the data created from the previous step and merge everything together into a finished PDF

In other words, use OCRmyPDF in a workflow where a human can come in and hand-correct (and quite possibly version control) the OCR data before performing the merge? I've been playing around with running something like img2pdf *.png | ocrmypdf - -k --tesseract-config hocr --deskew --clean ../output.pdf, but it outputs a lot of stuff that isn't really necessary to the workflow and tends to get quite large (6.1G worth of source files generates 55G worth of artifacts). That, and I'm not quite sure how to get OCRmyPDF to pick the process back up afterwards.

I'm fine with digging around in the code if necessary... in fact, that's probably where I'm going to be going after I finish writing this. I just figured I'd ask first in case someone has the answers close at hand.

Thanks,
--Jays

Footnotes

  1. (I read somewhere that OCRmyPDF is happiest with the output of img2pdf <image files> |2 but am happy to change that if it's not the case)

  2. My current workflow consists of scanning a document as .png files, collating them so that they sort ASCIIbetically, and then running img2pdf *.png | ocrmypdf - --deskew --clean output.pdf.

@jbarlow83
Copy link
Collaborator

jbarlow83 commented Jul 13, 2022

There are no great options for hand-correction because I tend to see as a problem a command line program is not well equipped to solve. I suppose I (or you) could insert a plugin hook that allows a custom renderer at src/ocrmypdf/_sync.py:216. Then you could intercept the .hocr file and make edits to it before proceeding with the standard hocrtransform. The pages would come at you out of order, because ocrmypdf processes pages in parallel.

@rmast
Copy link

rmast commented Jul 13, 2022 via email

@jaysonlarose
Copy link

@rmast Checking it out now. It's going to be chewing on the 214 MiB 324-page PDF I just fed it for a while, but what it's showing while it processes each page is making me optimistic. Well, besides looking a little bit like a ransom letter, but I can understand that considering how each word seems to be in its own bounding box, and that's just kind of how things are.

@jaysonlarose
Copy link

GScan2pdf almost works. It gets up to saving page 35 of 324, and then it just sits there. The UI is still responsive, it claims to be saving, but it's just... sitting there..

@jaysonlarose
Copy link

@jbarlow83 I just tossed you $200USD via OpenCollective in hopes that I can shamelessly bribe you to add some method for hocr export/import.

I just spent the last couple of hours learning about postfix operators, and it's a rabbit hole I really don't want to have to go down.

@jaysonlarose
Copy link

See? I just said "postfix" when I meant "PDF".

@jbarlow83
Copy link
Collaborator

@jaysonlarose I do appreciate the generous contribution and I'll try to think if there's something clever/efficient/reasonable for a CLI app.

@rmast
Copy link

rmast commented Oct 11, 2022 via email

@lumalav
Copy link

lumalav commented Sep 2, 2023

@jaysonlarose, I needed something exactly to what you are describing and I forked the repo. In my version, ocrmypdf can be used in two additional workflows:

  1. ocr is ONLY performed with the ability of outputting the hocr file and no merge occurs at the end (--pdf-renderer hocr --ocr-only --hocr-out /some/path), NOTE: the ocr image will be outputted as well with the same name of the hocr file but with the .png extension. Also, the hocr file is modified in place after being copied so the source image reflects the one that was outputted (/html/body/div[@class=\'ocr_page\'])
  2. ocr is performed but the merge is done using an hocr file that is coming as an input and the normal pipeline continues (--pdf-rended hocr --hocr-in /some/path )

However, I don't think I'll be opening a pull request because of the way option 2 works. It needs to perform ocr again in case optimizations where done in the middle (--deskew or --rotation) so the hocr that is being passed could match the optimizations and the ocr image that are being generated in the moment. In theory, you need to call the same set of options (--rotate-pages, --clean, --deskew) if you are planning to perform these flows separately... Additionally, option 2 loses the sidecar because this one is created only if hocr is generated on the regular flow or --ocr-only flow, so in theory you should have the original sidecar with you.

I'll be waiting for my 200 bucks :)

commits
fork

@dansbandit
Copy link

@jaysonlarose

I had the same problem as you. I wanted to edit the OCR before merging it to a PDF. This is what i did.

  • ran ocrmypdf with -k and --pdf-renderer hocr
  • edit hocr-files to your liking
  • move all .png and .hocr to folder (make sure that they have exact same filename i.e. 00001.png and 00001.hocr)
  • git clone https://github.com/rescribe/bookpipeline
  • cd bookpipeline/cmd/pdfbook
  • go build pdfbook
  • ./pdfbook folder output.pdf

@jbarlow83
Copy link
Collaborator

Now implemented as an experimental API

@endolith
Copy link
Contributor

endolith commented Feb 17, 2024

@jaysonlarose

I had the same problem as you. I wanted to edit the OCR before merging it to a PDF. This is what i did.

* ran ocrmypdf with `-k` and` --pdf-renderer hocr`

* edit hocr-files to your liking

* move all .png and .hocr to folder (make sure that they have exact same filename i.e. 00001.png and 00001.hocr)

* `git clone https://github.com/rescribe/bookpipeline`

* `cd bookpipeline/cmd/pdfbook`

* `go build pdfbook`

* `./pdfbook folder output.pdf`

Why do we need a separate tool when ocrmypdf already does this step internally?

#1254

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants