possibility to interrupt execution to edit hocr files #907

femifrak · 2022-02-03T13:09:14Z

The desire to edit recognized text before it is converted to pdf arised several times here. (e.g. #91)

If there was a possibility to temporarily interrupt the execution of ocrmypdf as soon as (some of) the hocr files are created, (e.g. with ocrmypdf -k --tesseract-config hocr) one could edit the hocr files before they were processed. When editing is finished one could manually trigger to let ocrmypdf continue the remaining process.

Would you consider including such an interruption option?

Maybe such a possibility would not be user-friendly to incorporate, but for testing the feasibility or as temporary workaround, could you please guide my, at which location in which file such a break could be implemented?

(As hocr files are pure text files, repeating errors could easily be corrected e.g. with regex.)

The text was updated successfully, but these errors were encountered:

femifrak · 2022-02-03T19:26:50Z

I changed my mind because it is difficult to proofread and edit the hocr files when there are many of them. Even in the case where you can define and correct the errors with regular expressions, it would be cumbersome to manually interrupt and resume the ocrmypdf process. Instead, this would be my wish:

In hocrtransform.py there exists a dictionary, with which ligatures should be exchanged. Could this list be moved to a separate file that users can edit? So one could flexibly correct frequent, repetitive recognition errors. I have tried this on a test basis, but have only had partial success because these replacements are only applied when calling hocrtransform.py directly, but not when using ocrmypdf. (Didn't understand the code correctly obviously).

My specific problem are separators in old texts, which were not written with a "-" but also with "⸗" or with "—" as is common today, and are translated that way by teseract. Unfortunately, pdf readers only find separated words if they are separated with today's usual separator. A replacement with "-" would solve this problem. Thanks!

jbarlow83 · 2022-02-04T00:58:39Z

There isn't a good way to allow hocr editing without some heavy adjustments. It's probably more manageable now there's an API and plugin interface but not all the right bits are exposed for this.

I'd suggest just editing hocrtransform to deal with the separators using what is already done for ligatures. What do with these is not necessarily universal, e.g. maybe other users really need all of the archaic separators.

femifrak · 2022-02-04T06:52:46Z

Thanks for your explanation. I have 3 points:

1.) Is it right, that the mentioned replacements are only applied in case of hocr renderer?

2.) Furthermore I am not sure whether I clearly explained what my problems are. I now have a solution for the separators. Really nice!! However, there is often another kind of problem:
The process of ocr often creates the same type of recognition errors like "nnn" instead of "mm". If there was a list containing the wrong words and their corrected counterparts, a replacement of the wrong word in the hocr file would solve this issue (imho). This should not be problematic since the geometric word lengths are similar. Using the dict in hocrtransform is not possible here since only strings of length 1 are accepted. I have the feeling that this would be desired quite often and would really enhance your super program.

3.) I am not sure whether comments of closed issues reach you. Could you please give me a short note so that I don't need to open a new issue even if you have no reply? Thanks a lot :)

jbarlow83 · 2022-02-04T07:50:20Z

1.) Is it right, that the mentioned replacements are only applied in case of hocr renderer?

Correct. I don't know if the Tesseract renderer produces ligatures.

2.) Furthermore I am not sure whether I clearly explained what my problems are. I now have a solution for the separators. Really nice!! However, there is often another kind of problem: The process of ocr often creates the same type of recognition errors like "nnn" instead of "mm". If there was a list containing the wrong words and their corrected counterparts, a replacement of the wrong word in the hocr file would solve this issue (imho). This should not be problematic since the geometric word lengths are similar. Using the dict in hocrtransform is not possible here since only strings of length 1 are accepted. I have the feeling that this would be desired quite often and would really enhance your super program.

I agree in principle but the right place for these issues to be fixed is in Tesseract itself. I haven't checked in Tesseract still outputs these ligatures - that's a holdover from the earliest versions.

I can see that Tesseract doesn't necessarily know whether to output ligatures or normalize text, and that decision will vary with both use case and language.

3.) I am not sure whether comments of closed issues reach you. Could you please give me a short note so that I don't need to open a new issue even if you have no reply? Thanks a lot :)

Yes, I get notified and I believe that is Github's default. In the past I let issues stay open but I've been trying to close them more quickly lately so that open issues remain a list of things I should act on. Hopefully people don't take this the wrong way. Closing mainly means no further action is needed from me (I don't need to investigate or fix an issue). Also I'm very busy with regular work.

femifrak · 2022-02-04T11:26:31Z

Thanks for the answer! I wonder anyway about your immense work that you do on the side. Awesome. Thank you so much!!!
I can totally understand and don't expect a (quick) answer.

When your workload gets too heavy, you may not know this site, but you might like it. (Better a temporary announced inactivity than a burnout.)

Regarding my wish for an exchange list, I totally agree with you that this should be done at tesseract. However, I don't mean ligatures. I mean recognition errors. (The ligatures dict in hocrtransfrom.py was just a convenient way for me to edit your code and to explain what I mean.) There will never (!) come a time when tesseract will reach a recognition accuracy of 100%, so that an exchange list would always be a useful feature. It would be a fantastic and unique feature for an open source ocr software!!

My idea was to externally create a list of unique words from an ocr'ed pdf, run a spell checker on it, put wrong and corrected words side by side and run ocrmypdf again including this exchange list. At the end there would be a significant increase in accuracy.

Could I persuade you to reconsider the idea and create such a list?

BTW: I think the existence of ligatures depends on the language model used by tesseract. I on my side don't see any.

jbarlow83 closed this as completed Feb 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

possibility to interrupt execution to edit hocr files #907

possibility to interrupt execution to edit hocr files #907

femifrak commented Feb 3, 2022

femifrak commented Feb 3, 2022

jbarlow83 commented Feb 4, 2022

femifrak commented Feb 4, 2022

jbarlow83 commented Feb 4, 2022

femifrak commented Feb 4, 2022

possibility to interrupt execution to edit hocr files #907

possibility to interrupt execution to edit hocr files #907

Comments

femifrak commented Feb 3, 2022

femifrak commented Feb 3, 2022

jbarlow83 commented Feb 4, 2022

femifrak commented Feb 4, 2022

jbarlow83 commented Feb 4, 2022

femifrak commented Feb 4, 2022