-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
possibility to interrupt execution to edit hocr files #907
Comments
I changed my mind because it is difficult to proofread and edit the hocr files when there are many of them. Even in the case where you can define and correct the errors with regular expressions, it would be cumbersome to manually interrupt and resume the ocrmypdf process. Instead, this would be my wish: In hocrtransform.py there exists a dictionary, with which ligatures should be exchanged. Could this list be moved to a separate file that users can edit? So one could flexibly correct frequent, repetitive recognition errors. I have tried this on a test basis, but have only had partial success because these replacements are only applied when calling hocrtransform.py directly, but not when using ocrmypdf. (Didn't understand the code correctly obviously). My specific problem are separators in old texts, which were not written with a "-" but also with "⸗" or with "—" as is common today, and are translated that way by teseract. Unfortunately, pdf readers only find separated words if they are separated with today's usual separator. A replacement with "-" would solve this problem. Thanks! |
There isn't a good way to allow hocr editing without some heavy adjustments. It's probably more manageable now there's an API and plugin interface but not all the right bits are exposed for this. I'd suggest just editing hocrtransform to deal with the separators using what is already done for ligatures. What do with these is not necessarily universal, e.g. maybe other users really need all of the archaic separators. |
Thanks for your explanation. I have 3 points: 1.) Is it right, that the mentioned replacements are only applied in case of hocr renderer? 2.) Furthermore I am not sure whether I clearly explained what my problems are. I now have a solution for the separators. Really nice!! However, there is often another kind of problem: 3.) I am not sure whether comments of closed issues reach you. Could you please give me a short note so that I don't need to open a new issue even if you have no reply? Thanks a lot :) |
Correct. I don't know if the Tesseract renderer produces ligatures.
I agree in principle but the right place for these issues to be fixed is in Tesseract itself. I haven't checked in Tesseract still outputs these ligatures - that's a holdover from the earliest versions. I can see that Tesseract doesn't necessarily know whether to output ligatures or normalize text, and that decision will vary with both use case and language.
Yes, I get notified and I believe that is Github's default. In the past I let issues stay open but I've been trying to close them more quickly lately so that open issues remain a list of things I should act on. Hopefully people don't take this the wrong way. Closing mainly means no further action is needed from me (I don't need to investigate or fix an issue). Also I'm very busy with regular work. |
Thanks for the answer! I wonder anyway about your immense work that you do on the side. Awesome. Thank you so much!!! When your workload gets too heavy, you may not know this site, but you might like it. (Better a temporary announced inactivity than a burnout.) Regarding my wish for an exchange list, I totally agree with you that this should be done at tesseract. However, I don't mean ligatures. I mean recognition errors. (The ligatures dict in hocrtransfrom.py was just a convenient way for me to edit your code and to explain what I mean.) There will never (!) come a time when tesseract will reach a recognition accuracy of 100%, so that an exchange list would always be a useful feature. It would be a fantastic and unique feature for an open source ocr software!! My idea was to externally create a list of unique words from an ocr'ed pdf, run a spell checker on it, put wrong and corrected words side by side and run ocrmypdf again including this exchange list. At the end there would be a significant increase in accuracy. Could I persuade you to reconsider the idea and create such a list? BTW: I think the existence of ligatures depends on the language model used by tesseract. I on my side don't see any. |
The desire to edit recognized text before it is converted to pdf arised several times here. (e.g. #91)
If there was a possibility to temporarily interrupt the execution of ocrmypdf as soon as (some of) the hocr files are created, (e.g. with
ocrmypdf -k --tesseract-config hocr
) one could edit the hocr files before they were processed. When editing is finished one could manually trigger to let ocrmypdf continue the remaining process.Would you consider including such an interruption option?
Maybe such a possibility would not be user-friendly to incorporate, but for testing the feasibility or as temporary workaround, could you please guide my, at which location in which file such a break could be implemented?
(As hocr files are pure text files, repeating errors could easily be corrected e.g. with regex.)
The text was updated successfully, but these errors were encountered: