Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use a timeout for gs? #1010

Closed
svenha opened this issue Aug 30, 2022 · 13 comments
Closed

How to use a timeout for gs? #1010

svenha opened this issue Aug 30, 2022 · 13 comments

Comments

@svenha
Copy link

svenha commented Aug 30, 2022

I am batch processing pdf files. Some files lead to a never (?, I gave up after 2 hours of cputime) ending gs job. I am looking for a counterpart of the option --tesseract-timeout, I guess. (Ubuntu 22.04, with packaged ocrmypdf.)

@jbarlow83
Copy link
Collaborator

We can do tesseract-timeout because it's still possible to produce a functional, mostly OCRed PDF if Tesseract fails on certain pages. But Ghostscript is a one-shot - it has to run to completion or we don't get a usable PDF. (Or in some cases, we can't produce the images Ghostscript needs.)

For Ghostscript, if it fails to run to completion, we can't produce a functional PDF at all.

--output-type pdf avoids using Ghostscript for output and may be enough.

I assume it's a private file you won't share with me, but if f you run ocrmypdf with --keep-temporary-files and verbose you should be able to see the command that hangs and locate the exact files that triggered this. If this produces a hang independent of ocrmypdf, you could submit the issue to Ghostscript's bug tracker and get it fixed. I'm assuming it's probably their issue, but it's possible there's a deadlock or something in ocrmypdf that makes it look like Ghostscript's fault.

@svenha
Copy link
Author

svenha commented Aug 30, 2022

Thanks for the quick help and many hints. The gs call that hangs is for a single page:

gs -dQUIET -dSAFER -dBATCH -dNOPAUSE -dInterpolateControl=-1 -sDEVICE=png16m -dFirstPage=2 -dLastPage=2 -r12106.537530x12106.537530 -o - -sstdout=%stderr -dAutoRotatePages=/None -f /tmp/ocrmypdf.io.zcxgo0d2/origin.pdf

I reduced my options to just one (--redo-ocr) to make them minimal.
I will send you a good example pdf; it is a train time table, not very secret :-)

@jbarlow83
Copy link
Collaborator

-r12106.537530x12106.537530

That's the problem... ocrmypdf picked too high of a rendering resolution for the file some reason. It tries to pick a resolution that will capture all details in the file. This is not a ghostscript problem.

@svenha
Copy link
Author

svenha commented Aug 30, 2022

train_time_table.pdf.gpg.zip

Encrypted for @jbarlow83 as documented in the Wiki.

@svenha
Copy link
Author

svenha commented Aug 31, 2022

Would a maximum for the rendering resolution be a solution?

@hrst
Copy link

hrst commented Sep 6, 2022

I'm having exactly the same problem.

In my case:
-r13505.527644x13505.527644

gs hangs indefinitely, very high CPU usage ensues (100%).

Is there any command line parameter that can fix this? I'm fine with setting a limit to DPI or just setting a fixed value. --oversample 300 did not help. --output-type pdf did not help as well, gs is still run.

Log excerpt:
DEBUG - Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-sDEVICE=png16m', '-dFirstPage=1', '-dLastPage=1', '-r13505.527644x13505.527644', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', '/tmp/com.github.ocrmypdf.wrcn_5cl/origin.pdf']

@hrst
Copy link

hrst commented Nov 11, 2022

@jbarlow83 Sorry for bothering you, but is there anything that can be done to prevent this from happening?

@jbarlow83
Copy link
Collaborator

jbarlow83 commented Nov 25, 2022

I'm afraid the thing preventing anything from happening on this issue is that I'm too busy with other projects and a comprehensive resolution is not trivial. I am hoping to have time in late December. In the meantime if anyone wants to attempt a PR I'd be happy to help with that.

@hrst
Copy link

hrst commented Apr 8, 2023

@jbarlow83 Just made a contribution on Open Collective, it is not much and I do not expect anything but should you ever have the time and nerves it would be very help if you could take a look on this. Thanks a lot in any case for the awesome project!

@hrst
Copy link

hrst commented Sep 11, 2023

@jbarlow83 Any chance you could look into this? I sadly had to stop using ocrmypdf as it would put the server to 100% CPU with no way to prevent it from happening. Really any sort of fix (even if it cannot OCR the document and just exits) would be awesome!

@svenha
Copy link
Author

svenha commented Sep 12, 2023

@hrst you can use the timeout command as a rough workaround under Linux if you want to avoid running a ocrmypdf job forever:

timeout --preserve-status 300s ocrmypdf ...

@hrst
Copy link

hrst commented Sep 14, 2023

@svenha Thanks for the tip, I had considered this but the problem is that it is hard to determine a good time amount for the timeout. I could retrieve the numbers of pages first and then dynamically set the timeout but I had the hopes of this getting resolved at some point. I might just use the timeout and not used ocrmypdf at all for any documents with a larger amount of pages. However, if I remember correctly ocrmypdf still used 100% CPU even when using the --jobs parameter, but I have to try that.

jbarlow83 added a commit that referenced this issue Sep 21, 2023
To address #1010 and other issues.
@jbarlow83
Copy link
Collaborator

Underlying issue fixed in v15 - I probably won't add a timeout on Ghostscript itself because it's difficult to say what a reasonable completion time is. Large documents on slow computers might fail when nothing was wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants