Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase OCR timeout #10

Closed
opensemanticsearch opened this issue Jun 11, 2019 · 5 comments
Closed

Increase OCR timeout #10

opensemanticsearch opened this issue Jun 11, 2019 · 5 comments
Assignees

Comments

@opensemanticsearch
Copy link
Owner

opensemanticsearch commented Jun 11, 2019

Tika default OCR timeout of 120 not enough if multiple parallel processed documents or images doing OCR which leads to Tika OCR timeouts and so Tika exception for full document(s)

@Mandalka
Copy link
Collaborator

Mandalka commented Jun 13, 2019

The build script of the Debian Package now extracts the OCR config org/apache/tika/parser/ocr/TesseractOCRConfig.properties from the Tika Server JAR, changes timeout setting and adds/overwrites with changed config to/in the Tika Server JAR of the package.

@rmazzine
Copy link

rmazzine commented Oct 2, 2019

If anybody needs to modify the timeout via REST, just add a header with "X-Tika-OCRTimeout: 200" for 200 seconds of timeout.

Example:

curl -T file_to_ocr.jpg localhost:9998/tika --header "X-Tika-OCRTimeout: 200"

@opensemanticsearch
Copy link
Owner Author

Thanks for your tip, will add that in ETL plugin for the case someone uses a Tika on another server/installation which is not our preconfigured Tika deb package.

@Mandalka
Copy link
Collaborator

Timeout settings now by Open Semantic ETL using header X-Tika-OCRTimeout for Tika-Server.

@MparkG
Copy link

MparkG commented Jan 27, 2022

I am having this pop up now; Its for the fake tika server

java[1950828]: ERROR [Thread-22] 22:22:34,199 org.apache.tika.server.core.ServerStatusWatcher Timeout task PARSE, millis elapsed 300091, timeoutMillis 300000, file id b'World History.pdf'consider increasing the allowable time with the <taskTimeoutMillis/> parameter or the X-Tika-Timeout-Millis header
Jan 27 22:22:34 mgp java[1950828]: WARN  [Thread-22] 22:22:34,199 org.apache.tika.server.core.ServerStatusWatcher forked process observed TIMEOUT and is shutting down.
Jan 27 22:22:34 mgp java[1950828]: INFO  [Thread-22] 22:22:34,214 org.apache.tika.server.core.ServerStatusWatcher Shutting down forked process with status: TIMEOUT
Jan 27 22:22:34 mgp etl_tasks[2349205]: [2022-01-27 22:22:34,677: WARNING/ForkPoolWorker-3] Connection to Tika server (will retry in 120 seconds) failed. Exception: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
Jan 27 22:22:34 mgp etl_tasks[2349205]: [2022-01-27 22:22:34,677: WARNING/ForkPoolWorker-3] Retrying to connect to Tika server in 120 second(s).
Jan 27 22:22:34 mgp java[1929662]: INFO  [pool-2-thread-1] 22:22:34,678 org.apache.tika.server.core.TikaServerWatchDog forked process exited with exit value 3
Jan 27 22:22:36 mgp java[1961770]: INFO  [main] 22:22:36,867 org.apache.tika.server.core.TikaServerProcess Starting Apache Tika 2.2.1 server
Jan 27 22:22:37 mgp java[1961770]: INFO  [main] 22:22:37,014 org.apache.tika.server.core.TikaServerProcess Using custom config: /etc/tika/tika-config-fakecache.xml
Jan 27 22:22:37 mgp java[1961770]: INFO  [main] 22:22:37,897 org.apache.cxf.endpoint.ServerImpl Setting the server's publish address to be http://localhost:9999/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants