Proposal: Add Directory Watching Functionality #465

ianalexander · 2019-12-22T19:54:47Z

Following the contributing guidelines, I am opening a new issue to discuss potentially adding new functionality to OCRmyPDF.

I want to propose adding a small script that would enable directory watching. The goal of this would be to:

Add a script to enable watching a directory, and executing OCRmyPDF on the watched files.
This script should be usable primarily in docker, but also standalone.

I'm thinking the high level design would look like:

Use python's watchdog for the actual monitoring functionality.
Write a small script in misc/ (same place webserver.py is) that would execute watchdog with appropriate arguments.
Minimum 3 configuration options: input directory, output directory, and OCRmyPDF arguments. Would most likely use environment variables for configuration (which will enable reuse for docker or standalone).

I think two things worth getting your feedback on are:

This would require adding a new dependency (watchdog). If you prefer a non-python dependency, something like Facebook's watchman could be used.
I think this belongs in the official repo instead of separate, because many of the images I have seen on docker hub are poorly implemented and not updated. This would ensure people are using the latest version of OCRmyPDF, etc.

Also just to be clear, I would be writing the PR for this :) What do you think?

The text was updated successfully, but these errors were encountered:

jbarlow83 · 2019-12-22T23:36:00Z

I agree that this is important enough functionality to have a standard way to do this.

I do have some documentation about using watchdog...
https://ocrmypdf.readthedocs.io/en/latest/batch.html#hot-watched-folders

...but I think it would be useful to expand this documentation to cover many more cases, put some scripts into misc/ as you propose, rewrite that script to use the ocrmypdf API instead of CLI, and write some tests against it so it stays functional and we catch regressions.

I think it's better to teach people how to set up common configurations than try to embed this ocrmypdf. Some people have complex renaming rules - what someone might want to happen to a new file in a watched folder could easily need to be fully programmable and Turing complete.

I haven't done it for a while but my past experience was that Docker wasn't always reliable at signalling file system events with external volumes, especially if a network share is involved.

ianalexander · 2019-12-23T00:34:57Z

I haven't done it for a while but my past experience was that Docker wasn't always reliable at signalling file system events with external volumes

This is helpful, I didn't know about this. Doing a little googling it looks like this is only an issue on windows due to CIFS not supporting inotify events. I'll whip up a little proof of concept to test.

Also just to be sure, you're suggesting that we use the actual python API, as opposed to calling subprocess.run() like in webserver.py?

jbarlow83 · 2019-12-23T01:42:12Z

Yes. While webserver.py is just a toy implementation, I definitely wouldn't want to run ocrmypdf in a process that's also supposed to be serving web pages. The proper way to do it would be some sort of message queue where OCR workers can be isolated, possible in Docker containers or different hardware.

ianalexander · 2019-12-23T04:25:21Z

PR opened. I'll close this for now and we can move discussion there.

ianalexander mentioned this issue Dec 23, 2019

Add watched directory functionality #466

Closed

ianalexander closed this as completed Dec 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Add Directory Watching Functionality #465

Proposal: Add Directory Watching Functionality #465

ianalexander commented Dec 22, 2019

jbarlow83 commented Dec 22, 2019

ianalexander commented Dec 23, 2019

jbarlow83 commented Dec 23, 2019

ianalexander commented Dec 23, 2019

Proposal: Add Directory Watching Functionality #465

Proposal: Add Directory Watching Functionality #465

Comments

ianalexander commented Dec 22, 2019

jbarlow83 commented Dec 22, 2019

ianalexander commented Dec 23, 2019

jbarlow83 commented Dec 23, 2019

ianalexander commented Dec 23, 2019