Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Add Directory Watching Functionality #465

Closed
ianalexander opened this issue Dec 22, 2019 · 4 comments
Closed

Proposal: Add Directory Watching Functionality #465

ianalexander opened this issue Dec 22, 2019 · 4 comments

Comments

@ianalexander
Copy link
Contributor

Following the contributing guidelines, I am opening a new issue to discuss potentially adding new functionality to OCRmyPDF.

I want to propose adding a small script that would enable directory watching. The goal of this would be to:

  • Add a script to enable watching a directory, and executing OCRmyPDF on the watched files.
  • This script should be usable primarily in docker, but also standalone.

I'm thinking the high level design would look like:

  • Use python's watchdog for the actual monitoring functionality.
  • Write a small script in misc/ (same place webserver.py is) that would execute watchdog with appropriate arguments.
  • Minimum 3 configuration options: input directory, output directory, and OCRmyPDF arguments. Would most likely use environment variables for configuration (which will enable reuse for docker or standalone).

I think two things worth getting your feedback on are:

  1. This would require adding a new dependency (watchdog). If you prefer a non-python dependency, something like Facebook's watchman could be used.
  2. I think this belongs in the official repo instead of separate, because many of the images I have seen on docker hub are poorly implemented and not updated. This would ensure people are using the latest version of OCRmyPDF, etc.

Also just to be clear, I would be writing the PR for this :) What do you think?

@jbarlow83
Copy link
Collaborator

I agree that this is important enough functionality to have a standard way to do this.

I do have some documentation about using watchdog...
https://ocrmypdf.readthedocs.io/en/latest/batch.html#hot-watched-folders

...but I think it would be useful to expand this documentation to cover many more cases, put some scripts into misc/ as you propose, rewrite that script to use the ocrmypdf API instead of CLI, and write some tests against it so it stays functional and we catch regressions.

I think it's better to teach people how to set up common configurations than try to embed this ocrmypdf. Some people have complex renaming rules - what someone might want to happen to a new file in a watched folder could easily need to be fully programmable and Turing complete.

I haven't done it for a while but my past experience was that Docker wasn't always reliable at signalling file system events with external volumes, especially if a network share is involved.

@ianalexander
Copy link
Contributor Author

I haven't done it for a while but my past experience was that Docker wasn't always reliable at signalling file system events with external volumes

This is helpful, I didn't know about this. Doing a little googling it looks like this is only an issue on windows due to CIFS not supporting inotify events. I'll whip up a little proof of concept to test.

Also just to be sure, you're suggesting that we use the actual python API, as opposed to calling subprocess.run() like in webserver.py?

@jbarlow83
Copy link
Collaborator

Yes. While webserver.py is just a toy implementation, I definitely wouldn't want to run ocrmypdf in a process that's also supposed to be serving web pages. The proper way to do it would be some sort of message queue where OCR workers can be isolated, possible in Docker containers or different hardware.

@ianalexander
Copy link
Contributor Author

PR opened. I'll close this for now and we can move discussion there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants