Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add watched directory functionality #466

Closed
wants to merge 2 commits into from

Conversation

ianalexander
Copy link
Contributor

This PR is a followup to issue #465.

This PR will add watched directory functionality to OCRmyPDF. This PR includes the following changes:

  1. watcher.py: This file uses python's watchdog to watch a folder for new files, and then immediately send them to OCRmyPDF. Results are stored in a separate output folder. Optionally, output folder files may be organized by year and month. These are all configurable by environment variables.
  2. Dockerfile: We now include the new requirements (listed in requirements/watcher.txt) in building the docker image. This docker image can now be launched directly with the watcher.
  3. docs: I've updated the existing watched folders section to now reference the docker image with the watcher.py script. This includes examples of usage.

Here is how you test this:

git clone
cd OCRmyPDF
docker build -f .docker/Dockerfile .
docker run \
   -v $(pwd)/tmp/incoming/:/input \
   -v $(pwd)/tmp/:/output \
   -e OCR_OUTPUT_DIRECTORY_YEAR_MONTH=1 \
   -it --entrypoint python3 \
   <container id> \
   watcher.py

After this, drop a pdf into ./tmp/incoming/. Output should look something like:

Starting OCRmyPDF watcher with config:
Input Directory: /input
Output Directory: /output
Output Directory Year & Month: True
New file: /input/input.pdf.
Attempting to OCRmyPDF to: /output/2019/12/input.pdf
Scan: <snipped>
OCR: <snipped>
...

Afterwards, verify in ./tmp/ that the output file was OCR'ed correctly.

What do you think?

misc/watcher.py Show resolved Hide resolved
misc/watcher.py Outdated Show resolved Hide resolved
docs/batch.rst Outdated Show resolved Hide resolved
misc/watcher.py Show resolved Hide resolved
@ianalexander
Copy link
Contributor Author

Good feedback. I've updated the code to use pathlib instead of os and included the old content in the docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants