Watched folder bug fixes, new flags, and docs updates. #479

ianalexander · 2020-01-20T08:23:03Z

This pull request is a followup to #466.

This PR fixes 2 bugs in watcher.py and adds additional functionality.

Bug fixes:

Fix double-fire on create and modify events: the HandleObserverEvent fires on both create and modified events. Depending on how the file is written to disk, this may cause 2 events to be fired, OCR'ing the PDF twice. This fix listens only for the create event and only fires once.
Fix FileInput error when reading from docker volume: Non-deterministicly, watchdog may fire a create even when a file is not fully written to disk on a docker volume. We fix this by watching the file size every second and waiting for new data to stop being written to disk. This solution works 100% reliabily my testing, both docker and non-docker.

New functonality:

ON_SUCCESS_DELETE: Added a flag to delete the input file when ocrmypdf returns with a 0 (success) error code
DESKEW: Added a flag to enable or disable DESKEW
Docs update with PYTHONBUFFERED: By default, python3 buffers output to STDOUT. This prevents us from seeing what watcher.py is doing in the docker logs. I added an explaination to the docs that show how to enable this and see the python output in the logs.

ianalexander · 2020-01-20T18:47:50Z

Added an additional commit to upgrade logging from print to the actual python logging facility. I also added an env var POLL_NEW_FILE_SECONDS that controls how frequently watchdog should check that a new_file has been fully written to disk, regarding bugfix number 2. This may be useful for people in different (slower) network environments.

jbarlow83

Concerned about the file size = 0 problem...

jbarlow83 · 2020-01-28T20:59:01Z

misc/watcher.py

 PATTERNS = ['*.pdf']

+logging.basicConfig(level=LOGLEVEL)


Use ocrmypdf.configure_logging in the ifmain block

jbarlow83 · 2020-01-28T21:03:36Z

misc/watcher.py

+    # watchdog event before the file is actually fully on disk, causing
+    # pikepdf to fail.
+    current_size = None
+    while current_size != new_file.stat().st_size:


This is defeated by opening a Python terminal and running
f = open('file.pdf', 'wb') in the input directory. In other words if a program open a file and has any sizable wait before it begins writing then, we'll trigger on a file size of zero. Or during writing a delay of more than 1 second is entirely possible on a slow network. In some file systems even opening a folder with >1000 files can see this kind of delay.

This is really something watchdog should figure out for us....

Also consider the following:

Samba client opens new file for writing.

Samba client writes some bytes.

Watcher enters loop waiting for file size to change.

Samba client crashes/goes offline without finishing.

Watcher is stuck indefinitely waiting for that one file, even as other files are changed.

jbarlow83 · 2020-01-28T21:04:26Z

misc/watcher.py



 class HandleObserverEvent(PatternMatchingEventHandler):
    def on_any_event(self, event):
-        if event.event_type in ['created', 'modified']:
+        if event.event_type in ['created']:


I think if I modify a file in the input folder, I want it to be reprocessed.

jbarlow83 · 2020-01-28T21:09:52Z

docs/batch.rst

@@ -224,6 +227,9 @@ convert it to a OCRed PDF in ``/output/``. The parameters to this image are:
    "``-v <path to files to convert>:/input``", "Files placed in this location will be OCRed"
    "``-v <path to store results>:/output``", "This is where OCRed files will be stored"
    "``-e OCR_OUTPUT_DIRECTORY_YEAR_MONTH=1``", "This will place files in the output in {output}/{year}/{month}/{filename}"
+    "``-e OCR_ON_SUCCESS_DELETE=1``", "This will delete the input file if the exit code is 0 (OK)"
+    "``-e OCR_DESKEW=1``", "This will enable deskew for crooked PDFs"


I kind of agree with having an option to deskew baked into the script since it's one of the most desirable to have, but there are tons of options in ocrmypdf and we don't want to set up environment vars for each. I think the idea should be that you take this script and customize it.

Shouldn't there be something like inwebservice.py where we could pass all arguments to the main program by calling it through

cmd_args = [arg for arg in os.getenv('OCR_OPTIONS')] ocrmypdf_args = ["ocrmypdf", *cmd_args, file_path, output_path] proc = run(ocrmypdf_args, stdout=PIPE, stderr=PIPE, encoding="utf-8")

jbarlow83 · 2020-02-10T09:13:16Z

This PR still provides improvements so I've merged most of the changes, with some changes of my own. I'll keep this PR open for now while we debate other improvements.

ianalexander added 2 commits January 20, 2020 00:20

Watched folder bug fixes, new flags, and docs updates.

b7f38e9

Update logging and env var extensibility

3eab161

jbarlow83 reviewed Jan 28, 2020

View reviewed changes

svenihoney mentioned this pull request Feb 4, 2020

Create a watcher based on shell script #486

Closed

jbarlow83 merged commit 3eab161 into ocrmypdf:master Feb 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Watched folder bug fixes, new flags, and docs updates. #479

Watched folder bug fixes, new flags, and docs updates. #479

ianalexander commented Jan 20, 2020

ianalexander commented Jan 20, 2020

jbarlow83 left a comment

jbarlow83 Jan 28, 2020

jbarlow83 Jan 28, 2020

jbarlow83 Jan 28, 2020

jbarlow83 Jan 28, 2020

jbarlow83 Jan 28, 2020

svenihoney Feb 2, 2020

jbarlow83 commented Feb 10, 2020

Watched folder bug fixes, new flags, and docs updates. #479

Watched folder bug fixes, new flags, and docs updates. #479

Conversation

ianalexander commented Jan 20, 2020

ianalexander commented Jan 20, 2020

jbarlow83 left a comment

Choose a reason for hiding this comment

jbarlow83 Jan 28, 2020

Choose a reason for hiding this comment

jbarlow83 Jan 28, 2020

Choose a reason for hiding this comment

jbarlow83 Jan 28, 2020

Choose a reason for hiding this comment

jbarlow83 Jan 28, 2020

Choose a reason for hiding this comment

jbarlow83 Jan 28, 2020

Choose a reason for hiding this comment

svenihoney Feb 2, 2020

Choose a reason for hiding this comment

jbarlow83 commented Feb 10, 2020