Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watched folder bug fixes, new flags, and docs updates. #479

Merged
merged 2 commits into from Feb 10, 2020

Conversation

ianalexander
Copy link
Contributor

This pull request is a followup to #466.

This PR fixes 2 bugs in watcher.py and adds additional functionality.

Bug fixes:

  1. Fix double-fire on create and modify events: the HandleObserverEvent fires on both create and modified events. Depending on how the file is written to disk, this may cause 2 events to be fired, OCR'ing the PDF twice. This fix listens only for the create event and only fires once.
  2. Fix FileInput error when reading from docker volume: Non-deterministicly, watchdog may fire a create even when a file is not fully written to disk on a docker volume. We fix this by watching the file size every second and waiting for new data to stop being written to disk. This solution works 100% reliabily my testing, both docker and non-docker.

New functonality:

  1. ON_SUCCESS_DELETE: Added a flag to delete the input file when ocrmypdf returns with a 0 (success) error code
  2. DESKEW: Added a flag to enable or disable DESKEW
  3. Docs update with PYTHONBUFFERED: By default, python3 buffers output to STDOUT. This prevents us from seeing what watcher.py is doing in the docker logs. I added an explaination to the docs that show how to enable this and see the python output in the logs.

@ianalexander
Copy link
Contributor Author

Added an additional commit to upgrade logging from print to the actual python logging facility. I also added an env var POLL_NEW_FILE_SECONDS that controls how frequently watchdog should check that a new_file has been fully written to disk, regarding bugfix number 2. This may be useful for people in different (slower) network environments.

Copy link
Collaborator

@jbarlow83 jbarlow83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concerned about the file size = 0 problem...

PATTERNS = ['*.pdf']

logging.basicConfig(level=LOGLEVEL)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use ocrmypdf.configure_logging in the ifmain block

# watchdog event before the file is actually fully on disk, causing
# pikepdf to fail.
current_size = None
while current_size != new_file.stat().st_size:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is defeated by opening a Python terminal and running
f = open('file.pdf', 'wb') in the input directory. In other words if a program open a file and has any sizable wait before it begins writing then, we'll trigger on a file size of zero. Or during writing a delay of more than 1 second is entirely possible on a slow network. In some file systems even opening a folder with >1000 files can see this kind of delay.

This is really something watchdog should figure out for us....

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also consider the following:

  • Samba client opens new file for writing.
  • Samba client writes some bytes.
  • Watcher enters loop waiting for file size to change.
  • Samba client crashes/goes offline without finishing.
  • Watcher is stuck indefinitely waiting for that one file, even as other files are changed.



class HandleObserverEvent(PatternMatchingEventHandler):
def on_any_event(self, event):
if event.event_type in ['created', 'modified']:
if event.event_type in ['created']:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if I modify a file in the input folder, I want it to be reprocessed.

@@ -224,6 +227,9 @@ convert it to a OCRed PDF in ``/output/``. The parameters to this image are:
"``-v <path to files to convert>:/input``", "Files placed in this location will be OCRed"
"``-v <path to store results>:/output``", "This is where OCRed files will be stored"
"``-e OCR_OUTPUT_DIRECTORY_YEAR_MONTH=1``", "This will place files in the output in {output}/{year}/{month}/{filename}"
"``-e OCR_ON_SUCCESS_DELETE=1``", "This will delete the input file if the exit code is 0 (OK)"
"``-e OCR_DESKEW=1``", "This will enable deskew for crooked PDFs"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of agree with having an option to deskew baked into the script since it's one of the most desirable to have, but there are tons of options in ocrmypdf and we don't want to set up environment vars for each. I think the idea should be that you take this script and customize it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't there be something like inwebservice.py where we could pass all arguments to the main program by calling it through

    cmd_args = [arg for arg in os.getenv('OCR_OPTIONS')]
    ocrmypdf_args = ["ocrmypdf", *cmd_args, file_path, output_path]
    proc = run(ocrmypdf_args, stdout=PIPE, stderr=PIPE, encoding="utf-8")

@jbarlow83
Copy link
Collaborator

This PR still provides improvements so I've merged most of the changes, with some changes of my own. I'll keep this PR open for now while we debate other improvements.

@jbarlow83 jbarlow83 merged commit 3eab161 into ocrmypdf:master Feb 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants