New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Watched folder bug fixes, new flags, and docs updates. #479
Conversation
Added an additional commit to upgrade logging from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concerned about the file size = 0 problem...
PATTERNS = ['*.pdf'] | ||
|
||
logging.basicConfig(level=LOGLEVEL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use ocrmypdf.configure_logging
in the ifmain block
# watchdog event before the file is actually fully on disk, causing | ||
# pikepdf to fail. | ||
current_size = None | ||
while current_size != new_file.stat().st_size: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is defeated by opening a Python terminal and running
f = open('file.pdf', 'wb')
in the input directory. In other words if a program open a file and has any sizable wait before it begins writing then, we'll trigger on a file size of zero. Or during writing a delay of more than 1 second is entirely possible on a slow network. In some file systems even opening a folder with >1000 files can see this kind of delay.
This is really something watchdog should figure out for us....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also consider the following:
- Samba client opens new file for writing.
- Samba client writes some bytes.
- Watcher enters loop waiting for file size to change.
- Samba client crashes/goes offline without finishing.
- Watcher is stuck indefinitely waiting for that one file, even as other files are changed.
|
||
|
||
class HandleObserverEvent(PatternMatchingEventHandler): | ||
def on_any_event(self, event): | ||
if event.event_type in ['created', 'modified']: | ||
if event.event_type in ['created']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if I modify a file in the input folder, I want it to be reprocessed.
@@ -224,6 +227,9 @@ convert it to a OCRed PDF in ``/output/``. The parameters to this image are: | |||
"``-v <path to files to convert>:/input``", "Files placed in this location will be OCRed" | |||
"``-v <path to store results>:/output``", "This is where OCRed files will be stored" | |||
"``-e OCR_OUTPUT_DIRECTORY_YEAR_MONTH=1``", "This will place files in the output in {output}/{year}/{month}/{filename}" | |||
"``-e OCR_ON_SUCCESS_DELETE=1``", "This will delete the input file if the exit code is 0 (OK)" | |||
"``-e OCR_DESKEW=1``", "This will enable deskew for crooked PDFs" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of agree with having an option to deskew baked into the script since it's one of the most desirable to have, but there are tons of options in ocrmypdf and we don't want to set up environment vars for each. I think the idea should be that you take this script and customize it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't there be something like inwebservice.py
where we could pass all arguments to the main program by calling it through
cmd_args = [arg for arg in os.getenv('OCR_OPTIONS')]
ocrmypdf_args = ["ocrmypdf", *cmd_args, file_path, output_path]
proc = run(ocrmypdf_args, stdout=PIPE, stderr=PIPE, encoding="utf-8")
This PR still provides improvements so I've merged most of the changes, with some changes of my own. I'll keep this PR open for now while we debate other improvements. |
This pull request is a followup to #466.
This PR fixes 2 bugs in
watcher.py
and adds additional functionality.Bug fixes:
Fix double-fire on create and modify events
: theHandleObserverEvent
fires on bothcreate
andmodified
events. Depending on how the file is written to disk, this may cause 2 events to be fired, OCR'ing the PDF twice. This fix listens only for thecreate
event and only fires once.Fix FileInput error when reading from docker volume
: Non-deterministicly, watchdog may fire acreate
even when a file is not fully written to disk on a docker volume. We fix this by watching the file size every second and waiting for new data to stop being written to disk. This solution works 100% reliabily my testing, both docker and non-docker.New functonality:
ON_SUCCESS_DELETE
: Added a flag to delete the input file when ocrmypdf returns with a 0 (success) error codeDESKEW
: Added a flag to enable or disable DESKEWDocs update with PYTHONBUFFERED
: By default, python3 buffers output to STDOUT. This prevents us from seeing whatwatcher.py
is doing in the docker logs. I added an explaination to the docs that show how to enable this and see the python output in the logs.