Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch processing with OCRmyPDF using Synology NAS (DS216+II) and docker #180

Closed
Enantiomerie opened this issue Jul 31, 2017 · 10 comments
Closed

Comments

@Enantiomerie
Copy link

Enantiomerie commented Jul 31, 2017

thank you for your great work. After 9 Months of trial and error (sorry I am a newbie) I managed to alter the script you already provided and adopted it whilst only using the tools provided by Synology.
Basically this should work for any platforms that uses docker.

Now you can scan your document to a NAS folder (1st argument). The script creates an OCRed PDF and moves original PDF and OCRed PDF to an archive folder (2nd argument)

The biggest obstacle was the permission concept.
I hope this script works for you (Synology +) docker people.
Feel free to optimize the script.

diskstation_chron_ocrmypdf

#!/bin/env python3

# script needs 2 arguments
# 1. source dir with *.pdf - default is location of script
# 2. move dir where *.pdf and *_OCR.pdf are moved to

import logging
import os
import subprocess
import sys
import time
import shutil

script_dir = os.path.dirname(os.path.realpath(__file__))
timestamp = time.strftime("%Y-%m-%d-%H%M_")
log_file = script_dir + '/' + timestamp + 'ocrmypdf.log'
logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s', filename=log_file, filemode='w')

if len(sys.argv) > 1:
    start_dir = sys.argv[1]
else:
    start_dir = '.'

for dir_name, subdirs, file_list in os.walk(start_dir):
    logging.info('\n')
    logging.info(dir_name + '\n')
    os.chdir(dir_name)
    for filename in file_list:
        file_ext = os.path.splitext(filename)[1]
        if file_ext == '.pdf':
            full_path = dir_name + '/' + filename
            file_noext = os.path.splitext(filename)[0]
            timestamp_OCR = time.strftime("%Y-%m-%d-%H%M_OCR_")
            filename_OCR = timestamp_OCR + file_noext + '.pdf'
            docker_mount = dir_name + ':/home/docker'
# create string for pdf processing 
# diskstation needs a user:group docker:docker. find uid:gid of your diskstation docker:docker with id docker.
# use this uid:gid in -u flag
# rw rights for docker:docker at source dir are also necessary
# the script is processed as root user via chron 
            cmd = ['docker', 'run', '--rm', '-v', docker_mount, '-u="1030:65538"', 'jbarlow83/ocrmypdf', , '--deskew' , filename, filename_OCR]
            logging.info(cmd)
            proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
            result = proc.stdout.read()
            logging.info(result)
            full_path_OCR = dir_name + '/' + filename_OCR
            os.chmod(full_path_OCR, 0o666)
            os.chmod(full_path, 0o666)
            full_path_OCR_archive = sys.argv[2]
            full_path_archive = sys.argv[2] + '/no_ocr'
            shutil.move(full_path_OCR,full_path_OCR_archive)
            shutil.move(full_path, full_path_archive)
logging.info('Finished.\n')
@jbarlow83
Copy link
Collaborator

According to Docker documentation it looks you can do

docker run .... --user="docker:docker" ...

to avoid hardcoding uids.

@jbarlow83
Copy link
Collaborator

This would also require an x86 based Synology.

I'll make a note in docs that some people may need to specify a user for docker.

@Enantiomerie
Copy link
Author

I tested the -u="docker:docker" but it didn't work while -u="uid:gid" worked perfectly fine.
Seems like the mapping docker:docker to uid:gid failed but I am not the expert to figure out why.

The x86 based Synologys are the only ones that offer docker via the internal app store.

@Enantiomerie
Copy link
Author

For the test my Synology DS216 was used. So other Linux distributions might work fine.

@jbarlow83
Copy link
Collaborator

Added this script to documentation and notes on implementation

@tuxflo
Copy link

tuxflo commented Oct 10, 2017

Two things (since I tried to do something like that too for QNAP NAS Systems):
1: you should switch from chron to incron which is able to observe a directory for changes.
2: you should check and wait for the files to be finished before processing. Some scanners (like an HP Multifunction printer) will do multiple open/close events until the complete file is transmitted to the network share.

@mckar
Copy link

mckar commented Nov 26, 2017

I'm not able to get this working on my DS918plus with latest updates. I created group docker and user docker in DSM, after this checked id's with SSH:
root@DiskStation:/volume1/docker/ocrmypdf/in# id docker
uid=1030(docker) gid=100(users) groups=100(users),65536(docker)
user and group have read/write access to /volume1/docker.

Log output is:
2017-11-26 15:19:21,959
2017-11-26 15:19:21,960 /volume1/docker/ocrmypdf/in
2017-11-26 15:19:21,960 ['docker', 'run', '--rm', '-v', '/volume1/docker/ocrmypdf/in:/home/docker', '-u="1030:65536"', 'jbarlow83/ocrmypdf', '--deskew', 'Test.pdf', '2017-11-26-1519_OCR_Test.pdf']
2017-11-26 15:19:24,786 b'docker: Error response from daemon: linux spec user: unable to find user "1030: no matching entries in passwd file.\n'

Is the script still working on your diskstation?

@Atredis76
Copy link

Hi.
I think there is an Error in your Code.

This is the Output Log on My DS918+
File "/volume1/script/1.py", line 41 cmd = ['docker', 'run', '--rm', '-v', docker_mount, '-u="1030:65538"', 'jbarlow83/ocrmypdf', , '--deskew' , filename, filename_OCR] ^ SyntaxError: invalid syntax

When i delete the space and comma in your Code after the jbarlow83/ocrmypdf it starts without Fault Proms.
My Change:

cmd = ['docker', 'run', '--rm', '-v', docker_mount, '-u="1030:65538"', 'jbarlow83/ocrmypdf', '--deskew' , filename, filename_OCR]

Now Your Logfile says:
2017-11-26 17:38:28,926 Finished.

But nothing is done.

This is the way i start the script in the Task Scheduler

python /volume1/script/1.py document_inbox_ocrscript archive
Also an Manual start with the docker Command doesn´t work.

The uid:guid is changed to my inviroment.
The user and the Group docker have acces to the same Directory as shown in your discription above an the same Directorys are on ma DS.

Maybe you have a solution for this.

@mckar
Have you createt the Input and Output Folders and have you given them a read and write permission?

@mckar
Copy link

mckar commented Nov 26, 2017

Hi,
I also removed this part of the cmd. I think your script command does not work, you need to add the whole path like:
/volume1/script/ocrmypdf.py /volume1/docker/ocrmypdf/in /volume1/docker/ocrmypdf/in/archive
If you see "Finished" in the logfile and it did not process any files, then it did not found any pdf.
I did run the "docker run ..." command on SSH terminal as root and it did work - also I can't remember anymore if I did it with "-u" or without because I did a lot of testing during last hours. Maybe it has to do something because it's started from a python script.
Markus

@Atredis76
Copy link

Atredis76 commented Nov 26, 2017

OK Perfect.
The full path was the solution.
Then I get the same Error Message.
Delete the quatation marks at uid:guid
False:-u="1030:65538"
Right: -u=1030:65538

Now it is doing something.

So easy and it takes 2 days to find it out.

One Question the Narker for German is I have used -L deu but it doesn´t know deu.

So I search for it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants