Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: How to pass --skip-text to watcher.py in docker container? #1180

Closed
dolorosus opened this issue Oct 26, 2023 · 2 comments
Closed
Assignees
Labels

Comments

@dolorosus
Copy link

What were you trying to do?

I tried to pass --skip-text to ocrmypdf via watcher.py.

   environment:
      - OCR_OUTPUT_DIRECTORY_YEAR_MONTH=0
      - OCR_ON_SUCCESS_ARCHIVE=1
      - OCR_DESKEW=1 
      - PYTHONUNBUFFERED=1 
      - OCR_USE_POLLING=1
      - OCR_JSON_SETTINGS='{"skip-text":true}'

But after a fresh build of the docker container I get now:

2023-10-26T21:21:54.356524247Z ╭─ Error ──────────────────────────────────────────────────────────────────────╮
2023-10-26T21:21:54.356642450Z │ Invalid value for '--ocr-json-settings': ''{"skip-text":true}'': No such     │
2023-10-26T21:21:54.356668043Z │ file or directory                                                            │
2023-10-26T21:21:54.356689895Z ╰──────────────────────────────────────────────────────────────────────────────╯
2023-10-26T21:21:56.791233943Z Usage: watcher.py [OPTIONS] [INPUT_DIR] [OUTPUT_DIR] [ARCHIVE_DIR]
2023-10-26T21:21:56.793087098Z Try 'watcher.py --help' for help.

in the logfile.

How can I pass the skip-text option to ocrmypdf?

Where are you installing from?

Docker container

What operating system are you working on?

Linux

Relevant log output

2023-10-26T21:21:54.356524247Z ╭─ Error ──────────────────────────────────────────────────────────────────────╮
2023-10-26T21:21:54.356642450Z │ Invalid value for '--ocr-json-settings': ''{"skip-text":true}'': No such     │
2023-10-26T21:21:54.356668043Z │ file or directory                                                            │
2023-10-26T21:21:54.356689895Z ╰──────────────────────────────────────────────────────────────────────────────╯
2023-10-26T21:21:56.791233943Z Usage: watcher.py [OPTIONS] [INPUT_DIR] [OUTPUT_DIR] [ARCHIVE_DIR]
2023-10-26T21:21:56.793087098Z Try 'watcher.py --help' for help.
@jbarlow83
Copy link
Collaborator

ocr_json_settings is currently set up to read a file rather than parse JSON from a string.
I suppose it would be worth having the code try both - but for now, you need to create a separate JSON file.

@dolorosus
Copy link
Author

Sorry for any inconvenience , but can you go a little bit more in detail?
I created a json file with this content:

{
    "skip-text": true
}

and point OCR_JSON_SETTINGS to it, but I got this result::

2023-10-27T09:28:13.625344293Z ╭───────────────────── Traceback (most recent call last) ──────────────────────╮

2023-10-27T09:28:13.625497773Z │ /app/watcher.py:268 in main                                                  │

2023-10-27T09:28:13.625542329Z │                                                                              │

2023-10-27T09:28:13.625565532Z │   265 │   │   ),                                                             │

2023-10-27T09:28:13.625587421Z │   266 │   │   manage_root_logger=True,                                       │

2023-10-27T09:28:13.625608921Z │   267 │   )                                                                  │

2023-10-27T09:28:13.625629772Z │ ❱ 268 │   log.setLevel(loglevel)                                             │

2023-10-27T09:28:13.625650791Z │   269 │   log.info(                                                          │

2023-10-27T09:28:13.625671550Z │   270 │   │   f"Starting OCRmyPDF watcher with config:\n"                    │

2023-10-27T09:28:13.625692976Z │   271 │   │   f"Input Directory: {input_dir}\n"                              │

2023-10-27T09:28:13.625714124Z │                                                                              │

2023-10-27T09:28:13.625736013Z │ ╭───────────────────────────────── locals ─────────────────────────────────╮ │

2023-10-27T09:28:13.625759790Z │ │           archive_dir = PosixPath('/processed')                          │ │

2023-10-27T09:28:13.625781623Z │ │                deskew = True                                             │ │

2023-10-27T09:28:13.625803142Z │ │             input_dir = PosixPath('/input')                              │ │

2023-10-27T09:28:13.625824327Z │ │              loglevel = <LoggingLevelEnum.INFO: 'INFO'>                  │ │

2023-10-27T09:28:13.625847623Z │ │     ocr_json_settings = <_io.TextIOWrapper name='/archive/ocrmypdf.json' │ │

2023-10-27T09:28:13.625868993Z │ │                         mode='r' encoding='UTF-8'>                       │ │

2023-10-27T09:28:13.625890141Z │ │    on_success_archive = True                                             │ │

2023-10-27T09:28:13.625910863Z │ │     on_success_delete = False                                            │ │

2023-10-27T09:28:13.625941697Z │ │            output_dir = PosixPath('/output')                             │ │

2023-10-27T09:28:13.625964900Z │ │ output_dir_year_month = False                                            │ │

2023-10-27T09:28:13.625985937Z │ │              patterns = '*.pdf,*.PDF'                                    │ │

2023-10-27T09:28:13.626006437Z │ │ poll_new_file_seconds = 1                                                │ │

2023-10-27T09:28:13.626027159Z │ │  retries_loading_file = 5                                                │ │

2023-10-27T09:28:13.626115696Z │ │           use_polling = True                                             │ │

2023-10-27T09:28:13.626149844Z │ ╰──────────────────────────────────────────────────────────────────────────╯ │
2023-10-27T09:28:13.626267232Z │ /usr/lib/python3.10/logging/__init__.py:1452 in setLevel                     │

2023-10-27T09:28:13.626289435Z │                                                                              │

2023-10-27T09:28:13.626311676Z │   1449 │   │   """                                                           │

2023-10-27T09:28:13.626333435Z │   1450 │   │   Set the logging level of this logger.  level must be an int o │

2023-10-27T09:28:13.626355379Z │   1451 │   │   """                                                           │

2023-10-27T09:28:13.626377546Z │ ❱ 1452 │   │   self.level = _checkLevel(level)                               │

2023-10-27T09:28:13.626398675Z │   1453 │   │   self.manager._clear_cache()                                   │

2023-10-27T09:28:13.626419805Z │   1454 │                                                                     │

2023-10-27T09:28:13.626440990Z │   1455 │   def debug(self, msg, *args, **kwargs):                            │

2023-10-27T09:28:13.626462490Z │                                                                              │

2023-10-27T09:28:13.626483768Z │ ╭───────────────── locals ──────────────────╮                                │

2023-10-27T09:28:13.626505749Z │ │ level = <LoggingLevelEnum.INFO: 'INFO'>   │                                │

2023-10-27T09:28:13.626528415Z │ │  self = <Logger ocrmypdf-watcher (DEBUG)> │                                │

2023-10-27T09:28:13.626560952Z │ ╰───────────────────────────────────────────╯                                │

2023-10-27T09:28:13.626585878Z │                                                                              │

2023-10-27T09:28:13.626607119Z │ /usr/lib/python3.10/logging/__init__.py:201 in _checkLevel                   │

2023-10-27T09:28:13.626628656Z │                                                                              │

2023-10-27T09:28:13.626650026Z │    198 │   │   │   raise ValueError("Unknown level: %r" % level)             │

2023-10-27T09:28:13.626671470Z │    199 │   │   rv = _nameToLevel[level]                                      │

2023-10-27T09:28:13.626693155Z │    200 │   else:                                                             │

2023-10-27T09:28:13.626714155Z │ ❱  201 │   │   raise TypeError("Level not an integer or a valid string: %r"  │

2023-10-27T09:28:13.626781710Z │    202 │   │   │   │   │   │   % (level,))                                   │

2023-10-27T09:28:13.626812655Z │    203 │   return rv                                                         │

2023-10-27T09:28:13.626834580Z │    204                                                                       │

2023-10-27T09:28:13.626855914Z │                                                                              │

2023-10-27T09:28:13.626877136Z │ ╭──────────────── locals ─────────────────╮                                  │

2023-10-27T09:28:13.626898802Z │ │ level = <LoggingLevelEnum.INFO: 'INFO'> │                                  │

2023-10-27T09:28:13.626920376Z │ ╰─────────────────────────────────────────╯                                  │

2023-10-27T09:28:13.626944246Z ╰──────────────────────────────────────────────────────────────────────────────╯

2023-10-27T09:28:13.626967450Z TypeError: Level not an integer or a valid string: <LoggingLevelEnum.INFO: 

2023-10-27T09:28:13.626988654Z 'INFO'>

What I'm doing wrong?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants