Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ValueError: invalid arguments: (pikepdf._qpdf._ObjectList([Decimal('337.51312')]),) #2894

Closed
hzxie opened this issue Mar 15, 2023 · 7 comments
Labels
bug Bug report or a Bug-fix unconfirmed

Comments

@hzxie
Copy link

hzxie commented Mar 15, 2023

Description

An error message occurred during the upload of a certain PDF file, stating ValueError: invalid arguments: (pikepdf._qpdf._ObjectList([Decimal('337.51312')]),).

image

Steps to reproduce

  1. Download the PDF file here.
  2. Upload to Paperless-ngx.
  3. You would get the same error message.

Webserver logs

[2023-03-15 16:26:57,834] [ERROR] [paperless.consumer] Error while consuming document 2112.05504.pdf: ValueError: invalid arguments: (pikepdf._qpdf._ObjectList([Decimal('337.51312')]),)
Traceback (most recent call last):
  File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 321, in parse
    ocrmypdf.ocr(**args)
  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/api.py", line 332, in ocr
    return run_pipeline(options=options, plugin_manager=plugin_manager, api=True)
  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/_sync.py", line 378, in run_pipeline
    pdfinfo = get_pdfinfo(
  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/_pipeline.py", line 165, in get_pdfinfo
    return PdfInfo(
  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/pdfinfo/info.py", line 932, in __init__
    self._pages = _pdf_pageinfo_concurrent(
  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/pdfinfo/info.py", line 709, in _pdf_pageinfo_concurrent
    executor(
  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/_concurrent.py", line 87, in __call__
    self._execute(
  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/builtin_plugins/concurrency.py", line 141, in _execute
    result = future.result()
  File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.9/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/pdfinfo/info.py", line 666, in _pdf_pageinfo_sync
    page = PageInfo(pdf, pageno, infile, check_pages, detailed_analysis)
  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/pdfinfo/info.py", line 746, in __init__
    self._gather_pageinfo(pdf, pageno, infile, check_pages, detailed_analysis)
  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/pdfinfo/info.py", line 792, in _gather_pageinfo
    for info in _process_content_streams(
  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/pdfinfo/info.py", line 594, in _process_content_streams
    yield from _find_form_xobject_images(pdf, container, contentsinfo)
  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/pdfinfo/info.py", line 541, in _find_form_xobject_images
    yield from _process_content_streams(
  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/pdfinfo/info.py", line 586, in _process_content_streams
    contentsinfo = _interpret_contents(container, initial_shorthand)
  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/pdfinfo/info.py", line 236, in _interpret_contents
    ctm = PdfMatrix(operands) @ ctm
  File "/usr/local/lib/python3.9/site-packages/pikepdf/models/matrix.py", line 56, in __init__
    raise ValueError('invalid arguments: ' + repr(args))
ValueError: invalid arguments: (pikepdf._qpdf._ObjectList([Decimal('337.51312')]),)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/src/paperless/src/documents/consumer.py", line 385, in try_consume_file
    document_parser.parse(self.path, mime_type, self.filename)
  File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 379, in parse
    raise ParseError(f"{e.__class__.__name__}: {str(e)}") from e
documents.parsers.ParseError: ValueError: invalid arguments: (pikepdf._qpdf._ObjectList([Decimal('337.51312')]),)


### Browser logs

```bash
N/a

Paperless-ngx version

1.13.0

Host OS

Arch Linux x86_64

Installation method

Docker - official image

Browser

Chrome 111.0.5563.64 (Official Build) (arm64)

Configuration changes

No response

Other

No response

@hzxie hzxie added bug Bug report or a Bug-fix unconfirmed labels Mar 15, 2023
@hzxie
Copy link
Author

hzxie commented Mar 15, 2023

I tried to use qpdf --replace-input input-file-name.pdf from the solution #2394
But it is not working for me.

@stumpylog
Copy link
Member

The issue is pretty clearly with the input PDF or pikepdf and not with paperless, so there's nothing we'll be able to do. I would suggest seeing if upstream pikepdf is able to fix it: https://github.com/pikepdf/pikepdf/issues

@stumpylog stumpylog closed this as not planned Won't fix, can't repro, duplicate, stale Mar 15, 2023
@hzxie
Copy link
Author

hzxie commented Mar 16, 2023

@stumpylog
Is it feasible to handle the exception without generating an error message? It's better than having the document unable to be imported into Paperless-ngx.

@stumpylog
Copy link
Member

Not really. It's a very generic exception, not specific to pikepdf. It's also pretty early on from what I can tell, so there wouldn't be any document text to get out.

You can work around it (probably) by setting skip_noarchive for this file. pdftotext complains about badly formatted numbers, but it appears to not be as strict as pikepdf.

@hzxie
Copy link
Author

hzxie commented Mar 17, 2023

Instead of directly throwing an exception and not saving the file, you could consider throwing a warning message.

@stumpylog
Copy link
Member

As mentioned above, the exception is not raised by our code and is a generic exception, which means determining what has gone wrong is basically impossible to know, so a decision can't be made about not erroring out.

@github-actions
Copy link
Contributor

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new discussion or issue for related concerns.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 17, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bug report or a Bug-fix unconfirmed
Projects
None yet
Development

No branches or pull requests

2 participants