Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Make Ghostscript Colour Conversion Configurable #1143

Closed
marcules opened this issue Aug 27, 2023 · 1 comment
Closed

[Feature]: Make Ghostscript Colour Conversion Configurable #1143

marcules opened this issue Aug 27, 2023 · 1 comment
Assignees

Comments

@marcules
Copy link

marcules commented Aug 27, 2023

Describe the proposed feature

Hello,

I'm using OCRmyPDF through the DMS https://github.com/paperless-ngx/paperless-ngx/ and found some issues with colouring in specific PDFs. In particular, the bank statements that are generated through DKB AG are rendered with the wrong colours when converted to PDF/A with OCRmyPDF.

I've seen this issue already reported in pre-fork paperless-ng: jonaswinkler/paperless-ng#1248
There they suggested to change to the output PDF instead of PDF/A and the theory was, that ocrmypdf or rather unpaper was producing some undesired PDF artefacts: jonaswinkler/paperless-ng#1490

I dug a bit deeper, and found out that this is not an issue with unpaper, or ocrmypdf at all, but rather with PDF.js (and Chromes built-in PDF renderer) and the ghostscript generated ICC profiles themselves.
mozilla/pdf.js#2856
mozilla/pdf.js#7905
mozilla/pdf.js#9940
possibly https://bugs.chromium.org/p/chromium/issues/detail?id=401118


Both the original bank statement and the ocrmypdf converted bank statements look fine if I open them on my PC with Evince, but the converted one has the aforementioned issues within PDF.js and Chromium.

Original Bank Statement opened in pdfbox PDFDebugger containing "Pantone" and "HKS" coloured objects:
2023-08-27_16-10

OCRmyPDF converted Bank Statement (with -sColorConversionStrategy=LeaveColorUnchanged) containing both those coloured objects, but as converted ICC profiles:
2023-08-27_16-19

Rendering error with PDF.js (cannot interpret the colours with ICC profiles and renders the light blue HKS colour as red and the dark blue Pantone colour as yellow):
2023-08-27_16-15


I've also found some rendering-bugs that could be related to this change: 327df5c (#679) possibly related to an ocrmypdf library upgrade:
paperless-ngx/paperless-ngx#4004
paperless-ngx/paperless-ngx#4056
paperless-ngx/paperless-ngx#3933

(Probably unrelated)


When I set that parameter manually back to RGB the generated PDF/A document is rendered fine with PDF.js and Chromium. I tried setting the environment variable GS_OPTIONS (https://ghostscript.com/docs/9.54.0/Use.htm#Environment_variables) for ghostscript directly, but as ocrmypdf is calling the subprocess with the attribute explicitly set to LeaveColorUnchanged, I'm unable to override it.

As paperless-ngx already has options for providing additional parameters to ocrmypdf: https://docs.paperless-ngx.com/configuration/#ocr through an environment variable PAPERLESS_OCR_USER_ARGS I was wondering if we could create a similar option ghostscript_args as already exists for unpaper (unpaper_args).

As a workaround for now, I have manually edited the strategy, but it would be cool if this was something, that could be passed from the outside as an argument.

@jbarlow83
Copy link
Collaborator

Added in v15

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants