You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm using OCRmyPDF through the DMS https://github.com/paperless-ngx/paperless-ngx/ and found some issues with colouring in specific PDFs. In particular, the bank statements that are generated through DKB AG are rendered with the wrong colours when converted to PDF/A with OCRmyPDF.
I've seen this issue already reported in pre-fork paperless-ng: jonaswinkler/paperless-ng#1248
There they suggested to change to the output PDF instead of PDF/A and the theory was, that ocrmypdf or rather unpaper was producing some undesired PDF artefacts: jonaswinkler/paperless-ng#1490
Both the original bank statement and the ocrmypdf converted bank statements look fine if I open them on my PC with Evince, but the converted one has the aforementioned issues within PDF.js and Chromium.
Original Bank Statement opened in pdfbox PDFDebugger containing "Pantone" and "HKS" coloured objects:
OCRmyPDF converted Bank Statement (with -sColorConversionStrategy=LeaveColorUnchanged) containing both those coloured objects, but as converted ICC profiles:
Rendering error with PDF.js (cannot interpret the colours with ICC profiles and renders the light blue HKS colour as red and the dark blue Pantone colour as yellow):
When I set that parameter manually back to RGB the generated PDF/A document is rendered fine with PDF.js and Chromium. I tried setting the environment variable GS_OPTIONS (https://ghostscript.com/docs/9.54.0/Use.htm#Environment_variables) for ghostscript directly, but as ocrmypdf is calling the subprocess with the attribute explicitly set to LeaveColorUnchanged, I'm unable to override it.
As paperless-ngx already has options for providing additional parameters to ocrmypdf: https://docs.paperless-ngx.com/configuration/#ocr through an environment variable PAPERLESS_OCR_USER_ARGS I was wondering if we could create a similar option ghostscript_args as already exists for unpaper (unpaper_args).
As a workaround for now, I have manually edited the strategy, but it would be cool if this was something, that could be passed from the outside as an argument.
The text was updated successfully, but these errors were encountered:
Describe the proposed feature
Hello,
I'm using OCRmyPDF through the DMS https://github.com/paperless-ngx/paperless-ngx/ and found some issues with colouring in specific PDFs. In particular, the bank statements that are generated through DKB AG are rendered with the wrong colours when converted to PDF/A with OCRmyPDF.
I've seen this issue already reported in pre-fork paperless-ng: jonaswinkler/paperless-ng#1248
There they suggested to change to the output PDF instead of PDF/A and the theory was, that ocrmypdf or rather unpaper was producing some undesired PDF artefacts: jonaswinkler/paperless-ng#1490
I dug a bit deeper, and found out that this is not an issue with unpaper, or ocrmypdf at all, but rather with PDF.js (and Chromes built-in PDF renderer) and the ghostscript generated ICC profiles themselves.
mozilla/pdf.js#2856
mozilla/pdf.js#7905
mozilla/pdf.js#9940
possibly https://bugs.chromium.org/p/chromium/issues/detail?id=401118
Both the original bank statement and the ocrmypdf converted bank statements look fine if I open them on my PC with Evince, but the converted one has the aforementioned issues within PDF.js and Chromium.
Original Bank Statement opened in pdfbox PDFDebugger containing "Pantone" and "HKS" coloured objects:
OCRmyPDF converted Bank Statement (with
-sColorConversionStrategy=LeaveColorUnchanged
) containing both those coloured objects, but as converted ICC profiles:Rendering error with PDF.js (cannot interpret the colours with ICC profiles and renders the light blue HKS colour as red and the dark blue Pantone colour as yellow):
I've also found some rendering-bugs that could be related to this change: 327df5c (#679) possibly related to an ocrmypdf library upgrade:paperless-ngx/paperless-ngx#4004
paperless-ngx/paperless-ngx#4056
paperless-ngx/paperless-ngx#3933
(Probably unrelated)
When I set that parameter manually back to
RGB
the generated PDF/A document is rendered fine with PDF.js and Chromium. I tried setting the environment variableGS_OPTIONS
(https://ghostscript.com/docs/9.54.0/Use.htm#Environment_variables) for ghostscript directly, but as ocrmypdf is calling the subprocess with the attribute explicitly set toLeaveColorUnchanged
, I'm unable to override it.As paperless-ngx already has options for providing additional parameters to ocrmypdf: https://docs.paperless-ngx.com/configuration/#ocr through an environment variable
PAPERLESS_OCR_USER_ARGS
I was wondering if we could create a similar optionghostscript_args
as already exists for unpaper (unpaper_args
).As a workaround for now, I have manually edited the strategy, but it would be cool if this was something, that could be passed from the outside as an argument.
The text was updated successfully, but these errors were encountered: