[BUG] PAPERLESS_OCR_CLEAN=clean-final sometimes introduces artifacts / [Feature request] Reprocess documents with different settings #1490

denilsonsa · 2021-12-14T23:45:40Z

Describe the bug

Using PAPERLESS_OCR_CLEAN=clean-final gives good results most of the time, but for some documents it can introduce ugly artifacts. (See examples at the bottom.)

To Reproduce

Steps to reproduce the behavior:

Scan a document that happens to trigger this bug. Could be one with light blue background. YMMV.
Make sure PAPERLESS_OCR_CLEAN=clean-final is set.
Import the document into paperless-ng.
Observe how the document looks ugly, with very distracting large white rectangles.

Expected behavior

I expected two kinds of behaviors:

Never generate ugly artifacts.
- I understand this is near-impossible, and this is more a bug of unpaper and less from paperless.
Whenever unpaper screws up, provide a way to re-process a document using a different set of parameters.
- This is something paperless-ng can provide.
- It could be some UI that allows re-processing (re-archiving) a document. With the option of changing some settings (e.g. most of the OCR settings).
- For bonus points paperless could even (pre-)process the document in two different settings (clean and clean-final) and let the user choose which one should be kept for archival.
- For a quicker solution (easier to implement), just allow the document_archiver administration command to accept new configuration values. (Or… does it already accept? If yes, then we need documentation on how to run it with different values than those from docker-compose.env.)

Current workaround

Since all configuration settings are hard-coded into a configuration file and can't be changed on-the-fly, we have to go through many steps to fix ugly documents:

Notice the document looks ugly. Take note of the document id. (It's in the URL.)
Edit docker-compose.env, change PAPERLESS_OCR_CLEAN=clean-final to PAPERLESS_OCR_CLEAN=clean.
Restart/rebuild the running docker container: docker-compose up.
Wait until it is up and running. It can take a couple of minutes.
Ask paperless-ng to reprocess the document: docker-compose exec webserver document_archiver --overwrite --document DOCUMENT_ID_HERE
Wait several more minutes.
Reload the document on the paperless-ng interface on the web browser. Observe it looks good now.
Edit docker-compose.env, reverting the change from step 2.
Restart/rebuild the running docker container again: docker-compose up -d

Those are too many steps and it takes too long. :-/

Environment

Installation method: docker-compose
Version: Paperless-ng 1.5.0
Host OS of the machine running paperless: Gentoo Linux for Arm 64, on a Raspberry Pi 4

Related issues:

[BUG] Certain PDF have changed colors #1248 [BUG] Certain PDF have changed colors → I believe reprocessing those documents with different settings could fix the color issue. So, it's somewhat similar to this issue over here.
[Other] Rerun OCR on uploaded documents #1474 [Other] Rerun OCR on uploaded documents → Re-running the OCR step with a different language would be very convenient, and a very common use-case of people dealing with documents on multiple languages.

Screenshots

"Cleaned up" document, after using unpaper with PAPERLESS_OCR_CLEAN=clean-final:

Original document, exactly as it was scanned, or if PAPERLESS_OCR_CLEAN=clean:

The text was updated successfully, but these errors were encountered:

denilsonsa · 2021-12-15T00:22:43Z

Easier workaround

I discovered I can set environment variables at the docker-compose exec command:

docker-compose exec -e PAPERLESS_OCR_CLEAN=clean webserver document_archiver --overwrite --document DOC_ID_HERE

This is way simpler and doesn't require restarting the server. I just wish it was documented somewhere.

denilsonsa · 2021-12-18T21:44:34Z

Note: I believe that (re-)running the document_archiver will overwrite any text that I had manually written into the "Content" tab/textarea, but I can't double-check that right now.

If that's the case, that's very unfortunate. I'd like to be able to reprocess/recrete the archived version of the PDF in order to get rid of graphical glitches, but I may or may not want to overwrite my manually-fixed OCR-ed content text.

sprnza · 2022-01-15T19:27:40Z

Hi! Thanks for the instruction. I've tried to fix my issue described in #1474 but it seems like the document OCR language remains as it was set during uploading so Content is not being fixed.

christf · 2022-01-31T20:10:08Z

As another workaround, you could use pre-processing hook as described in the docs and process every inbound document in a way that creates a copy of each document that is produced with unpaper and writes that to the input directory. Then paperless sees both documents and prepares them and you get to chose. However you always have to chose as well.

I do not know the implications of using the API to post a document and the preprocessing / postprocessing scripts but that may be something to try :)
And I am not sure how to avoid endless loops just yet but that may come up when building the script.

marcules mentioned this issue Aug 27, 2023

[Feature]: Make Ghostscript Colour Conversion Configurable ocrmypdf/OCRmyPDF#1143

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] PAPERLESS_OCR_CLEAN=clean-final sometimes introduces artifacts / [Feature request] Reprocess documents with different settings #1490

[BUG] PAPERLESS_OCR_CLEAN=clean-final sometimes introduces artifacts / [Feature request] Reprocess documents with different settings #1490

denilsonsa commented Dec 14, 2021

denilsonsa commented Dec 15, 2021

denilsonsa commented Dec 18, 2021

sprnza commented Jan 15, 2022

christf commented Jan 31, 2022 •

edited

[BUG] PAPERLESS_OCR_CLEAN=clean-final sometimes introduces artifacts / [Feature request] Reprocess documents with different settings #1490

[BUG] PAPERLESS_OCR_CLEAN=clean-final sometimes introduces artifacts / [Feature request] Reprocess documents with different settings #1490

Comments

denilsonsa commented Dec 14, 2021

denilsonsa commented Dec 15, 2021

denilsonsa commented Dec 18, 2021

sprnza commented Jan 15, 2022

christf commented Jan 31, 2022 • edited

christf commented Jan 31, 2022 •

edited