Skip to content
This repository has been archived by the owner on Feb 16, 2023. It is now read-only.

[BUG] PAPERLESS_OCR_CLEAN=clean-final sometimes introduces artifacts / [Feature request] Reprocess documents with different settings #1490

Open
denilsonsa opened this issue Dec 14, 2021 · 4 comments

Comments

@denilsonsa
Copy link

Describe the bug

Using PAPERLESS_OCR_CLEAN=clean-final gives good results most of the time, but for some documents it can introduce ugly artifacts. (See examples at the bottom.)

To Reproduce

Steps to reproduce the behavior:

  1. Scan a document that happens to trigger this bug. Could be one with light blue background. YMMV.
  2. Make sure PAPERLESS_OCR_CLEAN=clean-final is set.
  3. Import the document into paperless-ng.
  4. Observe how the document looks ugly, with very distracting large white rectangles.

Expected behavior

I expected two kinds of behaviors:

  1. Never generate ugly artifacts.
    • I understand this is near-impossible, and this is more a bug of unpaper and less from paperless.
  2. Whenever unpaper screws up, provide a way to re-process a document using a different set of parameters.
    • This is something paperless-ng can provide.
    • It could be some UI that allows re-processing (re-archiving) a document. With the option of changing some settings (e.g. most of the OCR settings).
    • For bonus points paperless could even (pre-)process the document in two different settings (clean and clean-final) and let the user choose which one should be kept for archival.
    • For a quicker solution (easier to implement), just allow the document_archiver administration command to accept new configuration values. (Or… does it already accept? If yes, then we need documentation on how to run it with different values than those from docker-compose.env.)

Current workaround

Since all configuration settings are hard-coded into a configuration file and can't be changed on-the-fly, we have to go through many steps to fix ugly documents:

  1. Notice the document looks ugly. Take note of the document id. (It's in the URL.)
  2. Edit docker-compose.env, change PAPERLESS_OCR_CLEAN=clean-final to PAPERLESS_OCR_CLEAN=clean.
  3. Restart/rebuild the running docker container: docker-compose up.
  4. Wait until it is up and running. It can take a couple of minutes.
  5. Ask paperless-ng to reprocess the document: docker-compose exec webserver document_archiver --overwrite --document DOCUMENT_ID_HERE
  6. Wait several more minutes.
  7. Reload the document on the paperless-ng interface on the web browser. Observe it looks good now.
  8. Edit docker-compose.env, reverting the change from step 2.
  9. Restart/rebuild the running docker container again: docker-compose up -d

Those are too many steps and it takes too long. :-/

Environment

  • Installation method: docker-compose
  • Version: Paperless-ng 1.5.0
  • Host OS of the machine running paperless: Gentoo Linux for Arm 64, on a Raspberry Pi 4

Related issues:

  • [BUG] Certain PDF have changed colors #1248 [BUG] Certain PDF have changed colors → I believe reprocessing those documents with different settings could fix the color issue. So, it's somewhat similar to this issue over here.
  • [Other] Rerun OCR on uploaded documents #1474 [Other] Rerun OCR on uploaded documents → Re-running the OCR step with a different language would be very convenient, and a very common use-case of people dealing with documents on multiple languages.

Screenshots

"Cleaned up" document, after using unpaper with PAPERLESS_OCR_CLEAN=clean-final:
Document after passing through unpaper

Original document, exactly as it was scanned, or if PAPERLESS_OCR_CLEAN=clean:
Original document (directly from the scanner)

@denilsonsa
Copy link
Author

Easier workaround

I discovered I can set environment variables at the docker-compose exec command:

docker-compose exec -e PAPERLESS_OCR_CLEAN=clean webserver document_archiver --overwrite --document DOC_ID_HERE

This is way simpler and doesn't require restarting the server. I just wish it was documented somewhere.

@denilsonsa
Copy link
Author

Note: I believe that (re-)running the document_archiver will overwrite any text that I had manually written into the "Content" tab/textarea, but I can't double-check that right now.

If that's the case, that's very unfortunate. I'd like to be able to reprocess/recrete the archived version of the PDF in order to get rid of graphical glitches, but I may or may not want to overwrite my manually-fixed OCR-ed content text.

@sprnza
Copy link

sprnza commented Jan 15, 2022

Hi! Thanks for the instruction. I've tried to fix my issue described in #1474 but it seems like the document OCR language remains as it was set during uploading so Content is not being fixed.

@christf
Copy link

christf commented Jan 31, 2022

As another workaround, you could use pre-processing hook as described in the docs and process every inbound document in a way that creates a copy of each document that is produced with unpaper and writes that to the input directory. Then paperless sees both documents and prepares them and you get to chose. However you always have to chose as well.

I do not know the implications of using the API to post a document and the preprocessing / postprocessing scripts but that may be something to try :)
And I am not sure how to avoid endless loops just yet but that may come up when building the script.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants