[BUG] Sanity check returns "Document xxx has no content." #1041

WalterWampe · 2022-05-28T07:27:29Z

Description

My sanity checker returned several lines of "Document xxx has no content."

Is there any way to find the Document via this number and how can I fix?

Do I need to be worried?

Steps to reproduce

go to docker-compose folder
run command docker-compose exec webserver document_sanity_checker

Webserver logs

user@dockerhost:~/Docker-Composes/paperless-ng$ docker-compose exec webserver document_sanity_checker
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1542/1542 [03:40<00:00,  7.00it/s]
[2022-05-28 09:21:54,773] [INFO] [paperless.sanity_checker] Document 823 has no content.
[2022-05-28 09:21:54,843] [INFO] [paperless.sanity_checker] Document 86 has no content.
[2022-05-28 09:21:54,857] [INFO] [paperless.sanity_checker] Document 85 has no content.
[2022-05-28 09:21:54,873] [INFO] [paperless.sanity_checker] Document 84 has no content.
[2022-05-28 09:21:54,888] [INFO] [paperless.sanity_checker] Document 525 has no content.
[2022-05-28 09:21:54,901] [INFO] [paperless.sanity_checker] Document 103 has no content.
[2022-05-28 09:21:54,922] [INFO] [paperless.sanity_checker] Document 511 has no content.
[2022-05-28 09:21:54,933] [INFO] [paperless.sanity_checker] Document 242 has no content.
[2022-05-28 09:21:54,948] [INFO] [paperless.sanity_checker] Document 230 has no content.
[2022-05-28 09:21:54,962] [INFO] [paperless.sanity_checker] Document 510 has no content.
[2022-05-28 09:21:54,973] [INFO] [paperless.sanity_checker] Document 310 has no content.
[2022-05-28 09:21:54,985] [INFO] [paperless.sanity_checker] Document 566 has no content.
[2022-05-28 09:21:55,001] [INFO] [paperless.sanity_checker] Document 262 has no content.

Paperless-ngx version

1.7.1

Host OS

Ubuntu 20.04.3 LTS x86_64

Installation method

Docker

Browser

No response

Configuration changes

No response

Other

No response

The text was updated successfully, but these errors were encountered:

sukisoft · 2022-05-30T04:43:16Z

Well, i stumbled upon this a few weeks ago and looked after it. It was completely right, the mentioned documents hat no text in its content tab.

The number is the unique database id of the document, you can just go to your running instance and pass ist with "/documents/" and should be able to view the document.

stumpylog · 2022-05-30T19:48:24Z

The URL would look something like this: http://localhost:8800/documents/823. You'd just need to modify local host or account for a domain, etc, to view the document.

As to why there is no text, there's a plenty of possibilities: no text in the document or images, OCR failed (PDFs are surprisingly not very standard), etc. As of currently, there isn't a way to re-do OCR on a document.

I do agree outputting the primary key isn't very user friendly. When I can, I'll look at including the title and/or path

WalterWampe · 2022-05-31T07:09:58Z

Oh I see, thanks for the help, I figured it out and it is as you described like e.g there ended up a picture in my system which obviously does not have any content.

Thanks again for the kind help!

tooomm · 2022-06-01T13:13:41Z

My sanity checker returned several lines of "Document xxx has no content."

I figured it out and it is as you described like e.g there ended up a picture in my system which obviously does not have any content.

So "no content" actually means "no OCR data"? To a newer user and without context, this is not obvious. No content = empty file.
Adding e.g. OCR to the message would help a lot here.

As to why there is no text, there's a plenty of possibilities: no text in the document or images, OCR failed (PDFs are surprisingly not very standard), etc. As of currently, there isn't a way to re-do OCR on a document.

I'm wondering how one would catch OCR errors or other import issues?
Is there some warning or useful hint, except hidden in the logs?

I was totally not aware of the direct accessability of documents via an URL! That's awesome.
That seems not to be documented? At least I couldn't find it via the search.

stumpylog · 2022-06-01T14:19:25Z

Unless the OCR problem is an error which ocrmypdf can't work around, it will only output the issues to the log. Actual show stopping issues would be reported to the web ui if uploading, otherwise still the log if using the consume folder.

As for the URL, it's just a URL, there's not something to document there that I see. Usually, if someone is caring about a document via primary key, they'd be in the API, which is documented.

shamoon · 2022-06-02T18:18:20Z

Closed in branch by 04db521

tooomm · 2022-06-05T14:58:10Z

As for the URL, it's just a URL, there's not something to document there that I see.

It can be a quite useful feature for some, but many people just don't know that it can be used like that.

Somebody in the discussions wanted to link documents to a task in their ToDo app for example. The direct link feature is really nice there - instead having to download+attach a file. It also works when appending /preview to receive the file directly and not in the Paperless UI.

Keep in mind that not all Paperless users are tech guys.

github-actions · 2023-04-15T10:01:54Z

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new discussion or issue for related concerns.

WalterWampe added bug Bug report or a Bug-fix unconfirmed labels May 28, 2022

stumpylog mentioned this issue May 31, 2022

Bugfix: Better sanity check messages #1049

Merged

10 tasks

stumpylog linked a pull request Jun 1, 2022 that will close this issue

Bugfix: Better sanity check messages #1049

Merged

10 tasks

stumpylog added backend and removed unconfirmed labels Jun 1, 2022

shamoon closed this as completed Jun 2, 2022

github-actions bot locked as resolved and limited conversation to collaborators Apr 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Sanity check returns "Document xxx has no content." #1041

[BUG] Sanity check returns "Document xxx has no content." #1041

WalterWampe commented May 28, 2022

sukisoft commented May 30, 2022

stumpylog commented May 30, 2022

WalterWampe commented May 31, 2022

tooomm commented Jun 1, 2022

stumpylog commented Jun 1, 2022

shamoon commented Jun 2, 2022

tooomm commented Jun 5, 2022

github-actions bot commented Apr 15, 2023

[BUG] Sanity check returns "Document xxx has no content." #1041

[BUG] Sanity check returns "Document xxx has no content." #1041

Comments

WalterWampe commented May 28, 2022

Description

Steps to reproduce

Webserver logs

Paperless-ngx version

Host OS

Installation method

Browser

Configuration changes

Other

sukisoft commented May 30, 2022

stumpylog commented May 30, 2022

WalterWampe commented May 31, 2022

tooomm commented Jun 1, 2022

stumpylog commented Jun 1, 2022

shamoon commented Jun 2, 2022

tooomm commented Jun 5, 2022

github-actions bot commented Apr 15, 2023