Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Sanity check returns "Document xxx has no content." #1041

Closed
WalterWampe opened this issue May 28, 2022 · 8 comments · Fixed by #1049
Closed

[BUG] Sanity check returns "Document xxx has no content." #1041

WalterWampe opened this issue May 28, 2022 · 8 comments · Fixed by #1049
Labels
backend bug Bug report or a Bug-fix

Comments

@WalterWampe
Copy link

Description

My sanity checker returned several lines of "Document xxx has no content."

Is there any way to find the Document via this number and how can I fix?

Do I need to be worried?

Steps to reproduce

go to docker-compose folder
run command docker-compose exec webserver document_sanity_checker

Webserver logs

user@dockerhost:~/Docker-Composes/paperless-ng$ docker-compose exec webserver document_sanity_checker
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1542/1542 [03:40<00:00,  7.00it/s]
[2022-05-28 09:21:54,773] [INFO] [paperless.sanity_checker] Document 823 has no content.
[2022-05-28 09:21:54,843] [INFO] [paperless.sanity_checker] Document 86 has no content.
[2022-05-28 09:21:54,857] [INFO] [paperless.sanity_checker] Document 85 has no content.
[2022-05-28 09:21:54,873] [INFO] [paperless.sanity_checker] Document 84 has no content.
[2022-05-28 09:21:54,888] [INFO] [paperless.sanity_checker] Document 525 has no content.
[2022-05-28 09:21:54,901] [INFO] [paperless.sanity_checker] Document 103 has no content.
[2022-05-28 09:21:54,922] [INFO] [paperless.sanity_checker] Document 511 has no content.
[2022-05-28 09:21:54,933] [INFO] [paperless.sanity_checker] Document 242 has no content.
[2022-05-28 09:21:54,948] [INFO] [paperless.sanity_checker] Document 230 has no content.
[2022-05-28 09:21:54,962] [INFO] [paperless.sanity_checker] Document 510 has no content.
[2022-05-28 09:21:54,973] [INFO] [paperless.sanity_checker] Document 310 has no content.
[2022-05-28 09:21:54,985] [INFO] [paperless.sanity_checker] Document 566 has no content.
[2022-05-28 09:21:55,001] [INFO] [paperless.sanity_checker] Document 262 has no content.

Paperless-ngx version

1.7.1

Host OS

Ubuntu 20.04.3 LTS x86_64

Installation method

Docker

Browser

No response

Configuration changes

No response

Other

No response

@WalterWampe WalterWampe added bug Bug report or a Bug-fix unconfirmed labels May 28, 2022
@sukisoft
Copy link

Well, i stumbled upon this a few weeks ago and looked after it. It was completely right, the mentioned documents hat no text in its content tab.

The number is the unique database id of the document, you can just go to your running instance and pass ist with "/documents/" and should be able to view the document.

@stumpylog
Copy link
Member

The URL would look something like this: http://localhost:8800/documents/823. You'd just need to modify local host or account for a domain, etc, to view the document.

As to why there is no text, there's a plenty of possibilities: no text in the document or images, OCR failed (PDFs are surprisingly not very standard), etc. As of currently, there isn't a way to re-do OCR on a document.

I do agree outputting the primary key isn't very user friendly. When I can, I'll look at including the title and/or path

@WalterWampe
Copy link
Author

Oh I see, thanks for the help, I figured it out and it is as you described like e.g there ended up a picture in my system which obviously does not have any content.

Thanks again for the kind help!

@tooomm
Copy link
Contributor

tooomm commented Jun 1, 2022

My sanity checker returned several lines of "Document xxx has no content."

I figured it out and it is as you described like e.g there ended up a picture in my system which obviously does not have any content.

So "no content" actually means "no OCR data"? To a newer user and without context, this is not obvious. No content = empty file.
Adding e.g. OCR to the message would help a lot here.


As to why there is no text, there's a plenty of possibilities: no text in the document or images, OCR failed (PDFs are surprisingly not very standard), etc. As of currently, there isn't a way to re-do OCR on a document.

I'm wondering how one would catch OCR errors or other import issues?
Is there some warning or useful hint, except hidden in the logs?


I was totally not aware of the direct accessability of documents via an URL! That's awesome.
That seems not to be documented? At least I couldn't find it via the search.

@stumpylog
Copy link
Member

Unless the OCR problem is an error which ocrmypdf can't work around, it will only output the issues to the log. Actual show stopping issues would be reported to the web ui if uploading, otherwise still the log if using the consume folder.

As for the URL, it's just a URL, there's not something to document there that I see. Usually, if someone is caring about a document via primary key, they'd be in the API, which is documented.

@stumpylog stumpylog linked a pull request Jun 1, 2022 that will close this issue
10 tasks
@shamoon
Copy link
Member

shamoon commented Jun 2, 2022

Closed in branch by 04db521

@shamoon shamoon closed this as completed Jun 2, 2022
@tooomm
Copy link
Contributor

tooomm commented Jun 5, 2022

As for the URL, it's just a URL, there's not something to document there that I see.

It can be a quite useful feature for some, but many people just don't know that it can be used like that.

Somebody in the discussions wanted to link documents to a task in their ToDo app for example. The direct link feature is really nice there - instead having to download+attach a file. It also works when appending /preview to receive the file directly and not in the Paperless UI.

Keep in mind that not all Paperless users are tech guys.

@github-actions
Copy link
Contributor

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new discussion or issue for related concerns.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
backend bug Bug report or a Bug-fix
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants