Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Unsupported mime type application/octet-stream #776

Closed
invinciberry opened this issue Apr 20, 2022 · 12 comments
Closed

[BUG] Unsupported mime type application/octet-stream #776

invinciberry opened this issue Apr 20, 2022 · 12 comments
Labels

Comments

@invinciberry
Copy link

Description

Getting "Unsupported mime type application/octet-stream : Traceback (most recent call last)" when ingesting a normal PDF file.
File "/usr/local/lib/python3.9/site-packages/django_q/cluster.py", line 432, in worker res = f(*task["args"], **task["kwargs"]) File "/usr/src/paperless/src/documents/tasks.py", line 70, in consume_file document = Consumer().try_consume_file( File "/usr/src/paperless/src/documents/consumer.py", line 211, in try_consume_file self._fail(MESSAGE_UNSUPPORTED_TYPE, f"Unsupported mime type {mime_type}") File "/usr/src/paperless/src/documents/consumer.py", line 69, in _fail raise ConsumerError(f"{self.filename}: {log_message or message}") documents.consumer.ConsumerError: 110Tapestry2021TaxBill.pdf: Unsupported mime type application/octet-stream

Expected behavior

Other PDFs work fine.

Steps to reproduce

When uploading certain PDFs

Webserver logs

No response

Screenshots

No response

Paperless-ngx version

1.6

Host OS

Unraid

Installation method

Docker

Browser

Chrome

Configuration changes

No response

Other

No response

@invinciberry invinciberry added bug Bug report or a Bug-fix unconfirmed labels Apr 20, 2022
@shamoon
Copy link
Member

shamoon commented Apr 20, 2022

Have you tried re-creating that PDF? Im not sure application/octet-stream is a correct file type for a pdf.

See jonaswinkler/paperless-ng#906 also jonaswinkler/paperless-ng#291 --> jonaswinkler/paperless-ng#201

@stumpylog
Copy link
Member

Normally, a PDF would be detected as application/pdf

I'd be curious to see what file --mime-type against this PDF produces both inside the container and outside. It's probably the same issue as those linked issues, some junk in the file header.

@ErrorSource
Copy link

I have a similar problem:

Jul  8 10:45:27 ares python3[312409]: 10:45:27 [Q] ERROR Failed [Taufzeugnis Anna.pdf] - Taufzeugnis Anna.pdf: Unsupported mime type inode/x-empty : Traceback (most recent call last):
Jul  8 10:45:27 ares python3[312409]:   File "/opt/paperless-data/.local/lib/python3.10/site-packages/django_q/cluster.py", line 432, in worker
Jul  8 10:45:27 ares python3[312409]:     res = f(*task["args"], **task["kwargs"])
Jul  8 10:45:27 ares python3[312409]:   File "/opt/paperless-ngx-1.7.1/src/documents/tasks.py", line 298, in consume_file
Jul  8 10:45:27 ares python3[312409]:     document = Consumer().try_consume_file(
Jul  8 10:45:27 ares python3[312409]:   File "/opt/paperless-ngx-1.7.1/src/documents/consumer.py", line 225, in try_consume_file
Jul  8 10:45:27 ares python3[312409]:     self._fail(MESSAGE_UNSUPPORTED_TYPE, f"Unsupported mime type {mime_type}")
Jul  8 10:45:27 ares python3[312409]:   File "/opt/paperless-ngx-1.7.1/src/documents/consumer.py", line 81, in _fail
Jul  8 10:45:27 ares python3[312409]:     raise ConsumerError(f"{self.filename}: {log_message or message}")
Jul  8 10:45:27 ares python3[312409]: documents.consumer.ConsumerError: Taufzeugnis Anna.pdf: Unsupported mime type inode/x-empty
file --mime-type /opt/paperless-data/media/documents/originals/.../Taufzeugnis\ Anna.pdf 
/opt/paperless-data/media/documents/originals/.../Taufzeugnis Anna.pdf: application/pdf

It's a (what i think) valid PDF created by FineReader with OCR-recognition. If i examine the raw file (via vi), i can't find any string of "inode" or "x-empty". Where does this come from?

@shamoon shamoon removed the stale label Jul 10, 2022
@stumpylog
Copy link
Member

Revisiting this, it seems likely there are 2 things happening here.

To make a mime type of inode/x-empty, the file needs to be empty. Perhaps a scanner is creating an empty file first, before actually writing to it. To fix this, you could utilize polling, with a larger timeout. There will also be some additional fixes and options for delaying consumption in the oncoming release.

For the application/octet-stream, that's being detected as a generic binary file. I don't think there's much to do about it, besides fixing the file itself. You could attempt to run it through qpdf, something like qpdf [infile] [outfile] might be enough to fix it, as the file will be re-written. qpdf is available in the image at its most recent version.

@SamuelBolduc
Copy link

I'm encountering this error with all account statement PDFs my bank produces. After tinkering with it for a while, I found out that for some reason they have extra data at the beginning of the file, before %PDF-1.3. What I do now is I open the file in Vim and I just erase those characters. Here's the beginning of a file with this issue:

¬í^@^Eur^@^B[B¬ó^Wø^F^HTà^B^@^@xp^@^@-^L%PDF-1.3
%<81><96>½Ý^M
1 0 obj
<< /Type /Catalog
/Pages 2 0 R
/Outlines 3 0 R
>>
endobj
4 0 obj
<</Length 86/BitsPerComponent 1/Width 1790/ImageMask true/Height 64/Filter /FlateDecode/Subtype /Image/Type /XObject/Decode [1 0]>>stream

Paperless will error out when trying to consume this file. However, if I simply erase everything before %PDF-1.3 in Vim, Paperless consumes it as expected, without errors. Here's the same extract from that file, corrected so it can be parsed:

%PDF-1.3
%<81><96>½Ý^M
1 0 obj
<< /Type /Catalog
/Pages 2 0 R
/Outlines 3 0 R
>>
endobj
4 0 obj
<</Length 86/BitsPerComponent 1/Width 1790/ImageMask true/Height 64/Filter /FlateDecode/Subtype /Image/Type /XObject/Decode [1 0]>>stream

I'm not sure it's the same problem exactly, but I thought this might help. A possible solution could be to look forward a few hundred characters when hitting this error to solve the case where a small amount of extra data is added at the beginning of the file? Although as I understand, this might be entirely dependent on an external library (libmagic?), so in that case there's nothing that could be done easily about it.

@stumpylog
Copy link
Member

stumpylog commented Aug 2, 2022

Yes, python-magic is an interface to libmagic. The leading bunch of chars is exactly what was previously encountered. I don't see anything in its documentation for skipping data (which shouldn't be there anyway). Wouldn't it be nice if things followed specifications?

I suspect qpdf would be able to handle it and produce a more valid PDF. It's something when can be setup with a pre-consume hook to replace the original file.

@shamoon shamoon added the stale label Sep 1, 2022
@stale stale bot closed this as completed Sep 8, 2022
@jgillula
Copy link

jgillula commented Oct 3, 2022

Following @stumpylog's recommendation, I tried to set up a preconsume script to handle this. However, paperless does the mime-type checking (and fails out) before the preconsume script is run. See lines 272-291 of consumer.py

Is this the expected behavior? Or should the preconsume script be allowed to run before mime-type checking?

@auberginepop
Copy link

qpdf [infile] [outfile] as suggested by stumpylog worked for me.

@andybali
Copy link

@auberginepop
Ich have the same problem, but I am not expert enough. Can you explain more detailed where an how I have to implement that, please :-)

qpdf [infile] [outfile] as suggested by stumpylog worked for me.

@auberginepop
Copy link

@auberginepop Ich have the same problem, but I am not expert enough. Can you explain more detailed where an how I have to implement that, please :-)

qpdf [infile] [outfile] as suggested by stumpylog worked for me.

You need to have qpdf installed. I assume you are running Linux in which case use whatever method you normally use to install things. For example, sudo apt install qpdf on Ubuntu or yay -S qpdf on Arch/Manjaro.
From a terminal you need to run the qpdf tool on the file that will not import, e.g. qpdf problemfile.pdf problemfile2.pdf. That will create a new file called problemfile2.pdf which should now import.
Does make sense?

@goldjunge91
Copy link

goldjunge91 commented Feb 1, 2023

@auberginepop Ich have the same problem, but I am not expert enough. Can you explain more detailed where an how I have to implement that, please :-)

qpdf [infile] [outfile] as suggested by stumpylog worked for me.

You need to have qpdf installed. I assume you are running Linux in which case use whatever method you normally use to install things. For example, sudo apt install qpdf on Ubuntu or yay -S qpdf on Arch/Manjaro. From a terminal you need to run the qpdf tool on the file that will not import, e.g. qpdf problemfile.pdf problemfile2.pdf. That will create a new file called problemfile2.pdf which should now import. Does make sense?

thanks for that. need I do this if my paperless ngx is running in a docker container ?

i figure it out I have to run this in the container but I have in total over 50 files all invoice from amazon and 16 are with the issue Unsupported mime type inode/x-empty

@github-actions
Copy link
Contributor

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new discussion or issue for related concerns.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
Archived in project
Development

No branches or pull requests

10 participants