[BUG] Unsupported mime type application/octet-stream #776

invinciberry · 2022-04-20T16:41:50Z

Description

Getting "Unsupported mime type application/octet-stream : Traceback (most recent call last)" when ingesting a normal PDF file.
File "/usr/local/lib/python3.9/site-packages/django_q/cluster.py", line 432, in worker res = f(*task["args"], **task["kwargs"]) File "/usr/src/paperless/src/documents/tasks.py", line 70, in consume_file document = Consumer().try_consume_file( File "/usr/src/paperless/src/documents/consumer.py", line 211, in try_consume_file self._fail(MESSAGE_UNSUPPORTED_TYPE, f"Unsupported mime type {mime_type}") File "/usr/src/paperless/src/documents/consumer.py", line 69, in _fail raise ConsumerError(f"{self.filename}: {log_message or message}") documents.consumer.ConsumerError: 110Tapestry2021TaxBill.pdf: Unsupported mime type application/octet-stream

Expected behavior

Other PDFs work fine.

Steps to reproduce

When uploading certain PDFs

Webserver logs

No response

Screenshots

No response

Paperless-ngx version

1.6

Host OS

Unraid

Installation method

Docker

Browser

Chrome

Configuration changes

No response

Other

No response

The text was updated successfully, but these errors were encountered:

shamoon · 2022-04-20T17:18:30Z

Have you tried re-creating that PDF? Im not sure application/octet-stream is a correct file type for a pdf.

See jonaswinkler/paperless-ng#906 also jonaswinkler/paperless-ng#291 --> jonaswinkler/paperless-ng#201

stumpylog · 2022-04-20T18:02:15Z

Normally, a PDF would be detected as application/pdf

I'd be curious to see what file --mime-type against this PDF produces both inside the container and outside. It's probably the same issue as those linked issues, some junk in the file header.

ErrorSource · 2022-07-08T09:19:58Z

I have a similar problem:

Jul  8 10:45:27 ares python3[312409]: 10:45:27 [Q] ERROR Failed [Taufzeugnis Anna.pdf] - Taufzeugnis Anna.pdf: Unsupported mime type inode/x-empty : Traceback (most recent call last):
Jul  8 10:45:27 ares python3[312409]:   File "/opt/paperless-data/.local/lib/python3.10/site-packages/django_q/cluster.py", line 432, in worker
Jul  8 10:45:27 ares python3[312409]:     res = f(*task["args"], **task["kwargs"])
Jul  8 10:45:27 ares python3[312409]:   File "/opt/paperless-ngx-1.7.1/src/documents/tasks.py", line 298, in consume_file
Jul  8 10:45:27 ares python3[312409]:     document = Consumer().try_consume_file(
Jul  8 10:45:27 ares python3[312409]:   File "/opt/paperless-ngx-1.7.1/src/documents/consumer.py", line 225, in try_consume_file
Jul  8 10:45:27 ares python3[312409]:     self._fail(MESSAGE_UNSUPPORTED_TYPE, f"Unsupported mime type {mime_type}")
Jul  8 10:45:27 ares python3[312409]:   File "/opt/paperless-ngx-1.7.1/src/documents/consumer.py", line 81, in _fail
Jul  8 10:45:27 ares python3[312409]:     raise ConsumerError(f"{self.filename}: {log_message or message}")
Jul  8 10:45:27 ares python3[312409]: documents.consumer.ConsumerError: Taufzeugnis Anna.pdf: Unsupported mime type inode/x-empty

file --mime-type /opt/paperless-data/media/documents/originals/.../Taufzeugnis\ Anna.pdf 
/opt/paperless-data/media/documents/originals/.../Taufzeugnis Anna.pdf: application/pdf

It's a (what i think) valid PDF created by FineReader with OCR-recognition. If i examine the raw file (via vi), i can't find any string of "inode" or "x-empty". Where does this come from?

stumpylog · 2022-07-19T20:29:11Z

Revisiting this, it seems likely there are 2 things happening here.

To make a mime type of inode/x-empty, the file needs to be empty. Perhaps a scanner is creating an empty file first, before actually writing to it. To fix this, you could utilize polling, with a larger timeout. There will also be some additional fixes and options for delaying consumption in the oncoming release.

For the application/octet-stream, that's being detected as a generic binary file. I don't think there's much to do about it, besides fixing the file itself. You could attempt to run it through qpdf, something like qpdf [infile] [outfile] might be enough to fix it, as the file will be re-written. qpdf is available in the image at its most recent version.

SamuelBolduc · 2022-08-02T14:27:31Z

I'm encountering this error with all account statement PDFs my bank produces. After tinkering with it for a while, I found out that for some reason they have extra data at the beginning of the file, before %PDF-1.3. What I do now is I open the file in Vim and I just erase those characters. Here's the beginning of a file with this issue:

¬í^@^Eur^@^B[B¬ó^Wø^F^HTà^B^@^@xp^@^@-^L%PDF-1.3
%<81><96>½Ý^M
1 0 obj
<< /Type /Catalog
/Pages 2 0 R
/Outlines 3 0 R
>>
endobj
4 0 obj
<</Length 86/BitsPerComponent 1/Width 1790/ImageMask true/Height 64/Filter /FlateDecode/Subtype /Image/Type /XObject/Decode [1 0]>>stream

Paperless will error out when trying to consume this file. However, if I simply erase everything before %PDF-1.3 in Vim, Paperless consumes it as expected, without errors. Here's the same extract from that file, corrected so it can be parsed:

%PDF-1.3
%<81><96>½Ý^M
1 0 obj
<< /Type /Catalog
/Pages 2 0 R
/Outlines 3 0 R
>>
endobj
4 0 obj
<</Length 86/BitsPerComponent 1/Width 1790/ImageMask true/Height 64/Filter /FlateDecode/Subtype /Image/Type /XObject/Decode [1 0]>>stream

I'm not sure it's the same problem exactly, but I thought this might help. A possible solution could be to look forward a few hundred characters when hitting this error to solve the case where a small amount of extra data is added at the beginning of the file? Although as I understand, this might be entirely dependent on an external library (libmagic?), so in that case there's nothing that could be done easily about it.

stumpylog · 2022-08-02T14:50:35Z

Yes, python-magic is an interface to libmagic. The leading bunch of chars is exactly what was previously encountered. I don't see anything in its documentation for skipping data (which shouldn't be there anyway). Wouldn't it be nice if things followed specifications?

I suspect qpdf would be able to handle it and produce a more valid PDF. It's something when can be setup with a pre-consume hook to replace the original file.

jgillula · 2022-10-03T15:31:45Z

Following @stumpylog's recommendation, I tried to set up a preconsume script to handle this. However, paperless does the mime-type checking (and fails out) before the preconsume script is run. See lines 272-291 of consumer.py

Is this the expected behavior? Or should the preconsume script be allowed to run before mime-type checking?

auberginepop · 2022-11-30T08:20:34Z

qpdf [infile] [outfile] as suggested by stumpylog worked for me.

andybali · 2022-12-22T07:49:32Z

@auberginepop
Ich have the same problem, but I am not expert enough. Can you explain more detailed where an how I have to implement that, please :-)

qpdf [infile] [outfile] as suggested by stumpylog worked for me.

auberginepop · 2022-12-23T10:41:35Z

@auberginepop Ich have the same problem, but I am not expert enough. Can you explain more detailed where an how I have to implement that, please :-)

qpdf [infile] [outfile] as suggested by stumpylog worked for me.

You need to have qpdf installed. I assume you are running Linux in which case use whatever method you normally use to install things. For example, sudo apt install qpdf on Ubuntu or yay -S qpdf on Arch/Manjaro.
From a terminal you need to run the qpdf tool on the file that will not import, e.g. qpdf problemfile.pdf problemfile2.pdf. That will create a new file called problemfile2.pdf which should now import.
Does make sense?

goldjunge91 · 2023-02-01T17:29:03Z

@auberginepop Ich have the same problem, but I am not expert enough. Can you explain more detailed where an how I have to implement that, please :-)

qpdf [infile] [outfile] as suggested by stumpylog worked for me.

You need to have qpdf installed. I assume you are running Linux in which case use whatever method you normally use to install things. For example, sudo apt install qpdf on Ubuntu or yay -S qpdf on Arch/Manjaro. From a terminal you need to run the qpdf tool on the file that will not import, e.g. qpdf problemfile.pdf problemfile2.pdf. That will create a new file called problemfile2.pdf which should now import. Does make sense?

thanks for that. need I do this if my paperless ngx is running in a docker container ?

i figure it out I have to run this in the container but I have in total over 50 files all invoice from amazon and 16 are with the issue Unsupported mime type inode/x-empty

github-actions · 2023-04-15T02:16:27Z

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new discussion or issue for related concerns.

invinciberry added bug Bug report or a Bug-fix unconfirmed labels Apr 20, 2022

qcasey added backend and removed unconfirmed labels Apr 20, 2022

shamoon added unconfirmed stale labels Jul 3, 2022

stale bot removed the stale label Jul 3, 2022

shamoon added stale and removed unconfirmed labels Jul 3, 2022

shamoon removed the stale label Jul 10, 2022

stumpylog added the cant-reproduce label Jul 29, 2022

shamoon added the stale label Sep 1, 2022

stale bot closed this as completed Sep 8, 2022

github-actions bot locked as resolved and limited conversation to collaborators Apr 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Unsupported mime type application/octet-stream #776

[BUG] Unsupported mime type application/octet-stream #776

invinciberry commented Apr 20, 2022

shamoon commented Apr 20, 2022 •

edited

stumpylog commented Apr 20, 2022

ErrorSource commented Jul 8, 2022

stumpylog commented Jul 19, 2022

SamuelBolduc commented Aug 2, 2022

stumpylog commented Aug 2, 2022 •

edited

jgillula commented Oct 3, 2022

auberginepop commented Nov 30, 2022

andybali commented Dec 22, 2022

auberginepop commented Dec 23, 2022

goldjunge91 commented Feb 1, 2023 •

edited

github-actions bot commented Apr 15, 2023

[BUG] Unsupported mime type application/octet-stream #776

[BUG] Unsupported mime type application/octet-stream #776

Comments

invinciberry commented Apr 20, 2022

Description

Expected behavior

Steps to reproduce

Webserver logs

Screenshots

Paperless-ngx version

Host OS

Installation method

Browser

Configuration changes

Other

shamoon commented Apr 20, 2022 • edited

stumpylog commented Apr 20, 2022

ErrorSource commented Jul 8, 2022

stumpylog commented Jul 19, 2022

SamuelBolduc commented Aug 2, 2022

stumpylog commented Aug 2, 2022 • edited

jgillula commented Oct 3, 2022

auberginepop commented Nov 30, 2022

andybali commented Dec 22, 2022

auberginepop commented Dec 23, 2022

goldjunge91 commented Feb 1, 2023 • edited

github-actions bot commented Apr 15, 2023

shamoon commented Apr 20, 2022 •

edited

stumpylog commented Aug 2, 2022 •

edited

goldjunge91 commented Feb 1, 2023 •

edited