tesseract: [FIX] pdf pre-processing #468

bosd · 2023-02-10T20:12:20Z

Before this PR the mimetype of the input file was detected by it's extension.
It works fine for cli usage.

It fails when invoice2data is used as a library and a tempfile/stream is used as an input. (as there is no extension available)
Making it impossible to parse pdf files when using invoice2data as a library.
As tesseract cannot handle pdf files as an input.

After this pr, the mimetype is also detected when using invoice2data as a library.

@m3nu , @rmilecki I'm consideirng this as a hotfix, so fasttracking this one.

detect the mimetype of the input, also when using invoice2data as a library

rmilecki · 2023-02-18T11:15:57Z

Thanks for handling this.

bosd · 2023-02-18T11:22:51Z

It turns out, this really did not make a difference. The guess_type function also looks at the file extension.
Instead of running low level compares on the content.

It turned out that the problem was in the program calling invoice2data.
It was passing a tempfile without extension.
Forcing the program to always pass the .pdf extension fixed it.

tesseract: [FIX] pdf pre-processing

97cc826

detect the mimetype of the input, also when using invoice2data as a library

bosd mentioned this pull request Feb 10, 2023

using tesseract4 option when use as a python library? #326

Closed

bosd added priority:high type:bug labels Feb 10, 2023

bosd merged commit 26aa28a into invoice-x:master Feb 11, 2023

bosd deleted the fix-tesseract-stream branch February 11, 2023 07:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tesseract: [FIX] pdf pre-processing #468

tesseract: [FIX] pdf pre-processing #468

bosd commented Feb 10, 2023

rmilecki commented Feb 18, 2023

bosd commented Feb 18, 2023

tesseract: [FIX] pdf pre-processing #468

tesseract: [FIX] pdf pre-processing #468

Conversation

bosd commented Feb 10, 2023

rmilecki commented Feb 18, 2023

bosd commented Feb 18, 2023