ENH: Improve PDFium text extraction #11

mqq-marek · 2023-10-29T15:46:18Z

Several additional changes:

ENH: Add PDFium image extraction
ROB: Make opening/parsing the cache file more robust
MAINT: Update deprecated pdantic API
MAINT: Add pdfrw to main.in

reset cache.json while fail on read

Add page labels for pdfium Add post-processing for pdfium

MartinThoma · 2023-10-31T21:45:10Z

Good work! Thank you for updating the PR 🤗

mara004 · 2023-11-18T22:10:38Z

Just came across this, thanks for the addition!
I'd expect the benchmark results will be great with JPEG or JP2, but poor with any other formats due to limitations in pdfium's public API, especially poor with CCITT or JBIG2. Also note that this code doesn't take alpha masks into account (and some more finnicky things).

Adding pikepdf would also be nice, see #4. Programatically it's by far the best PDF image extractor I'm aware of.

marek-kubowicz added 5 commits October 29, 2023 12:40

Add missing dependency to pdfrw

aae47c5

pydantic compatibility changes

8460d60

add tika_get_text function with timeout support

cbda0fc

reset cache.json while fail on read

Add pdfium style hyphen processing

89a886b

Add page labels for pdfium Add post-processing for pdfium

Add pdfium image extraction

53532c7

MartinThoma changed the title ~~Pdfium test updates~~ ENH: Improve PDFium text extraction Oct 31, 2023

MartinThoma merged commit 24c51dd into py-pdf:main Oct 31, 2023

MartinThoma mentioned this pull request Oct 31, 2023

Refresh updates v2 #10

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Improve PDFium text extraction #11

ENH: Improve PDFium text extraction #11

mqq-marek commented Oct 29, 2023 •

edited by MartinThoma

Loading

MartinThoma commented Oct 31, 2023

mara004 commented Nov 18, 2023 •

edited

Loading

ENH: Improve PDFium text extraction #11

ENH: Improve PDFium text extraction #11

Conversation

mqq-marek commented Oct 29, 2023 • edited by MartinThoma Loading

MartinThoma commented Oct 31, 2023

mara004 commented Nov 18, 2023 • edited Loading

mqq-marek commented Oct 29, 2023 •

edited by MartinThoma

Loading

mara004 commented Nov 18, 2023 •

edited

Loading