Use pypdfium2's new range-based text extractor #5

mara004 · 2022-10-07T13:36:54Z

get_text() was boundary-based, which is not that suited for the use case of just extracting all text of a page. I believe the new get_text_range() function might both yield better results and be more performant.

This can be merged once pypdfium2 3.3 is released.

get_text() was boundary-based, which is not that suited for the use case of just extracting all text of a page. I believe the new get_text_range() function might both yield better results and be more performant. This can be merged once pypdfium2 3.3 is released.

MartinThoma · 2022-10-09T09:47:47Z

https://pypi.org/project/pypdfium2/#history - I'm super curious to see the new results :-)

mara004 · 2022-10-10T19:00:06Z

FYI, v3.3 was released today

MartinThoma · 2022-10-11T05:09:37Z

Overall, it's faster. However, Pdfium became a lot slower for https://arxiv.org/abs/2201.00214

MartinThoma · 2022-10-11T05:10:18Z

The quality didn't really change:

MartinThoma · 2022-10-11T05:15:07Z

Thank you for the update @mara004 🙏

As an unrelated side-note: Is Pdfium also able to extract images? We now also have this in the benchmark:
https://github.com/py-pdf/benchmarks/blob/main/benchmark.py#L170-L195

mara004 · 2022-10-11T10:33:05Z

Thanks for the new results. I didn't expect it to be really different, given that it's still the same base. get_text_range() is just a bit nicer internally.

The slowdown on the first document appears suspicious to me, however. If I run time pypdfium2 extract-text "2201.00214.pdf" --strategy range multiple times, I get

real    0m2.919s
user    0m0.730s
sys     0m0.206s

for the first run, and something like

real    0m0.676s
user    0m0.750s
sys     0m0.162s

for all following runs.

Is it possible that there's some external explanation for the slowdown and that (py)pdfium is not at fault here?

mara004 · 2022-10-11T10:45:51Z

As an unrelated side-note: Is Pdfium also able to extract images? We now also have this in the benchmark:
https://github.com/py-pdf/benchmarks/blob/main/benchmark.py#L170-L195

In principle, yes. There are some functions in the raw API for this (FPDFImageObj_GetBitmap(), FPDFImageObj_GetRenderedBitmap(), FPDFImageObj_GetImageDataDecoded(), FPDFImageObj_GetImageDataRaw(), ...).

The problem is: if we use one of the FPDFImageObj_*Bitmap() functions, the result is only suited for displaying, not saving. Otherwise we would have to re-encode the pixel data, which would be disadvantageous in terms of performance / compression / quality. So I'm not sure how to best create a support model for this.

There are also functions to get the raw data, filters and metadata, but this approach would be quite complicated, and I'm not even sure if the information provided by PDFium is sufficient (e. g. Alpha masks, ICC profiles ?).
(As I stated in #4, pikepdf is able to smartly handle the raw data and "reconstruct" the original image in many cases.)

I've discussed this topic with a user already (there's an open issue about it).

mara004 · 2022-10-11T13:18:58Z

For reference, even with the previous extraction strategry, there was already a slowdown from 0.7s to 1.7s with update 1ce8729. Given the time results I shared above, 0.7s would be more plausible.
Is it possible that this is related to Python bytecode compilation on first run or something?

mara004 · 2022-11-17T16:47:22Z

As an unrelated side-note: Is Pdfium also able to extract images? We now also have this in the benchmark:

@MartinThoma pypdfium2 now has a function for this in the devel branch (PdfImage.extract()), but it's not as good as it theoretically could be, since PDFium does not expose all required information, as I hinted above.
(I wrote an essay about that in PDFium's bug tracker: https://crbug.com/pdfium/1930)

MartinThoma merged commit 2417507 into py-pdf:main Oct 11, 2022

mara004 deleted the patch-1 branch October 11, 2022 10:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use pypdfium2's new range-based text extractor #5

Use pypdfium2's new range-based text extractor #5

mara004 commented Oct 7, 2022

MartinThoma commented Oct 9, 2022

mara004 commented Oct 10, 2022

MartinThoma commented Oct 11, 2022

MartinThoma commented Oct 11, 2022

MartinThoma commented Oct 11, 2022

mara004 commented Oct 11, 2022

mara004 commented Oct 11, 2022 •

edited

Loading

mara004 commented Oct 11, 2022

mara004 commented Nov 17, 2022 •

edited

Loading

Use pypdfium2's new range-based text extractor #5

Use pypdfium2's new range-based text extractor #5

Conversation

mara004 commented Oct 7, 2022

MartinThoma commented Oct 9, 2022

mara004 commented Oct 10, 2022

MartinThoma commented Oct 11, 2022

MartinThoma commented Oct 11, 2022

MartinThoma commented Oct 11, 2022

mara004 commented Oct 11, 2022

mara004 commented Oct 11, 2022 • edited Loading

mara004 commented Oct 11, 2022

mara004 commented Nov 17, 2022 • edited Loading

mara004 commented Oct 11, 2022 •

edited

Loading

mara004 commented Nov 17, 2022 •

edited

Loading