Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] PyPDF2 ❤️ pdfrw #232

Open
MartinThoma opened this issue Apr 9, 2022 · 2 comments
Open

[Discussion] PyPDF2 ❤️ pdfrw #232

MartinThoma opened this issue Apr 9, 2022 · 2 comments

Comments

@MartinThoma
Copy link

I've recently became the maintainer of PyPDF2. While it will take me some time to increase test coverage, merge/close PRs, deal with Github tickets (issues), I'm looking forward to a new major release in which I can deprecate some parts.

One thing I like to do is to change the interface. It starts with simple things like reader.getNumPages() to become len(reader), changing the camelCase method names to snake_case, and adding type annotations.

I was wondering how big the difference between pdfrw is and how it's related to PyPDF2. Some things look super similar. Does pdfrw have its origins as a fork of PyPDF2?

Maybe it's possible to join efforts (I could imaging merging pdfrw into PyPDF2, creating a shared "base library" for both projects, or sharing parts of the test suite between both projects)

@pmaupin
Copy link
Owner

pmaupin commented Apr 9, 2022 via email

@abubelinha
Copy link

abubelinha commented Feb 20, 2023

  1. pdfrw is much faster than pypdf

I can confirm this.
I was using pypdf to extract some pages of a big pdf to create smaller files.

When I found this issue I wanted to check the speed difference, and followed this pdfrw example in order to compare them:
https://github.com/pmaupin/pdfrw/blob/master/examples/subset.py

I adapted it to just pass a list of page numbers to subset from a given pdf with 340 pages (EDIT: 58 MB scanned book, 175 KB/page on average).
I noticed the following:

  • speed difference was huge: pdfrw many times faster than pypdf with my first tests ... until I realized this difference increased with number of extracted pages: pdfrw speed is not that much affected by increasing the number of pages, whereas pypdf output time (in seconds) increases a lot.
--------------------------------------------------
4 DIFFERENT PAGES (repeated N=1, 2 or 3 times):
N only affects pypdf in both time and output size
--------------------------------------------------
N=1, pageslist: [4, 6, 8, 9]
pdfrw: 1478.83 KB output size, took 1.61 seconds
pypdf: 1486.26 KB output size, took 6.684 seconds
pypdf_time / pdfrw_time = 4.15 ratio
--------------------------------------------------
N=2, pageslist: [4, 6, 8, 9, 4, 6, 8, 9]
pdfrw: 1479.90 KB output size, took 1.146 seconds
pypdf: 2972.23 KB output size, took 13.163 seconds
pypdf_time / pdfrw_time = 11.49 ratio
--------------------------------------------------
N=3, pageslist: [4, 6, 8, 9, 4, 6, 8, 9, 4, 6, 8, 9]
pdfrw: 1480.97 KB output size, took 1.127 seconds
pypdf: 4458.29 KB output size, took 19.644 seconds
pypdf_time / pdfrw_time = 17.43 ratio
--------------------------------------------------
NOW 8 DIFFERENT PAGES (repeated N=1, 2 or 3 times):
Different pages number only affects pdfrw output size, but not its speed
Pages number (no matter they are repeated or different) affects pypdf in both time and output size
--------------------------------------------------
N=1, pageslist: [4, 6, 8, 9, 193, 1, 16, 18]
pdfrw: 2774.33 KB output size, took 1.691 seconds
pypdf: 2790.06 KB output size, took 13.073 seconds
pypdf_time / pdfrw_time = 7.73 ratio
--------------------------------------------------
N=2, pageslist: [4, 6, 8, 9, 193, 1, 16, 18, 4, 6, 8, 9, 193, 1, 16, 18]
pdfrw: 2776.51 KB output size, took 1.181 seconds
pypdf: 5580.01 KB output size, took 26.387 seconds
pypdf_time / pdfrw_time = 22.34 ratio
--------------------------------------------------
N=3, pageslist: [4, 6, 8, 9, 193, 1, 16, 18, 4, 6, 8, 9, 193, 1, 16, 18, 4, 6, 8, 9, 193, 1, 16, 18]
pdfrw: 2778.69 KB output size, took 1.171 seconds
pypdf: 8369.96 KB output size, took 39.936 seconds
pypdf_time / pdfrw_time = 34.1 ratio
--------------------------------------------------
  • memory consumption is much bigger in pypdf; i.e., if I do N=5 this eats all my RAM (whereas pdfrw is not affected at all)
  • also, pdf file sizes generated by pdfrw are much smaller: particularly, repeated pages do not affect pdfrw at all, whereas pypdf multiplies the output file size and script time

@MartinThoma the difference is so amazing that I'd say there is something wrong with pypdf memory usage
I tested this on Windows 7, Python 3.8

Regards
@abubelinha

EDIT: a recent version of the test is posted here now py-pdf/benchmarks#7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants