Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PI: Check duplicate objects in writer._sweep_indirect_references #207

Merged
merged 2 commits into from Jun 28, 2022

Conversation

Hatell
Copy link
Contributor

@Hatell Hatell commented Jun 17, 2015

PdfFileWriter._sweepIndirectReferences checks if object is already in pdf.

This check will reduce file-size significantly when merge pdfs and there are same objects, for example images.

@MartinThoma MartinThoma added the PdfWriter The PdfWriter component is affected label Apr 6, 2022
@py-pdf py-pdf deleted a comment from claird Apr 16, 2022
@MartinThoma
Copy link
Member

Hey! It's been 7 years, but your PR was not forgotten!

Would you mind to either re-do the PR or to fix the merge conflicts?

If not, I can take care of it and give credit to you via the Github co-authored-by feature

@MartinThoma MartinThoma added the needs-change The PR/issue cannot be handled as issue and needs to be improved label Apr 16, 2022
@Hatell
Copy link
Contributor Author

Hatell commented Apr 19, 2022

I can take a look this soon.

@Hatell
Copy link
Contributor Author

Hatell commented Apr 19, 2022

Now I think this is ready to testing.

I done one quick test and verified that merged PDF is smaller and looks same.

@Hatell
Copy link
Contributor Author

Hatell commented Apr 19, 2022

Now tests should pass.

@MartinThoma
Copy link
Member

I'm very sorry that I didn't merge it 🙈 There were just too many open issues / PRs - i completely missed that one 🙈

@MartinThoma
Copy link
Member

Would you mind doing the changes again with the latest main branch?

I would completely understand if you don't want to do that. In that case I would suggest that I take care of it and give the credit to you in the commit message. With Githubs "co-authored-by" feature you would also be seen as an author of the commit.

@MartinThoma MartinThoma changed the title Check duplicate objects when sweepIndirectReferences PI: Check duplicate objects when sweepIndirectReferences Jun 25, 2022
@MartinThoma MartinThoma added nf-performance Non-functional change: Performance needs-rebase This PR cannot be merged as the main branch is too different. You need to rebase or merge main. and removed needs-change The PR/issue cannot be handled as issue and needs to be improved labels Jun 25, 2022
@Hatell
Copy link
Contributor Author

Hatell commented Jun 27, 2022

I take a look.

@MartinThoma MartinThoma added the soon PRs that are almost ready to be merged, issues that get solved pretty soon label Jun 27, 2022
@codecov
Copy link

codecov bot commented Jun 27, 2022

Codecov Report

Merging #207 (16393f5) into main (53efd73) will decrease coverage by 0.00%.
The diff coverage is 88.23%.

@@            Coverage Diff             @@
##             main     #207      +/-   ##
==========================================
- Coverage   89.60%   89.59%   -0.01%     
==========================================
  Files          24       24              
  Lines        4425     4441      +16     
  Branches      914      917       +3     
==========================================
+ Hits         3965     3979      +14     
  Misses        314      314              
- Partials      146      148       +2     
Impacted Files Coverage Δ
PyPDF2/_writer.py 88.70% <75.00%> (-0.20%) ⬇️
PyPDF2/generic.py 91.35% <100.00%> (+0.07%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 53efd73...16393f5. Read the comment docs.

@Hatell Hatell force-pushed the master branch 3 times, most recently from 4eb187e to 9fab63a Compare June 27, 2022 11:01
@MartinThoma MartinThoma removed the needs-rebase This PR cannot be merged as the main branch is too different. You need to rebase or merge main. label Jun 27, 2022
@Hatell
Copy link
Contributor Author

Hatell commented Jun 27, 2022

I think it should work now.

I did few tests and it worked well.

@MartinThoma MartinThoma changed the title PI: Check duplicate objects when sweepIndirectReferences PI: Check duplicate objects when _sweep_indirect_references Jun 27, 2022
@MartinThoma MartinThoma changed the title PI: Check duplicate objects when _sweep_indirect_references PI: Check duplicate objects in writer._sweep_indirect_references Jun 27, 2022
@MartinThoma MartinThoma self-requested a review June 27, 2022 19:36
Copy link
Member

@MartinThoma MartinThoma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@MartinThoma
Copy link
Member

@pubpub-zz @MasterOdin What do you think?

@MartinThoma
Copy link
Member

@Hatell If nobody has doubts on that one, I'll merge + release latest on Sunday evening.

@MartinThoma
Copy link
Member

MartinThoma commented Jun 28, 2022

I'm running a test over thousands of PDFs with essentially this code:

from PyPDF2 import PdfReader, PdfWriter

reader = PdfReader(path)
writer = PdfWriter()

for page in reader.pages:
    writer.add_page(page)

writer.add_metadata(reader.metadata)

Then I check the file size before and after. When the difference is over 1 MB, I print it for manual inspection. Here is what I found so far: (old size in bytes vs new size in bytes: difference)

@MartinThoma
Copy link
Member

MartinThoma commented Jun 28, 2022

Only by this PR

The size reduction also already happens with the current main, but this PR improves it. However, there were a couple of files that were not having a 1 MB size reduction with the current main:

@MartinThoma
Copy link
Member

MartinThoma commented Jun 28, 2022

@Hatell Using the approach above, a few files failed that were working before:

@Hatell
Copy link
Contributor Author

Hatell commented Jun 28, 2022

Thanks for tests. I check what happens in those files.

Copy link
Member

@MasterOdin MasterOdin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like the usage of self._idnum_hash and extern_map share a slightly similar purpose, that could we get rid of the former altogether? With this change, could it be that cases where we would avoid the if surrounding if statement of the change we'll now enter as we've not written to extern_map for those deep self-referential objects.

PyPDF2/_writer.py Outdated Show resolved Hide resolved
@Hatell
Copy link
Contributor Author

Hatell commented Jun 28, 2022

Now those NoneType object has no attribute is handled.

@MartinThoma MartinThoma merged commit ed832b3 into py-pdf:main Jun 28, 2022
@MartinThoma
Copy link
Member

Thank you for your contribution! It will be part of the next release 🎉 (I'm uncertain when that will be, but likely on 3rd of July)

MartinThoma added a commit that referenced this pull request Jun 30, 2022
New Features (ENH):
-  Add writer.pdf_header property (getter and setter) (#1038)

Performance Improvements (PI):
-  Remove b_ call in FloatObject.write_to_stream (#1044)
-  Check duplicate objects in writer._sweep_indirect_references (#207)

Documentation (DOC):
-  How to surppress exceptions/warnings/log messages (#1037)
-  Remove hyphen from lossless (#1041)
-  Compression of content streams (#1040)
-  Fix inconsistent variable names in add-watermark.md (#1039)
-  File size reduction
-  Add CHANGELOG to the rendered docs (#1023)

Maintenance (MAINT):
-  Handle XML error when reading XmpInformation (#1030)
-  Deduplicate Code / add mutmut config (#1022)

Code Style (STY):
-  Use unnecessary one-line function / class attribute (#1043)
-  Docstring formatting (#1033)

Full Changelog: 2.4.0...2.4.1
MartinThoma added a commit that referenced this pull request Jul 5, 2022
Caused-by: #207

Why it wasn't detected by the tests: We don't have any tests that check
for the correct result of a merge. We just check for exceptions

Closes: #1062
MartinThoma added a commit that referenced this pull request Jul 5, 2022
Caused-by: #207

Why it wasn't detected by the tests: We don't have any tests that check
for the correct result of a merge. We just check for exceptions

How we prevent it in future: Unit test was added

Risk of the fix:
- We will have bigger file sizes again as #207 was effectively reverted
- We will need to adjust this test if we change the way we write PDFs

Closes: #1062
MartinThoma added a commit that referenced this pull request Jul 5, 2022
Caused-by: #207

Why it wasn't detected by the tests: We don't have any tests that check
for the correct result of a merge. We just check for exceptions

How we prevent it in future: Unit test was added

Risk of the fix:
- We will have bigger file sizes again as #207 was effectively reverted
- We will need to adjust this test if we change the way we write PDFs

Closes: #1062
MartinThoma added a commit that referenced this pull request Jul 5, 2022
Caused-by: #207

Why it wasn't detected by the tests: We don't have any tests that check
for the correct result of a merge. We just check for exceptions

How we prevent it in future: Unit test was added

Risk of the fix:
- We will have bigger file sizes again as #207 was effectively reverted
- We will need to adjust this test if we change the way we write PDFs

Closes: #1062
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
nf-performance Non-functional change: Performance PdfWriter The PdfWriter component is affected soon PRs that are almost ready to be merged, issues that get solved pretty soon
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants