Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PI: Use iterative DFS in PdfWriter._sweep_indirect_references #1072

Merged
merged 20 commits into from Jul 10, 2022

Conversation

Hatell
Copy link
Contributor

@Hatell Hatell commented Jul 7, 2022

  • Recursive Depth-first search (DFS) was changed to iterative DFS
  • Removed PdfWriter.external_reference_map and calculate hash from every referred object and use that to detect duplicate objects.
  • In several cases, the warning "Unable to resolve .*, returning NullObject instead" is no longer necessary.
  • Bugfix: Recalculate all parents hashes when a dictionary or array object value changes

Closes #351
Closes #1036

@Hatell
Copy link
Contributor Author

Hatell commented Jul 7, 2022

Hm, some unknown reason py37 and py38 fails but py39 and py310 was ok.

@Hatell
Copy link
Contributor Author

Hatell commented Jul 8, 2022

There is two tests with issues:

  • test_sweep_recursion1
  • test_sweep_recursion2

I did research them and those tests "succeeded" because this _sweep_indirect_references hit recursionlimit. And it happens because PDF has a linked list over 1000 items.

@MartinThoma
Copy link
Member

Maybe we should work on getting #351 ready first so that we don't hit the recursion limit anymore?

@Hatell
Copy link
Contributor Author

Hatell commented Jul 8, 2022

This can be done in iterative algorithm like that.

@Hatell
Copy link
Contributor Author

Hatell commented Jul 8, 2022

Now this is transformed to iterative version.

Some tests needed update because warnings was not raised any more.

@codecov
Copy link

codecov bot commented Jul 8, 2022

Codecov Report

Merging #1072 (5b29862) into main (b42e0db) will increase coverage by 0.07%.
The diff coverage is 94.33%.

@@            Coverage Diff             @@
##             main    #1072      +/-   ##
==========================================
+ Coverage   91.50%   91.57%   +0.07%     
==========================================
  Files          24       24              
  Lines        4530     4524       -6     
  Branches      927      926       -1     
==========================================
- Hits         4145     4143       -2     
+ Misses        245      241       -4     
  Partials      140      140              
Impacted Files Coverage Δ
PyPDF2/_writer.py 89.04% <93.61%> (-0.02%) ⬇️
PyPDF2/generic.py 91.61% <100.00%> (+0.35%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b42e0db...5b29862. Read the comment docs.

@MartinThoma
Copy link
Member

Wow, this is amazing @Hatell ! Thank you 🙏 🤗

I will review it today, but it might take to the evening :-)

@Hatell
Copy link
Contributor Author

Hatell commented Jul 9, 2022

This is now ready for testing.

Main changes is:

  • Recursive DFS was changed to this DFS_iterative algorithm.
  • Removed extern_map and calculate hash from every referred object and use that to detect duplicate objects.

One fix need to be done to recalculate all parents hash if dictionary or array object value changes.

Harry Karvonen added 3 commits July 9, 2022 11:45
If data is changed then update of keys is done all parents.
Added checks to tests to verify that all keys in _idnum_hash is valid.
@Hatell
Copy link
Contributor Author

Hatell commented Jul 9, 2022

I think I solved this issue for recalculating hashes when updating a dictionary or array object.

@MartinThoma MartinThoma changed the title Refactor: writer._sweep_indirect_references: _idnum_hash and extern_map PI: Use iterative DFS in PdfWriter._sweep_indirect_references Jul 9, 2022
@MartinThoma
Copy link
Member

Thank you so much for all the effort @Hatell !

I've adjusted the title of the PR and the first message of it. I will use them for the squash commit to represent all of the changes done here. Feel free to adjust if you think there should be something added / adjusted.

@MartinThoma MartinThoma added nf-performance Non-functional change: Performance PdfWriter The PdfWriter component is affected labels Jul 9, 2022
@MartinThoma
Copy link
Member

MartinThoma commented Jul 9, 2022

If you want, you can also remove the

TODO: This test looks like an infinite loop.

in test_merger.py

PyPDF2/_writer.py Outdated Show resolved Hide resolved
PyPDF2/_writer.py Outdated Show resolved Hide resolved
PyPDF2/_writer.py Outdated Show resolved Hide resolved
PyPDF2/_writer.py Outdated Show resolved Hide resolved
PyPDF2/_writer.py Outdated Show resolved Hide resolved
PyPDF2/_writer.py Outdated Show resolved Hide resolved
PyPDF2/_writer.py Outdated Show resolved Hide resolved
@MartinThoma
Copy link
Member

I'm currently letting a bigger text run through. So far, it looks good. I'm still a tiny bit worried as this is such a core part of PyPDF2 😅

@Hatell
Copy link
Contributor Author

Hatell commented Jul 10, 2022

Great and thanks for help.

@MartinThoma MartinThoma merged commit 1e4c2c9 into py-pdf:main Jul 10, 2022
@MartinThoma
Copy link
Member

Thank you for your contribution ❤️ I'll make a release in a couple of hours

MartinThoma added a commit that referenced this pull request Jul 10, 2022
New Features (ENH):
-  Add PageObject._get_fonts (#1083)
-  Add support for indexed color spaces / BitsPerComponent for decoding PNGs (#1067)

Performance Improvements (PI):
-  Use iterative DFS in PdfWriter._sweep_indirect_references (#1072)

Bug Fixes (BUG):
-  Let Page.scale also scale the crop-/trim-/bleed-/artbox (#1066)
-  Column default for CCITTFaxDecode (#1079)

Robustness (ROB):
-  Guard against None-value in _get_outlines (#1060)

Documentation (DOC):
-  Stamps and watermarks (#1082)
-  OCR vs PDF text extraction (#1081)
-  Python Version support
-  Formatting of CHANGELOG

Developer Experience (DEV):
-  Cache downloaded files (#1070)
-  Speed-up for CI (#1069)

Maintenance (MAINT):
-  Set page.rotate(angle: int) (#1092)
-  Issue #416 was fixed by #1015 (#1078)

Testing (TST):
-  Image extraction (#1080)
-  Image extraction (#1077)

Code Style (STY):
-  Apply black
-  Typo in Changelog

Full Changelog: 2.4.2...2.4.3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
nf-performance Non-functional change: Performance PdfWriter The PdfWriter component is affected
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor writer._sweep_indirect_references: _idnum_hash and extern_map
2 participants