PI: Use iterative DFS in PdfWriter._sweep_indirect_references #1072

Hatell · 2022-07-07T12:28:31Z

Recursive Depth-first search (DFS) was changed to iterative DFS
Removed PdfWriter.external_reference_map and calculate hash from every referred object and use that to detect duplicate objects.
In several cases, the warning "Unable to resolve .*, returning NullObject instead" is no longer necessary.
Bugfix: Recalculate all parents hashes when a dictionary or array object value changes

Closes #351
Closes #1036

xref indexes has updated.

Hatell · 2022-07-07T13:08:01Z

Hm, some unknown reason py37 and py38 fails but py39 and py310 was ok.

Hatell · 2022-07-08T05:46:52Z

There is two tests with issues:

test_sweep_recursion1
test_sweep_recursion2

I did research them and those tests "succeeded" because this _sweep_indirect_references hit recursionlimit. And it happens because PDF has a linked list over 1000 items.

MartinThoma · 2022-07-08T05:48:48Z

Maybe we should work on getting #351 ready first so that we don't hit the recursion limit anymore?

Hatell · 2022-07-08T05:56:11Z

This can be done in iterative algorithm like that.

Hatell · 2022-07-08T11:37:31Z

Now this is transformed to iterative version.

Some tests needed update because warnings was not raised any more.

codecov · 2022-07-08T11:44:30Z

Codecov Report

Merging #1072 (5b29862) into main (b42e0db) will increase coverage by 0.07%.
The diff coverage is 94.33%.

@@            Coverage Diff             @@
##             main    #1072      +/-   ##
==========================================
+ Coverage   91.50%   91.57%   +0.07%     
==========================================
  Files          24       24              
  Lines        4530     4524       -6     
  Branches      927      926       -1     
==========================================
- Hits         4145     4143       -2     
+ Misses        245      241       -4     
  Partials      140      140

Impacted Files	Coverage Δ
PyPDF2/_writer.py	`89.04% <93.61%> (-0.02%)`	⬇️
PyPDF2/generic.py	`91.61% <100.00%> (+0.35%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b42e0db...5b29862. Read the comment docs.

MartinThoma · 2022-07-09T07:57:07Z

Wow, this is amazing @Hatell ! Thank you 🙏 🤗

I will review it today, but it might take to the evening :-)

Hatell · 2022-07-09T08:13:04Z

This is now ready for testing.

Main changes is:

Recursive DFS was changed to this DFS_iterative algorithm.
Removed extern_map and calculate hash from every referred object and use that to detect duplicate objects.

One fix need to be done to recalculate all parents hash if dictionary or array object value changes.

If data is changed then update of keys is done all parents. Added checks to tests to verify that all keys in _idnum_hash is valid.

Hatell · 2022-07-09T11:23:50Z

I think I solved this issue for recalculating hashes when updating a dictionary or array object.

MartinThoma · 2022-07-09T12:29:50Z

Thank you so much for all the effort @Hatell !

I've adjusted the title of the PR and the first message of it. I will use them for the squash commit to represent all of the changes done here. Feel free to adjust if you think there should be something added / adjusted.

MartinThoma · 2022-07-09T12:32:47Z

If you want, you can also remove the

TODO: This test looks like an infinite loop.

in test_merger.py

PyPDF2/_writer.py

MartinThoma · 2022-07-09T20:54:18Z

I'm currently letting a bigger text run through. So far, it looks good. I'm still a tiny bit worried as this is such a core part of PyPDF2 😅

Hatell · 2022-07-10T07:36:01Z

Great and thanks for help.

MartinThoma · 2022-07-10T12:07:26Z

Thank you for your contribution ❤️ I'll make a release in a couple of hours

New Features (ENH): - Add PageObject._get_fonts (#1083) - Add support for indexed color spaces / BitsPerComponent for decoding PNGs (#1067) Performance Improvements (PI): - Use iterative DFS in PdfWriter._sweep_indirect_references (#1072) Bug Fixes (BUG): - Let Page.scale also scale the crop-/trim-/bleed-/artbox (#1066) - Column default for CCITTFaxDecode (#1079) Robustness (ROB): - Guard against None-value in _get_outlines (#1060) Documentation (DOC): - Stamps and watermarks (#1082) - OCR vs PDF text extraction (#1081) - Python Version support - Formatting of CHANGELOG Developer Experience (DEV): - Cache downloaded files (#1070) - Speed-up for CI (#1069) Maintenance (MAINT): - Set page.rotate(angle: int) (#1092) - Issue #416 was fixed by #1015 (#1078) Testing (TST): - Image extraction (#1080) - Image extraction (#1077) Code Style (STY): - Apply black - Typo in Changelog Full Changelog: 2.4.2...2.4.3

Harry Karvonen and others added 9 commits July 5, 2022 15:57

Ensure indirect sweep to handle all objects.

190479c

Update PyPDF2/_writer.py

b6cc70e

StreamObject.hash_value_data get DictionaryObject data as well.

1ec14d1

Added test to merge PDF with missing objects.

36aa111

Handle case when PdfWriter._sweep_indirect_references returns None.

b17baef

Refactored extern_map.

c699945

IndirectObject.__repr__ added pdf object identifier.

a195ab3

Check if indirect reference is internal and add object if it external

434a4fd

tests: test_workflow/test_merge_output updated merged PDF

5688668

xref indexes has updated.

Hatell mentioned this pull request Jul 7, 2022

Ensure indirect sweep to handle all objects. #1064

Closed

Harry Karvonen added 2 commits July 8, 2022 14:17

Refactored indirect sweep from recursive to iterative.

7ba13f2

Updated tests because iterative sweep.

ab180db

Fixed type definitions.

7fcb550

Added test to test stream object handling in dictionary object.

d7ff64b

Hatell force-pushed the IS-1036 branch from 5ffe9f2 to d7ff64b Compare July 8, 2022 13:01

Updated test_write_dict_stream_object to check object structure.

d3511df

Harry Karvonen added 3 commits July 9, 2022 11:45

Update _idnum_hash keys if data is changed.

f333ade

If data is changed then update of keys is done all parents. Added checks to tests to verify that all keys in _idnum_hash is valid.

test_merge_output updated result.

2a2d234

Typing fixes.

41f78cf

Hatell force-pushed the IS-1036 branch from 9eef42d to 41f78cf Compare July 9, 2022 09:22

Removed old check.

aa49716

MartinThoma changed the title ~~Refactor: writer._sweep_indirect_references: _idnum_hash and extern_map~~ PI: Use iterative DFS in PdfWriter._sweep_indirect_references Jul 9, 2022

MartinThoma added nf-performance Non-functional change: Performance PdfWriter The PdfWriter component is affected labels Jul 9, 2022