Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PI: Use __slots__ #1524

Closed
wants to merge 11 commits into from
Closed

PI: Use __slots__ #1524

wants to merge 11 commits into from

Conversation

MartinThoma
Copy link
Member

@MartinThoma MartinThoma commented Dec 31, 2022

More information

An awesome explanation on StackOverflow

How much is it worth?

It's about a 10x improvement in performance

Before:
bench-1

After:
bench-2

Text extraction speed according to the benchmark is not affected at all 🥲

@MartinThoma
Copy link
Member Author

I'm running into https://stackoverflow.com/q/10857515/562769

@MartinThoma MartinThoma marked this pull request as draft December 31, 2022 10:19
@MartinThoma
Copy link
Member Author

@pubpub-zz Is indirect_reference actually a property of PdfObjectProtocol / PdfObject? I've only seen it on Page and Destination

@pubpub-zz
Copy link
Collaborator

Quite. 99% of the time
This property is only initiated for objects attached to a pdfreader or a PdfWriter. The cloning process checks if this property exists to do some specific actions.

@MartinThoma
Copy link
Member Author

MartinThoma commented Jan 4, 2023

There are two places where we assign indirect_reference:

PdfReader:

    def cache_indirect_object(
        self, generation: int, idnum: int, obj: Optional[PdfObject]
    ) -> Optional[PdfObject]:
        if (generation, idnum) in self.resolved_objects:
            msg = f"Overwriting cache for {generation} {idnum}"
            if self.strict:
                raise PdfReadError(msg)
            logger_warning(msg, __name__)
        self.resolved_objects[(generation, idnum)] = obj
        if obj is not None:
            obj.indirect_reference = IndirectObject(idnum, generation, self)
        return obj

PdfWriter:

    def _add_object(self, obj: PdfObject) -> IndirectObject:
        if hasattr(obj, "indirect_reference") and obj.indirect_reference.pdf == self:  # type: ignore
            return obj.indirect_reference  # type: ignore
        self._objects.append(obj)
        obj.indirect_reference = IndirectObject(len(self._objects), 0, self)
        return obj.indirect_reference

Those cause problems with the slots (potentially only for mypy)

@codecov
Copy link

codecov bot commented Jan 4, 2023

Codecov Report

Base: 91.83% // Head: 91.83% // Increases project coverage by +0.00% 🎉

Coverage data is based on head (ded82f6) compared to base (ea598dd).
Patch coverage: 90.90% of modified lines in pull request are covered.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1524   +/-   ##
=======================================
  Coverage   91.83%   91.83%           
=======================================
  Files          33       33           
  Lines        6073     6077    +4     
  Branches     1193     1193           
=======================================
+ Hits         5577     5581    +4     
  Misses        321      321           
  Partials      175      175           
Impacted Files Coverage Δ
pypdf/generic/_data_structures.py 89.97% <0.00%> (ø)
pypdf/_protocols.py 80.00% <100.00%> (ø)
pypdf/_reader.py 90.36% <100.00%> (ø)
pypdf/_writer.py 83.75% <100.00%> (ø)
pypdf/generic/_base.py 99.64% <100.00%> (+<0.01%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@MartinThoma MartinThoma marked this pull request as ready for review January 4, 2023 10:46
@MartinThoma
Copy link
Member Author

@pubpub-zz The interesting question is if FloatObject and BooleanObject ever get an attribute dynamically assigned. I only see dynamic attribute assignment for indirect_reference . I also only see it happening in

  • PdfReader.cache_indirect_object
  • PdfWriter._add_object
  • DictionaryObject._clone

I guess all of them (or at least one) could assign indirect_reference to a float object, right?

@MartinThoma MartinThoma changed the title PERF: Use __slots__ PI: Use __slots__ Jan 6, 2023
@pubpub-zz
Copy link
Collaborator

Although unusual the indirect_reference can point to an FloatObject and NumberObject.

@MartinThoma
Copy link
Member Author

Ok, then I'll close it. We need to find another way (or find a way to add indirect_reference to the object while having multiple inheritance)

@MartinThoma MartinThoma closed this Jan 7, 2023
@MartinThoma MartinThoma added the nf-performance Non-functional change: Performance label Jan 13, 2023
@MartinThoma MartinThoma reopened this Jan 24, 2023
@MartinThoma
Copy link
Member Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
nf-performance Non-functional change: Performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants