Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memoize the zipfile namelist when it's safe to do so. #33

Closed
wants to merge 1 commit into from

Conversation

benjyw
Copy link

@benjyw benjyw commented Jan 23, 2020

If the zipfile is in 'r' mode, we know it can't change under
us, so it's safe to lazily memoize the namelist (augmented with
the implied directories).

We propagate the memo to all Path objects created from
some existing Path object, for maximum effect.

We memoize the namelist as a set, to make lookups into it efficient.
This brings into focus the fact that iterdir's iteration order is arbitrary.

This was actually true before - the order we were getting in
practice was "all the explicit names, in zipfile insertion order,
followed by all the implicit dirs, in zipfile insertion order of
the descendants from which we inferred them). This isn't a
particularly intuitive or useful order for end users. However
pathlib's iterdir, which this one emulates, also does not guarantee
anything about order (it appears to in practice on MacOS, but not
on Linux). So it seems proper to double-down on this lack of order.

To allow for ordering when needed, this change adds ordering semantics
for Path objects, to make it easy to sort them if you need to.
The tests now use this sorting.

As a result of this change, in 'r' mode, exists() and joinpath()
are constant time instead of linear time. They are still linear
in other modes.

Existing tests have been extended to also run on on-disk zipfiles opened
in 'r' mode, and a new test has been added, to verify the performance fix.

If the zipfile is in 'r' mode, we know it can't change under
us, so it's safe to lazily memoize the namelist (augmented with
the implied directories).

We memoize it as a set, to make lookups into it efficient.
This emphasizes that iterdir's iteration order is arbitrary.

This was actually true before - the order we were getting in
practice was "all the explicit names, in zipfile insertion order,
followed by all the implicit dirs, in zipfile insertion order of
the descendants from which we inferred them). This isn't a
particularly intuitive or useful order for end users. However
pathlib's iterdir, which this one emulates, also does not guarantee
anything about order (it appears to in practice on MacOS, but not
on Linux). So it seems right to double-down on this lack of order.

This change adds ordering semantics for Path objects, to make it
easy to sort them if you need to. The tests now use this sorting.

As a result of this change, in 'r' mode, exists() and joinpath()
are constant time instead of linear time.  They are still linear
in other modes.
@jaraco
Copy link
Owner

jaraco commented Jan 24, 2020

I feel pretty good about this change. I have some ideas about how to tweak it a bit, but that doesn't have to block adoption. Having the unit test to prevent regression is really great, so thank you for that.

jaraco added a commit that referenced this pull request Jan 25, 2020
@jaraco
Copy link
Owner

jaraco commented Jan 25, 2020

Please have a look at #34, where I've taken inspiration from this contribution and refactored it into a design I'm much more comfortable accepting.

@jaraco jaraco closed this Jan 25, 2020
jaraco added a commit that referenced this pull request Jan 25, 2020
jaraco added a commit that referenced this pull request Jan 30, 2021
* Use `extend-ignore` in flake8 config

This option allows to add extra ignored rules to the default list
instead of replacing it.

The default exclusions are: E121, E123, E126, E226, E24, E704,
W503 and W504.

Fixes #28.

Refs:
* https://github.com/pypa/setuptools/pull/2486/files#r541943356
* https://flake8.pycqa.org/en/latest/user/options.html#cmdoption-flake8-extend-ignore
*
https://flake8.pycqa.org/en/latest/user/options.html#cmdoption-flake8-ignore

* Enable complexity limit. Fixes jaraco/skeleton#34.

* Replace pep517.build with build (#37)

* Replace pep517.build with build

Resolves #30

* Prefer simple usage

Co-authored-by: Jason R. Coombs <jaraco@jaraco.com>

* Use license_files instead of license_file in meta (#35)

Singular `license_file` is deprecated since wheel v0.32.0.

Refs:
* https://wheel.readthedocs.io/en/stable/news.html
* https://wheel.readthedocs.io/en/stable/user_guide.html#including-license-files-in-the-generated-wheel-file

Co-authored-by: Jason R. Coombs <jaraco@jaraco.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants