Memoize the zipfile namelist when it's safe to do so. #33
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
If the zipfile is in 'r' mode, we know it can't change under
us, so it's safe to lazily memoize the namelist (augmented with
the implied directories).
We propagate the memo to all Path objects created from
some existing Path object, for maximum effect.
We memoize the namelist as a set, to make lookups into it efficient.
This brings into focus the fact that iterdir's iteration order is arbitrary.
This was actually true before - the order we were getting in
practice was "all the explicit names, in zipfile insertion order,
followed by all the implicit dirs, in zipfile insertion order of
the descendants from which we inferred them). This isn't a
particularly intuitive or useful order for end users. However
pathlib's iterdir, which this one emulates, also does not guarantee
anything about order (it appears to in practice on MacOS, but not
on Linux). So it seems proper to double-down on this lack of order.
To allow for ordering when needed, this change adds ordering semantics
for Path objects, to make it easy to sort them if you need to.
The tests now use this sorting.
As a result of this change, in 'r' mode, exists() and joinpath()
are constant time instead of linear time. They are still linear
in other modes.
Existing tests have been extended to also run on on-disk zipfiles opened
in 'r' mode, and a new test has been added, to verify the performance fix.