Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-42306: Performance optimizations #76

Merged
merged 28 commits into from Jan 23, 2024
Merged

DM-42306: Performance optimizations #76

merged 28 commits into from Jan 23, 2024

Conversation

timj
Copy link
Member

@timj timj commented Jan 16, 2024

  • No longer call os.path.isdir when constructing a schemeless URI.
  • Simplify relativeToPathRoot and getExtension to remove Path constructor.
  • Add tri-state for dirLike and forceDirectory to allow someone to declare that a URI is definitely file-like or else say that they don't know.

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes

Copy link

codecov bot commented Jan 16, 2024

Codecov Report

Attention: 7 lines in your changes are missing coverage. Please review.

Comparison is base (651d55e) 86.53% compared to head (1d37903) 86.84%.

Files Patch % Lines
python/lsst/resources/_resourcePath.py 86.36% 3 Missing and 3 partials ⚠️
python/lsst/resources/schemeless.py 93.33% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #76      +/-   ##
==========================================
+ Coverage   86.53%   86.84%   +0.31%     
==========================================
  Files          27       27              
  Lines        4202     4248      +46     
  Branches      850      858       +8     
==========================================
+ Hits         3636     3689      +53     
+ Misses        445      438       -7     
  Partials      121      121              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

return self.dirLike or os.path.isdir(self.ospath)
if self.dirLike is None:
# Cache state for next time.
self.dirLike = os.path.isdir(self.ospath)
Copy link
Member Author

@timj timj Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ResourcePath is nominally immutable but this cache can be updated from None to bool. This will presumably need locking around it and in packageresource and schemeless implementations.

I am also wondering if this and isTemporary and isLocal should be made read-only getters with _dirLike etc used internally.

Copy link
Contributor

@dhirving dhirving left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the general thrust of this change, but I'm concerned about some of the details.

Unless I'm misunderstanding things, this seems to break some of the code paths for schemeless paths when the user does not explicitly specify forceDirectory, in the cases where schemeless is used to represent "some path that will be later bound to a root other than the current working directory".

It also appears to add many new calls to stat for File and Schemeless paths where previously no I/O was done, if the user doesn't explicitly specify forceDirectory.

python/lsst/resources/schemeless.py Show resolved Hide resolved
python/lsst/resources/schemeless.py Outdated Show resolved Hide resolved
python/lsst/resources/_resourcePath.py Outdated Show resolved Hide resolved
Comment on lines 247 to 257
if not root_uri.isdir():
raise ValueError(
f"Root URI ({root}) was not a directory so can not be joined with"
f" path {parsed.path!r}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exception will always go off if root_uri was created with forceDirectory=None and and is schemeless in the "relative path not relevant to the local filesystem" sense. As above, I wonder if this test should be root_uri.dirLike is False.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

root_uri is required to be a directory. This code is going to trigger a stat call in rare case when root is a schemeless URI itself. All other cases dirLike will only be a boolean. Again it becomes an issue of "assume it's a directory because what other choice is there" vs "is that directory a real directory relative to cwd".

Comment on lines 433 to 448
if self.isdir():
# This is a directory so must return itself and the empty string.
return self, ""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combined with the change to schemeless.isdir, doesn't this addition break the schemeless forceDirectory=None case for relative directory paths not present in the current working directory?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a/b it will return a/b for the directory if both a and b happen to exist relative to cwd and b is a directory. This does mean that you can get a different answer based on what directory you happen to be sitting in for schemeless. Changing semantics such that people are required to provide the trailing slash for a directory is going to break things a little as described above with zsh dropping / on autocomplete on the command line.

os.path and pathlib assume things based on a trailing slash that they sometimes drop in other scenarios.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Offline we have agreed to leave split() calling isdir() and to deprecate split, dirname, and basename on another ticket. pathlib.Path has a consistent approach of parent and name and we will switch to that and if someone wants to know if it is a directory they call isdir themselves.

python/lsst/resources/_resourcePath.py Outdated Show resolved Hide resolved
python/lsst/resources/_resourcePath.py Outdated Show resolved Hide resolved
python/lsst/resources/_resourcePath.py Show resolved Hide resolved
tests/test_schemeless.py Show resolved Hide resolved
python/lsst/resources/_resourcePath.py Outdated Show resolved Hide resolved
@timj timj force-pushed the tickets/DM-42306 branch 4 times, most recently from 58d8ad3 to 6688fa3 Compare January 19, 2024 22:07

Notes
-----
Equivalent to `os.path.split` where head preserves the URI
components.
"""
if self.dirLike:
Copy link
Member Author

@timj timj Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One problem remaining is this (which you can reproduce from the resources directory):

>>> from lsst.resources import ResourcePath
>>> r = ResourcePath("python")
>>> r
ResourcePath("file:///Users/timj/work/lsstsw/build/resources/python")
>>> r.parent()
ResourcePath("file:///Users/timj/work/lsstsw/build/resources/")
>>> r.dirname()
ResourcePath("file:///Users/timj/work/lsstsw/build/resources/")
>>> r.isdir()
True
>>> r.dirname()
ResourcePath("file:///Users/timj/work/lsstsw/build/resources/python")
>>> r.parent()
ResourcePath("file:///Users/timj/work/lsstsw/build/resources/")

so dirname() gives the right answer for the absolute file URI only when it's allowed to check the file system to be sure and the answer changes depending on whether isdir has been called or not.

If you added the / to the file path or did forceDirectory it would all work obviously.

Copy link
Contributor

@dhirving dhirving Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might consider removing dirname() entirely (or making it an alias for parent). I just looked and outside of Butler the only thing using it is lsst.alert.packet.schema:from_uri, which only uses it for remote URLs and would get identical results from parent(). I haven't looked but I strongly suspect the Butler cases (other than the ButlerConfig thing) would be just as happy using parent() as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my point of view dirname is just "parent, but confusing."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.path.dirname is clearly ridiculous given it changes what it does based on whether it thinks it's a directory or not. Path.parent doesn't care.

@timj timj force-pushed the tickets/DM-42306 branch 5 times, most recently from 59681a3 to 7a80091 Compare January 22, 2024 21:57
@timj timj requested a review from dhirving January 22, 2024 22:19
@timj
Copy link
Member Author

timj commented Jan 22, 2024

@dhirving I think this is ready for review again. I fixed the butler issue (it was related to the updatedFile changes). I think the main outcomes are:

  • Need to deprecated dirname, split, and basename
  • Need to deprecate updatedFile deciding to do a join if it sees a directory.

Copy link
Contributor

@dhirving dhirving left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

This would only work if the file pointed to by the schemeless URI
happened to exist. Given that a schemeless URI could be relative
to any location, we should not be checking relative to cwd unless
we are really asked to see if this is a directory or to read it.
For schemeless URIs we do not know whether something is a directory
or not without checking the file system but we should not do that
until we are asked to find out. Add a None state to the dirLike
property to indicate that directory status is not yet known.
This is only relevant for files since S3 resources rely on
the trailing slash.
The code was trying very hard not to assume that this was
simply removing the leading "/" but in every case all it
ends up doing is removing the leading "/". Make the code
do that directly.
The default of None means that the code will try to work it out
for itself.
pathlib did not require that we restrict it to just the file
part of the path.
The pathlib suffixes property works well but the pathlib
constructor is very slow only for it to be discarded
immediately. Therefore write our own implementation.
Otherwise the fallback code path to update the cache won't
trigger.
Previously it had not been modified to understand forceDirectory=None.
Also, we do not cache the result if the resource is not present.
Now always appends without trying to replace a trailing file
component. Complains if the URI refers to a file already.
Only overrides if the state is not known.
Document split can use the file system.
Also document that the current behavior should not be relied upon
for directories and will likely change in the future.
Stop it caring one way or the other.
@timj timj merged commit bee90fe into main Jan 23, 2024
17 checks passed
@timj timj deleted the tickets/DM-42306 branch January 23, 2024 23:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants