New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-42306: Performance optimizations #76
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #76 +/- ##
==========================================
+ Coverage 86.53% 86.84% +0.31%
==========================================
Files 27 27
Lines 4202 4248 +46
Branches 850 858 +8
==========================================
+ Hits 3636 3689 +53
+ Misses 445 438 -7
Partials 121 121 ☔ View full report in Codecov by Sentry. |
0699b18
to
7f9ff4a
Compare
6296a1d
to
234c1af
Compare
return self.dirLike or os.path.isdir(self.ospath) | ||
if self.dirLike is None: | ||
# Cache state for next time. | ||
self.dirLike = os.path.isdir(self.ospath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ResourcePath
is nominally immutable but this cache can be updated from None
to bool
. This will presumably need locking around it and in packageresource and schemeless implementations.
I am also wondering if this and isTemporary
and isLocal
should be made read-only getters with _dirLike
etc used internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the general thrust of this change, but I'm concerned about some of the details.
Unless I'm misunderstanding things, this seems to break some of the code paths for schemeless paths when the user does not explicitly specify forceDirectory, in the cases where schemeless is used to represent "some path that will be later bound to a root other than the current working directory".
It also appears to add many new calls to stat
for File and Schemeless paths where previously no I/O was done, if the user doesn't explicitly specify forceDirectory.
if not root_uri.isdir(): | ||
raise ValueError( | ||
f"Root URI ({root}) was not a directory so can not be joined with" | ||
f" path {parsed.path!r}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This exception will always go off if root_uri was created with forceDirectory=None and and is schemeless in the "relative path not relevant to the local filesystem" sense. As above, I wonder if this test should be root_uri.dirLike is False
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
root_uri is required to be a directory. This code is going to trigger a stat call in rare case when root is a schemeless URI itself. All other cases dirLike will only be a boolean. Again it becomes an issue of "assume it's a directory because what other choice is there" vs "is that directory a real directory relative to cwd".
if self.isdir(): | ||
# This is a directory so must return itself and the empty string. | ||
return self, "" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Combined with the change to schemeless.isdir, doesn't this addition break the schemeless forceDirectory=None case for relative directory paths not present in the current working directory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a/b
it will return a/b
for the directory if both a
and b
happen to exist relative to cwd and b
is a directory. This does mean that you can get a different answer based on what directory you happen to be sitting in for schemeless. Changing semantics such that people are required to provide the trailing slash for a directory is going to break things a little as described above with zsh dropping /
on autocomplete on the command line.
os.path and pathlib assume things based on a trailing slash that they sometimes drop in other scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Offline we have agreed to leave split()
calling isdir()
and to deprecate split
, dirname
, and basename
on another ticket. pathlib.Path
has a consistent approach of parent
and name
and we will switch to that and if someone wants to know if it is a directory they call isdir
themselves.
58d8ad3
to
6688fa3
Compare
|
||
Notes | ||
----- | ||
Equivalent to `os.path.split` where head preserves the URI | ||
components. | ||
""" | ||
if self.dirLike: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One problem remaining is this (which you can reproduce from the resources directory):
>>> from lsst.resources import ResourcePath
>>> r = ResourcePath("python")
>>> r
ResourcePath("file:///Users/timj/work/lsstsw/build/resources/python")
>>> r.parent()
ResourcePath("file:///Users/timj/work/lsstsw/build/resources/")
>>> r.dirname()
ResourcePath("file:///Users/timj/work/lsstsw/build/resources/")
>>> r.isdir()
True
>>> r.dirname()
ResourcePath("file:///Users/timj/work/lsstsw/build/resources/python")
>>> r.parent()
ResourcePath("file:///Users/timj/work/lsstsw/build/resources/")
so dirname()
gives the right answer for the absolute file URI only when it's allowed to check the file system to be sure and the answer changes depending on whether isdir
has been called or not.
If you added the /
to the file path or did forceDirectory
it would all work obviously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might consider removing dirname() entirely (or making it an alias for parent). I just looked and outside of Butler the only thing using it is lsst.alert.packet.schema:from_uri, which only uses it for remote URLs and would get identical results from parent(). I haven't looked but I strongly suspect the Butler cases (other than the ButlerConfig thing) would be just as happy using parent() as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my point of view dirname is just "parent, but confusing."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
os.path.dirname
is clearly ridiculous given it changes what it does based on whether it thinks it's a directory or not. Path.parent
doesn't care.
59681a3
to
7a80091
Compare
@dhirving I think this is ready for review again. I fixed the butler issue (it was related to the updatedFile changes). I think the main outcomes are:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
This would only work if the file pointed to by the schemeless URI happened to exist. Given that a schemeless URI could be relative to any location, we should not be checking relative to cwd unless we are really asked to see if this is a directory or to read it.
For schemeless URIs we do not know whether something is a directory or not without checking the file system but we should not do that until we are asked to find out. Add a None state to the dirLike property to indicate that directory status is not yet known. This is only relevant for files since S3 resources rely on the trailing slash.
The code was trying very hard not to assume that this was simply removing the leading "/" but in every case all it ends up doing is removing the leading "/". Make the code do that directly.
The default of None means that the code will try to work it out for itself.
pathlib did not require that we restrict it to just the file part of the path.
The pathlib suffixes property works well but the pathlib constructor is very slow only for it to be discarded immediately. Therefore write our own implementation.
Otherwise the fallback code path to update the cache won't trigger.
Previously it had not been modified to understand forceDirectory=None. Also, we do not cache the result if the resource is not present.
Now always appends without trying to replace a trailing file component. Complains if the URI refers to a file already.
Only overrides if the state is not known.
Document split can use the file system.
Also document that the current behavior should not be relied upon for directories and will likely change in the future.
Stop it caring one way or the other.
Those no longer exist.
7a80091
to
1d37903
Compare
Checklist
doc/changes