-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimization: job/project initialization. #451
Conversation
…oading the state point lazily.
Since the new code path does not cover that path anymore.
…glotzerlab/signac into feature/lazy-statepoint-loading
…cases that would change public APIs).
@@ -1196,16 +1214,18 @@ def write_statepoints(self, statepoints=None, fn=None, indent=2): | |||
with open(fn, "w") as file: | |||
file.write(json.dumps(tmp, indent=indent)) | |||
|
|||
def _register(self, job): | |||
def _register(self, _id, statepoint): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this method to reduce overhead. The id and state point are already known in all the cases where project._register
is called, but the previous implementation had to explicitly perform a conversion back from the synced state point to a plain dict. That accounted for some of the slowdown between master
and feature/lazy-statepoint-loading
in #239 when opening jobs by state point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've studied the diffs (less the surrounding code) and my main concern is the naming issue. See my inline comments.
Will |
Not necessary - |
@csadorf @tommy-waltmann @klywang I would appreciate a second review if you are able. @cbkerr Thank you for the review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! 👍 Sorry for the delay.
Description
This is a follow-up PR to #239, but it may be possible to adopt this set of changes separately from #239 if that PR stalls. I recommend that reviewers focus on #239 first and then handle this PR second.
This PR refactors the
__init__
methods of theJob
andProject
classes so that less work is done during initialization, instead deferring (and often caching) that work in a property. This requires no API breaking changes, because the existing methods and properties were enough to handle this change.One tricky part of this is that I added a "mutation hook" to the Project class. In 99% of users' workflows, the Project config (which contains information about the workspace directory name, project name, etc.) does not need to be modified once the project is instantiated. While in-memory config mutation is allowed, mutating the configuration of a
Project
instance has been deprecated since version 1.3. However, there is a significant performance penalty to allowing this mutation: every job that is opened must determine its workspace directory by first checking the project's workspace directory and joining its own id. That requires an access of the project configuration (and more importantly, various path manipulations) for every job, and it shows up quite obviously when profiling iteration over a large number of jobs. To solve this, I added a_mutate_hook
method that will reset the values of any cached properties that are derived from the project configuration if the configuration is edited in memory.I have verified that the existing tests should cover these changes, particularly the changes to the
Project
cached properties and config mutation hook.Motivation and Context
This makes the "lazy state point loading" in #239 even lazier, and thus extremely fast for many common operations (e.g. in signac-flow).
[job for job in pr]
; N=30,000[job.sp() for job in pr]
; N=30,000Types of Changes
1The change breaks (or has the potential to break) existing functionality.
Checklist:
If necessary: