Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization: job/project initialization. #451

Merged
merged 48 commits into from
Jan 11, 2021
Merged

Conversation

bdice
Copy link
Member

@bdice bdice commented Dec 29, 2020

Description

This is a follow-up PR to #239, but it may be possible to adopt this set of changes separately from #239 if that PR stalls. I recommend that reviewers focus on #239 first and then handle this PR second.

This PR refactors the __init__ methods of the Job and Project classes so that less work is done during initialization, instead deferring (and often caching) that work in a property. This requires no API breaking changes, because the existing methods and properties were enough to handle this change.

One tricky part of this is that I added a "mutation hook" to the Project class. In 99% of users' workflows, the Project config (which contains information about the workspace directory name, project name, etc.) does not need to be modified once the project is instantiated. While in-memory config mutation is allowed, mutating the configuration of a Project instance has been deprecated since version 1.3. However, there is a significant performance penalty to allowing this mutation: every job that is opened must determine its workspace directory by first checking the project's workspace directory and joining its own id. That requires an access of the project configuration (and more importantly, various path manipulations) for every job, and it shows up quite obviously when profiling iteration over a large number of jobs. To solve this, I added a _mutate_hook method that will reset the values of any cached properties that are derived from the project configuration if the configuration is edited in memory.

I have verified that the existing tests should cover these changes, particularly the changes to the Project cached properties and config mutation hook.

Motivation and Context

This makes the "lazy state point loading" in #239 even lazier, and thus extremely fast for many common operations (e.g. in signac-flow).

Test (best of 3, seconds) master (426d2ed) feature/lazy-statepoint-loading (9a12405) feature/lazy-job-init (32fef57)
[job for job in pr]; N=30,000 1.014 0.471 (2.15x) 0.141 (7.19x)
[job.sp() for job in pr]; N=30,000 1.442 2.208 (0.653x) 1.381 (1.04x)

Types of Changes

  • Documentation update
  • Bug fix
  • New feature
  • Breaking change1

1The change breaks (or has the potential to break) existing functionality.

Checklist:

If necessary:

  • I have updated the API documentation as part of the package doc-strings.
  • I have created a separate pull request to update the framework documentation on signac-docs and linked it here.
  • I have updated the changelog and added all related issue and pull request numbers for future reference (if applicable). See example below.

bdice and others added 30 commits October 31, 2019 18:55
Since the new code path does not cover that path anymore.
@bdice bdice added this to the v1.6.0 milestone Dec 29, 2020
@bdice bdice changed the base branch from master to feature/lazy-statepoint-loading December 29, 2020 09:21
signac/contrib/job.py Outdated Show resolved Hide resolved
signac/contrib/job.py Show resolved Hide resolved
@@ -1196,16 +1214,18 @@ def write_statepoints(self, statepoints=None, fn=None, indent=2):
with open(fn, "w") as file:
file.write(json.dumps(tmp, indent=indent))

def _register(self, job):
def _register(self, _id, statepoint):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this method to reduce overhead. The id and state point are already known in all the cases where project._register is called, but the previous implementation had to explicitly perform a conversion back from the synced state point to a plain dict. That accounted for some of the slowdown between master and feature/lazy-statepoint-loading in #239 when opening jobs by state point.

signac/contrib/job.py Show resolved Hide resolved
@bdice bdice requested review from cbkerr and klywang December 29, 2020 09:35
@bdice
Copy link
Member Author

bdice commented Dec 29, 2020

I'm tagging @klywang and @cbkerr as reviewers because they expressed specific interest in performance-related improvements.

@bdice bdice mentioned this pull request Dec 30, 2020
12 tasks
Base automatically changed from feature/lazy-statepoint-loading to master December 30, 2020 18:43
Copy link
Member

@cbkerr cbkerr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've studied the diffs (less the surrounding code) and my main concern is the naming issue. See my inline comments.

signac/contrib/project.py Outdated Show resolved Hide resolved
signac/contrib/project.py Show resolved Hide resolved
signac/contrib/project.py Outdated Show resolved Hide resolved
signac/contrib/project.py Show resolved Hide resolved
@cbkerr
Copy link
Member

cbkerr commented Dec 30, 2020

Will _mutate_hook be deprecated in version 2.0 when mutating the configuration of a Project will be deprecated? If so, should we add a flag to deprecate it then too?

@bdice
Copy link
Member Author

bdice commented Dec 30, 2020

Will _mutate_hook be deprecated in version 2.0 when mutating the configuration of a Project will be deprecated? If so, should we add a flag to deprecate it then too?

Not necessary - _ProjectConfig is a private class and _invalidate_config_cache is a private function. We don't have to worry about deprecating/changing private APIs because users should not rely on them. The existing deprecation warning shown when _ProjectConfig is mutated is sufficient to indicate that the behavior will change.

@bdice bdice mentioned this pull request Jan 7, 2021
12 tasks
@bdice
Copy link
Member Author

bdice commented Jan 11, 2021

@csadorf @tommy-waltmann @klywang I would appreciate a second review if you are able. @cbkerr Thank you for the review.

Copy link
Contributor

@csadorf csadorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! 👍 Sorry for the delay.

@bdice bdice merged commit ae72fe1 into master Jan 11, 2021
@bdice bdice deleted the feature/lazy-job-init branch January 11, 2021 16:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request refactor Code refactoring
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants