Optimization: job/project initialization. #451

bdice · 2020-12-29T09:08:22Z

Description

This is a follow-up PR to #239, but it may be possible to adopt this set of changes separately from #239 if that PR stalls. I recommend that reviewers focus on #239 first and then handle this PR second.

This PR refactors the __init__ methods of the Job and Project classes so that less work is done during initialization, instead deferring (and often caching) that work in a property. This requires no API breaking changes, because the existing methods and properties were enough to handle this change.

One tricky part of this is that I added a "mutation hook" to the Project class. In 99% of users' workflows, the Project config (which contains information about the workspace directory name, project name, etc.) does not need to be modified once the project is instantiated. While in-memory config mutation is allowed, mutating the configuration of a Project instance has been deprecated since version 1.3. However, there is a significant performance penalty to allowing this mutation: every job that is opened must determine its workspace directory by first checking the project's workspace directory and joining its own id. That requires an access of the project configuration (and more importantly, various path manipulations) for every job, and it shows up quite obviously when profiling iteration over a large number of jobs. To solve this, I added a _mutate_hook method that will reset the values of any cached properties that are derived from the project configuration if the configuration is edited in memory.

I have verified that the existing tests should cover these changes, particularly the changes to the Project cached properties and config mutation hook.

Motivation and Context

This makes the "lazy state point loading" in #239 even lazier, and thus extremely fast for many common operations (e.g. in signac-flow).

Test (best of 3, seconds)	master (`426d2ed`)	feature/lazy-statepoint-loading (`9a12405`)	feature/lazy-job-init (`32fef57`)
`[job for job in pr]`; N=30,000	1.014	0.471 (2.15x)	0.141 (7.19x)
`[job.sp() for job in pr]`; N=30,000	1.442	2.208 (0.653x)	1.381 (1.04x)

Types of Changes

Documentation update
Bug fix
New feature
Breaking change¹

¹The change breaks (or has the potential to break) existing functionality.

Checklist:

I am familiar with the Contributing Guidelines.
I agree with the terms of the Contributor Agreement.
My name is on the list of contributors.
My code follows the code style guideline of this project.
The changes introduced by this pull request are covered by existing or newly introduced tests.

If necessary:

I have updated the API documentation as part of the package doc-strings.
I have created a separate pull request to update the framework documentation on signac-docs and linked it here.
I have updated the changelog and added all related issue and pull request numbers for future reference (if applicable). See example below.

…oading the state point lazily.

Since the new code path does not cover that path anymore.

…glotzerlab/signac into feature/lazy-statepoint-loading

…the project.

…oint-loading2

…cases that would change public APIs).

signac/contrib/job.py

bdice · 2020-12-29T09:27:13Z

signac/contrib/project.py

@@ -1196,16 +1214,18 @@ def write_statepoints(self, statepoints=None, fn=None, indent=2):
        with open(fn, "w") as file:
            file.write(json.dumps(tmp, indent=indent))

-    def _register(self, job):
+    def _register(self, _id, statepoint):


I changed this method to reduce overhead. The id and state point are already known in all the cases where project._register is called, but the previous implementation had to explicitly perform a conversion back from the synced state point to a plain dict. That accounted for some of the slowdown between master and feature/lazy-statepoint-loading in #239 when opening jobs by state point.

signac/contrib/job.py

bdice · 2020-12-29T09:35:26Z

I'm tagging @klywang and @cbkerr as reviewers because they expressed specific interest in performance-related improvements.

… into feature/lazy-job-init

cbkerr

I've studied the diffs (less the surrounding code) and my main concern is the naming issue. See my inline comments.

signac/contrib/project.py

cbkerr · 2020-12-30T21:24:39Z

Will _mutate_hook be deprecated in version 2.0 when mutating the configuration of a Project will be deprecated? If so, should we add a flag to deprecate it then too?

bdice · 2020-12-30T21:29:15Z

Will _mutate_hook be deprecated in version 2.0 when mutating the configuration of a Project will be deprecated? If so, should we add a flag to deprecate it then too?

Not necessary - _ProjectConfig is a private class and _invalidate_config_cache is a private function. We don't have to worry about deprecating/changing private APIs because users should not rely on them. The existing deprecation warning shown when _ProjectConfig is mutated is sufficient to indicate that the behavior will change.

bdice · 2021-01-11T03:40:29Z

@csadorf @tommy-waltmann @klywang I would appreciate a second review if you are able. @cbkerr Thank you for the review.

csadorf

Looks good! 👍 Sorry for the delay.

bdice and others added 30 commits October 31, 2019 18:55

Refactor uses of job._statepoint to job.statepoint.

8b14b15

Implement lazy loading of statepoints.

24d3b8c

Use cheaper condition for determining if the job id exists.

6c57f11

Fix flake8.

f4a31e8

Validate that the state point manifest hash matches the job id when l…

737170a

…oading the state point lazily.

Merge branch 'master' into feature/lazy-statepoint-loading

82683b7

Added changelog line.

a40c22a

Add explicit benchmark for iterating and loading state point.

51b6c26

Since the new code path does not cover that path anymore.

Merge branch 'master' into feature/lazy-statepoint-loading

9bcceaf

Use _sp_cache to cache list of known valid job ids.

c51b413

Merge branch 'feature/lazy-statepoint-loading' of https://github.com/…

9824ec7

…glotzerlab/signac into feature/lazy-statepoint-loading

Deregister jobs from the statepoint cache when they are removed from …

67ba8da

…the project.

Fix issues with registration/deregistration.

be044f0

Merge remote-tracking branch 'origin/master' into feature/lazy-statep…

88c5369

…oint-loading2

Update comment.

5c4e284

Merge remote-tracking branch 'origin/master' into feature/lazy-statep…

bf9108e

…oint-loading2

Remove deregistration logic.

52643e0

Add comment explaining behavior of Project._sp_cache.

6c5dcd2

Update docstrings and deprecated methods.

0e7d6c3

Update changelog.

10926f9

Revise implementation details and docstrings.

ce5ce6e

Update docstrings to use state point.

0b65502

Update comment.

206928c

Update _register method.

80acc1b

Use job.id instead of job._id and job_id instead of jobid (excluding …

3f93f14

…cases that would change public APIs).

Use descriptive variable names and comments.

5304fff

Use descriptive variable names.

ebad980

Improve cache registration behavior and comments.

eed20a4

Remove extra blank lines.

76b5933

Merge branch 'master' into feature/lazy-statepoint-loading

baf36e5

bdice added this to the v1.6.0 milestone Dec 29, 2020

bdice changed the base branch from master to feature/lazy-statepoint-loading December 29, 2020 09:21

bdice commented Dec 29, 2020

View reviewed changes

Update comment.

e7a32dd

bdice requested review from cbkerr and klywang December 29, 2020 09:35

bdice added 3 commits December 29, 2020 13:43

Add test of error when opening without a state point or job id.

6494150

Merge branch 'master' into feature/lazy-statepoint-loading

347b99e

Merge remote-tracking branch 'origin/feature/lazy-statepoint-loading'…

9e875cb

… into feature/lazy-job-init

bdice mentioned this pull request Dec 30, 2020

Lazy statepoint loading #239

Merged

12 tasks

Base automatically changed from feature/lazy-statepoint-loading to master December 30, 2020 18:43

bdice added 2 commits December 30, 2020 12:52

Merge remote-tracking branch 'origin/master' into feature/lazy-job-init

2c88537

Fix mistake in merge.

3a88e45

cbkerr requested changes Dec 30, 2020

View reviewed changes

signac/contrib/project.py Outdated Show resolved Hide resolved

signac/contrib/project.py Show resolved Hide resolved

signac/contrib/project.py Outdated Show resolved Hide resolved

signac/contrib/project.py Show resolved Hide resolved

bdice added 2 commits December 30, 2020 13:21

Rename _mutate_hook to _invalidate_config_cache.

99a4fd3

Return None instead of passing and implicitly returning None.

ee6a0aa

cbkerr reviewed Dec 30, 2020

View reviewed changes

signac/contrib/project.py Show resolved Hide resolved

cbkerr approved these changes Dec 30, 2020

View reviewed changes

bdice mentioned this pull request Dec 31, 2020

Proposal: Unify dict classes and improve buffering and synchronization #249

Closed

bdice added 2 commits January 2, 2021 14:25

Assume state point is valid if id is provided.

e3c608c

Merge remote-tracking branch 'origin/master' into feature/lazy-job-init

efcb817

bdice mentioned this pull request Jan 7, 2021

Fix: don't delete invalid statepoints #467

Merged

12 tasks

Merge branch 'master' into feature/lazy-job-init

7b07063

csadorf approved these changes Jan 11, 2021

View reviewed changes

bdice merged commit ae72fe1 into master Jan 11, 2021

bdice deleted the feature/lazy-job-init branch January 11, 2021 16:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization: job/project initialization. #451

Optimization: job/project initialization. #451

bdice commented Dec 29, 2020 •

edited

Loading

bdice Dec 29, 2020

bdice commented Dec 29, 2020

cbkerr left a comment

cbkerr commented Dec 30, 2020

bdice commented Dec 30, 2020

bdice commented Jan 11, 2021

csadorf left a comment

Optimization: job/project initialization. #451

Optimization: job/project initialization. #451

Conversation

bdice commented Dec 29, 2020 • edited Loading

Description

Motivation and Context

Types of Changes

Checklist:

bdice Dec 29, 2020

Choose a reason for hiding this comment

bdice commented Dec 29, 2020

cbkerr left a comment

Choose a reason for hiding this comment

cbkerr commented Dec 30, 2020

bdice commented Dec 30, 2020

bdice commented Jan 11, 2021

csadorf left a comment

Choose a reason for hiding this comment

bdice commented Dec 29, 2020 •

edited

Loading