-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lazy statepoint loading #239
Conversation
Codecov Report
@@ Coverage Diff @@
## master #239 +/- ##
==========================================
+ Coverage 76.98% 77.11% +0.12%
==========================================
Files 42 42
Lines 5704 5732 +28
Branches 1112 1117 +5
==========================================
+ Hits 4391 4420 +29
+ Misses 1029 1027 -2
- Partials 284 285 +1
Continue to review full report at Codecov.
|
There is a lot of refactoring that I think would be possible if this feature were accepted. Specifically, we have deprecated If statepoints were loaded lazily, then it would be possible to remove much of the statepoint-reading logic from the |
…oading the state point lazily.
@glotzerlab/signac-maintainers I would like some conceptual feedback on this PR's stated benefits/costs, though the code is not necessarily "ready for review" since I don't intend to merge it until the concepts are improved and other possible solutions are discussed. I think that something like this lazy-load-by-id, or an automatically/aggressively updated statepoint cache, should be considered to improve performance. |
I think before we delve more deeply into any conceptual discussion or possibly even approaches for implementation, I think it would be crucial to identify the exact use cases and bottlenecks that we want to address. Looking at the benchmarks, it appears that this PR is already improving performance significantly, albeit not by an order magnitude. Is the use case that this PR is supposed to speed up already covered by our benchmarks? |
@csadorf The benchmarks |
Since the new code path does not cover that path anymore.
According to the benchmarks, it appears that loading each state point is somewhat slower (1.5-2x). @bdice Maybe you can do a little bit of profiling? I'm willing to move forward regardless, but I think we should take some care to not slow down signac in a very relevant use case. |
I have done some profiling. The primary difference in cost is an error check that is needed in the lazy code but not the original code path. Suppose we try to open a job id like signac/signac/contrib/project.py Line 366 in cfc41de
The path to the statepoint does not exist, raising a KeyError .signac/signac/contrib/project.py Line 721 in cfc41de
The lazy branch does not load the statepoint in signac/signac/contrib/project.py Lines 369 to 371 in 51b6c26
So essentially this means that the lazy branch has to check whether a job workspace exists, AND load the statepoint, while the old code only had to load (and potentially fail to load) the statepoint. Another small improvement that could be made someday in the future (not in this PR): |
I'd like to merge #243 prior to this PR. |
@bdice Please rebase on/merge master so that we can run the latest benchmark script with this branch. |
Another thing I just noticed -- the master branch doesn't ensure that statepoints and ids correspond (that is, there is no corruption check) when opening jobs. I thought this was the expected behavior, so I made sure that this check was performed when loading lazily. This has performance implications (around 5% of runtime for a simple statepoint fetching loop, On master, this never triggers a import signac
import json
pr = signac.init_project('test')
# Clear all jobs
for job in pr:
job.remove()
job = pr.open_job({'a': 0}).init()
# Corrupt the job statepoint/id mapping
with open(job.fn('signac_statepoint.json'), 'w') as sp:
json.dump({'a': 1}, sp)
print('Existing job handle:', job._id, job.sp())
id_, sp_ = job._id, job.sp()
del job
# Re-initialize the project so that we skip the project._sp_cache
del pr
pr = signac.init_project('test')
job2 = pr.open_job(id=id_)
print('New job handle by id:', job2._id, job2.sp())
job3 = pr.open_job(sp_)
print('New job handle by sp:', job3._id, job3.sp()) master output: Existing job handle after corruption: 9bfd29df07674bc4aa960cf661b5acd2 {'a': 0}
New job handle by id: 9bfd29df07674bc4aa960cf661b5acd2 {'a': 1}
New job handle by sp: 9bfd29df07674bc4aa960cf661b5acd2 {'a': 0} lazy output: Existing job handle: 9bfd29df07674bc4aa960cf661b5acd2 {'a': 0}
Traceback (most recent call last):
File "/home/bdice/code/signac/signac/contrib/job.py", line 380, in _check_manifest
assert calc_id(manifest) == self._id
AssertionError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "bug.py", line 24, in <module>
print('New job handle by id:', job2._id, job2.sp())
File "/home/bdice/code/signac/signac/contrib/job.py", line 243, in sp
return self.statepoint
File "/home/bdice/code/signac/signac/contrib/job.py", line 231, in statepoint
statepoint = self._check_manifest()
File "/home/bdice/code/signac/signac/contrib/job.py", line 382, in _check_manifest
raise JobsCorruptedError([self._id])
signac.contrib.errors.JobsCorruptedError: ['9bfd29df07674bc4aa960cf661b5acd2'] |
I see, that would explain the slowdown. Do you think that a ~3x slowdown is justified for that increased security? |
signac/contrib/project.py
Outdated
statepoint = self._get_statepoint_from_workspace(job_id) | ||
# Update the project's state point cache from this cache miss | ||
self._sp_cache[job_id] = statepoint | ||
except KeyError as error: | ||
# Fall back to a file containing all state points because the state | ||
# point could not be read from the job workspace. | ||
try: | ||
statepoint = self.read_statepoints(fn=fn)[job_id] | ||
statepoints = self.read_statepoints(fn=fn) | ||
# Update the project's state point cache | ||
self._sp_cache.update(statepoints) | ||
statepoint = statepoints[job_id] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A more aggressive solution would be to update the internal state point cache inside the methods _get_statepoint_from_workspace
and read_statepoints
. I tested this "always read to cache" behavior and all of our existing tests pass. I think this would be valid: in every case where we currently call those methods, we manually update the cache with the results. Moving that "always read to cache" logic into the implementation of those methods makes sense to me, but I would like a second opinion. That can also be done in a follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this PR seems almost done, my hunch would be to leave the aggressive solution for a follow-up PR. But I'm still understanding this PR and can comment better when I finish my review.
… It doesn't provide a benefit to do so.
I imagine this would be something to implement in a different PR, but why doesn't |
@klywang The first comment was a misunderstanding on my part about the expectations of the data model. The best explanation is here from Simon:
This explains the In an earlier form of this PR, I tried to out-smart that by trying to know if job ids were "in the project" without actually checking the disk (by checking if the job id is in the state point cache). However, this was a bad idea for guaranteeing consistency and removed some important guarantees of the current design. Other processes working on the same data space could easily invalidate the cache, so we must resort to checking the disk for knowing what jobs are/aren't in the project. |
Thank you for the clarification! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taggging me @bdice; I’ve learned a lot by starting to review this!
Here are some initial questions.
I’ve numbered my comments so you can respond inline easily.
In contrib/job.py
-
Is there anything stopping the initialization of a corrupted Job instance by providing a
statepoint
and_id
such thatcalc_id(statepoint) != _id
? The only related check I see for this is in theproject.open_job()
method:if (statepoint is None) == (id is None): raise ValueError("Either statepoint or id must be provided, but not both.")
But still one of the lines calls
Job()
with bothstatepoint
and_id
arguments.-
Possible idea to try to fix this, to replace the line
self._id = calc_id(self.statepoint()) if _id is None else _id
.# Set id. if _id is not None: if statepoint is None: # The id must have been provided self._id = _id else: # make sure provided _id and statepoint are consistent if calc_id(statepoint) != _id: raise JobsCorruptedError([_id]) # or maybe ValueError else: self._id = _id else: # The id is computed from the state point if not provided. self._id = calc_id(self.statepoint())
-
-
I see that
self.statepoint()
calls_check_manifest()
, which looks atself.id
…but what isself.id
before it is set when the Job is instanced? Is this a problem when initializing the Job? -
In the code
self._id = calc_id(self.statepoint())
, do we use the getter method rather than the providedstatepoint
argument to the__init__
method in case the job has been initialized before? It seems like an extra system call because the getter method callsjob._check_manifest()
. -
What’s the difference between the errors raised in
job._read_manifest()
? In the docstring, both say the same thing: “If an error occurs while reading/parsing the state point manifest.” The solution here might be to not raise a JobsCorruptedError when catching a ValueError? -
The docstring describing the JobsCorruptedError in
_check_manifest
sounds like a combination of OSError and JobsCorruptedError:If the manifest hash is not equal to the job id, or if an error occurs while reading/parsing the state point manifest.
Something is inconsistent between this bullet and the previous bullet. Maybe the docstring for
_check_manifest
should just say “If the manifest hash is not equal to the job id.” because_check_manfiest()
calls_read_manifest()
which would raise the OSError.
In contrib/project.py
-
This is slightly related to the confusion over opening a job by an id not in the project. Is the following true?: If you open a job by id that doesn’t match any existing job using
fail_job = project.open_job('f'*32)
the only error will be when you try to access
fail_job.statepoint()
(because it calls_check_manifest()
) which would be after opening the job? I think we need to document this better or have it fail earlier.
Thanks for the review, @cbkerr! Responses inline below.
We guard against issues like this in a few ways: firstly, users are not supposed to call the
There's a bit of nuance here: if opening by state point, we must go through the round-trip of encoding/decoding the state point before we can calculate the id. The argument to We choose to trust the caller of the Job constructor that
This proposed case is not reachable in the code (but it may be possible to improve the code). The line Would this be clearer? if statepoint is None and _id is None:
raise ValueError("Either statepoint or _id must be provided.")
elif statepoint is not None:
# Set state point if provided
self._statepoint = SyncedAttrDict(statepoint, parent=_sp_save_hook(self))
# Set id. The id is computed from the state point if not provided.
self._id = calc_id(self.statepoint()) if _id is None else _id
else:
# State point will be loaded lazily
self._statepoint = None
self._id = _id
Ah, this is confusing.
These errors are triggered by different causes, but I couldn't figure out how to explain the difference concisely. The error-raising behavior should be kept the same as before (so it's not a breaking change), but the documentation could be improved. Here's an explanation of potential causes:
Edited. You're correct that this should only say "If the manifest hash is not equal to the job id."
No, that's not a problem in the current implementation. If you open by an (invalid) id, it will check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your detailed answers @bdice!
After addressing these small comment/docs things, it looks good to me!
-
New doc in Jobs class looks good.
-
Ah yes, and that conditional structure you suggest is more clear! The new structure can also address @csadorf’s comment by moving the line
self._project._register(self)
inside the proper condition.
EDIT: I see that these changes are part of Optimization: job/project initialization. #451 -
Let’s add a comment at the line I was confused about (
self._id = calc_id(self.statepoint()) if _id is None else _id
) to remind of this. Maybe something like?:# Simply returns the (properly JSON encoded) statepoint if provided already (no _check_manifest)
-
Thanks for the explanation. To fix the documentation of the kinds of errors, I think we just need to separate reading and parsing like:
JobsCorruptedError If an error occurs while parsing the state point manifest. OSError If an error occurs while reading the state point manifest.
It would also help to have a comment to remind that JSONDecodeError is a subclass of ValueError at the line where we catch that.
project.py
- Got it, so basically, if the user manually (outside of signac) creates a directory
'f'*32
, then opens the job'f'*32
, they’ve created their own trouble.
Description
Changes behavior of
Job
to load its statepoint lazily, when opened by id.Motivation and Context
Implementation of #238.
Types of Changes
1The change breaks (or has the potential to break) existing functionality.
Checklist:
If necessary: