Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy statepoint loading #239

Merged
merged 37 commits into from
Dec 30, 2020
Merged

Lazy statepoint loading #239

merged 37 commits into from
Dec 30, 2020

Conversation

bdice
Copy link
Member

@bdice bdice commented Oct 31, 2019

Description

Changes behavior of Job to load its statepoint lazily, when opened by id.

Motivation and Context

Implementation of #238.

Types of Changes

  • Documentation update
  • Bug fix
  • New feature
  • Breaking change1

1The change breaks (or has the potential to break) existing functionality.

Checklist:

If necessary:

  • I have updated the API documentation as part of the package doc-strings.
  • I have created a separate pull request to update the framework documentation on signac-docs and linked it here.
  • I have updated the changelog.

@codecov
Copy link

codecov bot commented Oct 31, 2019

Codecov Report

Merging #239 (ecdc7dc) into master (c1cf74e) will increase coverage by 0.12%.
The diff coverage is 82.53%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #239      +/-   ##
==========================================
+ Coverage   76.98%   77.11%   +0.12%     
==========================================
  Files          42       42              
  Lines        5704     5732      +28     
  Branches     1112     1117       +5     
==========================================
+ Hits         4391     4420      +29     
+ Misses       1029     1027       -2     
- Partials      284      285       +1     
Impacted Files Coverage Δ
signac/contrib/project.py 86.77% <77.41%> (-0.02%) ⬇️
signac/contrib/job.py 91.14% <87.50%> (+1.54%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c1cf74e...ecdc7dc. Read the comment docs.

signac/contrib/project.py Outdated Show resolved Hide resolved
@bdice
Copy link
Member Author

bdice commented Nov 1, 2019

There is a lot of refactoring that I think would be possible if this feature were accepted. Specifically, we have deprecated Project.get_statepoint(self, jobid, fn=None) for v1.3 (to be removed in v2.0) but we are still actively using it in Project.open_job().

If statepoints were loaded lazily, then it would be possible to remove much of the statepoint-reading logic from the Project class and leave more of that to the Job, which could be opened by id alone. Project-level caching of statepoints is valuable, and this would only apply to cases that miss the cache. For those cases, we currently have methods like Project._get_statepoint_from_workspace(self, jobid). I think this would allow for some nice simplification of the core of the Project / Job codebase while potentially improving performance in other places where lazy statepoint loads would benefit the internals.

@bdice
Copy link
Member Author

bdice commented Nov 16, 2019

@glotzerlab/signac-maintainers I would like some conceptual feedback on this PR's stated benefits/costs, though the code is not necessarily "ready for review" since I don't intend to merge it until the concepts are improved and other possible solutions are discussed. I think that something like this lazy-load-by-id, or an automatically/aggressively updated statepoint cache, should be considered to improve performance.

@csadorf
Copy link
Contributor

csadorf commented Nov 18, 2019

@glotzerlab/signac-maintainers I would like some conceptual feedback on this PR's stated benefits/costs, though the code is not necessarily "ready for review" since I don't intend to merge it until the concepts are improved and other possible solutions are discussed. I think that something like this lazy-load-by-id, or an automatically/aggressively updated statepoint cache, should be considered to improve performance.

I think before we delve more deeply into any conceptual discussion or possibly even approaches for implementation, I think it would be crucial to identify the exact use cases and bottlenecks that we want to address. Looking at the benchmarks, it appears that this PR is already improving performance significantly, albeit not by an order magnitude. Is the use case that this PR is supposed to speed up already covered by our benchmarks?

@bdice
Copy link
Member Author

bdice commented Nov 18, 2019

@csadorf The benchmarks iterate and iterate_single_pass are using the new code path, since list(project) opens jobs by id. This may be sufficiently covered by the benchmarks as-is, unless you wanted something else.

@bdice bdice marked this pull request as ready for review November 18, 2019 19:19
@bdice bdice requested review from csadorf and a team as code owners November 18, 2019 19:19
@ghost ghost removed their request for review November 18, 2019 19:19
@bdice bdice self-assigned this Nov 18, 2019
@bdice bdice changed the title [Experimental]: Lazy statepoint loading Lazy statepoint loading Nov 18, 2019
@bdice bdice added the enhancement New feature or request label Nov 18, 2019
Since the new code path does not cover that path anymore.
@csadorf
Copy link
Contributor

csadorf commented Nov 18, 2019

According to the benchmarks, it appears that loading each state point is somewhat slower (1.5-2x). @bdice Maybe you can do a little bit of profiling? I'm willing to move forward regardless, but I think we should take some care to not slow down signac in a very relevant use case.

@bdice
Copy link
Member Author

bdice commented Nov 18, 2019

@csadorf

profiles.zip

I have done some profiling. The primary difference in cost is an error check that is needed in the lazy code but not the original code path.

Suppose we try to open a job id like 'ffffffffffffffffffffffffffffffff' that does not exist in the workspace. The current master branch fails when initializing the job by statepoint, in the call to self.get_statepoint(id).

return self.Job(project=self, statepoint=self.get_statepoint(id), _id=id)

The path to the statepoint does not exist, raising a KeyError.
raise KeyError(jobid)

The lazy branch does not load the statepoint in Job.__init__, so we must check whether the job id exists in the workspace directory to know if we're fetching an already-existing job (the job must already exist when opening by id):

elif not self._contains_job_id(id):
raise KeyError(id)
return self.Job(project=self, statepoint=None, _id=id)

So essentially this means that the lazy branch has to check whether a job workspace exists, AND load the statepoint, while the old code only had to load (and potentially fail to load) the statepoint.

Another small improvement that could be made someday in the future (not in this PR):
project._wd and project._rd should be cached and only updated if project.config is modified. Getting the project workspace directory accounts for ~13% of the runtime in both the old and new code. In contrast to the suggestions in #81, it would be ideal for the project.config to be immutable to make this easier.

@csadorf csadorf added the blocked Dependent on something else label Nov 19, 2019
@csadorf
Copy link
Contributor

csadorf commented Nov 19, 2019

I'd like to merge #243 prior to this PR.

@csadorf
Copy link
Contributor

csadorf commented Nov 19, 2019

@bdice Please rebase on/merge master so that we can run the latest benchmark script with this branch.

@csadorf csadorf removed the blocked Dependent on something else label Nov 19, 2019
@bdice
Copy link
Member Author

bdice commented Nov 19, 2019

Another thing I just noticed -- the master branch doesn't ensure that statepoints and ids correspond (that is, there is no corruption check) when opening jobs. I thought this was the expected behavior, so I made sure that this check was performed when loading lazily. This has performance implications (around 5% of runtime for a simple statepoint fetching loop, [job.sp() for job in project]), which also partly explains the slower speed of the lazy branch.

On master, this never triggers a JobsCorruptedError:

import signac
import json
pr = signac.init_project('test')

# Clear all jobs
for job in pr:
    job.remove()

job = pr.open_job({'a': 0}).init()

# Corrupt the job statepoint/id mapping
with open(job.fn('signac_statepoint.json'), 'w') as sp:
    json.dump({'a': 1}, sp)

print('Existing job handle:', job._id, job.sp())
id_, sp_ = job._id, job.sp()
del job

# Re-initialize the project so that we skip the project._sp_cache
del pr
pr = signac.init_project('test')

job2 = pr.open_job(id=id_)
print('New job handle by id:', job2._id, job2.sp())

job3 = pr.open_job(sp_)
print('New job handle by sp:', job3._id, job3.sp())

master output:

Existing job handle after corruption: 9bfd29df07674bc4aa960cf661b5acd2 {'a': 0}
New job handle by id: 9bfd29df07674bc4aa960cf661b5acd2 {'a': 1}
New job handle by sp: 9bfd29df07674bc4aa960cf661b5acd2 {'a': 0}

lazy output:

Existing job handle: 9bfd29df07674bc4aa960cf661b5acd2 {'a': 0}
Traceback (most recent call last):
  File "/home/bdice/code/signac/signac/contrib/job.py", line 380, in _check_manifest
    assert calc_id(manifest) == self._id
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "bug.py", line 24, in <module>
    print('New job handle by id:', job2._id, job2.sp())
  File "/home/bdice/code/signac/signac/contrib/job.py", line 243, in sp
    return self.statepoint
  File "/home/bdice/code/signac/signac/contrib/job.py", line 231, in statepoint
    statepoint = self._check_manifest()
  File "/home/bdice/code/signac/signac/contrib/job.py", line 382, in _check_manifest
    raise JobsCorruptedError([self._id])
signac.contrib.errors.JobsCorruptedError: ['9bfd29df07674bc4aa960cf661b5acd2']

@csadorf
Copy link
Contributor

csadorf commented Nov 19, 2019

I see, that would explain the slowdown. Do you think that a ~3x slowdown is justified for that increased security?

Comment on lines 1280 to 1290
statepoint = self._get_statepoint_from_workspace(job_id)
# Update the project's state point cache from this cache miss
self._sp_cache[job_id] = statepoint
except KeyError as error:
# Fall back to a file containing all state points because the state
# point could not be read from the job workspace.
try:
statepoint = self.read_statepoints(fn=fn)[job_id]
statepoints = self.read_statepoints(fn=fn)
# Update the project's state point cache
self._sp_cache.update(statepoints)
statepoint = statepoints[job_id]
Copy link
Member Author

@bdice bdice Dec 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more aggressive solution would be to update the internal state point cache inside the methods _get_statepoint_from_workspace and read_statepoints. I tested this "always read to cache" behavior and all of our existing tests pass. I think this would be valid: in every case where we currently call those methods, we manually update the cache with the results. Moving that "always read to cache" logic into the implementation of those methods makes sense to me, but I would like a second opinion. That can also be done in a follow-up PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this PR seems almost done, my hunch would be to leave the aggressive solution for a follow-up PR. But I'm still understanding this PR and can comment better when I finish my review.

@bdice
Copy link
Member Author

bdice commented Dec 29, 2020

@csadorf @klywang @cbkerr This PR is ready for review. I will update the PR description with an overview of the design and some representative benchmarks. I'm tagging Kelly and Corwin as reviewers because they expressed specific interest in performance-related improvements.

@klywang
Copy link
Contributor

klywang commented Dec 29, 2020

If a job is already in the statepoint cache and is removed from the project with job.remove(), it remains in the statepoint cache, and it is still possible to open the job by statepoint or id. This bug also occurs in the master branch, so I don't think this branch makes anything worse.

I imagine this would be something to implement in a different PR, but why doesn't job.remove() clear the cache after removing the job?

@bdice
Copy link
Member Author

bdice commented Dec 29, 2020

If a job is already in the statepoint cache and is removed from the project with job.remove(), it remains in the statepoint cache, and it is still possible to open the job by statepoint or id. This bug also occurs in the master branch, so I don't think this branch makes anything worse.

I imagine this would be something to implement in a different PR, but why doesn't job.remove() clear the cache after removing the job?

@klywang The first comment was a misunderstanding on my part about the expectations of the data model. The best explanation is here from Simon:

No, it's fine to be able to open by id, even if there is no job in the project if the state point happens to be in some lookup table like the cache. It's not ok to open some arbitrary job by id where we wouldn't know the state point.

This explains the job.remove() behavior, which should not remove items from the cache. Once a job is registered, the state point cache will track the id: statepoint in-memory until the end of the session. This means that a job can be initialized, removed, and re-opened by id. The project's state point cache is simply a lookup table of id: statepoint and does not mean that all job ids in the cache actually exist on disk. The way we determine job in project is to check for the existence of a directory with the job's id.

In an earlier form of this PR, I tried to out-smart that by trying to know if job ids were "in the project" without actually checking the disk (by checking if the job id is in the state point cache). However, this was a bad idea for guaranteeing consistency and removed some important guarantees of the current design. Other processes working on the same data space could easily invalidate the cache, so we must resort to checking the disk for knowing what jobs are/aren't in the project.

@klywang
Copy link
Contributor

klywang commented Dec 29, 2020

@klywang The first comment was a misunderstanding on my part about the expectations of the data model. The best explanation is here from Simon:

No, it's fine to be able to open by id, even if there is no job in the project if the state point happens to be in some lookup table like the cache. It's not ok to open some arbitrary job by id where we wouldn't know the state point.

This explains the job.remove() behavior, which should not remove items from the cache. Once a job is registered, the state point cache will track the id: statepoint in-memory until the end of the session. This means that a job can be initialized, removed, and re-opened by id. The project's state point cache is simply a lookup table of id: statepoint and does not mean that all job ids in the cache actually exist on disk. The way we determine job in project is to check for the existence of a directory with the job's id.

Thank you for the clarification!

@bdice bdice requested a review from klywang December 29, 2020 19:53
@bdice bdice self-assigned this Dec 29, 2020
@bdice bdice added this to the v1.6.0 milestone Dec 29, 2020
@bdice bdice removed the blocked Dependent on something else label Dec 29, 2020
Copy link
Member

@cbkerr cbkerr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taggging me @bdice; I’ve learned a lot by starting to review this!

Here are some initial questions.

I’ve numbered my comments so you can respond inline easily.

In contrib/job.py

  1. Is there anything stopping the initialization of a corrupted Job instance by providing a statepoint and _id such that calc_id(statepoint) != _id? The only related check I see for this is in the project.open_job() method:

    if (statepoint is None) == (id is None):
       raise ValueError("Either statepoint or id must be provided, but not both.")
    

    But still one of the lines calls Job() with both statepoint and _id arguments.

    • Possible idea to try to fix this, to replace the line self._id = calc_id(self.statepoint()) if _id is None else _id.

      # Set id.
      if _id is not None:
          if statepoint is None:
              # The id must have been provided
              self._id = _id
          else:
              # make sure provided _id and statepoint are consistent
              if calc_id(statepoint) != _id:
                  raise JobsCorruptedError([_id]) # or maybe ValueError
              else:
                  self._id = _id     
      else:
          # The id is computed from the state point if not provided.
          self._id = calc_id(self.statepoint())
      
  2. I see that self.statepoint() calls _check_manifest(), which looks at self.id…but what is self.id before it is set when the Job is instanced? Is this a problem when initializing the Job?

  3. In the code self._id = calc_id(self.statepoint()), do we use the getter method rather than the provided statepoint argument to the __init__ method in case the job has been initialized before? It seems like an extra system call because the getter method calls job._check_manifest().

  4. What’s the difference between the errors raised in job._read_manifest()? In the docstring, both say the same thing: “If an error occurs while reading/parsing the state point manifest.” The solution here might be to not raise a JobsCorruptedError when catching a ValueError?

  5. The docstring describing the JobsCorruptedError in _check_manifest sounds like a combination of OSError and JobsCorruptedError:

     If the manifest hash is not equal to the job id, or if an error occurs while reading/parsing the state point manifest.
    

    Something is inconsistent between this bullet and the previous bullet. Maybe the docstring for _check_manifest should just say “If the manifest hash is not equal to the job id.” because _check_manfiest() calls _read_manifest() which would raise the OSError.

In contrib/project.py

  1. This is slightly related to the confusion over opening a job by an id not in the project. Is the following true?: If you open a job by id that doesn’t match any existing job using

    fail_job = project.open_job('f'*32)
    

    the only error will be when you try to access fail_job.statepoint() (because it calls _check_manifest()) which would be after opening the job? I think we need to document this better or have it fail earlier.

@bdice
Copy link
Member Author

bdice commented Dec 30, 2020

Thanks for the review, @cbkerr! Responses inline below.

1. Is there anything stopping the initialization of a corrupted Job instance by providing a `statepoint` and `_id` such that `calc_id(statepoint) != _id`? The only related check I see for this is in the  `project.open_job()` method:
   ```
   if (statepoint is None) == (id is None):
      raise ValueError("Either statepoint or id must be provided, but not both.")
   ```

We guard against issues like this in a few ways: firstly, users are not supposed to call the Job constructor directly, they are expected to get jobs through open_job or by iterating / searching over a project. The only case in open_job where the Job is directly constructed with both id and statepoint is when opening by id and finding a state point in the project's state point cache. Here, we do assume that the state point cache is a valid source of truth, because the state point cache data always comes from other presumed-valid sources of truth such as a cache on disk or an instantiated Job that has registered its id and state point with the project (once it has been lazily loaded and validated).

   But still one of the lines calls `Job()` with both `statepoint` and `_id` arguments.
   
   * Possible idea to try to fix this, to replace the line `self._id = calc_id(self.statepoint()) if _id is None else _id`.
     ```
     # Set id.
     if _id is not None:
         if statepoint is None:
             # The id must have been provided
             self._id = _id
         else:
             # make sure provided _id and statepoint are consistent
             if calc_id(statepoint) != _id:
                 raise JobsCorruptedError([_id]) # or maybe ValueError
             else:
                 self._id = _id     
     else:
         # The id is computed from the state point if not provided.
         self._id = calc_id(self.statepoint())
     ```

There's a bit of nuance here: if opening by state point, we must go through the round-trip of encoding/decoding the state point before we can calculate the id. The argument to calc_id must be JSON-encodable using the default encoder, while the provided statepoint argument may need to be altered (e.g. converting NumPy arrays to lists) by the custom JSON encoder class used by SyncedAttrDict and related classes.

We choose to trust the caller of the Job constructor that calc_id(statepoint) == id, because we can internally guarantee that in open_job and that's the only place the constructor is called. There is already a note in the docs of the Job class to suggest this and I will expand that documentation to address this specifically.

2. I see that `self.statepoint()` calls `_check_manifest()`, which looks at `self.id`…but what is `self.id` before it is set when the Job is instanced? Is this a problem when initializing the Job?

This proposed case is not reachable in the code (but it may be possible to improve the code). The line self._id = calc_id(self.statepoint()) if _id is None else _id only triggers self.statepoint() if _id is None. However, because of the first check requiring id-or-statepoint, _id is None is only possible if a state point is provided. Thus self.statepoint does not execute _check_manifest and simply returns self._statepoint.

Would this be clearer?

        if statepoint is None and _id is None:
            raise ValueError("Either statepoint or _id must be provided.")
        elif statepoint is not None:
            # Set state point if provided
            self._statepoint = SyncedAttrDict(statepoint, parent=_sp_save_hook(self))
            # Set id. The id is computed from the state point if not provided.
            self._id = calc_id(self.statepoint()) if _id is None else _id
        else:
            # State point will be loaded lazily
            self._statepoint = None
            self._id = _id
3. In the code `self._id = calc_id(self.statepoint())`, do we use the getter method rather than the provided `statepoint` argument to the `__init__` method in case the job has been initialized before? It seems like an extra system call because the getter method calls `job._check_manifest()`.

Ah, this is confusing. self.statepoint() accesses the property self.statepoint and then uses its __call__ method, which returns a "plain dict" instead of a synced dict. It looks like a getter but it's not. As mentioned above, the Job.__init__ method cannot trigger Job._check_manifest by its construction. Only opening by id and lazy-loading the state point can trigger that method. For efficiency, all state points that are pulled from the project's in-memory cache or a disk cache are trusted to be correct (and not corrupted) on-disk. The Project.check() method can be used to explicitly verify all state points on disk.

4. What’s the difference between the errors raised in `job._read_manifest()`? In the docstring, both say the same thing: “If an error occurs while reading/parsing the state point manifest.” The solution here might be to not raise a JobsCorruptedError when catching a ValueError?

These errors are triggered by different causes, but I couldn't figure out how to explain the difference concisely. The error-raising behavior should be kept the same as before (so it's not a breaking change), but the documentation could be improved. Here's an explanation of potential causes: OSError is difficult to test and likely results from something like hardware failure, permissions issues, etc. while JobsCorruptedError is caused by json.loads raising a JSONDecodeError (which is a subclass of ValueError -- we choose to catch the ValueError for generality and because it saves us from needing to import that specific error class). A file that can be read but isn't valid JSON will raise ValueError, which will be caught, and raise a JobsCorruptedError. A file that can't be read will raise OSError. I welcome your thoughts on ways to improve the docs here.

5. The docstring describing the JobsCorruptedError in `_check_manifest` sounds like a combination of OSError and JobsCorruptedError:
   ```
    If the manifest hash is not equal to the job id, or if an error occurs while reading/parsing the state point manifest.
   ```

   Something is inconsistent between this bullet and the previous bullet. Maybe the docstring for `_check_manifest` should just say “If the manifest hash is not equal to the job id.” because `_check_manfiest()` calls `_read_manifest()` which would raise the OSError.

Edited. You're correct that this should only say "If the manifest hash is not equal to the job id."

In contrib/project.py

1. This is slightly related to the [confusion over opening a job by an id not in the project](https://github.com/glotzerlab/signac/pull/239/#issuecomment-556069852). Is the following true?: If you open a job by id that doesn’t match any existing job using
   ```
   fail_job = project.open_job('f'*32)
   ```
   
   
   the only error will be when you try to access `fail_job.statepoint()` (because it calls `_check_manifest()`) which would be **after** opening the job? I think we need to document this better or have it fail earlier.

No, that's not a problem in the current implementation. If you open by an (invalid) id, it will check Project._contains_job_id which ensures that a directory exists in the workspace and raises KeyError if not. However, it does not verify that a state point file exists in that directory. This means that an empty directory ffffffffffffffffffffffffffffffff could be opened as a job, but would raise an error when its state point is accessed. I believe this is a desired and acceptable outcome. If that happens, a user has broken an invariant of the signac data model (in this case, "all job workspace directories must contain a manifest and be named according to the hash of that manifest"). We can use that invariant as an assumption, which is what enables us to do lazy-loading in the first place. We haven't formalized our data model in a document but it might be a good idea (currently, our implementation is the data model specification).

@bdice bdice requested a review from cbkerr December 30, 2020 08:04
Copy link
Contributor

@csadorf csadorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! 👍

signac/contrib/job.py Show resolved Hide resolved
Copy link
Member

@cbkerr cbkerr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your detailed answers @bdice!

After addressing these small comment/docs things, it looks good to me!

  1. New doc in Jobs class looks good.

  2. Ah yes, and that conditional structure you suggest is more clear! The new structure can also address @csadorf’s comment by moving the line self._project._register(self) inside the proper condition.
    EDIT: I see that these changes are part of Optimization: job/project initialization. #451

  3. Let’s add a comment at the line I was confused about (self._id = calc_id(self.statepoint()) if _id is None else _id) to remind of this. Maybe something like?:

    # Simply returns the (properly JSON encoded) statepoint if provided already (no _check_manifest)
    
  4. Thanks for the explanation. To fix the documentation of the kinds of errors, I think we just need to separate reading and parsing like:

    JobsCorruptedError
          If an error occurs while parsing the state point manifest.
    OSError
          If an error occurs while reading the state point manifest.
    

    It would also help to have a comment to remind that JSONDecodeError is a subclass of ValueError at the line where we catch that.

project.py

  1. Got it, so basically, if the user manually (outside of signac) creates a directory 'f'*32, then opens the job 'f'*32, they’ve created their own trouble.

@bdice
Copy link
Member Author

bdice commented Dec 30, 2020

Thanks again, @csadorf and @cbkerr. I have addressed your suggestions in this PR or #451.

@bdice bdice merged commit fd14f48 into master Dec 30, 2020
@bdice bdice deleted the feature/lazy-statepoint-loading branch December 30, 2020 18:43
@bdice bdice mentioned this pull request Jan 7, 2021
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Proposal: Lazy statepoint loading
4 participants