Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for a snapshot feature #69

Closed
csadorf opened this issue Sep 22, 2017 · 3 comments
Closed

Proposal for a snapshot feature #69

csadorf opened this issue Sep 22, 2017 · 3 comments

Comments

@csadorf
Copy link
Contributor

csadorf commented Sep 22, 2017

Original report by Carl Simon Adorf (Bitbucket: csadorf, GitHub: csadorf).


About

This is a proposal for the implementation of feature that would allow users to generate (shallow/deep) snapshots of one or more job workspace directories and other additional directories.

The requirement to be able to create snapshots arises from the need for improved provenance management, where users would like to diligently track the exact path the data took from its source to its current state. Snapshots are a crucial part of such provenance management.

The snapshot directory structure

Snapshots are directories that contain

  • All non-hidden files that are within the selected job's workspace directories.
  • The project document.
  • Any additionally path that the user manually specifies, e.g., to make certain source code directories part of the snapshot.

A snapshot directory would have the following structure:

./workspace/<job-id_0>/
./workspace/<job-id_1>/
# ...
./signac_project_document.json
./<additional_relative_path_1>
./<additional_relative_path_2>
# ...

The name of the workspace directory would always be workspace regardless of its actual path.
We use this convention to ensure that signac is able to translate to the actual workspace directory on restore.
This makes it illegal to add additional paths named 'workspace'.

Proposed API

Creating snapshots

Snapshots would be created on the project-level.

The snapshot method would have the following signature:

Project.snapshot(self,
    # Only selected jobs are part of the snapshot.
    selection=None,
    # Paths to other directories to be included with the snapshot.
    include=None,
    # Do not include the project document if False.
    include_document=True,
    # Create a deep copy if True.
    deep=False, 
    # Do not validate the id of the snapshot if False.
    validate=True,
   ): snapshot_id [int]
  • The optional selection argument is a list of jobs or a list of job ids and defaults to all jobs.
  • The optional include argument is a path (or a list of paths) that shall be included with the snapshot. All paths are evaluated with respect to the project root directory. Providing an absolute path raises an exception.

Listing snapshots

Project.find_snapshots(self,
    selection=None,
    touched=None,
    ): snapshots_ids [list(int)]
  • The optional selection argument is a list of jobs or a list of job ids and defaults to all jobs.
  • The touched argument would allow users to provide a file name pattern and to only list snapshots that modified files matching the pattern.

This function lists all snapshot ids in reversed chronological order that overlap with the optional selection.

Restoring snaphots

Project.restore(self,
    # The number of restores to go back in time.
    index=None, (defaults to 0)
    # One can provide the id directly (but not both)
    _id=None,
    ): snapshot_id [int]

The restore function is designed to restore the last (n'th last) snapshot chronologically, which is then looked up and translated into a specific snapshot id.
Alternatively, the user can provide the snapshot id directly.

Example workflow

The following listing illustrates some typical examples for usage:

# Snapshot the whole project
project.snapshot()

# Snapshot a specific job
project.snapshot(selection=[a_job])

# Restore the project to last snapshotted state
project.restore()

# Restore the project to the 3rd last state
project.restore(3)

# Restore the project to a specific snapshot id:
project.restore(_id=int('abcdefef...', 16))

Implementation Details

Algorithm for the generation of snapshots

The snapshot algorithm would consist of the following steps:

  1. Record the current timestamp and index all files that should be part of the snapshot.
  2. A deterministic snapshot id is calculated based on the hash value of each file's content similar to a gid commit id. Race conditions are addressed by following the algorithm outlined here. In case that a snapshot with the exact same id already exists, no snapshot is created.
  3. All files that are part of the snapshots are hard linked into a temporary directory $root/.signac/snapshots/_tmp_<snapshot-id> taking previous snapshots into account to avoid unnecessary copies, unless the user explicitly requests a deep copy. An exception is raised in case that the temporary directory already exists.
  4. The snapshot id is optionally validated over the linked files and the temporary directory is moved to $root/.signac/snapshots/<snapshot-id>.
  5. Symbolic links are created with the following names (using the timestamp recorded in step 1):
    • $root/.signac/snapshots/%Y-%m-%dT%H%M%S.%f
    • $workspace/<jobid_0>/.signac/snapshots/%Y-%m-%dT%H%M%S.%f
    • $workspace/<jobid_1>/.signac/snapshots/%Y-%m-%dT%H%M%S.%f
    • ...

Possible Race Conditions

In general it would not be safe to modify the job's workspace data during the creation of a snapshot.
This is a list of possible race conditions and their impact:

Creating two snapshot at the same time

The attempt to create two snapshots at the same time would fail, because the temporary directory already exists. The creation of directories is an atomic operation.

  • Impact: None
  • Solution: not required

Modification of files during the creation of snapshots

The final snapshot id is calculated after the snapshot has been created, so the snapshot id would be consistent. However, the snapshot may contain files at different states.

  • Possible impact: Inconsistent snapshot.
  • Possible solution: Validate the final snapshot id.

Addition or removal of files during the creation of snapshots

  • Possible Impact: Inconsistent snapshot.
  • Possible solution: None.
@csadorf
Copy link
Contributor Author

csadorf commented Sep 22, 2017

Original comment by Carl Simon Adorf (Bitbucket: csadorf, GitHub: csadorf).


@PaulDodd @vyasr @bdice I would appreciate your input.

@csadorf
Copy link
Contributor Author

csadorf commented Sep 28, 2017

Original comment by Vyas Ramasubramani (Bitbucket: vramasub, GitHub: vyasr).


Might be nice to have more granular restores, i.e. restoring just specific jobs from a certain commit. Agree with the general proposal though.

@csadorf
Copy link
Contributor Author

csadorf commented Jan 26, 2018

Original comment by Carl Simon Adorf (Bitbucket: csadorf, GitHub: csadorf).


We are currently evaluating whether the use of 3rd party tools/libraries might be a better solution.

csadorf pushed a commit that referenced this issue Feb 4, 2019
Enable transparently gzipped collections

* Merged in feature/enable-gzip-use-binary-mode (pull request #69)

    Feature/enable gzip use binary mode

    * Refactor the collection io streams to support binary modes.

        This streamlines operating on zipped binary files.

        Use io.BytesIO() file for compressed collection construction.

    * Implement unit test to check compression ratio of compressed collections.

    * Implement reading/writing of compressed files for collections.

    * Add unit tests to check buffer size.

    * Fix issue where collection is not flushed to file after construction.

        This issue is related to implementation internals and does not affect
        the public API.

    Approved-by: Eric Harper <harperic@umich.edu>
    Approved-by: Carl Simon Adorf <csadorf@umich.edu>

* Update changelog, contributors

* Update docstring, reorder args

Approved-by: Vyas Ramasubramani <vramasub@umich.edu>
Approved-by: Carl Simon Adorf <csadorf@umich.edu>
@mikemhenry mikemhenry removed the major label Feb 21, 2019
@b-butler b-butler closed this as completed Jul 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants