Proposal for a snapshot feature #69

csadorf · 2017-09-22T23:24:20Z

Original report by Carl Simon Adorf (Bitbucket: csadorf, GitHub: csadorf).

About

This is a proposal for the implementation of feature that would allow users to generate (shallow/deep) snapshots of one or more job workspace directories and other additional directories.

The requirement to be able to create snapshots arises from the need for improved provenance management, where users would like to diligently track the exact path the data took from its source to its current state. Snapshots are a crucial part of such provenance management.

The snapshot directory structure

Snapshots are directories that contain

All non-hidden files that are within the selected job's workspace directories.
The project document.
Any additionally path that the user manually specifies, e.g., to make certain source code directories part of the snapshot.

A snapshot directory would have the following structure:

./workspace/<job-id_0>/
./workspace/<job-id_1>/
# ...
./signac_project_document.json
./<additional_relative_path_1>
./<additional_relative_path_2>
# ...

The name of the workspace directory would always be workspace regardless of its actual path.
We use this convention to ensure that signac is able to translate to the actual workspace directory on restore.
This makes it illegal to add additional paths named 'workspace'.

Proposed API

Creating snapshots

Snapshots would be created on the project-level.

The snapshot method would have the following signature:

Project.snapshot(self,
    # Only selected jobs are part of the snapshot.
    selection=None,
    # Paths to other directories to be included with the snapshot.
    include=None,
    # Do not include the project document if False.
    include_document=True,
    # Create a deep copy if True.
    deep=False, 
    # Do not validate the id of the snapshot if False.
    validate=True,
   ): snapshot_id [int]

The optional selection argument is a list of jobs or a list of job ids and defaults to all jobs.
The optional include argument is a path (or a list of paths) that shall be included with the snapshot. All paths are evaluated with respect to the project root directory. Providing an absolute path raises an exception.

Listing snapshots

Project.find_snapshots(self,
    selection=None,
    touched=None,
    ): snapshots_ids [list(int)]

The optional selection argument is a list of jobs or a list of job ids and defaults to all jobs.
The touched argument would allow users to provide a file name pattern and to only list snapshots that modified files matching the pattern.

This function lists all snapshot ids in reversed chronological order that overlap with the optional selection.

Restoring snaphots

Project.restore(self,
    # The number of restores to go back in time.
    index=None, (defaults to 0)
    # One can provide the id directly (but not both)
    _id=None,
    ): snapshot_id [int]

The restore function is designed to restore the last (n'th last) snapshot chronologically, which is then looked up and translated into a specific snapshot id.
Alternatively, the user can provide the snapshot id directly.

Example workflow

The following listing illustrates some typical examples for usage:

# Snapshot the whole project
project.snapshot()

# Snapshot a specific job
project.snapshot(selection=[a_job])

# Restore the project to last snapshotted state
project.restore()

# Restore the project to the 3rd last state
project.restore(3)

# Restore the project to a specific snapshot id:
project.restore(_id=int('abcdefef...', 16))

Implementation Details

Algorithm for the generation of snapshots

The snapshot algorithm would consist of the following steps:

Record the current timestamp and index all files that should be part of the snapshot.
A deterministic snapshot id is calculated based on the hash value of each file's content similar to a gid commit id. Race conditions are addressed by following the algorithm outlined here. In case that a snapshot with the exact same id already exists, no snapshot is created.
All files that are part of the snapshots are hard linked into a temporary directory $root/.signac/snapshots/_tmp_<snapshot-id> taking previous snapshots into account to avoid unnecessary copies, unless the user explicitly requests a deep copy. An exception is raised in case that the temporary directory already exists.
The snapshot id is optionally validated over the linked files and the temporary directory is moved to $root/.signac/snapshots/<snapshot-id>.
Symbolic links are created with the following names (using the timestamp recorded in step 1):
- $root/.signac/snapshots/%Y-%m-%dT%H%M%S.%f
- $workspace/<jobid_0>/.signac/snapshots/%Y-%m-%dT%H%M%S.%f
- $workspace/<jobid_1>/.signac/snapshots/%Y-%m-%dT%H%M%S.%f
- ...

Possible Race Conditions

In general it would not be safe to modify the job's workspace data during the creation of a snapshot.
This is a list of possible race conditions and their impact:

Creating two snapshot at the same time

The attempt to create two snapshots at the same time would fail, because the temporary directory already exists. The creation of directories is an atomic operation.

Impact: None
Solution: not required

Modification of files during the creation of snapshots

The final snapshot id is calculated after the snapshot has been created, so the snapshot id would be consistent. However, the snapshot may contain files at different states.

Possible impact: Inconsistent snapshot.
Possible solution: Validate the final snapshot id.

Addition or removal of files during the creation of snapshots

Possible Impact: Inconsistent snapshot.
Possible solution: None.

The text was updated successfully, but these errors were encountered:

csadorf · 2017-09-22T23:28:51Z

Original comment by Carl Simon Adorf (Bitbucket: csadorf, GitHub: csadorf).

@PaulDodd @vyasr @bdice I would appreciate your input.

csadorf · 2017-09-28T17:51:37Z

Original comment by Vyas Ramasubramani (Bitbucket: vramasub, GitHub: vyasr).

Might be nice to have more granular restores, i.e. restoring just specific jobs from a certain commit. Agree with the general proposal though.

csadorf · 2018-01-26T19:12:54Z

Original comment by Carl Simon Adorf (Bitbucket: csadorf, GitHub: csadorf).

We are currently evaluating whether the use of 3rd party tools/libraries might be a better solution.

Enable transparently gzipped collections * Merged in feature/enable-gzip-use-binary-mode (pull request #69) Feature/enable gzip use binary mode * Refactor the collection io streams to support binary modes. This streamlines operating on zipped binary files. Use io.BytesIO() file for compressed collection construction. * Implement unit test to check compression ratio of compressed collections. * Implement reading/writing of compressed files for collections. * Add unit tests to check buffer size. * Fix issue where collection is not flushed to file after construction. This issue is related to implementation internals and does not affect the public API. Approved-by: Eric Harper <harperic@umich.edu> Approved-by: Carl Simon Adorf <csadorf@umich.edu> * Update changelog, contributors * Update docstring, reorder args Approved-by: Vyas Ramasubramani <vramasub@umich.edu> Approved-by: Carl Simon Adorf <csadorf@umich.edu>

csadorf added major labels Jan 30, 2019

mikemhenry removed the major label Feb 21, 2019

csadorf removed Project labels Jul 5, 2019

b-butler closed this as completed Jul 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal for a snapshot feature #69

Proposal for a snapshot feature #69

csadorf commented Sep 22, 2017

csadorf commented Sep 22, 2017

csadorf commented Sep 28, 2017

csadorf commented Jan 26, 2018

Proposal for a snapshot feature #69

Proposal for a snapshot feature #69

Comments

csadorf commented Sep 22, 2017

About

The snapshot directory structure

Proposed API

Creating snapshots

Listing snapshots

Restoring snaphots

Example workflow

Implementation Details

Algorithm for the generation of snapshots

Possible Race Conditions

Creating two snapshot at the same time

Modification of files during the creation of snapshots

Addition or removal of files during the creation of snapshots

csadorf commented Sep 22, 2017

csadorf commented Sep 28, 2017

csadorf commented Jan 26, 2018