New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for a snapshot feature #69
Comments
csadorf
pushed a commit
that referenced
this issue
Feb 4, 2019
Enable transparently gzipped collections * Merged in feature/enable-gzip-use-binary-mode (pull request #69) Feature/enable gzip use binary mode * Refactor the collection io streams to support binary modes. This streamlines operating on zipped binary files. Use io.BytesIO() file for compressed collection construction. * Implement unit test to check compression ratio of compressed collections. * Implement reading/writing of compressed files for collections. * Add unit tests to check buffer size. * Fix issue where collection is not flushed to file after construction. This issue is related to implementation internals and does not affect the public API. Approved-by: Eric Harper <harperic@umich.edu> Approved-by: Carl Simon Adorf <csadorf@umich.edu> * Update changelog, contributors * Update docstring, reorder args Approved-by: Vyas Ramasubramani <vramasub@umich.edu> Approved-by: Carl Simon Adorf <csadorf@umich.edu>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Original report by Carl Simon Adorf (Bitbucket: csadorf, GitHub: csadorf).
About
This is a proposal for the implementation of feature that would allow users to generate (shallow/deep) snapshots of one or more job workspace directories and other additional directories.
The requirement to be able to create snapshots arises from the need for improved provenance management, where users would like to diligently track the exact path the data took from its source to its current state. Snapshots are a crucial part of such provenance management.
The snapshot directory structure
Snapshots are directories that contain
A snapshot directory would have the following structure:
The name of the workspace directory would always be
workspace
regardless of its actual path.We use this convention to ensure that signac is able to translate to the actual workspace directory on restore.
This makes it illegal to add additional paths named 'workspace'.
Proposed API
Creating snapshots
Snapshots would be created on the project-level.
The snapshot method would have the following signature:
Listing snapshots
selection
argument is a list of jobs or a list of job ids and defaults to all jobs.touched
argument would allow users to provide a file name pattern and to only list snapshots that modified files matching the pattern.This function lists all snapshot ids in reversed chronological order that overlap with the optional selection.
Restoring snaphots
The restore function is designed to restore the last (n'th last) snapshot chronologically, which is then looked up and translated into a specific snapshot id.
Alternatively, the user can provide the snapshot id directly.
Example workflow
The following listing illustrates some typical examples for usage:
Implementation Details
Algorithm for the generation of snapshots
The snapshot algorithm would consist of the following steps:
$root/.signac/snapshots/_tmp_<snapshot-id>
taking previous snapshots into account to avoid unnecessary copies, unless the user explicitly requests a deep copy. An exception is raised in case that the temporary directory already exists.$root/.signac/snapshots/<snapshot-id>
.$root/.signac/snapshots/%Y-%m-%dT%H%M%S.%f
$workspace/<jobid_0>/.signac/snapshots/%Y-%m-%dT%H%M%S.%f
$workspace/<jobid_1>/.signac/snapshots/%Y-%m-%dT%H%M%S.%f
Possible Race Conditions
In general it would not be safe to modify the job's workspace data during the creation of a snapshot.
This is a list of possible race conditions and their impact:
Creating two snapshot at the same time
The attempt to create two snapshots at the same time would fail, because the temporary directory already exists. The creation of directories is an atomic operation.
Modification of files during the creation of snapshots
The final snapshot id is calculated after the snapshot has been created, so the snapshot id would be consistent. However, the snapshot may contain files at different states.
Addition or removal of files during the creation of snapshots
The text was updated successfully, but these errors were encountered: