Versioning for input/output files.
When working with a version-controlled project, we often use/obtain
artifacts (configuration files, logs, measurements, figures, etc.)
for/from programs that correspond to a particular version of the
project, but that are not part of it (i.e. not being kept track by the
VCS). After a couple of executions, it quickly becomes difficult to
keep track of what versions of the project consumed/generated which
files. vio helps to deal with this issue by allowing a user to
create a snapshot of the unversioned files after a program has
executed, and to store and associate this snapshot with the latest
revision of the project.
git clone https://project.git
cd project
# work, work, work
git add -u
git commit -m "I worked hard and implemented many things"
# parametrize execution
echo "my configs for a particular execution" > params.conf
# execute and generate some results
exec program -c params.conf > execution.out
# commit anything that is not being tracked by git. In this
# particular case, files params.conf and execution.out
vio commit -m "the result of my hard work"In a nutshell, vio:
- Finds all files that are not tracked by the VCS.
- Creates a dataset of all unversioned files.
- Puts the dataset in a storage backend, associating it to an
execution ID (
commit_id + timestamp). - Provides versioning-semantics for datasets, allowing users to compare between distinct versions.
- Stores metadata for datasets, allowing users to annotate and contextualize them for future introspection.
The vio's "database" has the following schema:
commit_id | execution_id | vio_commit_message | files | metadata |
commit_id corresponds to the version in a VCS while execution_id
to a timestamp obtained at the moment when the snapshot is created.
files is the working directory snapshot of all unversioned files.
Lastly, metadata is a collection of key-value pairs.
git-lfs allows the inclusion of large files into a git repo. The
main difference between vio and git-lfs is that vio lets you
associate multiple datasets (or filesystem snapshots) to a single
version of the git repo, while git-lfs can only associate a single
one. In other words, the relationship between git commits and commits
in the storage backend is one-to-one for git-lfs while one-to-many
for vio.
Given the above, vio can use git-lfs as a backend, in the same way
that the git backend is used by vio.
Other tools such as git-annex, etc. also fall in this category.
TODO
TODO
Some use cases that this tool is aimed at solving: