Idea: model shovel more closely on git #20

calvingiles · 2017-07-14T09:05:07Z

Git LFS has some nice properties, but doesn't really map well to large datasets used for analysis. A git model of checking in all resources is good for reproducibility, but it is nice to separate the data from the code.

A proposed future direction for shovel is to support shovel <git commands where shovel intercepts some commands and swaps a bunch of behaviours out. These can likely be done using git hooks, so it may be possible to init those and then use git directly.

One benefit of the shovel model over LFS is that it lets you version datasets separate from a git repo and share them across multiple. In that sense, the git hooks would need to inspect the state of the filesystem and manage the dig and bury steps of shovel as part of the hooks.

These thoughts are very undeveloped.

The text was updated successfully, but these errors were encountered:

calvingiles · 2017-07-14T09:22:19Z

It looks like git attributes smudge and clean can be used for managing files https://git-scm.com/docs/gitattributes, and a pre-push hook would be suitable for actually uploading.

sjdenny · 2017-07-14T09:26:13Z

You would still need a non-git interface, since you may have a project which isn't version-controlled. Usually, this won't be the case, but it'll happen sometimes. E.g. you want to use shovel to fetch some data for a quick analysis. So then you have two interfaces? Perhaps I'm misunderstanding.

calvingiles · 2017-07-14T09:35:17Z

That makes sense. I was imagining shovel would stay the same, but it would be possible to set it up with hooks so the dig and bury commands are called for you. Unlike LFS, which tries to make it look like the files are in the repo, this would make it clear they are in a pit.

So, in addition to what exists already, inside a git repo:

cd data
shovel init .  # adds this dir to maybe repo-root/.shovel so the hooks know which directories are under shovel control
git add .  # clean calculates the MD% of the file and writes the interesting data into a .shovel file, for example
git commit  # If shovel has a local cache (which it may in the future), the files are copied there with the MD5 as the key by a pre-commit hook
git push  # the pre-push hook ensures the files have been uploaded to the S3 pit

Or something. Probably worth getting a lot of inspiration from LFS.

calvingiles · 2017-07-14T09:40:17Z

The problem to solve here is I currently add data/ to .gitignoreso git doesn't try to check them in. It would be preferable if I had a good way to check if my data is in sync - both with the pit, but also that the version matches the version in the code. So maybe if peek checked the MD5 etc (as it is intended to), then we are nearly there anyway. If the datasets get their config from a metadata file, not be hard coded in the python code etc., then shovel always checks for sync between the current code version, and bumping the version would show up in git status.

calvingiles added the enhancement label Jul 14, 2017

calvingiles mentioned this issue Jul 14, 2017

Read dataset params from a .shovel file in the dataset root #21

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idea: model shovel more closely on git #20

Idea: model shovel more closely on git #20

calvingiles commented Jul 14, 2017

calvingiles commented Jul 14, 2017

sjdenny commented Jul 14, 2017

calvingiles commented Jul 14, 2017 •

edited

Loading

calvingiles commented Jul 14, 2017 •

edited

Loading

Idea: model shovel more closely on git #20

Idea: model shovel more closely on git #20

Comments

calvingiles commented Jul 14, 2017

calvingiles commented Jul 14, 2017

sjdenny commented Jul 14, 2017

calvingiles commented Jul 14, 2017 • edited Loading

calvingiles commented Jul 14, 2017 • edited Loading

calvingiles commented Jul 14, 2017 •

edited

Loading

calvingiles commented Jul 14, 2017 •

edited

Loading