Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea: model shovel more closely on git #20

Open
calvingiles opened this issue Jul 14, 2017 · 4 comments
Open

Idea: model shovel more closely on git #20

calvingiles opened this issue Jul 14, 2017 · 4 comments

Comments

@calvingiles
Copy link
Contributor

Git LFS has some nice properties, but doesn't really map well to large datasets used for analysis. A git model of checking in all resources is good for reproducibility, but it is nice to separate the data from the code.

A proposed future direction for shovel is to support shovel <git commands where shovel intercepts some commands and swaps a bunch of behaviours out. These can likely be done using git hooks, so it may be possible to init those and then use git directly.

One benefit of the shovel model over LFS is that it lets you version datasets separate from a git repo and share them across multiple. In that sense, the git hooks would need to inspect the state of the filesystem and manage the dig and bury steps of shovel as part of the hooks.

These thoughts are very undeveloped.

@calvingiles
Copy link
Contributor Author

It looks like git attributes smudge and clean can be used for managing files https://git-scm.com/docs/gitattributes, and a pre-push hook would be suitable for actually uploading.

@sjdenny
Copy link

sjdenny commented Jul 14, 2017

You would still need a non-git interface, since you may have a project which isn't version-controlled. Usually, this won't be the case, but it'll happen sometimes. E.g. you want to use shovel to fetch some data for a quick analysis. So then you have two interfaces? Perhaps I'm misunderstanding.

@calvingiles
Copy link
Contributor Author

calvingiles commented Jul 14, 2017

That makes sense. I was imagining shovel would stay the same, but it would be possible to set it up with hooks so the dig and bury commands are called for you. Unlike LFS, which tries to make it look like the files are in the repo, this would make it clear they are in a pit.

So, in addition to what exists already, inside a git repo:

cd data
shovel init .  # adds this dir to maybe repo-root/.shovel so the hooks know which directories are under shovel control
git add .  # clean calculates the MD% of the file and writes the interesting data into a .shovel file, for example
git commit  # If shovel has a local cache (which it may in the future), the files are copied there with the MD5 as the key by a pre-commit hook
git push  # the pre-push hook ensures the files have been uploaded to the S3 pit

Or something. Probably worth getting a lot of inspiration from LFS.

@calvingiles
Copy link
Contributor Author

calvingiles commented Jul 14, 2017

The problem to solve here is I currently add data/ to .gitignoreso git doesn't try to check them in. It would be preferable if I had a good way to check if my data is in sync - both with the pit, but also that the version matches the version in the code. So maybe if peek checked the MD5 etc (as it is intended to), then we are nearly there anyway. If the datasets get their config from a metadata file, not be hard coded in the python code etc., then shovel always checks for sync between the current code version, and bumping the version would show up in git status.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants