Moving binaries out of the repo

James Blackburn edited this page Jul 7, 2016 · 13 revisions

The problem

Binaries bloat the repo cause they can't be diffed by git, so everytime you update a binary, you increase size by that file's size - text files just store the diff. The repo gets larger and slower and on a checkout users also have to pull all those old binary files down, too (primary problem according to TAZ).

TODO

General notes

  • Now that github closed down the Downloads button, we'd have to host binaries ourselves (with costs and uptime/maintenance concerns).
  • people would need internet to get OF to run. This basically moves OF away from the DVCS way, into a centralised structure. Also, it's probably not a good idea at workshops. Local cache/downloadable binary-zip-to-go would avoid this problem.
  • the problem with any git-external solution is that all our users would have to use/install it too, but maybe one of those is suitable for us, and easily integrable.
  • Space savings: See section for git-fat below

Approaches under consideration

Miscellaneous

  • core.bigFileThreshold not relevant, only affect packfile process
  • sparse checkout, skip-worktree not relevant.

shallow repos/submodules

http://stackoverflow.com/questions/2144406/git-shallow-submodules. It's kinda hacky, by first shallow-cloning a repo, then submoduling it.

+

  • only method where the binary files could be hosted on Github, all other would need separate hosting (with probably large traffic)
  • only method which would not need an external tool

-

  • if we do create a shallow submodule/repo (to save users' bandwidth on pulls) with (e.g.) git clone --depth 1 , and users want to fetch a past commit which is not contained in the repo, this doesn't work: e.g. git checkout HEAD~3, or maybe checking out 0071: error: pathspec '0071' did not match any file(s) known to git. this basically breaks shallow submodules for what we want. The workaround would be to fetch more and more depth until the required commit is reached (unknowable beforehand). the space savings are gone; and this will probably be slow. Also, depth seems to be an ambiguous measure depending on current branch etc.
  • In fact, if we want people to be able to access the whole history (since outsourcing the binaries), we'd need a full repo for the submodule, which would mean we would save neither space nor bandwith/pull size in the first place.
  • we are not solving the problem(!), we are shifting it. instead of having one bloated repo, we have one slim (OF) and one bloated (binaries) repo.
  • does not accomodate binaries distributed within the repo (except with one submodule per binary). this will generate some interesting include path problems i'm sure.
  • (there's no real pathway to transition away from this - if we start it we have to stick with it, otherwise the OF repo for the affected period becomes unusable. externally hosted files could be redirected etc.)

~

  • deepen a shallow repository with git fetch --depth=n
  • An analysis was done proving that "normal" submodules show no space or traffic savings
  • it's currently unclear how this affects PRs and merging. those are normally impossible when using shallow repos, but this still has to be confirmed, hopefully this can be contained/limited to the binaries repo.
  • keeping OF in sync with binaries probably only involves a git submodule update call or similar after a git checkout ( same for the other solutions, but mentioning it because that came up in the transcript)

space savings

  • not very much according to this article (1/9th of size with shallow clone): [http://blogs.gnome.org/simos/2009/04/18/git-clones-vs-shallow-git-clones/]
  • OF shallow clone with depth 1: 913MB, .git: 179MB, (du -hs)
  • OF full clone: 1.3GB, .git: 597MB, addons 324MB, ofxOpenCV 265MB

git-annex

Use git-annex (http://git-annex.branchable.com/) to basically store metadata of the binaries, and host the binaries somewhere else.

+

  • actively developed
  • backends for different storage mechanisms (S3, directory, web, rsync, webdav,..)
  • runs on OSX, linux flavors
  • seems to be pretty versatile, large amount of options, interface mimicks git

-

  • no windows support so far,but is expected in about 6 months: [http://git-annex.branchable.com/design/assistant/]
  • can't be used on github, so we'd need a full repo hosted elsewhere. could use gitolite: http://git-annex.branchable.com/tips/using_gitolite_with_git-annex/*
  • access controls/auth for mother repo unclear? I think non-existent

~

git-media

https://github.com/schacon/git-media Pretty similar to git-annex

+

  • fully cross-platform as far as i can tell

-

  • Authentication/read-only: I don't see how.
  • Apparently abandoned/unmaintained
  • seems like less thought went into it - less encompassing interface, much simpler use-case than git-media apparently
  • muss less granular controls

~

  • not sure if suitable for our purposes

artifact repositories

Use an artifact repository (e.g. Artifactory, Nexus, Archiva, Maven,...) to store/track the binaries, fetch on demand or as part of the build process, not as part of the version control process (although that may be an option, too).

+

  • this seems to be one of the more "proper" ways, artifact repos are made for storing binaries. seems most robust.
  • centrally host binaries somewhere, people fetch from there on demand
  • could also used for hosting OF releases

-

  • another service we'd have to use and maybe pay for. probably can find one we can put on our own server(s).

~

  • maybe too "enterprise-y"
  • maybe zsync (see below) could be useful here

Maven

  • XML file for metadata
  • IDE plugins probably available
  • Integrates into build process? make?

write ourselves, go more bare-metal with git

Write some scripts/stuff for git to work with and locally cache externally hosted binaries. Leverage git magic (clean/smudge filters, binary gitattributes,etc) to accomplish something like git-annex and git-media do (this was kinda where those two came from)

zsync could be a useful tool - like rsync over http, focused on distributing one file to many users, no server-side process necessary. Maybe if we go artifact-repo, then host binaries in ofSite, pull with zsync? sources are ~250KB

+

  • could maybe be kept as a script-only solution using only git, so no separate tool needed

-

  • development and maintenance effort, i'd rather use something which already exists than writing our own tool.

some other, yet unidentified solution.

Apparently there has been talk on the git mailing list to build some mechanism for handling the present problem: http://git.661346.n2.nabble.com/Fwd-Git-and-Large-Binaries-A-Proposed-Solution-td5948908.html, https://github.com/peff/git/wiki/SoC-2012-Ideas

I'm currently trying to reach the git devs about this. (Update: Never got a response)

+

  • This would be the best solution
  • no new tool would be needed
  • we could host on Github
  • it would be clean because it's within git.

-

  • Apparently it's a very difficult problem to do this cleanly. Some proposed GSoC projects never got any takers
  • AFAIK, no significant work on this has been done, so any feature would take a long time until it trickles down to everyone (i.e. Github, OF-devs, OF-users)

Binary files we should ignore

*.a, *.lib, *.dll, *.so*


Possible approaches to cleaning binaries out of git

I) Do we strip out the binaries from the history, or just remove them from HEAD and stop using them?

II) The former has some really ugly implications about changing public history, upstream repos breakage for users, github breakage, fresh starts etc., changed name for the OF repo. a whole bag of headaches.

III) how much space will stripping out the binaries really save?

IV) stripping out the binaries will conserve the repo's commit history, reduce the size of the repo (hopefully significantly), but change the SHAs and make the historic points unusable (cause the binaries are stripped out). Maybe (big maybe) some really advanced git magic can convert the historical binaries to pointers to them in the external solution a)-f) to avoid that. git-annex can do that, actually

V) just stopping to use binaries will stop growth of the problem, but will not reduce current size of the repo. as per II-IV, it may still be the better option.

Strip out with git-annex

Apparently git-annex can do this: [http://git-annex.branchable.com/tips/How_to_retroactively_annex_a_file_already_in_a_git_repo/]]:

 `git filter-branch --tree-filter 'for FILE in file1 file2 file3;do if [ -f "$FILE" ] && [ ! -L "$FILE" ];then git rm --cached "$FILE";git annex add "$FILE";ln -sf `readlink "$FILE"|sed -e "s:^../../::"` "$FILE";fi;done' --tag-name-filter cat -- --all`

Not as easy as it sounds, though. First you have to enumerate all the commits in the repo with git rev-list (~5400). Then use git show to find all the libraries (.dll, .a, .lib)in those commit points. That's about 655 unique binary library locations. All that takes a couple of minutes. I (bilderbuchi) got a script.

Then you feed those library paths into the above command, and sit back. Basically it rewrites all the history, touching every commit, and replacing libraries by symlinks to libraries in a git-annex folder. On my dual-core,spinning-disk laptop, after about 1hr runtime I was through 435/5400 revisions, with about 400MB in that storage folder.

Then it tails off, presumably cause the number of libraries becomes larger. After around 40hrs runtime I had only reached 2000/5600 commits, and gave up. Most of the time is apparently spent re-writing the git history for every commit, less on the annex part.

Probably a more powerful (single-core performance) computer with an SSD could make a substantial difference. Also, not all is clear yet, I currently have more objects in my storage folder than I expected.

Strip out with git-fat

A quick preliminary analysis with git-fat shows the following:

  • Base OF (freshly cloned) size (using du -hs): 1,4G
  • Tearing out all library files above 1M in size, over the whole repo history (process described in the Readme!) results in
    • slimmmed_OF: 329M (without stripped library files)
    • git-fat-storage: 2,1G (which contains all the libraries)
    • slimmed-OF_pulled: 800MB without .git/fat which contains the binary objects (after pulling libraries of current commit)

This was all a pretty quick job, so treat the results with caution.


Comments

Git LFS

Git Large File Storage (LFS) replaces large files such as audio samples, videos, datasets, and graphics with text pointers inside Git, while storing the file contents on a remote server like GitHub.com or GitHub Enterprise. Please put anything which does not fit into an edit above (e.g. discussion) in here...