Git Large File Storage #437

Closed
ghost opened this Issue Nov 17, 2011 · 86 comments

Projects

None yet
@ghost
ghost commented Nov 17, 2011

Hi,

I've been testing SparkleShare on windows, which has been working fine until now, with the git repo on a ubuntu server. My git repo including history is over 1 GB in size but the actual current files are only about 100 MB (I deleted many files along the way). Is there a way to purge history versions of deleted files and propagate this across all clients and server?

Also, a quota system feature might be interesting to implement, i.e. if the total size including history goes above a specified threshold of n MB then SparkleShare should purge the oldest history until it goes below n.

Cheers

Spec

  1. Rebase history
  2. Quota for automatic rebase
  3. Force clients to move to new repo
@fire
fire commented Nov 17, 2011
@fire fire closed this Nov 17, 2011
@ghost
ghost commented Nov 17, 2011

... and I did search before posting :(

@hbons hbons reopened this Nov 17, 2011
@fire
fire commented Nov 17, 2011

@hbons Annex stuff starts getting complicated. I need chained list of dependencies.

@fire
fire commented Nov 17, 2011

@hbon as of now, you cannot purge history* that is already committed into git. Therefore, it duplicates #385.

Unless you rebase.

@hbons
Owner
hbons commented Nov 17, 2011

@fire well this is a feature request against current SparkleShare. They annex issue is a brainstorm for something we may use. We still need this issue if we won't use annex

@fire
fire commented Nov 17, 2011

I will change the issue to mean rebase existing repos to remove old history then.

@hbons
Owner
hbons commented Mar 25, 2012

@gionn

"Hi,

IMHO disk space usage for the git repository could became a problem. There is a procedure to filter out and remove objects from git history: http://www.ducea.com/2012/02/07/howto-completely-remove-a-file-from-git-history/ (and someone packed it as a few nice scripts: https://github.com/cmaitchison/git_diet).

Couldn't be nice to implement even a simple "Permanently delete removed files more than X days ago" function?

Thanks!"

@vanto
vanto commented Mar 26, 2012

One more link: http://viget.com/extend/backup-your-database-in-git#comment-400537128. Not sure if it works, though.

@fab1an
fab1an commented Apr 10, 2012

This is going to be a huge problem. Git basically looks at the entire history for most of it's operations, and doing a repack on a repo with several gigabytes can easily hog all of your cpu und your memory. I crashed servers doing git repack or git gc.

Why is the history even stored on the client? Why not just download the latest version?

I tried sparkleshare yesterday, i have a 5GB repo full of ~100-200MB files. SparkleShare on Mac hung regularly and after the initial upload the repo had over 10GB even though the files along are just 5. This is unacceptable.

I think using git on the client-side is a design-flaw. It would make more sense to use rsync or something to copy the repo and use git on the server to store the history, kind of like DropBox does it with svn (as far as i know).

There are options for git to just clone a certain depth, maybe you can look into that. But i'm pretty sure you'll continue to have huge performance problems if you clone everything.

@hbons
Owner
hbons commented Apr 10, 2012

@fab1an the next version will do a --depth=1 clone.

I don't agree having all the history is a bad thing in all cases, it depends on your usecase.
Sure, I could have started creating something completely custom, but then we wouldn't have anything usable now. Git does all everything I need to, but it has some downsides.

You are free to contribute to the custom backend Rainbows though.

@rnorris7756

Another use case that I've come across where this is an issue is synchronizing encrypted file containers. I keep a repository with a small (~20MB) Truecrypt volume in it, and after some changes, the master repository on my server takes up over 100MB. This is after only a few updates to the file container.

For a use case like this, it's not only inefficient (space-wise), but also insecure, since it keeps multiple versions of an encrypted file, making it easier (although, still very hard) to break that encryption.

@hbons
Owner
hbons commented Apr 21, 2012

@rnorris7756 i have a branch that does encryption per file, so they are stored encrypted on the server. I'm not sure if that's your need, but it should be more efficient.

@hbons
Owner
hbons commented Apr 22, 2012

http://code.google.com/p/boar/ is interesting. It seems to have everything needed to work well with SparkleShare's backend system, only it doesn't seem to (yet) support SSH.

@Velrok
Velrok commented Apr 24, 2012

I haven't looked into it but maybe using bup might solve some problems? https://github.com/apenwarr/bup

@rnorris7756

Encryption on the server really isn't sufficient in my case. I'm encrypting the files locally to protect the information in there in the case that my computer gets stolen. I don't consider it to be a critical issue, since the file container is in its own repository that can be re-initialized easily. I just wanted to present a real case where history compaction (or no history) would be a convenient option.

@arthurlutz

+1 on this issue

@bomma
bomma commented Aug 17, 2012

I'm using the following script to remove history older than 30 days (which works fine for me, just set the timespec variable).
You could run this every week or so, or provide a button/menu item to run it manually.

the script:
http://pastebin.com/PXYFJSGL

The code is originally from gibak (https://github.com/pangloss/gibak), I fixed some small issues and removed what I didn't need.

@hbons
Owner
hbons commented Aug 17, 2012

@bomma interesting. Do you run this on just the client or both the client and server?

@bomma
bomma commented Aug 17, 2012

@hbons only on the clients

@hbons
Owner
hbons commented Aug 17, 2012

@bomma can you give some more details about how this works? Seems that it just removes objects from the database that aren't referenced in commits older than a particular date and not remove the actual commits? Is it possible that it would create a conflict?

@hbons
Owner
hbons commented Aug 17, 2012

@bomma I just tried this script. It does seem to remove earlier commits and their objects, but it doesn't remove them from the pack file, so they still take up space:

$ git count-objects -v
count: 0
size: 0
in-pack: 82
packs: 1
size-pack: 1331
prune-packable: 0
garbage: 0
@bomma
bomma commented Aug 19, 2012

@hbons The script removes history beyond a time limit and then tries to garbage collect unreachable objects. Git is very conservative in that respect though. In my experience, git gc doesn't (always) clean up recent unreachable files (no matter what you specify as time spec). I'm not a git-wizard so maybe somebody else could explain what exactly is going on.

Normally no conflicts should occur since we have no tags, branches and a linear history.

On the history cleaning:
The script originates from git-home-history (http://jean-francois.richard.name/ghh). I seem to get better results with this sequence of commands after filter-branch:

git reset --soft
rm -rf .git/refs/original/
git reflog expire --all --expire=now --expire-unreachable=0
git repack -ad # immediately repack
git gc --prune=now
git push origin +master:master # force pushing the history changes

I run this script if I know I removed a lot of files or big files, not as a cron job. I think someone with more knowledge of git should take a look at it first, but I seem to get decent results (well, better than nothing anyways).

Full script: http://pastebin.com/A6h2UdcC

This was referenced Oct 16, 2012
@Thadone
Thadone commented Oct 28, 2012

Guys, I just want to add that most of the SparkleShare users will use this program to replace the Dropbox. So there is no option to say: its not designed to handle a big files and so on. It will always be an issue until you guys wouldn't find the solution. For example I am using SS to sync between two macs and backup my web development projects (5 gigs) but my secret wish is to use it in the same way with my graphics projects (100 gigs). I am ready to pay for a possibility to host my own Dropbox (SparkleShare) on my own server with unlimited disk space. So good luck on that! :)

@hbons
Owner
hbons commented Oct 29, 2012

@Thadone it's already on the roadmap for 2.0.

@nottings

The git-bin idea is interesting, but I'd like to mention that the key benefit for me in using sparkleshare is the ability to keep everything in-house (i.e. not in the cloud). My understanding of git-bin is that large files are stored in the cloud on amazon... is this configurable such that these files can be stored on a local server instead? If not, then this instantly becomes unusable for my use-case and I'd imagine many others as well.

@hbons
Owner
hbons commented Dec 12, 2012

@nottings the plan is to write an SSH plugin, so the files are stored in the same location as the git repo.

@hbons
Owner
hbons commented Dec 14, 2012

i've adapted git-bin to use SSH and it nicely handles large binary files now. this completes step one. check it out at https://github.com/hbons/git-bin/tree/ssh

This was referenced Dec 16, 2012
@knocte
Contributor
knocte commented Jan 8, 2013

@hbons: quick question, by using an SSH backend for git-bin, does this mean that the use of this feature will be restricted to servers that have SSH access? What I mean is, in the case of the server being i.e. GitHub, it wouldn't work, right? (Because Github only exposes the git protocol via SSH, not SSH itself, right?)

@hbons
Owner
hbons commented Jan 8, 2013

@knocte you are right. you'll still be able to use Github, git-bin will only bin enabled on "Own servers"

@knocte
Contributor
knocte commented Jan 9, 2013

I see.

BTW can we take a step back for a moment and think again why git-bin is needed?

I did some reading and it seems vanilla git is already good enough at managing binary files (both when diffing and when storing) provided git-gc is used. Some links:

http://stackoverflow.com/questions/540535/managing-large-binary-files-with-git#comment-8212510

http://stackoverflow.com/questions/3601672/how-does-git-deal-with-binary-files#answer-3601728

@hbons
Owner
hbons commented Jan 9, 2013

@knocte with git-bin there doesn't have to be a local history of all the files. which means massive space savings. if chunks are needed locally they're fetched from the remote.

@erenoglu
erenoglu commented Jan 9, 2013

@hbons: is this ready to be tested or do we wait for a beta release of 2.0? Is there any guide how to test it?

@hbons
Owner
hbons commented Jan 9, 2013

@erenoglu the core is ready, i just need to hook it up to SparkleShare. you can try it manually without SparkleShare here: https://github.com/hbons/git-bin/

@knocte
Contributor
knocte commented Jan 10, 2013

@hbons, yes, but then the policy to not store previous versions of the files is different depending if the file is binary or not. Shouldn't there rather be a setting that is global to every kind of file (i.e. [ ] prune history older than __ months ).

@hbons
Owner
hbons commented Jan 10, 2013

@knocte this will be done for all the files. all the files will have their history on the remote regardless of type or size.

@knocte
Contributor
knocte commented Jan 10, 2013

@hbons ok, but then this would only make sense when the client is configured to not do --depth=1 already, right?

@hbons
Owner
hbons commented Jan 10, 2013

@knocte git-bin only stores the file metadata in git. since this will be small and compressable, we can get away with always doing a full clone.

for the presets like Github we'll still use clone -depth=1 by default.

@knocte
Contributor
knocte commented Jan 10, 2013

Does this mean that one could not be able to use Github without --depth=1 ? I still think it's a valid use case.

@hbons
Owner
hbons commented Jan 10, 2013

@knocte the checkbox in the dialog will remain in the dialog.

edit: oops, that was a bit of lolspeak :P.

@knocte
Contributor
knocte commented Jan 10, 2013

Ah mkay, glad that "by default" doesn't mean remove the checkbox, but just "enable the checkbox" by default :)

@knocte
Contributor
knocte commented Jan 10, 2013

BTW: re "regardless of type" -> then you need to rename the summary of this github issue to not contain the word "binary" ;)

@lkraav
lkraav commented May 26, 2013

Just saw 1.1 released. How are things going with this issue?

@hbons
Owner
hbons commented May 27, 2013

@lkraav it's pretty much done now, just needs some thorough testing and integration into the build system.

@tbrandstetter

Is it possible for us to build it manually? Im eager to test the git-bin version.

@hbons
Owner
hbons commented Jun 5, 2013

@tbrandstetter on Linux it should work by toggling the use_git_bin boolean in the fetcher code. you also need git-bin.exe in your path (compiled from my fork). also make sure that SFTP is enabled on the remote SSH account.

careful though, it hasn't had much testing at all yet.

@lee-elenbaas

in many cases the binary files are autogenerated and are very similar to their previous version - for example in the case of keeping compiled .dll files.
in such cases i see a strong use case for storing the large files in chunks but under a folder in the original repo - or on a separate repo rather then use a new service for that.
this way if the only thing that changes is the .dll create time - only the package that contains that change gets uploaded, and stored.

are there any thoughts about adding git as a backend for git-bin?

@hbons
Owner
hbons commented Jul 1, 2013

@lee-elenbaas this is exactly what git-bin does, it chunks up large files and puts them in a separate folder in the existing git repo.

@lee-elenbaas

from reading it i got the impression that the chunks are placed in other
service - and not inside the git repo

On Mon, Jul 1, 2013 at 11:45 PM, Hylke Bons notifications@github.comwrote:

@lee-elenbaas https://github.com/lee-elenbaas this is exactly what
git-bin does, it chunks up large files and puts them in a separate folder
in the existing git repo.


Reply to this email directly or view it on GitHubhttps://github.com/hbons/SparkleShare/issues/437#issuecomment-20309048
.

@hbons
Owner
hbons commented Jul 6, 2013

@lee-elenbaas it saves the chunks in the git folder on the server, but using SSH directly, instead of using the git-push command.

@lee-elenbaas

in that case it does not handle chunk history - and does not treat the
chunks as objects inside the repository - so it is NOT the option i was
refering to.
I want those chunks to be treated as objects inside the repo or another repo

On Sun, Jul 7, 2013 at 12:20 AM, Hylke Bons notifications@github.comwrote:

@lee-elenbaas https://github.com/lee-elenbaas it saves the chunks in
the git folder on the server, but using SSH directly, instead of using the
git-push command.


Reply to this email directly or view it on GitHubhttps://github.com/hbons/SparkleShare/issues/437#issuecomment-20561253
.

@hbons
Owner
hbons commented Jul 7, 2013

@lee it does that. either i'm misunderstanding you or you should read its source.

@lkraav
lkraav commented Aug 12, 2013

Hola @hbons, based on the above, is use_git_bin still supposed to be Linux only? Is something significantly broken on other platforms, is there a point trying to set up test environments there?

@hbons
Owner
hbons commented Aug 12, 2013

@lkraav it's easiest to set up on Linux. i haven't tested on Mac or Windows yet, it will require bundling of git-bin.exe in some way.

@lkraav
lkraav commented Aug 12, 2013

OK. Is the implementation ready for a config.xml enable parameter, perhaps? I think you'd immediately receive a larger testing audience. Building the thing is a big barrier.

@hbons
Owner
hbons commented Aug 13, 2013

@lkraav i don't think it's ready yet. i haven't tested it myself completely yet.

@luxflux
luxflux commented Jun 11, 2014

Any news here? Would be cool to be able to handle big files with git-bin.

@creshal
creshal commented Jun 12, 2014

How does git-bin work when offline? We're looking into using sparkleshare to sync data between users who work offline regularly. Not having a local history prior the last sync/x days/y revisions is not a problem, but the current revision of each file (plus all since the last sync) must be kept locally in this case, or users suddenly run into "corrupted" files.

@hbons
Owner
hbons commented Jun 12, 2014

@creshal yes, that's the whole point of SparkleShare: to have local copies at all times.

@duelafn
duelafn commented Apr 10, 2015

Github now has a similar project that is using clean and smudge: https://git-lfs.github.com/. It may be worth considering supporting that (instead or additionally - I'm not sure what the disappearance of Mighty-M's account means for the future of git-bin). Also, if the empty file warning on your git-bin page is still a problem, there may be a solution in the git-lfs code.

@hbons
Owner
hbons commented Apr 10, 2015

@duelafn this definitely looks like the way to go. the clean and smudge filters are now used to encrypt/decrypt files when you have encryption enabled, so there might be a conflict in functionality here.

@rejon
rejon commented Feb 3, 2016

Hi, any progress on this? We have a project called http://newpalmyra.org where we are using git lfs, and need something simple and user friendly like sparkleshare so our artists don't have a technical heart attack.

Make sense?

How to get this feature done? Have manpower and/or cash.

@hbons
Owner
hbons commented Feb 3, 2016

@rejon GitLFS is something i want to look into. Can you email me?

@rejon
rejon commented Feb 3, 2016

Oops, don't have your email, I'm jon@fabricatorz.com

On Wed, Feb 3, 2016 at 8:16 PM, Hylke Bons notifications@github.com wrote:

@rejon https://github.com/rejon GitLFS is something i want to look
into. Can you email me?


Reply to this email directly or view it on GitHub
#437 (comment).

@mvgijssel

I'm also interested in the SparkleShare git-lfs integration. I'm unable to get it working. I've done the following:

  • setup a simple Github project, linked the project with SparkeShare
  • ran git lfs install in the project repository cloned by SparkleShare
  • ran git lfs track "*.txt"
  • added test-sparkleshare.txt
  • waited until the file was synchronised with Github

Opening the file on Github, https://github.com/mvgijssel/test-lfs/blob/master/test-sparkleshare.txt, only shows the file pointer but not the actual file. When adding another file manually after closing SparkleShare (git add / git commit / git push) it'll actually do the LFS syncing properly of the new and old file.

$ git push origin master
Git LFS: (2 of 2 files) 37 B / 37 B
Counting objects: 3, done.
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 455 bytes | 0 bytes/s, done.
Total 3 (delta 0), reused 0 (delta 0)
To ssh://git@github.com/mvgijssel/test-lfs
c453f60..75854c8 master -> master

Is it possible that the pre-push isn't called by SparkleShare?

@hbons
Owner
hbons commented Mar 14, 2016

@mvgijssel what version of SparkleShare are you running and on what OS? The pre-push hook was introduced in Git 1.8.2, so if it's running the bundled Git it may not be new enough?

@mvgijssel

From git-bash

$ git --version
git version 2.7.2.windows.1

SparkleShare version 1.5.0

@BarryThePenguin
Collaborator

I might be wrong, but I didn't think SparkleShare uses system git on windows?

@mvgijssel

If that's the case is it configurable? Like export GIT_BIN=...?

@BarryThePenguin
Collaborator

Should be safe enough to just replace the git files in %programfiles%\SparkleShare with a newer version of git

@mvgijssel

Yeah, so checking the version used by SparkleShare

C:\Program Files\SparkleShare\msysgit\bin>git.exe --version
git version 1.8.0.msysgit.0

I'll try to symlink the new Git in the SparkleShare directory.

@mvgijssel

Replacing the msysgit with a new git version (2.7.2.windows.1) resulted in the immediate crash of SparkleShare. Replacing the git executable in the msysgit/bin directory resulted in nothing being uploaded. I'll try to see if I have any success with Git 1.8.2.

@BarryThePenguin
Collaborator

Sure, that sucks... I'm pretty keen to get a newer version of git in the windows release. There was this recently

@rejon
rejon commented Mar 22, 2016

Hi, I'm willing to pay someone to fix this. email me jon@fabricatorz.com

On Sat, Mar 19, 2016 at 3:40 AM, Jonathan Haines notifications@github.com
wrote:

Sure, that sucks... I'm pretty keen to get a newer version of git in the
windows release. There was this recently
http://www.theregister.co.uk/2016/03/16/git_server_client_patch_now/


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#437 (comment)

@hbons hbons changed the title from Handle large files better with git-bin to Git Large File Storage May 28, 2016
@hbons
Owner
hbons commented May 28, 2016

There's been good progress on this feature now, but for git-lfs to work properly with SparkleShare we need proper support for passing SSH options to git-lfs: git-lfs/git-lfs#1142

I don't know Go, so if anyone wants to speed up getting large file support into SparkleShare, getting that git-lfs issue fixed would be a big help. :)

@rejon
rejon commented May 28, 2016

Thats good progress.

How much time to do this? We could potentially sponsor this work if someone
is interested, ok?

Jon

On Saturday, May 28, 2016, Hylke Bons notifications@github.com wrote:

There's been good progress on this feature now, but for git-lfs to work
properly with SparkleShare we need proper support for passing SSH options
to git-lfs: git-lfs/git-lfs#1142
git-lfs/git-lfs#1142

I don't know Go, so if anyone wants to speed up getting large file support
into SparkleShare, getting that git-lfs issue fixed would be a big help. :)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#437 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AADo5y1GlQQUAh6r9UzJrZkXSq3G16Y8ks5qGD3CgaJpZM4ABFZz
.

@hbons
Owner
hbons commented May 28, 2016

@rejon I don't think it's very much work, it just needs some understanding of Go, Git, and SSH. When that issue is fixed and released SparkleShare can properly call to git-lfs with it authentication needs. After that I can make a preview release.

@pdf
Contributor
pdf commented May 29, 2016
@hbons
Owner
hbons commented May 29, 2016

@pdf awesome! I'll try to compile this and feed it to SparkleShare. :)

@hbons
Owner
hbons commented May 29, 2016 edited

IT'S WORKING 😈 evil laugh https://github.com/hbons/guadec-designs/blob/master/SparkleShare.txt

Thanks to @pdf's patch to git-lfs. Let's hope it gets merged and released soon.
@rejon I'll clean up my branch and push it so you can do testing.

@rejon
rejon commented May 29, 2016

Excellent news!!!

On Sunday, May 29, 2016, Hylke Bons notifications@github.com wrote:

IT'S WORKING evil laugh
https://github.com/hbons/guadec-designs/blob/master/SparkleShare.txt

Thanks to @pdf https://github.com/pdf's patch to git-lfs. Let's hope it
gets merged and released soon.
@rejon https://github.com/rejon I'll clean up my branch and push it so
you can do testing.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#437 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AADo51oVaQa7hEwPjXVi9xJqxALub9YSks5qGdcXgaJpZM4ABFZz
.

@pdf
Contributor
pdf commented Jun 3, 2016

Merged upstream.

@hbons
Owner
hbons commented Jul 8, 2016

LFS support has been merged into master now. :) Needs testers!

@hbons hbons closed this Jul 8, 2016
@glunardi
Contributor
glunardi commented Jul 8, 2016

Will try to get Collabora to roll this out internally so we can test it.
Thank you for adding support for LFS

@hbons
Owner
hbons commented Jul 9, 2016

@glunardi Thanks! I'd test it locally before deploying anywhere, I'm sure it's still broken in some way I've overlooked. :) On Linux you'll need to have Git LFS from master installed too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment