Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Handle large files better with git-bin #437

Open
ghost opened this Issue · 60 comments

17 participants

K. S. Ernest (iFire) Lee Hylke Bons Tammo van Lessen Fabian Zeindl Robert Waldemar arthurlutz Kevin Pinte Thadone Scott Nottingham Andres G. Aragoneses Emre Erenoglu Leho Kraav Thomas Brandstetter Lee Elenbaas Raffael Schmid Samuel Creshal
Deleted user

Hi,

I've been testing SparkleShare on windows, which has been working fine until now, with the git repo on a ubuntu server. My git repo including history is over 1 GB in size but the actual current files are only about 100 MB (I deleted many files along the way). Is there a way to purge history versions of deleted files and propagate this across all clients and server?

Also, a quota system feature might be interesting to implement, i.e. if the total size including history goes above a specified threshold of n MB then SparkleShare should purge the oldest history until it goes below n.

Cheers

Spec

  1. Rebase history
  2. Quota for automatic rebase
  3. Force clients to move to new repo
K. S. Ernest (iFire) Lee fire closed this
Deleted user

... and I did search before posting :(

Hylke Bons hbons reopened this
K. S. Ernest (iFire) Lee

@hbons Annex stuff starts getting complicated. I need chained list of dependencies.

K. S. Ernest (iFire) Lee

@hbon as of now, you cannot purge history* that is already committed into git. Therefore, it duplicates #385.

Unless you rebase.

Hylke Bons
Owner

@fire well this is a feature request against current SparkleShare. They annex issue is a brainstorm for something we may use. We still need this issue if we won't use annex

K. S. Ernest (iFire) Lee

I will change the issue to mean rebase existing repos to remove old history then.

Hylke Bons
Owner

@gionn

"Hi,

IMHO disk space usage for the git repository could became a problem. There is a procedure to filter out and remove objects from git history: http://www.ducea.com/2012/02/07/howto-completely-remove-a-file-from-git-history/ (and someone packed it as a few nice scripts: https://github.com/cmaitchison/git_diet).

Couldn't be nice to implement even a simple "Permanently delete removed files more than X days ago" function?

Thanks!"

Fabian Zeindl

This is going to be a huge problem. Git basically looks at the entire history for most of it's operations, and doing a repack on a repo with several gigabytes can easily hog all of your cpu und your memory. I crashed servers doing git repack or git gc.

Why is the history even stored on the client? Why not just download the latest version?

I tried sparkleshare yesterday, i have a 5GB repo full of ~100-200MB files. SparkleShare on Mac hung regularly and after the initial upload the repo had over 10GB even though the files along are just 5. This is unacceptable.

I think using git on the client-side is a design-flaw. It would make more sense to use rsync or something to copy the repo and use git on the server to store the history, kind of like DropBox does it with svn (as far as i know).

There are options for git to just clone a certain depth, maybe you can look into that. But i'm pretty sure you'll continue to have huge performance problems if you clone everything.

Hylke Bons
Owner

@fab1an the next version will do a --depth=1 clone.

I don't agree having all the history is a bad thing in all cases, it depends on your usecase.
Sure, I could have started creating something completely custom, but then we wouldn't have anything usable now. Git does all everything I need to, but it has some downsides.

You are free to contribute to the custom backend Rainbows though.

Robert

Another use case that I've come across where this is an issue is synchronizing encrypted file containers. I keep a repository with a small (~20MB) Truecrypt volume in it, and after some changes, the master repository on my server takes up over 100MB. This is after only a few updates to the file container.

For a use case like this, it's not only inefficient (space-wise), but also insecure, since it keeps multiple versions of an encrypted file, making it easier (although, still very hard) to break that encryption.

Hylke Bons
Owner

@rnorris7756 i have a branch that does encryption per file, so they are stored encrypted on the server. I'm not sure if that's your need, but it should be more efficient.

Hylke Bons
Owner

http://code.google.com/p/boar/ is interesting. It seems to have everything needed to work well with SparkleShare's backend system, only it doesn't seem to (yet) support SSH.

Waldemar

I haven't looked into it but maybe using bup might solve some problems? https://github.com/apenwarr/bup

Robert

Encryption on the server really isn't sufficient in my case. I'm encrypting the files locally to protect the information in there in the case that my computer gets stolen. I don't consider it to be a critical issue, since the file container is in its own repository that can be re-initialized easily. I just wanted to present a real case where history compaction (or no history) would be a convenient option.

arthurlutz

+1 on this issue

Kevin Pinte

I'm using the following script to remove history older than 30 days (which works fine for me, just set the timespec variable).
You could run this every week or so, or provide a button/menu item to run it manually.

the script:
http://pastebin.com/PXYFJSGL

The code is originally from gibak (https://github.com/pangloss/gibak), I fixed some small issues and removed what I didn't need.

Hylke Bons
Owner

@bomma interesting. Do you run this on just the client or both the client and server?

Kevin Pinte

@hbons only on the clients

Hylke Bons
Owner

@bomma can you give some more details about how this works? Seems that it just removes objects from the database that aren't referenced in commits older than a particular date and not remove the actual commits? Is it possible that it would create a conflict?

Hylke Bons
Owner

@bomma I just tried this script. It does seem to remove earlier commits and their objects, but it doesn't remove them from the pack file, so they still take up space:

$ git count-objects -v
count: 0
size: 0
in-pack: 82
packs: 1
size-pack: 1331
prune-packable: 0
garbage: 0
Kevin Pinte

@hbons The script removes history beyond a time limit and then tries to garbage collect unreachable objects. Git is very conservative in that respect though. In my experience, git gc doesn't (always) clean up recent unreachable files (no matter what you specify as time spec). I'm not a git-wizard so maybe somebody else could explain what exactly is going on.

Normally no conflicts should occur since we have no tags, branches and a linear history.

On the history cleaning:
The script originates from git-home-history (http://jean-francois.richard.name/ghh). I seem to get better results with this sequence of commands after filter-branch:

git reset --soft
rm -rf .git/refs/original/
git reflog expire --all --expire=now --expire-unreachable=0
git repack -ad # immediately repack
git gc --prune=now
git push origin +master:master # force pushing the history changes

I run this script if I know I removed a lot of files or big files, not as a cron job. I think someone with more knowledge of git should take a look at it first, but I seem to get decent results (well, better than nothing anyways).

Full script: http://pastebin.com/A6h2UdcC

This was referenced
Thadone

Guys, I just want to add that most of the SparkleShare users will use this program to replace the Dropbox. So there is no option to say: its not designed to handle a big files and so on. It will always be an issue until you guys wouldn't find the solution. For example I am using SS to sync between two macs and backup my web development projects (5 gigs) but my secret wish is to use it in the same way with my graphics projects (100 gigs). I am ready to pay for a possibility to host my own Dropbox (SparkleShare) on my own server with unlimited disk space. So good luck on that! :)

Hylke Bons
Owner

@Thadone it's already on the roadmap for 2.0.

Scott Nottingham

The git-bin idea is interesting, but I'd like to mention that the key benefit for me in using sparkleshare is the ability to keep everything in-house (i.e. not in the cloud). My understanding of git-bin is that large files are stored in the cloud on amazon... is this configurable such that these files can be stored on a local server instead? If not, then this instantly becomes unusable for my use-case and I'd imagine many others as well.

Hylke Bons
Owner

@nottings the plan is to write an SSH plugin, so the files are stored in the same location as the git repo.

Hylke Bons
Owner

i've adapted git-bin to use SSH and it nicely handles large binary files now. this completes step one. check it out at https://github.com/hbons/git-bin/tree/ssh

This was referenced
Andres G. Aragoneses

@hbons: quick question, by using an SSH backend for git-bin, does this mean that the use of this feature will be restricted to servers that have SSH access? What I mean is, in the case of the server being i.e. GitHub, it wouldn't work, right? (Because Github only exposes the git protocol via SSH, not SSH itself, right?)

Hylke Bons
Owner

@knocte you are right. you'll still be able to use Github, git-bin will only bin enabled on "Own servers"

Andres G. Aragoneses

I see.

BTW can we take a step back for a moment and think again why git-bin is needed?

I did some reading and it seems vanilla git is already good enough at managing binary files (both when diffing and when storing) provided git-gc is used. Some links:

http://stackoverflow.com/questions/540535/managing-large-binary-files-with-git#comment-8212510

http://stackoverflow.com/questions/3601672/how-does-git-deal-with-binary-files#answer-3601728

Hylke Bons
Owner

@knocte with git-bin there doesn't have to be a local history of all the files. which means massive space savings. if chunks are needed locally they're fetched from the remote.

Emre Erenoglu

@hbons: is this ready to be tested or do we wait for a beta release of 2.0? Is there any guide how to test it?

Hylke Bons
Owner

@erenoglu the core is ready, i just need to hook it up to SparkleShare. you can try it manually without SparkleShare here: https://github.com/hbons/git-bin/

Andres G. Aragoneses

@hbons, yes, but then the policy to not store previous versions of the files is different depending if the file is binary or not. Shouldn't there rather be a setting that is global to every kind of file (i.e. [ ] prune history older than __ months ).

Hylke Bons
Owner

@knocte this will be done for all the files. all the files will have their history on the remote regardless of type or size.

Andres G. Aragoneses

@hbons ok, but then this would only make sense when the client is configured to not do --depth=1 already, right?

Hylke Bons
Owner

@knocte git-bin only stores the file metadata in git. since this will be small and compressable, we can get away with always doing a full clone.

for the presets like Github we'll still use clone -depth=1 by default.

Andres G. Aragoneses

Does this mean that one could not be able to use Github without --depth=1 ? I still think it's a valid use case.

Hylke Bons
Owner

@knocte the checkbox in the dialog will remain in the dialog.

edit: oops, that was a bit of lolspeak :P.

Andres G. Aragoneses

Ah mkay, glad that "by default" doesn't mean remove the checkbox, but just "enable the checkbox" by default :)

Andres G. Aragoneses

BTW: re "regardless of type" -> then you need to rename the summary of this github issue to not contain the word "binary" ;)

Leho Kraav

Just saw 1.1 released. How are things going with this issue?

Hylke Bons
Owner

@lkraav it's pretty much done now, just needs some thorough testing and integration into the build system.

Thomas Brandstetter

Is it possible for us to build it manually? Im eager to test the git-bin version.

Hylke Bons
Owner

@tbrandstetter on Linux it should work by toggling the use_git_bin boolean in the fetcher code. you also need git-bin.exe in your path (compiled from my fork). also make sure that SFTP is enabled on the remote SSH account.

careful though, it hasn't had much testing at all yet.

Lee Elenbaas

in many cases the binary files are autogenerated and are very similar to their previous version - for example in the case of keeping compiled .dll files.
in such cases i see a strong use case for storing the large files in chunks but under a folder in the original repo - or on a separate repo rather then use a new service for that.
this way if the only thing that changes is the .dll create time - only the package that contains that change gets uploaded, and stored.

are there any thoughts about adding git as a backend for git-bin?

Hylke Bons
Owner

@lee-elenbaas this is exactly what git-bin does, it chunks up large files and puts them in a separate folder in the existing git repo.

Lee Elenbaas
Hylke Bons
Owner

@lee-elenbaas it saves the chunks in the git folder on the server, but using SSH directly, instead of using the git-push command.

Lee Elenbaas
Hylke Bons
Owner

@lee it does that. either i'm misunderstanding you or you should read its source.

Leho Kraav

Hola @hbons, based on the above, is use_git_bin still supposed to be Linux only? Is something significantly broken on other platforms, is there a point trying to set up test environments there?

Hylke Bons
Owner

@lkraav it's easiest to set up on Linux. i haven't tested on Mac or Windows yet, it will require bundling of git-bin.exe in some way.

Leho Kraav

OK. Is the implementation ready for a config.xml enable parameter, perhaps? I think you'd immediately receive a larger testing audience. Building the thing is a big barrier.

Hylke Bons
Owner

@lkraav i don't think it's ready yet. i haven't tested it myself completely yet.

Raffael Schmid

Any news here? Would be cool to be able to handle big files with git-bin.

Samuel Creshal

How does git-bin work when offline? We're looking into using sparkleshare to sync data between users who work offline regularly. Not having a local history prior the last sync/x days/y revisions is not a problem, but the current revision of each file (plus all since the last sync) must be kept locally in this case, or users suddenly run into "corrupted" files.

Hylke Bons
Owner

@creshal yes, that's the whole point of SparkleShare: to have local copies at all times.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.