I've been testing SparkleShare on windows, which has been working fine until now, with the git repo on a ubuntu server. My git repo including history is over 1 GB in size but the actual current files are only about 100 MB (I deleted many files along the way). Is there a way to purge history versions of deleted files and propagate this across all clients and server?
Also, a quota system feature might be interesting to implement, i.e. if the total size including history goes above a specified threshold of n MB then SparkleShare should purge the oldest history until it goes below n.
... and I did search before posting :(
@hbons Annex stuff starts getting complicated. I need chained list of dependencies.
@hbon as of now, you cannot purge history* that is already committed into git. Therefore, it duplicates #385.
Unless you rebase.
@fire well this is a feature request against current SparkleShare. They annex issue is a brainstorm for something we may use. We still need this issue if we won't use annex
I will change the issue to mean rebase existing repos to remove old history then.
IMHO disk space usage for the git repository could became a problem. There is a procedure to filter out and remove objects from git history: http://www.ducea.com/2012/02/07/howto-completely-remove-a-file-from-git-history/ (and someone packed it as a few nice scripts: https://github.com/cmaitchison/git_diet).
Couldn't be nice to implement even a simple "Permanently delete removed files more than X days ago" function?
One more link: http://viget.com/extend/backup-your-database-in-git#comment-400537128. Not sure if it works, though.
This is going to be a huge problem. Git basically looks at the entire history for most of it's operations, and doing a repack on a repo with several gigabytes can easily hog all of your cpu und your memory. I crashed servers doing git repack or git gc.
Why is the history even stored on the client? Why not just download the latest version?
I tried sparkleshare yesterday, i have a 5GB repo full of ~100-200MB files. SparkleShare on Mac hung regularly and after the initial upload the repo had over 10GB even though the files along are just 5. This is unacceptable.
I think using git on the client-side is a design-flaw. It would make more sense to use rsync or something to copy the repo and use git on the server to store the history, kind of like DropBox does it with svn (as far as i know).
There are options for git to just clone a certain depth, maybe you can look into that. But i'm pretty sure you'll continue to have huge performance problems if you clone everything.
@fab1an the next version will do a --depth=1 clone.
I don't agree having all the history is a bad thing in all cases, it depends on your usecase.
Sure, I could have started creating something completely custom, but then we wouldn't have anything usable now. Git does all everything I need to, but it has some downsides.
You are free to contribute to the custom backend Rainbows though.
Another use case that I've come across where this is an issue is synchronizing encrypted file containers. I keep a repository with a small (~20MB) Truecrypt volume in it, and after some changes, the master repository on my server takes up over 100MB. This is after only a few updates to the file container.
For a use case like this, it's not only inefficient (space-wise), but also insecure, since it keeps multiple versions of an encrypted file, making it easier (although, still very hard) to break that encryption.
@rnorris7756 i have a branch that does encryption per file, so they are stored encrypted on the server. I'm not sure if that's your need, but it should be more efficient.
http://code.google.com/p/boar/ is interesting. It seems to have everything needed to work well with SparkleShare's backend system, only it doesn't seem to (yet) support SSH.
I haven't looked into it but maybe using bup might solve some problems? https://github.com/apenwarr/bup
Encryption on the server really isn't sufficient in my case. I'm encrypting the files locally to protect the information in there in the case that my computer gets stolen. I don't consider it to be a critical issue, since the file container is in its own repository that can be re-initialized easily. I just wanted to present a real case where history compaction (or no history) would be a convenient option.
+1 on this issue
Current plan: https://github.com/Mighty-M/git-bin
I'm using the following script to remove history older than 30 days (which works fine for me, just set the timespec variable).
You could run this every week or so, or provide a button/menu item to run it manually.
The code is originally from gibak (https://github.com/pangloss/gibak), I fixed some small issues and removed what I didn't need.
@bomma interesting. Do you run this on just the client or both the client and server?
@hbons only on the clients
@bomma can you give some more details about how this works? Seems that it just removes objects from the database that aren't referenced in commits older than a particular date and not remove the actual commits? Is it possible that it would create a conflict?
@bomma I just tried this script. It does seem to remove earlier commits and their objects, but it doesn't remove them from the pack file, so they still take up space:
$ git count-objects -v
@hbons The script removes history beyond a time limit and then tries to garbage collect unreachable objects. Git is very conservative in that respect though. In my experience, git gc doesn't (always) clean up recent unreachable files (no matter what you specify as time spec). I'm not a git-wizard so maybe somebody else could explain what exactly is going on.
Normally no conflicts should occur since we have no tags, branches and a linear history.
On the history cleaning:
The script originates from git-home-history (http://jean-francois.richard.name/ghh). I seem to get better results with this sequence of commands after filter-branch:
git reset --soft
rm -rf .git/refs/original/
git reflog expire --all --expire=now --expire-unreachable=0
git repack -ad # immediately repack
git gc --prune=now
git push origin +master:master # force pushing the history changes
I run this script if I know I removed a lot of files or big files, not as a cron job. I think someone with more knowledge of git should take a look at it first, but I seem to get decent results (well, better than nothing anyways).
Full script: http://pastebin.com/A6h2UdcC
Guys, I just want to add that most of the SparkleShare users will use this program to replace the Dropbox. So there is no option to say: its not designed to handle a big files and so on. It will always be an issue until you guys wouldn't find the solution. For example I am using SS to sync between two macs and backup my web development projects (5 gigs) but my secret wish is to use it in the same way with my graphics projects (100 gigs). I am ready to pay for a possibility to host my own Dropbox (SparkleShare) on my own server with unlimited disk space. So good luck on that! :)
@Thadone it's already on the roadmap for 2.0.
The git-bin idea is interesting, but I'd like to mention that the key benefit for me in using sparkleshare is the ability to keep everything in-house (i.e. not in the cloud). My understanding of git-bin is that large files are stored in the cloud on amazon... is this configurable such that these files can be stored on a local server instead? If not, then this instantly becomes unusable for my use-case and I'd imagine many others as well.
@nottings the plan is to write an SSH plugin, so the files are stored in the same location as the git repo.
i've adapted git-bin to use SSH and it nicely handles large binary files now. this completes step one. check it out at https://github.com/hbons/git-bin/tree/ssh
@hbons: quick question, by using an SSH backend for git-bin, does this mean that the use of this feature will be restricted to servers that have SSH access? What I mean is, in the case of the server being i.e. GitHub, it wouldn't work, right? (Because Github only exposes the git protocol via SSH, not SSH itself, right?)
@knocte you are right. you'll still be able to use Github, git-bin will only bin enabled on "Own servers"
BTW can we take a step back for a moment and think again why git-bin is needed?
I did some reading and it seems vanilla git is already good enough at managing binary files (both when diffing and when storing) provided git-gc is used. Some links:
@knocte with git-bin there doesn't have to be a local history of all the files. which means massive space savings. if chunks are needed locally they're fetched from the remote.
@hbons: is this ready to be tested or do we wait for a beta release of 2.0? Is there any guide how to test it?
@erenoglu the core is ready, i just need to hook it up to SparkleShare. you can try it manually without SparkleShare here: https://github.com/hbons/git-bin/
@hbons, yes, but then the policy to not store previous versions of the files is different depending if the file is binary or not. Shouldn't there rather be a setting that is global to every kind of file (i.e. [ ] prune history older than __ months ).
@knocte this will be done for all the files. all the files will have their history on the remote regardless of type or size.
@hbons ok, but then this would only make sense when the client is configured to not do --depth=1 already, right?
@knocte git-bin only stores the file metadata in git. since this will be small and compressable, we can get away with always doing a full clone.
for the presets like Github we'll still use clone -depth=1 by default.
Does this mean that one could not be able to use Github without --depth=1 ? I still think it's a valid use case.
@knocte the checkbox in the dialog will remain in the dialog.
edit: oops, that was a bit of lolspeak :P.
Ah mkay, glad that "by default" doesn't mean remove the checkbox, but just "enable the checkbox" by default :)
BTW: re "regardless of type" -> then you need to rename the summary of this github issue to not contain the word "binary" ;)
Just saw 1.1 released. How are things going with this issue?
@lkraav it's pretty much done now, just needs some thorough testing and integration into the build system.
Is it possible for us to build it manually? Im eager to test the git-bin version.
@tbrandstetter on Linux it should work by toggling the use_git_bin boolean in the fetcher code. you also need git-bin.exe in your path (compiled from my fork). also make sure that SFTP is enabled on the remote SSH account.
careful though, it hasn't had much testing at all yet.
in many cases the binary files are autogenerated and are very similar to their previous version - for example in the case of keeping compiled .dll files.
in such cases i see a strong use case for storing the large files in chunks but under a folder in the original repo - or on a separate repo rather then use a new service for that.
this way if the only thing that changes is the .dll create time - only the package that contains that change gets uploaded, and stored.
are there any thoughts about adding git as a backend for git-bin?
@lee-elenbaas this is exactly what git-bin does, it chunks up large files and puts them in a separate folder in the existing git repo.
@lee-elenbaas it saves the chunks in the git folder on the server, but using SSH directly, instead of using the git-push command.
@lee it does that. either i'm misunderstanding you or you should read its source.
Hola @hbons, based on the above, is use_git_bin still supposed to be Linux only? Is something significantly broken on other platforms, is there a point trying to set up test environments there?
@lkraav it's easiest to set up on Linux. i haven't tested on Mac or Windows yet, it will require bundling of git-bin.exe in some way.
OK. Is the implementation ready for a config.xml enable parameter, perhaps? I think you'd immediately receive a larger testing audience. Building the thing is a big barrier.
@lkraav i don't think it's ready yet. i haven't tested it myself completely yet.
Any news here? Would be cool to be able to handle big files with git-bin.
How does git-bin work when offline? We're looking into using sparkleshare to sync data between users who work offline regularly. Not having a local history prior the last sync/x days/y revisions is not a problem, but the current revision of each file (plus all since the last sync) must be kept locally in this case, or users suddenly run into "corrupted" files.
@creshal yes, that's the whole point of SparkleShare: to have local copies at all times.
Github now has a similar project that is using clean and smudge: https://git-lfs.github.com/. It may be worth considering supporting that (instead or additionally - I'm not sure what the disappearance of Mighty-M's account means for the future of git-bin). Also, if the empty file warning on your git-bin page is still a problem, there may be a solution in the git-lfs code.
@duelafn this definitely looks like the way to go. the clean and smudge filters are now used to encrypt/decrypt files when you have encryption enabled, so there might be a conflict in functionality here.