Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

APIs for git-gc and git-reflog #3247

Open
leoyanggit opened this issue Jun 23, 2015 · 27 comments
Open

APIs for git-gc and git-reflog #3247

leoyanggit opened this issue Jun 23, 2015 · 27 comments

Comments

@leoyanggit
Copy link

I'm wondering if I could implement some functionalities of git-gc and git-reflog by using the current public APIs. Do we have this kind of APIs? I'm basically looking for

git reflog expire --expire-unreachable=now --all
git gc --prune=now

If we don't have these APIs do we have a plan on it?

@carlosmn
Copy link
Member

git-gc is mostly policy-based and it's something you run regularly in the background as housekeeping. It is a poor candidate for a libgit2 feature.

Similarly for git-reflog expire. You can implement it via reference transactions, but there's not much to gain from using libgit2 for that.

@leoyanggit
Copy link
Author

I'm actually not looking for the pack functionalities of git-gc. I want the prune part though. So I can live with the ability to remove unreachable objects. Do we have this capability?

@carlosmn
Copy link
Member

There is no functionality to delete objects or perform the repack necessary for that.

@leoyanggit
Copy link
Author

Do we consider add this ability? It's common to get unreachable objects in a repo so we need a way to remove them while git-gc or whatever is not available on the target platform.

@carlosmn
Copy link
Member

As I mentioned above, I don't think there is much of an advantage, if at all, of having a libgit2-implemented repacker or the other house-keeping operations.

Not having git itself but having libgit2 on such a system is an edge case which I don't believe we should base our goals on.

@YuzhongHuangCS
Copy link

Any updates? When used on a server, I wish it could do background garbage collection regularly to save disk space, so I think it's a very useful functionally. Currently I have to call git gc via command line.

@carlosmn
Copy link
Member

The goals of the project haven't changed, and neither have the tradeoffs. If one were to rewrite git-gc on top of libgit2, the best-case scenario is ending up with what we already had.

If you want to use regular maintenance on some repostories, use git gc, that's what it's there for.

@Xorcerer
Copy link

Xorcerer commented Mar 5, 2019

I keep using libgit2 maintaining a repo, and the repo eventually run into

could not open 'E:/Repos/CoreUXOperations/.git/HEAD': Too many open files
everytime.

Git gc will solve the issue.

@eaigner
Copy link
Contributor

eaigner commented Apr 25, 2019

Question: wouldn't be running a local clone of the repo in libgit2 and then replacing the .git folder on disk with the cloned one be equivalent to a garbabe collect?

@pks-t
Copy link
Member

pks-t commented Apr 26, 2019

@eaigner: while this would seem like an easy solution, it's unfortunately not equivalent to git-gc. git-gc has some important aspects that we definitely have to honor if we were to implement it in libgit2. Most importantly, a git-gc will by default not purge unreachable objects unless they haven't been touched in at least n days. This not only avoids people loosing data in case they inadvertantly caused a desired object to be unreachable, but it also avoids race conditions with concurrently running tools. If one were to implement GC with a local clone (which is probably not what you want to imply but propose as a workaround only), then you wouldn't get those unreachable objects at all and thus potentially loose them.

@eaigner
Copy link
Contributor

eaigner commented Apr 26, 2019

@pks-t true, but may be enough for some peoples use case. And as you know, I'm all about thoes half-assed-stop-gap solutions 😆

@pks-t
Copy link
Member

pks-t commented Apr 26, 2019

Haha. Well, yeah, it would be a possible way to implement it for users if they know to have a tight grip on the environment. Most importantly, it needs to be known that no other process will access the repo at the same point in time, which is a guarantee that cannot be made in most environments as users are always free to use e.g. the IDE (which may regularly execute git commands in a repo without the user doing anything) and the command line at the same time.

@kcsaul
Copy link
Contributor

kcsaul commented Jun 19, 2019

@pks-t Any ideas how one would go about implementing garbage collection for a custom odb backend without cloning/rebuilding from scratch?

With my project, I'm using an OO database as the backend, so I've no way of using standard git tools to clean-up any old garbage. In my database environment, there is such things as single-user & offline modes, and there's other locking strategies I could use while online, which would guarantee it's safe to run garbage collection.

With regards the challenge you've mentioned above about not deleting objects that haven't been touched in n days - If garbage collection were to be implemented by libgit2, perhaps this is something that could be left up to the odb backend to decide - i.e. pass n through to a new odb delete function that'd be needed, and have it return a result indicating whether it went ahead to determine if deletion of any dependent objects should also be attempted. In my case, I'd simply check if the object was created more than n days ago, rather than worrying about updating the database to reflect when it was last accessed.

@tiennou
Copy link
Contributor

tiennou commented Jun 20, 2019

The first step would be to decide what parts of gc you're after the most. I can say for sure that I'm not liking adding deletion-capabilities to the ODB though, but that's on me 😉. As we're currently concentrating on getting 1.0 ready, it's also somewhat out-of-scope at the moment. Anyways…

Reading what the manpage says, git-gc seems to depend on git-repack (no support), which uses git-pack-objects (some support). As part of the ODB backend API, we have writepack, which is used to handle packwriting in general, but AFAICS nothing that can trigger a repack. I'm pretty sure our low-level pack-wrangling functions can be used to do that — packbuilder — and they only seem to miss "knobs" for deltaification, winsize, etc..

In short, designing the repack method to the ODB backend struct would be a good first step toward a possible git_repository_gc system. Then it's about implementing it, both for the normal ODBs (using the packbuilder) and yours (SQL?).

@kcsaul
Copy link
Contributor

kcsaul commented Jun 20, 2019

The main part I think I'd need is a common approach/algorithm for identifying all the unreachable objects and deleting them.

The backend in my case is another ODB (JADE), which fits in quite nicely with the pluggable backend approach provided by libgit2.

While it would be needed for cleaning up standard git file based ODB's (which can just be cleaned up using git), I don't believe there's any need to provide support for repacking objects into compressed pack files for other custom ODB backends. In my case, the JADE database already packs the objects away into it's own database structure. I suspect the same can be said for other kinds of databases that'd be used as a backend, as each would store all the objects in a way that's most appropriate to them. I don't think it'd make sense to pull objects out of other kinds of databases in order to archive them into a pack file stored outside of the database.

@tiennou
Copy link
Contributor

tiennou commented Jun 20, 2019

Hmm, I see the issue: you don't have a loose/packed system with a custom ODB, and I'm too focused on that 😉 . So there must be a way to ask for each backend's idea of which objects are unreachable (that could be handled as public helper), and a way to ask for a list of objects to be deleted (I think that's where the transaction-y issues alluded to enter the picture). So we'd provide
git_odb_gc(git_odb *, git_odb_gc_options *) for your use case, a garbage_collect(…) ODB method, and the aforementioned helper. I think implementing the helper is doable, repacking would be handled by our backends, and you'd have the entry point you need.

@kcsaul
Copy link
Contributor

kcsaul commented Jun 20, 2019

That sounds like the sort of thing I'd need, but wouldn't it need to be at the repository level, as the garbage collection would need to use its associated refdb to determine what can and can't be reached?

@pks-t
Copy link
Member

pks-t commented Jun 20, 2019

That sounds like the sort of thing I'd need, but wouldn't it need to be at the repository level, as the garbage collection would need to use its associated refdb to determine what can and can't be reached?

Definitely. It could always be the case that some objects A is part of ODB X and only reachable via an object B in another ODB Y, only. So asking X whether A is unreachable would return true, while it in fact is reachable from B.

@pks-t
Copy link
Member

pks-t commented Jun 20, 2019

So what would be required is callbacks that enumerate all objects in an ODB. We'd then have to recreate the complete tree from all objects of each ODB and mark all the other objects as unreachable. We could then notify ODBs that this set of objects is unreachable. Each ODB can then decide for itself whether it wants to delete those objects.

This still ignores the issue of expiry times, though, and I'm not convinced that this can be handled at ODB level, only.

@kcsaul
Copy link
Contributor

kcsaul commented Jun 20, 2019

There is already a 'foreach' callback on git_odb_backend which looks like it's intended to enumerate all objects in an ODB.

This still ignores the issue of expiry times, though, and I'm not convinced that this can be handled at ODB level, only.

While I wouldn't want to update an object when it's being read to track when it was last accessed, there's another approach that may make the overall gc operation more efficient as it'd avoid the overhead of building up & trimming a potentially very large list of objects that can/can't be reached to work out what can be deleted.

If the ODB backend provided functions to update/check a timestamp (which would be initially set on creation) - This could be updated while recursively visiting all objects that can be reached. All objects in the ODB could then be enumerated to delete those where the timestamp hasn't been updated and has expired as a result.

@pks-t
Copy link
Member

pks-t commented Jun 20, 2019

There is already a 'foreach' callback on git_odb_backend which looks like it's intended to enumerate all objects in an ODB.

Oops, yes. I've accidentally looked at the wrong struct and had been really confused that I didn't see any such callback.

If the ODB backend provided functions to update/check a timestamp (which would be initially set on creation) - This could be updated while recursively visiting all objects that can be reached. All objects in the ODB could then be enumerated to delete those where the timestamp hasn't been updated and has expired as a result.

Yeah, that'd work. The downside though is that we'd batch-update all the objects whenever a GC is being performed. Sounds like a lot of disk writes to me, and for example if you have a loose file backend then it would be quite inefficient. Might be preferable in some situations though compared to building a complete graph

@kcsaul
Copy link
Contributor

kcsaul commented Jun 20, 2019

In my case, the disk writes wouldn't be a problem, but I can understand why the library can't make assumptions like that.

Now that we've identified a strategy that'd work in my case - Perhaps all I really need is a way of visiting every reachable object. I'll be able to take care of the rest by setting an indicator elsewhere so I know to update the timestamp when an object is visited.

Is there anything in the library that could be used to just visit every reachable object?

@pks-t
Copy link
Member

pks-t commented Jun 20, 2019 via email

@borkmann
Copy link

borkmann commented Nov 13, 2019

@pks-t @ethomson @kcsaul @tiennou any progress on such API?

If not, can we re-open this issue?

I have a stand-alone application which does not assume that git binary for git-repack is present, and I ran into the issue that after running non-stop fine for several days, git_revwalk_next() with prior git_revwalk_push_head() out of a sudden stopped working to my surprise.

Turns out I ran into the default resource limit on open files deep down inside git_revwalk_next() when it preps the list due to having too many pack files. Doing a plain git repack -a -d -f --depth=250 --window=250 shrinks the number of files dramatically and makes everything working again. For repos with a large number of commits, a git-repack API would be highly desirable.

I worked around the issue by bumping the ulimit ...

https://git.kernel.org/pub/scm/linux/kernel/git/dborkman/l2md.git/commit/?id=ecdbb0c79d9274a04a1da9efd429c1c88fe9bd9a

... but it's not really /actually/ solving it just mitigating, hence proper API for repacking is needed for libgit2.

Are there any other options I'd have available to workaround, which are already in upstream libgit2 but which I've been missing?

@tiennou
Copy link
Contributor

tiennou commented Nov 14, 2019

I'm okay with reopening this issue though, but there are no promises on work actually taking place 😉. The goal at the moment is to try to make as much stabilization happen as possible in preparation of v1.0, so a "new feature" is pretty low priority, especially one with (IMHO) not enough preparatory design work. Adding more ways to have write/deletion problems, especially with objects, doesn't feel 1.0-ish to me 😜.

That said, "a repacker" sure is something that would be nice, but it's unclear how it's supposed to work against custom refdb implementations, and those are already in-use… So I think the summary, to me, is: Should packing be implemented as a new backend callback ? Or be backend-agnostic, which makes it effectively a repository-level "system" ? Should we invest time in improving our own refdb code, or should we concentrate into trying to port "what git already is", relibrarizing their refdb code ? It should be theorically possible to yank our implementation for git's anyway, and there was some interest in doing that a while back — I've looked at little at how much mismatch would there be, but not much more, because of the priority…

It should be possible to write a "dumb" version of the repacker with the code now, loading and "locking" objects touched by either being pointed to by a reference, or related to one such locked object. And then issue a final "purge/compact" action to the odb with the "untouched objects" list. This looks simple, but you have to make sure to keep tabs on the refdb while doing that, you have worktrees and alternates to handle consistenly, and someone might be commiting from another process entirely. I'm not sure how this is doable given all there constraints…

@pks-t
Copy link
Member

pks-t commented Nov 28, 2019

I'm fine with reopening as it's still an issue. And as @tiennou said, an open isuse is no promise that it's going to be implemented anytime soon ;) But yeah, having it closed may seem like libgit2 isn't willing to consider any GC/repacking implementation at all.

Adding more ways to have write/deletion problems, especially with objects, doesn't feel 1.0-ish to me.

Agreed, though I think it would be a nice to have in the long term.

Should packing be implemented as a new backend callback ? Or be backend-agnostic, which makes it effectively a repository-level "system" ?

Tough question. I feel like having it backend-specific is going to scale a lot better than a backend-agnostic way, especially so as for some backends repacking would essentially be needless/impossible. If you've got a database backend there is no need for you to repack as the database handles that for you, already. So having a generic layer above the backends that still does the repacking is going to perform a lot of work with zero benefit at all.

So I think the answer is backend agnostic, but that again throws up the question of how to handle cross-backend repackings. The most obvious case here loose/packed objects, which are two backends where repacking of loose objects results in a packed object. But how to handle that generically where there's e.g. a loose and database backend is a tough question.

So I think any judgement here is premature. There definitely is a rather big discussion to be had on this topic to nail down pros/cons of either of the approaches and get the design right.

@pks-t pks-t reopened this Nov 28, 2019
@tiennou
Copy link
Contributor

tiennou commented Nov 28, 2019

Yeah, to be frank, the main issue I see with the current design is that our loose/packed reference backend carries too much "knowledge" about how to do things — case in point, any custom backend has to implement reflog handling, as both an "internal" and "external" API (resp. itself, and the git_refdb layer) API, which IMO could have been better handled by the generic refdb code using a complete "transaction" (capture reference state, update value, update/create reflog, commit or rollback depending on a write-locked reference state check). This can be fixed, but not without changes to the git_refdb_backend interface, and then git_reference could benefit from a more transparent handling of atomicity by keeping track of the "read state" (via at least a "staleness" check for users, instead of having to handle a "maybe stale now" case each time they reuse a loaded reference — which is exposed via old_target OID or refnames). I do realize there was a removal of state-related things a few years back, but AFAIU I don't see how to make long-running users of any reference code not be impacted by staleness issues without us being aware of it — right now care must be taken to pass those old_target value to get the correct code paths.

OTOH, it's the only working backend I know of, and it already has some conformance issues with git because of its "customizability". A big discussion indeed 😉.

martinvonz added a commit to martinvonz/jj that referenced this issue Dec 25, 2020
Before this commit, running Git's GC in a Git repo backing a Jujube
repo would risk deleting the conflict data we store as blobs in the
Git repo. This commit fixes that by adding a Git note pointing to the
conflict blob.

I wasn't able to add a test case for this because libgit2 doesn't
support gc [1]. Just testing that the ref is there doesn't seem very
useful.

 [1] libgit2/libgit2#3247
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants