-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a configurable limit to the max pack size that will be indexed #4721
Conversation
Yeah, I think that a configurable limit (via |
PTAL, I've added a |
include/git2/common.h
Outdated
@@ -372,6 +374,18 @@ typedef enum { | |||
* > fail. (Using the FORCE flag to checkout will still overwrite | |||
* > these changes.) | |||
* | |||
* opts(GIT_OPT_GET_INDEXER_MAX_OBJECTS, size_t *out) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd rather name this GIT_OPT_{SET,GET}_PACK_MAX_OBJECTS
. Mentioning the indexer here is simply leaking implementation details which are not of much interest to the general user.
* > Get the maximum number of objects libgit2 will allow in a pack | ||
* > file when downloading a pack file from a remote. This can be | ||
* > used to limit maximum memory usage when fetching from an untrusted | ||
* > remote. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it really? A remote is still able to send a very small amount of huge objects. So I wonder whether a size limitation would actually be more useful than a count limitation.
That being said, should this really be a global value? It could be somebody has two remotes, one of which he is trusting and another one which he is not trusting. Another scenario is packfiles which were generated locally, which have different trust levels than packfiles which are being downloaded. So I think making this an option of the indexer would be a lot saner. We'd have to provide a new way to instantiate an indexer for this, as git_indexer_new
cannot be changed but does not have git_indexer_opts
parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As best I can tell -- and you certainly know the code better than I -- everything other than the index metadata is streamed directly from the remote to disk, and so does not consume memory. And so the number of objects (which sizes the entries
and deltas
vectors) is the salient variable for memory (but not disk) usage.
Is there another type that has an alternate constructor that takes an _opts
type I can look at for prior art? In addition, the git_indexer
is constructed by the pack backend right now; What would be the idiomatic way to plumb configuration all the way through from the remote
or repository
to that creation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have much prior experience with the indexer, so you're actually right. It will only store struct delta_info
s and struct entry
s. They obviously still result in quite a lot of memory being allocated when we have 2^32-1 objects.
I currently still think about accepting this PR with only the global option. In case the need arises to have different limits based on what the indexer is being used for we can still add an options struct to make it configurable per indexer, with the default being the global option's value.
src/indexer.c
Outdated
@@ -22,6 +22,8 @@ | |||
|
|||
extern git_mutex git__mwindow_mutex; | |||
|
|||
size_t git_indexer__max_objects = (size_t)-1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should use UINT32_MAX
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why UINT32_MAX
and not SIZE_MAX
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original behavior capped at 2³², so UINT32_MAX
would replicate that behavior. I'll push a change up shortly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks.
This replicates the old behavior of limiting to 2³² by default.
This is needed in order for google/oss-fuzz#1604 to not immediately OOM with a ~32GB allocation.
The existing code enforces a hard limit of 2³² objects, which may be acceptable as a general default, but is too large for the fuzzer's memory limits.
I'm very open to other approaches (should there be a configurable limit that the fuzzer configures?) but I'm throwing up a PR with the simplest fix I could find in order to start a discussion.