New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fetch negotiation broken, fetches way more than it needs #6532
Comments
I have a fix for this, will submit in the next few days, since I'm trying to fix an additional problem. |
This commit fixes the following issues: 1. In `git_revwalk__push_commit`, if `opts->insert_by_date` is true, `git_commit_list_insert_by_date` is called. However, by this point the commit wasn’t parsed yet, so the `time` field still has 0 as value. Solved by parsing the commit immediately if `opts->insert_by_date` is true. 2. In the same function, there was an error in the boolean logic. When `opts->insert_by_date` was true, the commit would be inserted twice, first “by date” (not really, due to the issue described above) and then in the first position again. Logic was corrected. 3. In `prepare_walk`, when processing `user_input` and building the `commits` list, the order was being inverted. Assuming both fixes above, this would mean we would start negotiation by the oldest reference, not the newest one. This was fixed by producing a `commits` list in the same order as `user_input`. The output list for the list of “have” statements during the negotiation is still not the same as core git’s because there is an optimization missing (excluding ancestors of commits known to be common between the client and the server) but this commit brings it much closer to core git’s. Specifically, the example on the libgit2#6532 issue description now fetches exactly the same objects than core git, and other examples I tested as well.
This commit fixes the following issues: 1. In `git_revwalk__push_commit`, if `opts->insert_by_date` is true, `git_commit_list_insert_by_date` is called. However, by this point the commit wasn’t parsed yet, so the `time` field still has 0 as value. Solved by parsing the commit immediately if `opts->insert_by_date` is true. 2. In the same function, there was an error in the boolean logic. When `opts->insert_by_date` was true, the commit would be inserted twice, first “by date” (not really, due to the issue described above) and then in the first position again. Logic was corrected. 3. In `prepare_walk`, when processing `user_input` and building the `commits` list, the order was being inverted. Assuming both fixes above, this would mean we would start negotiation by the oldest reference, not the newest one. This was fixed by producing a `commits` list in the same order as `user_input`. The output list for the list of “have” statements during the negotiation is still not the same as core git’s because there is an optimization missing (excluding ancestors of commits known to be common between the client and the server) but this commit brings it much closer to core git’s. Specifically, the example on the libgit2#6532 issue description now fetches exactly the same objects than core git, and other examples I tested as well.
Negotiation code updated to pretty much match core git’s behaviour (v0/1).
@ethomson @carlosmn I split this into two PRs. The first one is simple and can go immediately if you're OK with it. The second one is a bit more problematic: although it works, it may increase the probability of timeout-related errors since we don't recover an HTTP keep-alive connection if it's closed. Can you read the comments on the second PR (#6540) and let me know your thoughts and a bit of guidance? Thank you! 😊 |
Partial fix for #6532: insert-by-date order.
Sorry I haven't had much time to think about this lately. I feel like we're still a bit stuck on what to do around timeouts. I'd like to think a bit more about the various work we have in flight but I did go ahead and merge the first PR. Thanks for this. |
TLDR: The implementation of the negotiation protocol between client and server when fetching is broken in multiple ways. When a libgit2 client fetches from a server, this may cause the server to generate an extremely large pack containing mostly objects already present in the client.
Detailed, long version: During the fetch negotiation process, after receiving the full list of advertised references and respective OIDs from the server, the client sends a list of "want" commands with the requested OIDs, and "have" commands with a list of the commits it already has, in a way that minimized the amount of objects that need to be transfered.
According to core git's documentation, in a simplified way, the "haves" are generated by queuing the peeled commit OIDs of all the heads the client knows about, sorted by commit date (most recent at the head of the queue). Then, a loop is performed by popping the OID at the head of the queue, writing a "have" line for it, grabbing its parents, and inserting them on the queue, sorting them by commit date as well during the insertion. This loop is repeated at most 256 times, in batches of 20.
libgit2 is using the wrong order due to multiple bugs. Because of this, the OIDs used for the "have" commands are essentially arbitrary, and can be from commits years ago, instead of the most recent ones. When the server builds the packfile, it will assume those are the most recent the client has, and will send a huge amount of objects based on that wrong assumption.
Reproduction steps
Reproducing this reliably is a bit hard due to the arbitrary nature of the problem (diagnostics show the order being used is the alphabetical order of the reference names, not the commit dates). But I've managed to find a situation that is reliably reproduced.
Setup a bare clone of the libgit2 repository in a location you can manipulate the file system directly. I used a VM with an HTTP server so I could easily inspect the traffic, but YMMV.
Replace the
packed-refs
file with the following one. This will bring the repository to how it was around March 16, 2023. packed-refs-old.txtCreate two regular clones of the clone created in step 1.
In the clone created in step 1, replace
packed-refs
with the following version. This will add a few commits to two branches that were being worked on at the time,ethomson/nonblocking
andethomson/schannel-2
.packed-refs-new.txt
In one of the clones created on step 3, run core git command:
git fetch --all
. Note how much data it receives over the wire.In the other clone created on step 3, use libgit2 to run a fetch operation, with a callback that prints the amount of received data. Note that it's almost 10 MB.
The text was updated successfully, but these errors were encountered: