Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fetch negotiation broken, fetches way more than it needs #6532

Open
arroz opened this issue Mar 18, 2023 · 3 comments
Open

fetch negotiation broken, fetches way more than it needs #6532

arroz opened this issue Mar 18, 2023 · 3 comments

Comments

@arroz
Copy link
Contributor

arroz commented Mar 18, 2023

TLDR: The implementation of the negotiation protocol between client and server when fetching is broken in multiple ways. When a libgit2 client fetches from a server, this may cause the server to generate an extremely large pack containing mostly objects already present in the client.

Detailed, long version: During the fetch negotiation process, after receiving the full list of advertised references and respective OIDs from the server, the client sends a list of "want" commands with the requested OIDs, and "have" commands with a list of the commits it already has, in a way that minimized the amount of objects that need to be transfered.

According to core git's documentation, in a simplified way, the "haves" are generated by queuing the peeled commit OIDs of all the heads the client knows about, sorted by commit date (most recent at the head of the queue). Then, a loop is performed by popping the OID at the head of the queue, writing a "have" line for it, grabbing its parents, and inserting them on the queue, sorting them by commit date as well during the insertion. This loop is repeated at most 256 times, in batches of 20.

libgit2 is using the wrong order due to multiple bugs. Because of this, the OIDs used for the "have" commands are essentially arbitrary, and can be from commits years ago, instead of the most recent ones. When the server builds the packfile, it will assume those are the most recent the client has, and will send a huge amount of objects based on that wrong assumption.

Reproduction steps

Reproducing this reliably is a bit hard due to the arbitrary nature of the problem (diagnostics show the order being used is the alphabetical order of the reference names, not the commit dates). But I've managed to find a situation that is reliably reproduced.

  1. Setup a bare clone of the libgit2 repository in a location you can manipulate the file system directly. I used a VM with an HTTP server so I could easily inspect the traffic, but YMMV.

  2. Replace the packed-refs file with the following one. This will bring the repository to how it was around March 16, 2023. packed-refs-old.txt

  3. Create two regular clones of the clone created in step 1.

  4. In the clone created in step 1, replace packed-refs with the following version. This will add a few commits to two branches that were being worked on at the time, ethomson/nonblocking and ethomson/schannel-2.
    packed-refs-new.txt

  5. In one of the clones created on step 3, run core git command: git fetch --all. Note how much data it receives over the wire.

  6. In the other clone created on step 3, use libgit2 to run a fetch operation, with a callback that prints the amount of received data. Note that it's almost 10 MB.

@arroz
Copy link
Contributor Author

arroz commented Mar 18, 2023

I have a fix for this, will submit in the next few days, since I'm trying to fix an additional problem.

arroz added a commit to cavaquinho/libgit2 that referenced this issue Mar 19, 2023
This commit fixes the following issues:

1. In `git_revwalk__push_commit`, if `opts->insert_by_date` is true, `git_commit_list_insert_by_date` is called. However, by this point the commit wasn’t parsed yet, so the `time` field still has 0 as value. Solved by parsing the commit immediately if `opts->insert_by_date` is true.

2. In the same function, there was an error in the boolean logic. When `opts->insert_by_date` was true, the commit would be inserted twice, first “by date” (not really, due to the issue described above) and then in the first position again. Logic was corrected.

3. In `prepare_walk`, when processing `user_input` and building the `commits` list, the order was being inverted. Assuming both fixes above, this would mean we would start negotiation by the oldest reference, not the newest one. This was fixed by producing a `commits` list in the same order as `user_input`.

The output list for the list of “have” statements during the negotiation is still not the same as core git’s because there is an optimization missing (excluding ancestors of commits known to be common between the client and the server) but this commit brings it much closer to core git’s. Specifically, the example on the libgit2#6532 issue description now fetches exactly the same objects than core git, and other examples I tested as well.
arroz added a commit to cavaquinho/libgit2 that referenced this issue Mar 27, 2023
This commit fixes the following issues:

1. In `git_revwalk__push_commit`, if `opts->insert_by_date` is true, `git_commit_list_insert_by_date` is called. However, by this point the commit wasn’t parsed yet, so the `time` field still has 0 as value. Solved by parsing the commit immediately if `opts->insert_by_date` is true.

2. In the same function, there was an error in the boolean logic. When `opts->insert_by_date` was true, the commit would be inserted twice, first “by date” (not really, due to the issue described above) and then in the first position again. Logic was corrected.

3. In `prepare_walk`, when processing `user_input` and building the `commits` list, the order was being inverted. Assuming both fixes above, this would mean we would start negotiation by the oldest reference, not the newest one. This was fixed by producing a `commits` list in the same order as `user_input`.

The output list for the list of “have” statements during the negotiation is still not the same as core git’s because there is an optimization missing (excluding ancestors of commits known to be common between the client and the server) but this commit brings it much closer to core git’s. Specifically, the example on the libgit2#6532 issue description now fetches exactly the same objects than core git, and other examples I tested as well.
arroz added a commit to cavaquinho/libgit2 that referenced this issue Mar 27, 2023
Negotiation code updated to pretty much match core git’s behaviour (v0/1).
@arroz
Copy link
Contributor Author

arroz commented Mar 27, 2023

@ethomson @carlosmn I split this into two PRs. The first one is simple and can go immediately if you're OK with it. The second one is a bit more problematic: although it works, it may increase the probability of timeout-related errors since we don't recover an HTTP keep-alive connection if it's closed. Can you read the comments on the second PR (#6540) and let me know your thoughts and a bit of guidance? Thank you! 😊

ethomson added a commit that referenced this issue Jul 15, 2023
@ethomson
Copy link
Member

Sorry I haven't had much time to think about this lately. I feel like we're still a bit stuck on what to do around timeouts. I'd like to think a bit more about the various work we have in flight but I did go ahead and merge the first PR. Thanks for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants