Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix shallow git clone in docker-build #33704

Merged
merged 1 commit into from Aug 1, 2017

Conversation

ecnerwala
Copy link
Contributor

- What I did
Properly test if git servers are smart HTTP for cloning.
Fixes #33701, introduced by #12502

- How I did it
Change the Head request to a Get.

Requires update on https://github.com/docker/cli.

@ecnerwala
Copy link
Contributor Author

Some more context (pasted from #33701)

docker build github.com/moby/moby.git should only clone with depth 1. However, it's fetching the entire repository. This is because we only git clone --depth 1 if we think the server is "smart", by checking with a HEAD request for /info/refs?service=git-upload-pack (code here). However, github doesn't actually support HEAD requests (Git documentation never mentions HEAD requests).

See also: #12502 (comment), which is the reason that we need to check for smart-vs-dumb http to begin with.

@GordonTheTurtle GordonTheTurtle added the dco/no Automatically set by a bot when one of the commits lacks proper signature label Jun 16, 2017
@GordonTheTurtle GordonTheTurtle removed the dco/no Automatically set by a bot when one of the commits lacks proper signature label Jun 16, 2017
@@ -70,7 +70,7 @@ func fetchArgs(remoteURL *url.URL, ref string) []string {
shallow := true

if strings.HasPrefix(remoteURL.Scheme, "http") {
res, err := http.Head(fmt.Sprintf("%s/info/refs?service=git-upload-pack", remoteURL))
res, err := http.Get(fmt.Sprintf("%s/info/refs?service=git-upload-pack", remoteURL))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should call res.Body.Close() on successful request.

The GET request to get all remote refs is usually quite slow but it shouldn't matter here much as we are not consuming the body. I wouldn't mind actually having a whitelist with github here to avoid this request for that host.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a common approach is to perform a HEAD request, then fallback to GET if not supported by the server, not sure if that works for this, but would avoid special-casing GitHub

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HEAD and fallback sounds reasonable, but it doesn't solve the special case with GitHub, as GitHub doesn't support HEAD. In the interest of performance, we may just want to do no requests at all with GitHub.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should avoid special casing any particular provider whenever we can so that a) all providers can benefit from improvements (be they performance, robustness, protocol compatibility) and b) we avoid building in reliance on the behaviour's of particular services as they happen to be configured today and miss out on future improvements etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, added the initial HEAD request.

@thaJeztah
Copy link
Member

Sorry for the back and forth; discussing with @tonistiigi - we were wondering if there's still servers around that don't support this option, and what kind of error would be returned by such a server.

Perhaps we can do a "happy path" and just assume it's not supported, and if it fails, fall back to non-shallow clone?

Would you be interested in investigating that, and see if that would be a viable option?

@thaJeztah
Copy link
Member

@ijc perhaps you know something about that?

@thaJeztah thaJeztah removed this from backlog in maintainers-session Jul 20, 2017
@ijc
Copy link
Contributor

ijc commented Jul 21, 2017

"that" is the prevalence of smart http servers which can support shallow cloning?

I'm afraid that other than knowing that the smart server has been around for ages I don't really know how widespread it is.

@thaJeztah
Copy link
Member

Basically considering to "just do a shallow clone" and assuming it is supported; if (e.g.) 99% of the servers out there support this, then the extra handling is resulting in extra overhead for 99% of the cases, and only used for 1%.

I spent some time to look into what makes a "dumb" server, and how git/docker works with that. Apparently, git should fallback to the "dumb" protocol, but I guess this doesn't work if --depth 1 is set;

https://git-scm.com/book/en/v2/Git-on-the-Server-The-Protocols

If the server does not respond with a Git HTTP smart service, the Git client will try to fall back to the simpler “dumb” HTTP protocol. The Dumb protocol expects the bare Git repository to be served like normal files from the web server.

This is how to test a "dumb" server;

Build a dumb server image;

$ docker build -t dumb-git -<<EOF
FROM nginx:alpine

WORKDIR /usr/share/nginx/html/
RUN apk add --no-cache git
RUN git clone --bare https://github.com/thaJeztah/pgadmin4-docker.git \
 && cd pgadmin4-docker.git \
 && git update-server-info
EOF

Start the server

$ docker run -d --name gitty -p 80:80 dumb-git

Try to clone from this server;

With --depth 1

$ git clone http://localhost/pgadmin4-docker.git --depth 1

Cloning into 'pgadmin4-docker'...
fatal: dumb http transport does not support shallow capabilities

Without (dumb)

$ git clone http://localhost/pgadmin4-docker.git
Cloning into 'pgadmin4-docker'...

Next, I built a Docker CLI, but with the feature detection disabled (i.e., always --depth 1);

$ docker build -t foo http://192.168.65.2/pgadmin4-docker.git
unable to prepare context: unable to 'git clone' to temporary context directory: error fetching: fatal: dumb http transport does not support shallow capabilities
: exit status 128

So, possibly we can detect the fatal: dumb http transport does not support shallow capabilities error, and in that case fallback?

@ijc
Copy link
Contributor

ijc commented Jul 21, 2017

Unfortunately setting up a dumb server is super trivial (expose a directory via http(s) and try to remember to run git update-server-info in it, even if you forget that many things still just work) compared with setting up a smart one (which involves cgi), so even though the smart one has been around for ages I would expect there will still be relatively significant numbers of dumb ones, especially internal/private (e.g. company) ones which we have no visibility onto. IOW I'm afraid I doubt it is as many as 99% of servers.

Perhaps the smart one supports some sort of ?query=foo type syntax which the dumb one certainly doesn't and which could be used as a probe?

@thaJeztah
Copy link
Member

Well, I was thinking if the request itself could be used; it looks to fail directly if it's not supported, so if we catch the error, and fallback to cloning without --depth=1. Or would that be really bad?

@ijc
Copy link
Contributor

ijc commented Jul 21, 2017

My concern was that if the clone failed for some other reason we don't just assume it was due to the shallow clone, perhaps matching on fatal: dumb http transport does not support shallow capabilities but that might be fragile due to differences in different git versions or due to l18n?

@ecnerwala
Copy link
Contributor Author

@ijc I think the GET $GIT_URL/info/refs?service=git-upload-pack HTTP/1.0 is the standard probe to differentiate between smart and dumb servers (https://www.kernel.org/pub/software/scm/git/docs/technical/http-protocol.html).

@ijc
Copy link
Contributor

ijc commented Jul 24, 2017

@ecnerwala I agree. I think the code is correct but could do with some additional comments explaining why it is looking at the things it is looking at, in particular the aspect that $GIT_URL/info/refs?service=git-upload-pack is specified for both dumb and smart servers but with distinct Content-Types specified.

For GET vs HEAD I note that the spec says WRT `HEAD:

   The HEAD method is identical to GET except that the server MUST NOT
   send a message body in the response (i.e., the response terminates at
   the end of the header section).  The server SHOULD send the same
   header fields in response to a HEAD request as it would have sent if
   the request had been a GET, except that the payload header fields
   (Section 3.3) MAY be omitted. 

So the GH implementation (which rejects HEAD with 405 Method Not Allowed) is, I suppose, compliant since it only violates a SHOULD, although it does go rather against the spirit of things.

I think the patch could be improved by believing the checking content type on a 2xx response to the HEAD request and only falling back to a GET check if the HEAD gets a non-2xx response. i.e. if the HEAD results in a 200 OK with Content-Type != application/x-git-upload-pack-advertisement then there is no need to fallback to a GET in that case, we know it is a dumb server. I don't think that's an absolute requirement though, it just avoids a redundant check for a dumb server which does respond to the HEAD based probe.

@ecnerwala ecnerwala force-pushed the 33701-shallow-clone branch 2 times, most recently from c99053b to ac87593 Compare July 24, 2017 18:18
@ecnerwala
Copy link
Contributor Author

Alright, I made the change to check for a 200 OK response.

@ecnerwala ecnerwala force-pushed the 33701-shallow-clone branch 3 times, most recently from 3654877 to 7853907 Compare July 24, 2017 19:05
// Only smart-HTTP git servers support shallow clone

// URL to test for the service
serviceURL := fmt.Sprintf("%s/info/refs?service=git-upload-pack", remoteURL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this can just be remoteURL + "/info/refs?service=git-upload-pack" rather than needing fmt.Sprintf.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 if not needed, it's best to avoid Sprintf

@ijc
Copy link
Contributor

ijc commented Jul 25, 2017

You should probably use http.StatusOK rather than the hard coded 200 but otherwise LGTM (I left one minor nit as a comment).

I don't know enough about these things to say if you should also accept other (or all) 2xx results, I think we can leave it as you have it unless some expert says otherwise.

@ecnerwala
Copy link
Contributor Author

I originally wrote it to accept all 2xx results, but the git spec actually specifies that the client should validate that the response is 200 OK or 304 Not Modified (for cached entries), so I went with this instead. If anyone knows better, please let me know!

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some suggestions, but let me know what you think

args = append(args, "--depth", "1")
}

return append(args, "origin", ref)
}

func supportsShallowCheckout(remoteURL string) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of putting inline comments for each step, a better approach would be to describe the global intent of this function (oh, should this be called supportsShallowClone (clone instead of checkout)? Something like;

// supportsShallowClone checks if a remote git repository supports
// shallow cloning. Shallow clone is supported over non-HTTP connections
// and HTTP connections to smart servers.

args = append(args, "--depth", "1")
}

return append(args, "origin", ref)
}

func supportsShallowCheckout(remoteURL string) bool {
// Check if the URL is an HTTP url
if urlutil.IsURL(remoteURL) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using an early return here prevents having to explain the HTTP <--> non HTTP twice (once here, and once before the return true, e.g.;

if !urlutil.IsURL(remoteURL) {
	// non-HTTP protocols always support shallow clones
	return true
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left it in this form because I'm afraid we might eventually have special logic for other transport types. I think this really should look like a switch protocol statement with a default, but that doesn't quite work here.

contentTypeResponse := "application/x-git-upload-pack-advertisement"

// First try a HEAD request
res, err := http.Head(serviceURL)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you tried to reduce complexity here by keeping the HEAD and GET parts separate, however doing so probably made it harder to grasp;

  • the contentTypeResponse variable is defined a lot further from where it's used
  • explaining things like "if x succeeds", "check content type" should not be needed; the code can be self-explanatory

If you combine the paths, a single explanation of what we're doing can be sufficient;

// try HEAD, fallback to GET if not supported
res, err := http.Head(serviceURL)
if err != nil {
	res, err = http.Get(serviceURL)
	if err != nil {
		return false
	}
	res.Body.Close()
}

After that, you'd just have to check the response code (should be self-explanatory), and the content-type (may need a comment), something like;

if res.StatusCode != http.StatusOK {
	return false
}
if res.Header.Get("Content-Type") != "application/x-git-upload-pack-advertisement" {
	// not a smart server
	return false
}
return true

I dropped the contentTypeResponse variable, in favour of a comment explaining we detected it's not a "smart" server. The function description itself already explained that that means "shallow clone" is not supported

Although this would've worked for the return;

return res.Header.Get("Content-Type") == "application/x-git-upload-pack-advertisement"

(mainly "taste"); I think it's less easy to grasp here, and having all returns false (early fail) makes it slightly more consistent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, definitely looks a bit cleaner. Thanks!

If the HEAD request fails, use a GET request to properly test if git
server is smart-http.

Signed-off-by: Andrew He <he.andrew.mail@gmail.com>
Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@yongtang
Copy link
Member

yongtang commented Aug 1, 2017

ping @tonistiigi

@moby moby deleted a comment from GordonTheTurtle Aug 1, 2017
Copy link
Member

@tonistiigi tonistiigi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yongtang
Copy link
Member

yongtang commented Aug 1, 2017

Thanks all for the review. The PR could be merged now as all Jenkins tests have passed as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants