Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Git package install optimizations #38

Merged
merged 17 commits into from
Sep 30, 2019

Conversation

bgagnon
Copy link
Collaborator

@bgagnon bgagnon commented Jul 31, 2019

This introduces three distinct optimizations for install operations on Git packages (the only type supported in jb at this point):

  • using GitHub tarball archive download whenever possible
  • using shallow clone whenever possible
  • use sparse checkouts whenever possible

GitHub archive download

The most efficient case. If the repo is from https://github.com, download a tarball at the requested revision. The ETag header contains the resolved commit SHA1. When unpacking the tarball, we can skip anything that isn't under the SubDir path since it won't be needed.

This does not require authentication with GitHub. At least, not for public repositories. Perhaps this needs some adjustment to avoid breaking private repo compatibility.

Shallow clones

If the server supports it, perform a shallow clone (actually a fetch) of --depth 1 so only the objects for the requested revision are downloaded. Not all servers support this, and some do it only if the request is reasonable (ie. it won't cause too much load on its side). For this reason, recover from the error and do a full fetch (all revisions). This has been tested to work with GitLab and GitHub (in some cases, but the archive download fast track covers those).

This change replaces the previous git clone step with distinct git init, git remote add and git fetch steps. This has the side effect of not creating an extra working copy before the git checkout step (ie. equivalent to git clone -n).

Sparse checkouts

For cases where a sub directory is specified, there is no need to create a full working copy; the subdir path can act as a filter. Remember that there is no need to end up with a functional git repository since it will all be deleted by the time the package installer is done.

Benchmarks

On kube-prometheus:

before:

$ time jb install
real    0m32.044s
user    0m19.452s
sys     0m7.984s

after:

$ time jb install
real    0m7.921s
user    0m1.575s
sys     0m0.622s

YMMV.

Will happily accept any criticism on my Go style as I am still learning this language and its standard library. For a future improvement, I would love to make the package installs concurrent (or at least, the IO-bound operations like downloads).

There's an explicit git checkout command
issued moments later, so there's no need
to create a working copy during the clone.
If a SubDir is configured for the package,
everything but that directory will be thrown
away after the package is installed.
If the server supports it, fetch a specific
revision with --depth 1. Otherwise, fall back
to the normal fetch.

This replaces the previous "clone" operation. The bandwidth and time savings
can be significant depending on the history
of the repository (number of commits).
@bgagnon
Copy link
Collaborator Author

bgagnon commented Jul 31, 2019

@davidovich please have a look when you have a chance :)

Copy link
Collaborator

@sh0rez sh0rez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very good to me so far! Some minor things to adress.

👍

cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
cmd.Dir = dir
err = cmd.Run()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This err should be checked that it is actually a failed fetch and not something else.

pkg/git.go Show resolved Hide resolved
pkg/git.go Show resolved Hide resolved
pkg/git.go Outdated Show resolved Hide resolved
pkg/git.go Outdated Show resolved Hide resolved
pkg/git.go Outdated Show resolved Hide resolved
pkg/git.go Outdated Show resolved Hide resolved
pkg/git.go Outdated Show resolved Hide resolved
pkg/git.go Show resolved Hide resolved
archiveFilepath := fmt.Sprintf("%s.tar.gz", dir)

defer os.Remove(archiveFilepath)
commitSha, err := downloadGitHubArchive(archiveFilepath, archiveUrl)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future providers (gitlab which has a similar archive functionality) I would fold the archive URL computation inside a downloadArchive function. This one could return a struct containing sha or version, depending on the capabilities of the service. (I believe gitlab does not support sha1 Etag).

return commitSha, nil
}

cmd := exec.CommandContext(ctx, "git", "init")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use the git clone --depth 1 --branch [version] here instead of the git init dance ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not find a way to combine shallow fetch and sparse checkout into a single git clone command, hence this init + fetch + checkout combo. Do you have an idea?

@metalmatze
Copy link
Contributor

These are awesome improvements!!! 🎉
I'm currently on vacation, but last week I've basically finished #36. Only one test is not passing for some reason.
Could you make sure to base your work on that PR? Once that one is merged, git will probably tell your about a few conflicts.

@brancz
Copy link
Contributor

brancz commented Sep 9, 2019

Sorry for missing this. I'm happy to review this once the above comments are addressed and rebased.

Very awesome work, thank you! 🙂

@olivierlemasle
Copy link
Contributor

This looks a great improvement! 🎉

@bgagnon Did you had time to address the comments and rebase?

Copy link
Collaborator Author

@bgagnon bgagnon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merged with master and addressed some review comments

@bgagnon
Copy link
Collaborator Author

bgagnon commented Sep 29, 2019

Hmm something went wrong in the merge; the new tmpDir is not handled properly for archives.

@bgagnon
Copy link
Collaborator Author

bgagnon commented Sep 29, 2019

Alright, better now. I notice the unit tests do not do any assertions on the contents of the installed package inside the vendor directory. That would make a good addition.

@brancz
Copy link
Contributor

brancz commented Sep 30, 2019

That would make a good addition.

Agreed.

As for now, this is in good shape I think. Let's move ahead with it! Thanks a lot for taking the time to do these very awesome improvements! 🎉

@brancz brancz merged commit 10e24cb into jsonnet-bundler:master Sep 30, 2019
@metalmatze
Copy link
Contributor

jb is sooooo fast now. Thanks so much! 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants