New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
inrepoconfig: optimize secondary clones #30400
inrepoconfig: optimize secondary clones #30400
Conversation
@listx: GitHub didn't allow me to request PR reviews from the following users: airbornepony. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/hold |
070db4b
to
b507e30
Compare
/retest |
859d799
to
760cf04
Compare
FTR the unit tests were previously failing because the version of git we use in CI is 2.30.2, which is older and does not exhibit the same behavior as I've seen in, for example 2.41. Notably in 2.30.2 the toplevel paths are not automatically populated in a sparse checkout. I've added a TODO comment for this because I don't think it's easy to update the version of git in CI (we'd have to update the bootstrap image, which uses debian and we know how long debian takes to update). Another solution would be to add a custom apt repo to get a newer git version for the bootstrap image, but the fix I've done here is simpler with virtually no practical difference. EDIT: Currently the git base image we use for some of our components (e.g. gerrit) uses version 2.32.0 which also exhibits the same "toplevel is not populated by default" behavior. |
prow/config/inrepoconfig.go
Outdated
// We only need ".prow" because the default "cone mode" of sparse | ||
// checkouts already include files at the toplevel (which would include | ||
// ".prow.yaml"). | ||
SparseCheckoutDirs: &[]string{".prow"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use inRepoConfigDirName instead of the literal. Also, does the TODO above apply, that you need the top-level file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, will update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
prow/git/v2/interactor_test.go
Outdated
{ | ||
name: "blank RepoOpts", | ||
dir: "/else", | ||
from: "/somewhere", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using strings "/dir" and "/from" improves readability of expected calls below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update to use "/mirrorclone" and "/secondaryclone" for even more readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
No functional change.
6270c39
to
91e0dcb
Compare
prow/git/v2/client_factory.go
Outdated
repoOpts := RepoOpts{ | ||
SparseCheckoutDirs: nil, | ||
ShareObjectsWithSourceRepo: false, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually just RepoOpts{}, and it could be inlined in the call below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this originally to make things more explicit (to show the reader "hey, we're not using these options"). I'm OK with inlining it. Will update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
I've updated the PR to use Git 2.36+ semantics. So the unit test will fail in CI (because it's using Git 2.30.2). Once the CI image is updated (part of it is in #30409), we can retest and the unit test should pass again. |
cea211c
to
d44c183
Compare
I've made it use pre-Git-2.36 semantics, because we don't want to be blocked by the various image upgrade processes. We can do the Git upgrade sometime later when we bump our Git version in kubekins-e2e (CI) and prow components. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only have minor nits. LGTM.
@@ -59,6 +59,12 @@ func (a *clientFactoryAdapter) ClientFor(org, repo string) (RepoClient, error) { | |||
return &repoClientAdapter{Repo: r}, err | |||
} | |||
|
|||
// v1 clients do not support customizing pre-repo options. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should just always return an "unsupported" error from this function then. That way if it is accidentally used, the problem won't be subtle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did try this already, but then we have to delete one of our tests because it will fail:
testDefaultProwYAMLGetter(localgit.New, t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to remove that test then, right? If that causes the test to fail that is because this is a breaking change and we don't support the prow YAML getter with the v1 client anymore.
So I think we either need to always return an error from the unsupported function and remove the obsolete test, or properly implement this in the v1 client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, technically the v1 client is supported for the prow YAML getter. But it will be less optimized than the v2 client, because the v1 client won't use the --sparse or --shared flags.
Granted, we are going to try to disable repo serialization altogether, so when we cross that bridge (soon), the v1 client will be basically unusable for the prow YAML getter. So I am OK with removing the obsolete test. Will update (err, this PR is merged now, so will have to do it as a follow-up).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prow/git/v2/interactor.go
Outdated
cloneArgs = append(cloneArgs, from) | ||
cloneArgs = append(cloneArgs, i.dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: These could be combined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. I.e., cloneArgs = append(cloneArgs, []string{from, i.dir}...)
, I think. Will update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
prow/git/v2/client_factory.go
Outdated
// reachable from the root. If non-nil and non-empty, then those additional | ||
// directories from the root are also checked out (populated) in the working | ||
// tree, recursively. | ||
SparseCheckoutDirs *[]string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: We don't need to make this a pointer value. Slices can already be nil
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah... true. I forgot. I.e.,
var foo []string // foo is nil
foo := []string{} // foo is an empty slice (not nil)
Will update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
d44c183
to
d4701fb
Compare
The only data we need from a clone inrepoconfig repo is the .prow contents (aside: we clone it in the first place so that we can merge to the base branch). The underlying Git client actually uses 2 clones --- first a mirror clone (git clone --mirror ...), and then a secondary local clone into a temporary folder (git clone ...). This secondary clone is the one we are optimizing here. This adds 2 optimizations: (1) Avoid populating object files with "--shared" flag. This way, the .git folder of the secondary clone can just have a pointer to the mirror clone. This saves a lot of disk space for large repos (not to mention that both the CPU time and disk usage for creating this clone is vastly reduced, especially when combined with the next optimization). (2) Avoid a full checkout with sparse-checkout (supported since Git 2.25). This way, we can avoid deflating Git blob objects into file paths on disk for everything except (a) files at the toplevel and (b) files specified by the "git sparse-checkout set <PATH> [...<PATH>]" command. The toplevel files are included as part of "cone mode" which is the recommended mode for sparse checkouts. In our case we specify the ".prow" folder (inRepoConfigDirName) which is the only path to make sure we get those inrepoconfig files checked out. Sadly, for Git versions 2.32 (which we currently use in our components), the toplevel files are *not* included by default. So we have to specify it separately, which is why we have inRepoConfigFileName also in SparseCheckoutDirs. In Git 2.41 (probably since 2.36) specifying inRepoConfigFileName will make Git exit with an error because Git will see that the `.prow.yaml` is a file, not a folder [1]. So before we upgrade our components to use Git 2.36+ we have to delete the inRepoConfigFileName from this list. [1] git/git@4ce5043
d4701fb
to
8e5361f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One remaining style nit that doesn't affect behavior. Lets get this merged.
/unhold
cloneArgs = append(cloneArgs, "--sparse") | ||
} | ||
|
||
cloneArgs = append(cloneArgs, []string{from, i.dir}...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cloneArgs = append(cloneArgs, []string{from, i.dir}...) | |
cloneArgs = append(cloneArgs, from, i.dir) |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cjwagner, listx The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…ests Before 916ed2b (Merge pull request kubernetes#30400 from listx/inrepoconfig-optimizations, 2023-08-17), we were under the impression that the secondary clones were unavoidably large. And so, we created the InRepoConfigGitCache in 3098b88 (Merge pull request would never delete and recreate the secondary clone for every request --- instead we would reuse the existing secondary clones across requests if those requests were for the same repo. This resulted in the need to ensure that only one such request would touch the associated secondary clone at any given time, which meant that we had to serialize the use of these secondary clones. But because the inrepoconfig repos we clone are so cheap now, there's no need to try to reuse the same one during the lifetime of a process (this is the point of InRepoConfigGitCache). We just create and delete each time we need to service a request. In manual tests, we saw reductions in size from 3.1GB to just 19MB for the disk space used for a secondary clone for a large repo using inrepoconfigs (where the bulk of the 19MB came from other unrelated files in the toplevel folder of the repo, due to the way sparse checkouts work). Note that we already have a call to repo.Clean() inside prowYAMLGetter, so that will delete the repo after we read the inrepoconfig values. However, the main "gitcache" directory (holding the mirror clone) will still be around and will not be deleted.
…ests Before 916ed2b (Merge pull request kubernetes#30400 from listx/inrepoconfig-optimizations, 2023-08-17), we were under the impression that the secondary clones were unavoidably large. So we created the InRepoConfigGitCache in 3098b88 (Merge pull request kubernetes#23264 from cjwagner/inrepo-config-cache, 2021-08-18) to make it so we would never delete and recreate the secondary clone for every request --- instead we would reuse the existing secondary clones across requests if those requests were for the same repo. This resulted in the need to ensure that only one such request would touch the associated secondary clone at any given time, which meant that we had to serialize the use of these secondary clones. But because the inrepoconfig repos we clone are so cheap now, there's no need to try to reuse the same one during the lifetime of a process. We just create and delete each time we need to service a request. In manual tests, we saw reductions in size from 3.1GB to just 19MB for the disk space used for a secondary clone for a large repo using inrepoconfigs (where the bulk of the 19MB came from other unrelated files in the toplevel folder of the repo, due to the way sparse checkouts work). Note that we already have a call to repo.Clean() inside prowYAMLGetter, so that will delete the repo after we read the inrepoconfig values. However, the main "gitcache" directory (holding the mirror clone) will still be around and will not be deleted.
…ests Before 916ed2b (Merge pull request kubernetes#30400 from listx/inrepoconfig-optimizations, 2023-08-17), we were under the impression that the secondary clones of inrepoconfig repos were unavoidably large. So we created the InRepoConfigGitCache wrapper in 3098b88 (Merge pull request kubernetes#23264 from cjwagner/inrepo-config-cache, 2021-08-18) to make it so we would never delete and recreate the secondary clone for every request (the default behavior of the wrapped git v2 client); instead, we would reuse the existing secondary clones across requests if those requests were for the same repo. This resulted in the need to ensure that only one such request would touch the associated secondary clone at any given time, which meant that we had to serialize the use of these secondary clones. But because the secondary clones are so cheap now, there's no need to try to reuse the same one during the lifetime of a process. We just create and delete each time we need to service a request. In manual tests, we saw reductions in size from 3.1GB to just 19MB for the disk space used for a secondary clone for a large repo using inrepoconfigs (where the bulk of the 19MB came from other unrelated files in the toplevel folder of the repo, due to the way sparse checkouts work). Note that we already have a call to repo.Clean() inside prowYAMLGetter, so that will delete the repo after we read the inrepoconfig values. However, the main "gitcache" directory (holding the mirror clone) will still be around and will not be deleted.
…ests Before 916ed2b (Merge pull request kubernetes#30400 from listx/inrepoconfig-optimizations, 2023-08-17), we were under the impression that the secondary clones of inrepoconfig repos were unavoidably large. So we created the InRepoConfigGitCache wrapper in 3098b88 (Merge pull request kubernetes#23264 from cjwagner/inrepo-config-cache, 2021-08-18) to make it so we would never delete and recreate the secondary clone for every request (the default behavior of the wrapped git v2 client); instead, we would reuse the existing secondary clones across requests if those requests were for the same repo. This resulted in the need to ensure that only one such request would touch the associated secondary clone at any given time, which meant that we had to serialize the use of these secondary clones. It turns out that the secondary clone optimizations in 916ed2b make these clones cheaper on disk by multiple orders of magnitude. In manual tests, we saw reductions in size from 3.1GB to just 19MB for the disk space used for a secondary clone for a large repo using inrepoconfigs (where the bulk of the 19MB came from other unrelated files in the toplevel folder of the repo, due to the way sparse checkouts work). Because the secondary clones are so cheap now, there's no need to try to reuse the same one during the lifetime of a process. We just create and delete each time we need to service a request. Note that we already have a call to repo.Clean() inside prowYAMLGetter, so that will delete the repo after we read the inrepoconfig values. However, the main "gitcache" directory (holding the mirror clone) will still be around and will not be deleted.
In case anyone else is searching for it: This change broke the generated tests in our product, as we had a symlink from |
@xrstf Thanks for the report. I did not foresee that users used symlinks like that though, which is why I didn't make an announcement. WRT technical reasons behind the breakage, I think this is because when we do sparse checkouts, we only populate the I'll make an announcement in our docs page. |
See kubernetes-sigs/prow#65 for the announcement update to our docs. |
Thanks :-) I admit that our constellation with symlinks was kinda ... unique and we happily changed our repo structure to benefit from the sparse checkouts. |
The only data we need from a clone inrepoconfig repo is the .prow contents (aside: we clone it in the first place so that we can merge to the base branch).
The underlying Git client actually uses 2 clones --- first a mirror clone (git clone --mirror ...), and then a secondary local clone into a temporary folder (git clone ...). This secondary clone is the one we are optimizing here.
This adds 2 optimizations:
(1) Avoid populating object files with "--shared" flag. This way, the .git folder of the secondary clone can just have a pointer to the mirror clone. This saves a lot of disk space for large repos (not to mention that both the CPU time and disk usage for creating this clone is vastly reduced, especially when combined with the next optimization).
(2) Avoid a full checkout with sparse-checkout (supported since Git 2.25). This way, we can avoid deflating Git blob objects into file paths on disk for everything except (a) files at the toplevel and (b) files specified by the "git sparse-checkout set [...]" command. The toplevel files are included as part of "cone mode" which is the recommended mode for sparse checkouts. In our case we specify the ".prow" folder as the only path to make sure we get those inrepoconfig files checked out.
/cc @cjwagner @airbornepony @timwangmusic