Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inrepoconfig: optimize secondary clones #30400

Merged
merged 2 commits into from Aug 18, 2023

Conversation

listx
Copy link
Contributor

@listx listx commented Aug 17, 2023

The only data we need from a clone inrepoconfig repo is the .prow contents (aside: we clone it in the first place so that we can merge to the base branch).

The underlying Git client actually uses 2 clones --- first a mirror clone (git clone --mirror ...), and then a secondary local clone into a temporary folder (git clone ...). This secondary clone is the one we are optimizing here.

This adds 2 optimizations:

(1) Avoid populating object files with "--shared" flag. This way, the .git folder of the secondary clone can just have a pointer to the mirror clone. This saves a lot of disk space for large repos (not to mention that both the CPU time and disk usage for creating this clone is vastly reduced, especially when combined with the next optimization).

(2) Avoid a full checkout with sparse-checkout (supported since Git 2.25). This way, we can avoid deflating Git blob objects into file paths on disk for everything except (a) files at the toplevel and (b) files specified by the "git sparse-checkout set [...]" command. The toplevel files are included as part of "cone mode" which is the recommended mode for sparse checkouts. In our case we specify the ".prow" folder as the only path to make sure we get those inrepoconfig files checked out.

/cc @cjwagner @airbornepony @timwangmusic

@k8s-ci-robot
Copy link
Contributor

@listx: GitHub didn't allow me to request PR reviews from the following users: airbornepony.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

The only data we need from a clone inrepoconfig repo is the .prow contents (aside: we clone it in the first place so that we can merge to the base branch).

The underlying Git client actually uses 2 clones --- first a mirror clone (git clone --mirror ...), and then a secondary local clone into a temporary folder (git clone ...). This secondary clone is the one we are optimizing here.

This adds 2 optimizations:

(1) Avoid populating object files with "--shared" flag. This way, the .git folder of the secondary clone can just have a pointer to the mirror clone. This saves a lot of disk space for large repos (not to mention that both the CPU time and disk usage for creating this clone is vastly reduced, especially when combined with the next optimization).

(2) Avoid a full checkout with sparse-checkout (supported since Git 2.25). This way, we can avoid deflating Git blob objects into file paths on disk for everything except (a) files at the toplevel and (b) files specified by the "git sparse-checkout set [...]" command. The toplevel files are included as part of "cone mode" which is the recommended mode for sparse checkouts. In our case we specify the ".prow" folder as the only path to make sure we get those inrepoconfig files checked out.

/cc @cjwagner @airbornepony @timwangmusic

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. area/prow Issues or PRs related to prow sig/testing Categorizes an issue or PR as relevant to SIG Testing. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Aug 17, 2023
@listx
Copy link
Contributor Author

listx commented Aug 17, 2023

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 17, 2023
@listx listx force-pushed the inrepoconfig-optimizations branch from 070db4b to b507e30 Compare August 17, 2023 03:54
@listx
Copy link
Contributor Author

listx commented Aug 17, 2023

/retest

@listx listx force-pushed the inrepoconfig-optimizations branch 3 times, most recently from 859d799 to 760cf04 Compare August 17, 2023 16:02
@listx
Copy link
Contributor Author

listx commented Aug 17, 2023

FTR the unit tests were previously failing because the version of git we use in CI is 2.30.2, which is older and does not exhibit the same behavior as I've seen in, for example 2.41. Notably in 2.30.2 the toplevel paths are not automatically populated in a sparse checkout. I've added a TODO comment for this because I don't think it's easy to update the version of git in CI (we'd have to update the bootstrap image, which uses debian and we know how long debian takes to update). Another solution would be to add a custom apt repo to get a newer git version for the bootstrap image, but the fix I've done here is simpler with virtually no practical difference.

EDIT: Currently the git base image we use for some of our components (e.g. gerrit) uses version 2.32.0 which also exhibits the same "toplevel is not populated by default" behavior.

// We only need ".prow" because the default "cone mode" of sparse
// checkouts already include files at the toplevel (which would include
// ".prow.yaml").
SparseCheckoutDirs: &[]string{".prow"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use inRepoConfigDirName instead of the literal. Also, does the TODO above apply, that you need the top-level file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, will update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

{
name: "blank RepoOpts",
dir: "/else",
from: "/somewhere",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using strings "/dir" and "/from" improves readability of expected calls below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update to use "/mirrorclone" and "/secondaryclone" for even more readability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@listx listx force-pushed the inrepoconfig-optimizations branch 2 times, most recently from 6270c39 to 91e0dcb Compare August 17, 2023 18:36
repoOpts := RepoOpts{
SparseCheckoutDirs: nil,
ShareObjectsWithSourceRepo: false,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually just RepoOpts{}, and it could be inlined in the call below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this originally to make things more explicit (to show the reader "hey, we're not using these options"). I'm OK with inlining it. Will update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@listx
Copy link
Contributor Author

listx commented Aug 17, 2023

I've updated the PR to use Git 2.36+ semantics. So the unit test will fail in CI (because it's using Git 2.30.2). Once the CI image is updated (part of it is in #30409), we can retest and the unit test should pass again.

@listx listx force-pushed the inrepoconfig-optimizations branch 2 times, most recently from cea211c to d44c183 Compare August 17, 2023 21:29
@listx
Copy link
Contributor Author

listx commented Aug 17, 2023

I've made it use pre-Git-2.36 semantics, because we don't want to be blocked by the various image upgrade processes. We can do the Git upgrade sometime later when we bump our Git version in kubekins-e2e (CI) and prow components.

Copy link
Member

@cjwagner cjwagner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only have minor nits. LGTM.

@@ -59,6 +59,12 @@ func (a *clientFactoryAdapter) ClientFor(org, repo string) (RepoClient, error) {
return &repoClientAdapter{Repo: r}, err
}

// v1 clients do not support customizing pre-repo options.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should just always return an "unsupported" error from this function then. That way if it is accidentally used, the problem won't be subtle.

Copy link
Contributor Author

@listx listx Aug 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did try this already, but then we have to delete one of our tests because it will fail:

testDefaultProwYAMLGetter(localgit.New, t)
. Doing it the current way still lets that test pass (and I didn't think too deeply here but it felt cleaner to do this than to delete that test altogether).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to remove that test then, right? If that causes the test to fail that is because this is a breaking change and we don't support the prow YAML getter with the v1 client anymore.

So I think we either need to always return an error from the unsupported function and remove the obsolete test, or properly implement this in the v1 client.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, technically the v1 client is supported for the prow YAML getter. But it will be less optimized than the v2 client, because the v1 client won't use the --sparse or --shared flags.

Granted, we are going to try to disable repo serialization altogether, so when we cross that bridge (soon), the v1 client will be basically unusable for the prow YAML getter. So I am OK with removing the obsolete test. Will update (err, this PR is merged now, so will have to do it as a follow-up).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That followup is part of #30428. It's kind of blocked on #30520 though.

Comment on lines 164 to 165
cloneArgs = append(cloneArgs, from)
cloneArgs = append(cloneArgs, i.dir)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: These could be combined.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I.e., cloneArgs = append(cloneArgs, []string{from, i.dir}...), I think. Will update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// reachable from the root. If non-nil and non-empty, then those additional
// directories from the root are also checked out (populated) in the working
// tree, recursively.
SparseCheckoutDirs *[]string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We don't need to make this a pointer value. Slices can already be nil, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah... true. I forgot. I.e.,

var foo []string // foo is nil
foo := []string{} // foo is an empty slice (not nil)

Will update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 17, 2023
@listx listx force-pushed the inrepoconfig-optimizations branch from d44c183 to d4701fb Compare August 17, 2023 23:40
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 17, 2023
The only data we need from a clone inrepoconfig repo is the .prow
contents (aside: we clone it in the first place so that we can merge to
the base branch).

The underlying Git client actually uses 2 clones --- first a mirror
clone (git clone --mirror ...), and then a secondary local clone into a
temporary folder (git clone ...). This secondary clone is the one we are
optimizing here.

This adds 2 optimizations:

(1) Avoid populating object files with "--shared" flag. This way, the
.git folder of the secondary clone can just have a pointer to the mirror
clone. This saves a lot of disk space for large repos (not to mention
that both the CPU time and disk usage for creating this clone is vastly
reduced, especially when combined with the next optimization).

(2) Avoid a full checkout with sparse-checkout (supported since Git
2.25). This way, we can avoid deflating Git blob objects into file paths
on disk for everything except (a) files at the toplevel and (b) files
specified by the "git sparse-checkout set <PATH> [...<PATH>]" command.
The toplevel files are included as part of "cone mode" which is the
recommended mode for sparse checkouts. In our case we specify the
".prow" folder (inRepoConfigDirName) which is the only path to make sure
we get those inrepoconfig files checked out.

Sadly, for Git versions 2.32 (which we currently use in our components),
the toplevel files are *not* included by default. So we have to specify
it separately, which is why we have inRepoConfigFileName also in
SparseCheckoutDirs. In Git 2.41 (probably since 2.36) specifying
inRepoConfigFileName will make Git exit with an error because Git will
see that the `.prow.yaml` is a file, not a folder [1]. So before we
upgrade our components to use Git 2.36+ we have to delete the
inRepoConfigFileName from this list.

[1] git/git@4ce5043
@listx listx force-pushed the inrepoconfig-optimizations branch from d4701fb to 8e5361f Compare August 17, 2023 23:50
Copy link
Member

@cjwagner cjwagner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One remaining style nit that doesn't affect behavior. Lets get this merged.
/unhold

cloneArgs = append(cloneArgs, "--sparse")
}

cloneArgs = append(cloneArgs, []string{from, i.dir}...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cloneArgs = append(cloneArgs, []string{from, i.dir}...)
cloneArgs = append(cloneArgs, from, i.dir)

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Aug 18, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cjwagner, listx

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 916ed2b into kubernetes:master Aug 18, 2023
6 of 7 checks passed
listx pushed a commit to listx/test-infra that referenced this pull request Aug 18, 2023
…ests

Before 916ed2b (Merge pull request kubernetes#30400 from
listx/inrepoconfig-optimizations, 2023-08-17), we were under the
impression that the secondary clones were unavoidably large. And so, we
created the InRepoConfigGitCache in 3098b88 (Merge pull request
would never delete and recreate the secondary clone for every request
--- instead we would reuse the existing secondary clones across requests
if those requests were for the same repo. This resulted in the need to
ensure that only one such request would touch the associated secondary
clone at any given time, which meant that we had to serialize the use of
these secondary clones.

But because the inrepoconfig repos we clone are so cheap now, there's no
need to try to reuse the same one during the lifetime of a process (this
is the point of InRepoConfigGitCache). We just create and delete each
time we need to service a request. In manual tests, we saw reductions in
size from 3.1GB to just 19MB for the disk space used for a secondary
clone for a large repo using inrepoconfigs (where the bulk of the 19MB
came from other unrelated files in the toplevel folder of the repo, due
to the way sparse checkouts work).

Note that we already have a call to repo.Clean() inside prowYAMLGetter,
so that will delete the repo after we read the inrepoconfig values.
However, the main "gitcache" directory (holding the mirror clone) will
still be around and will not be deleted.
listx pushed a commit to listx/test-infra that referenced this pull request Aug 18, 2023
…ests

Before 916ed2b (Merge pull request kubernetes#30400 from
listx/inrepoconfig-optimizations, 2023-08-17), we were under the
impression that the secondary clones were unavoidably large. So we
created the InRepoConfigGitCache in 3098b88 (Merge pull request
kubernetes#23264 from cjwagner/inrepo-config-cache, 2021-08-18) to make it so we
would never delete and recreate the secondary clone for every request
--- instead we would reuse the existing secondary clones across requests
if those requests were for the same repo. This resulted in the need to
ensure that only one such request would touch the associated secondary
clone at any given time, which meant that we had to serialize the use of
these secondary clones.

But because the inrepoconfig repos we clone are so cheap now, there's no
need to try to reuse the same one during the lifetime of a process. We
just create and delete each time we need to service a request. In manual
tests, we saw reductions in size from 3.1GB to just 19MB for the disk
space used for a secondary clone for a large repo using inrepoconfigs
(where the bulk of the 19MB came from other unrelated files in the
toplevel folder of the repo, due to the way sparse checkouts work).

Note that we already have a call to repo.Clean() inside prowYAMLGetter,
so that will delete the repo after we read the inrepoconfig values.
However, the main "gitcache" directory (holding the mirror clone) will
still be around and will not be deleted.
listx pushed a commit to listx/test-infra that referenced this pull request Aug 18, 2023
…ests

Before 916ed2b (Merge pull request kubernetes#30400 from
listx/inrepoconfig-optimizations, 2023-08-17), we were under the
impression that the secondary clones of inrepoconfig repos were
unavoidably large. So we created the InRepoConfigGitCache wrapper in
3098b88 (Merge pull request kubernetes#23264 from cjwagner/inrepo-config-cache,
2021-08-18) to make it so we would never delete and recreate the
secondary clone for every request (the default behavior of the wrapped
git v2 client); instead, we would reuse the existing secondary clones
across requests if those requests were for the same repo. This resulted
in the need to ensure that only one such request would touch the
associated secondary clone at any given time, which meant that we had to
serialize the use of these secondary clones.

But because the secondary clones are so cheap now, there's no
need to try to reuse the same one during the lifetime of a process. We
just create and delete each time we need to service a request. In manual
tests, we saw reductions in size from 3.1GB to just 19MB for the disk
space used for a secondary clone for a large repo using inrepoconfigs
(where the bulk of the 19MB came from other unrelated files in the
toplevel folder of the repo, due to the way sparse checkouts work).

Note that we already have a call to repo.Clean() inside prowYAMLGetter,
so that will delete the repo after we read the inrepoconfig values.
However, the main "gitcache" directory (holding the mirror clone) will
still be around and will not be deleted.
listx pushed a commit to listx/test-infra that referenced this pull request Aug 18, 2023
…ests

Before 916ed2b (Merge pull request kubernetes#30400 from
listx/inrepoconfig-optimizations, 2023-08-17), we were under the
impression that the secondary clones of inrepoconfig repos were
unavoidably large. So we created the InRepoConfigGitCache wrapper in
3098b88 (Merge pull request kubernetes#23264 from cjwagner/inrepo-config-cache,
2021-08-18) to make it so we would never delete and recreate the
secondary clone for every request (the default behavior of the wrapped
git v2 client); instead, we would reuse the existing secondary clones
across requests if those requests were for the same repo. This resulted
in the need to ensure that only one such request would touch the
associated secondary clone at any given time, which meant that we had to
serialize the use of these secondary clones.

It turns out that the secondary clone optimizations in 916ed2b make
these clones cheaper on disk by multiple orders of magnitude. In manual
tests, we saw reductions in size from 3.1GB to just 19MB for the disk
space used for a secondary clone for a large repo using inrepoconfigs
(where the bulk of the 19MB came from other unrelated files in the
toplevel folder of the repo, due to the way sparse checkouts work).

Because the secondary clones are so cheap now, there's no need to try to
reuse the same one during the lifetime of a process. We just create and
delete each time we need to service a request.

Note that we already have a call to repo.Clean() inside prowYAMLGetter,
so that will delete the repo after we read the inrepoconfig values.
However, the main "gitcache" directory (holding the mirror clone) will
still be around and will not be deleted.
@xrstf
Copy link
Contributor

xrstf commented Oct 20, 2023

In case anyone else is searching for it: This change broke the generated tests in our product, as we had a symlink from .prow/generated.yaml outside of the .prow/ directory. IMHO this could have been listed in the announcements as a breaking change.

@listx
Copy link
Contributor Author

listx commented Oct 20, 2023

@xrstf Thanks for the report. I did not foresee that users used symlinks like that though, which is why I didn't make an announcement.

WRT technical reasons behind the breakage, I think this is because when we do sparse checkouts, we only populate the .prow/ files and the files under the project root (which will include .prow.yaml). So even though the symlink will exist, its destination (if outside the .prow/ folder and not at the project root) won't be populated by the sparse checkout.

I'll make an announcement in our docs page.

@listx
Copy link
Contributor Author

listx commented Oct 20, 2023

See kubernetes-sigs/prow#65 for the announcement update to our docs.

@xrstf
Copy link
Contributor

xrstf commented Oct 21, 2023

Thanks :-) I admit that our constellation with symlinks was kinda ... unique and we happily changed our repo structure to benefit from the sparse checkouts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/prow Issues or PRs related to prow cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants