Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use http/2 for localhost webhook #122558

Merged
merged 1 commit into from Jan 4, 2024

Conversation

linxiulei
Copy link
Contributor

@linxiulei linxiulei commented Jan 2, 2024

What type of PR is this?

/kind bug

What this PR does / why we need it:

Previously we changed to http1 because of #82090. However, the fixed issue shouldn't affect cases where the webhook services are hosted in the same host but the fix did affect performance with using http1.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Changed the API server so that for admission webhooks that have a URL matching the hostname `localhost`, or a loopback IP address, the connection supports HTTP/2 where it can be negotiated.

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jan 2, 2024
@linxiulei
Copy link
Contributor Author

/assign @liggitt
/cc @aojea

@k8s-ci-robot k8s-ci-robot added area/apiserver sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jan 2, 2024
@@ -142,7 +143,9 @@ func (cm *ClientManager) HookClient(cc ClientConfig) (*rest.RESTClient, error) {
// Use http/1.1 instead of http/2.
// This is a workaround for http/2-enabled clients not load-balancing concurrent requests to multiple backends.
// See https://issue.k8s.io/75791 for details.
cfg.NextProtos = []string{"http/1.1"}
if isLocalHost(cc.URL) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both service and URL-based versions go through this code path. We also are already parsing the URL below.

This would be clearer as a forceHTTP1 := true which we only set to false after parsing u, err := url.Parse(cc.URL) below and checking the hostname there for localhost / loopback

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, @aojea you suggested what I initially did but @liggitt suggested otherwise. I have no strong opinion on either.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @aojea was suggesting moving the NextProtos assignment down below to where we had already parsed the URL (do it unconditionally in the service branch, and conditionally in the URL branch)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel this is simpler since we still need to change it in two places.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be missing something, why do we have such large diff? do we need to breaking it down in func (cm *ClientManager) hookClientConfig(cc ClientConfig) (*rest.Config, error) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be missing something, why do we have such large diff?

it was due to bringing several dependencies. I removed them since your suggestion doesn't need them.

do we need to breaking it down in func (cm *ClientManager) hookClientConfig(cc ClientConfig) (*rest.Config, error) {

yes, this refactoring would facilitate writing the unit test linxiulei@612c36d#diff-0afdfbb98099d9c00bfb801579318a05e3f226eb384d50561cef6886e7631046R1

@liggitt
Copy link
Member

liggitt commented Jan 2, 2024

this definitely needs a test and a release note

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Jan 2, 2024
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 2, 2024
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 2, 2024
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 2, 2024
@linxiulei
Copy link
Contributor Author

One option for testing is to add the http1 behavior test as an e2e test in test/e2e/apimachinery/webhook.go and modify the existing integration test to validate the http2 behavior.

I took a hacky approach to mock the dns for resolving example.com. Will clean up by splitting them into smaller commits if needed.

Comment on lines 161 to 163
if forceHTTP1 {
cfg.NextProtos = []string{"http/1.1"}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I see, for me is easier to read the code top down, and not have to remember that something is going to mutate it later ... it is also less fragile

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have strong preference on either way. But changed as you suggested.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 3, 2024
@aojea
Copy link
Member

aojea commented Jan 3, 2024

some comment lefts but overall LGTM, I think we can assume http2 is prefered for localhost webhooks is relatively safe , as Services will not have a ClusterIP that has a localhost address, and webhooks that are listening on localhost are not supposed to require loadbalancing.

Signed-off-by: Eric Lin <exlin@google.com>
@aojea
Copy link
Member

aojea commented Jan 4, 2024

/lgtm
/approve

Thanks

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 4, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 5f986a1a4a11a8620b08a2fa8f4f3e7f632c401d

@@ -58,154 +65,191 @@ func TestWebhookLoadBalance(t *testing.T) {
t.Fatalf("Failed to build cert with error: %+v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I recall correctly, we had issues with this test flaking when it was first introduced. Did you run it a bunch of times in a row to check that this reworked version is stable?

Copy link
Contributor Author

@linxiulei linxiulei Jan 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean tls.X509KeyPair() here or the whole test, which I'd guess the connection count is not always predictable.

Anyway, I ran more than a dozen times in a row and no failures.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, the whole test, just had to pick a random line to comment on

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FTR, @aojea suggested stress following https://gist.github.com/liggitt/6a3a2217fa5f846b52519acfc0ffece0#running-integration-tests-to-reproduce-flakes

stress ./admissionwebhook.test -test.run TestWebhookLoadBalance
5s: 0 runs so far, 0 failures
10s: 0 runs so far, 0 failures
15s: 0 runs so far, 0 failures
20s: 0 runs so far, 0 failures
25s: 0 runs so far, 0 failures
30s: 0 runs so far, 0 failures
35s: 3 runs so far, 0 failures
40s: 11 runs so far, 0 failures
45s: 16 runs so far, 0 failures
50s: 16 runs so far, 0 failures
55s: 16 runs so far, 0 failures
1m0s: 16 runs so far, 0 failures
1m5s: 20 runs so far, 0 failures
1m10s: 24 runs so far, 0 failures
1m15s: 31 runs so far, 0 failures
1m20s: 32 runs so far, 0 failures
1m25s: 32 runs so far, 0 failures
1m30s: 32 runs so far, 0 failures
1m35s: 36 runs so far, 0 failures
1m40s: 41 runs so far, 0 failures
1m45s: 46 runs so far, 0 failures
1m50s: 48 runs so far, 0 failures
1m55s: 48 runs so far, 0 failures
2m0s: 48 runs so far, 0 failures
2m5s: 48 runs so far, 0 failures
2m10s: 48 runs so far, 0 failures
2m15s: 50 runs so far, 0 failures
2m20s: 53 runs so far, 0 failures
2m25s: 57 runs so far, 0 failures
2m30s: 60 runs so far, 0 failures
2m35s: 63 runs so far, 0 failures
2m40s: 64 runs so far, 0 failures
2m45s: 64 runs so far, 0 failures
2m50s: 67 runs so far, 0 failures
2m55s: 71 runs so far, 0 failures
3m0s: 75 runs so far, 0 failures
3m5s: 78 runs so far, 0 failures
3m10s: 80 runs so far, 0 failures
3m15s: 80 runs so far, 0 failures
3m20s: 82 runs so far, 0 failures
3m25s: 84 runs so far, 0 failures
3m30s: 89 runs so far, 0 failures
3m35s: 92 runs so far, 0 failures
3m40s: 95 runs so far, 0 failures
3m45s: 96 runs so far, 0 failures
3m50s: 98 runs so far, 0 failures
3m55s: 99 runs so far, 0 failures
4m0s: 104 runs so far, 0 failures
4m5s: 108 runs so far, 0 failures
4m10s: 110 runs so far, 0 failures
4m15s: 112 runs so far, 0 failures

@liggitt
Copy link
Member

liggitt commented Jan 4, 2024

change lgtm, one question about making sure the updated integration test is non-flaky

@liggitt
Copy link
Member

liggitt commented Jan 4, 2024

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, liggitt, linxiulei

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 4, 2024
@k8s-ci-robot k8s-ci-robot merged commit 9bd0ef5 into kubernetes:master Jan 4, 2024
15 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.30 milestone Jan 4, 2024
@sftim
Copy link
Contributor

sftim commented Jan 9, 2024

Changelog suggestion

-The apiserver uses http/2 to communicate with admission webhooks specified with an `URL` of localhost or loopback IP addresses.
+Changed the API server so that for admission webhooks that have a URL matching the hostname
+`localhost`, or a loopback IP address, the connection supports HTTP/2 where it can be negotiated.

I presume that other webhooks, eg authz webhooks, use the legacy behavior. Have I got that right?

@linxiulei
Copy link
Contributor Author

Changelog changed as suggested.

I presume that other webhooks, eg authz webhooks, use the legacy behavior. Have I got that right?

I don't think this PR touched clients for other webhooks (i.e. only for clients of validating and mutating webhooks). But others might be more knowledgable than me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver area/dependency Issues or PRs related to dependency changes area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants