OCPBUGS-13946: do not use one second timeout when asserting a webhook connection #1510

p0lyn0mial · 2023-06-19T08:10:28Z

Previously the dial timeout to a webook was set to one second which seems to be very aggressive and could cause failures which could put the operator into degraded state.

This PR reads the timeout value for a webhook from the spec or uses a default value of 10 seconds if it wasn't specified

…sserting a webhook connection previously the dial timeout to a webook was set to one second which seems to be very aggressive and can cause failures which can put the operator into degraded state. This PR reads the timeout value for a webhook from the spec or uses a default value of 10 seconds if it wasn't specified

p0lyn0mial · 2023-06-19T08:10:59Z

/assign @benluddy

openshift-ci-robot · 2023-06-19T08:12:40Z

@p0lyn0mial: This pull request references Jira Issue OCPBUGS-13946, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.14.0) matches configured target version for branch (4.14.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @wangke19

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Previously the dial timeout to a webook was set to one second which seems to be very aggressive and could cause failures which could put the operator into degraded state.

This PR reads the timeout value for a webhook from the spec or uses a default value of 10 seconds if it wasn't specified

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2023-06-19T10:42:46Z

@p0lyn0mial: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-gcp-operator-single-node	`1156b3a`	link	false	`/test e2e-gcp-operator-single-node`
ci/prow/e2e-aws-operator-disruptive-single-node	`1156b3a`	link	false	`/test e2e-aws-operator-disruptive-single-node`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

benluddy · 2023-06-20T17:06:43Z

pkg/operator/webhooksupportabilitycontroller/degraded_webhook.go

@@ -114,7 +121,7 @@ func (c *webhookSupportabilityController) assertConnect(ctx context.Context, web
 		case <-time.After(time.Duration(i) * time.Second):
 		}
 		dialer := &tls.Dialer{
-			NetDialer: &net.Dialer{Timeout: 1 * time.Second},
+			NetDialer: &net.Dialer{Timeout: timeout},


This timeout only covers the TCP connect. Since we're using a TLS dialer here, I expect we intend timeouts to cover the handshake too. May need to move the timeout into a context deadline.

The timeout applies to connection and TLS handshake as a whole.
See https://github.com/golang/go/blob/master/src/crypto/tls/tls.go#L123

benluddy · 2023-06-20T17:07:21Z

pkg/operator/webhooksupportabilitycontroller/degraded_webhook_admission.go

@@ -27,6 +27,7 @@ func (c *webhookSupportabilityController) updateMutatingAdmissionWebhookConfigur
 				Name:                  webhook.Name,
 				CABundle:              webhook.ClientConfig.CABundle,
 				FailurePolicyIsIgnore: webhook.FailurePolicy != nil && *webhook.FailurePolicy == admissionregistrationv1.Ignore,
+				TimeoutSeconds:        webhook.TimeoutSeconds,


Does it make sense to perform defaulting here rather than as part of every check?

The defaulting is cheap. In the future we might consider adding some logs.
In addition to that it is consistent with defaulting the port.

See

cluster-kube-apiserver-operator/pkg/operator/webhooksupportabilitycontroller/degraded_webhook.go

Line 104 in 1156b3a

port = fmt.Sprintf("%d", *reference.Port)

Right, I'm not worried about the cost, I think it would be a tidier separation of responsibility. Since we already translate the API object into an internal representation for use by the dial probes, it didn't make sense to me that the internal representation (i.e. webhookInfo) wasn't directly usable by the dial probe.

It's only my preference and I'm satisfied with consistency too.

benluddy · 2023-06-21T14:26:40Z

Is this a good time to add test coverage for dial timeouts?

p0lyn0mial · 2023-06-27T13:52:22Z

Is this a good time to add test coverage for dial timeouts?

It turns out that unit test for dial timeouts would complicated, we would need to start a web server, a fake dns server and we would need to measure duration of the test.

I was also thinking about refactoring the code just to validate if a timeout value is applied but i don’t see a huge gain here.

I would like to merge this PR without timeout coverage.

benluddy · 2023-06-27T17:56:28Z

/lgtm

openshift-ci · 2023-06-27T17:58:02Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: benluddy, p0lyn0mial

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [benluddy,p0lyn0mial]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2023-06-27T19:29:46Z

/retest-required

Remaining retests: 0 against base HEAD 6270111 and 2 for PR HEAD 1156b3a in total

openshift-ci-robot · 2023-06-27T20:50:56Z

@p0lyn0mial: Jira Issue OCPBUGS-13946: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-13946 has been moved to the MODIFIED state.

In response to this:

Previously the dial timeout to a webook was set to one second which seems to be very aggressive and could cause failures which could put the operator into degraded state.

This PR reads the timeout value for a webhook from the spec or uses a default value of 10 seconds if it wasn't specified

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci bot assigned benluddy Jun 19, 2023

p0lyn0mial mentioned this pull request Jun 6, 2023

OCPBUGS-13946: degraded_webhook.go x509: certificate signed by unknown authority #1503

Merged

openshift-ci bot requested review from sanchezl and soltysh June 19, 2023 08:11

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 19, 2023

p0lyn0mial changed the title ~~OCPBUGS-13946 do not use one second timeout when asserting a webhook connection~~ OCPBUGS-13946: do not use one second timeout when asserting a webhook connection Jun 19, 2023

openshift-ci bot requested a review from wangke19 June 19, 2023 08:12

openshift-ci-robot mentioned this pull request Jun 19, 2023

OCPBUGS-13946: report fast resync interval only when > 0 and < 60 openshift/library-go#1535

Merged

benluddy reviewed Jun 20, 2023

View reviewed changes

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 27, 2023

openshift-merge-robot merged commit 8b64249 into openshift:master Jun 27, 2023
13 of 15 checks passed

dgoodwin mentioned this pull request Jul 10, 2023

Revert "OCPBUGS-13946: do not use one second timeout when asserting a webhook connection" #1525

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-13946: do not use one second timeout when asserting a webhook connection #1510

OCPBUGS-13946: do not use one second timeout when asserting a webhook connection #1510

p0lyn0mial commented Jun 19, 2023 •

edited

p0lyn0mial commented Jun 19, 2023

openshift-ci-robot commented Jun 19, 2023

openshift-ci bot commented Jun 19, 2023

benluddy Jun 20, 2023

p0lyn0mial Jun 21, 2023

benluddy Jun 20, 2023

p0lyn0mial Jun 21, 2023 •

edited

benluddy Jun 21, 2023

benluddy commented Jun 21, 2023

p0lyn0mial commented Jun 27, 2023

benluddy commented Jun 27, 2023

openshift-ci bot commented Jun 27, 2023

openshift-ci-robot commented Jun 27, 2023

openshift-ci-robot commented Jun 27, 2023

OCPBUGS-13946: do not use one second timeout when asserting a webhook connection #1510

OCPBUGS-13946: do not use one second timeout when asserting a webhook connection #1510

Conversation

p0lyn0mial commented Jun 19, 2023 • edited

p0lyn0mial commented Jun 19, 2023

openshift-ci-robot commented Jun 19, 2023

openshift-ci bot commented Jun 19, 2023

benluddy Jun 20, 2023

Choose a reason for hiding this comment

p0lyn0mial Jun 21, 2023

Choose a reason for hiding this comment

benluddy Jun 20, 2023

Choose a reason for hiding this comment

p0lyn0mial Jun 21, 2023 • edited

Choose a reason for hiding this comment

benluddy Jun 21, 2023

Choose a reason for hiding this comment

benluddy commented Jun 21, 2023

p0lyn0mial commented Jun 27, 2023

benluddy commented Jun 27, 2023

openshift-ci bot commented Jun 27, 2023

openshift-ci-robot commented Jun 27, 2023

openshift-ci-robot commented Jun 27, 2023

p0lyn0mial commented Jun 19, 2023 •

edited

p0lyn0mial Jun 21, 2023 •

edited