New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-13946: do not use one second timeout when asserting a webhook connection #1510
OCPBUGS-13946: do not use one second timeout when asserting a webhook connection #1510
Conversation
…sserting a webhook connection previously the dial timeout to a webook was set to one second which seems to be very aggressive and can cause failures which can put the operator into degraded state. This PR reads the timeout value for a webhook from the spec or uses a default value of 10 seconds if it wasn't specified
/assign @benluddy |
@p0lyn0mial: This pull request references Jira Issue OCPBUGS-13946, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@p0lyn0mial: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@@ -114,7 +121,7 @@ func (c *webhookSupportabilityController) assertConnect(ctx context.Context, web | |||
case <-time.After(time.Duration(i) * time.Second): | |||
} | |||
dialer := &tls.Dialer{ | |||
NetDialer: &net.Dialer{Timeout: 1 * time.Second}, | |||
NetDialer: &net.Dialer{Timeout: timeout}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This timeout only covers the TCP connect. Since we're using a TLS dialer here, I expect we intend timeouts to cover the handshake too. May need to move the timeout into a context deadline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timeout applies to connection and TLS handshake as a whole.
See https://github.com/golang/go/blob/master/src/crypto/tls/tls.go#L123
@@ -27,6 +27,7 @@ func (c *webhookSupportabilityController) updateMutatingAdmissionWebhookConfigur | |||
Name: webhook.Name, | |||
CABundle: webhook.ClientConfig.CABundle, | |||
FailurePolicyIsIgnore: webhook.FailurePolicy != nil && *webhook.FailurePolicy == admissionregistrationv1.Ignore, | |||
TimeoutSeconds: webhook.TimeoutSeconds, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to perform defaulting here rather than as part of every check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The defaulting is cheap. In the future we might consider adding some logs.
In addition to that it is consistent with defaulting the port
.
See
cluster-kube-apiserver-operator/pkg/operator/webhooksupportabilitycontroller/degraded_webhook.go
Line 104 in 1156b3a
port = fmt.Sprintf("%d", *reference.Port) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I'm not worried about the cost, I think it would be a tidier separation of responsibility. Since we already translate the API object into an internal representation for use by the dial probes, it didn't make sense to me that the internal representation (i.e. webhookInfo
) wasn't directly usable by the dial probe.
It's only my preference and I'm satisfied with consistency too.
Is this a good time to add test coverage for dial timeouts? |
It turns out that unit test for dial timeouts would complicated, we would need to start a web server, a fake dns server and we would need to measure duration of the test. I was also thinking about refactoring the code just to validate if a timeout value is applied but i don’t see a huge gain here. I would like to merge this PR without timeout coverage. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: benluddy, p0lyn0mial The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
8b64249
into
openshift:master
@p0lyn0mial: Jira Issue OCPBUGS-13946: All pull requests linked via external trackers have merged:
Jira Issue OCPBUGS-13946 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Previously the dial timeout to a webook was set to one second which seems to be very aggressive and could cause failures which could put the operator into degraded state.
This PR reads the timeout value for a webhook from the spec or uses a default value of 10 seconds if it wasn't specified