Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-13946: do not use one second timeout when asserting a webhook connection #1510

Merged

Conversation

p0lyn0mial
Copy link
Contributor

@p0lyn0mial p0lyn0mial commented Jun 19, 2023

Previously the dial timeout to a webook was set to one second which seems to be very aggressive and could cause failures which could put the operator into degraded state.

This PR reads the timeout value for a webhook from the spec or uses a default value of 10 seconds if it wasn't specified

…sserting a webhook connection

previously the dial timeout to a webook was set to one second
which seems to be very aggressive and can cause failures which can put the operator into degraded state.

This PR reads the timeout value for a webhook from the spec
or uses a default value of 10 seconds if it wasn't specified
@p0lyn0mial
Copy link
Contributor Author

/assign @benluddy

@openshift-ci openshift-ci bot requested review from sanchezl and soltysh June 19, 2023 08:11
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 19, 2023
@p0lyn0mial p0lyn0mial changed the title OCPBUGS-13946 do not use one second timeout when asserting a webhook connection OCPBUGS-13946: do not use one second timeout when asserting a webhook connection Jun 19, 2023
@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Jun 19, 2023
@openshift-ci-robot
Copy link

@p0lyn0mial: This pull request references Jira Issue OCPBUGS-13946, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @wangke19

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Previously the dial timeout to a webook was set to one second which seems to be very aggressive and could cause failures which could put the operator into degraded state.

This PR reads the timeout value for a webhook from the spec or uses a default value of 10 seconds if it wasn't specified

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested a review from wangke19 June 19, 2023 08:12
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 19, 2023

@p0lyn0mial: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-operator-single-node 1156b3a link false /test e2e-gcp-operator-single-node
ci/prow/e2e-aws-operator-disruptive-single-node 1156b3a link false /test e2e-aws-operator-disruptive-single-node

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@@ -114,7 +121,7 @@ func (c *webhookSupportabilityController) assertConnect(ctx context.Context, web
case <-time.After(time.Duration(i) * time.Second):
}
dialer := &tls.Dialer{
NetDialer: &net.Dialer{Timeout: 1 * time.Second},
NetDialer: &net.Dialer{Timeout: timeout},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This timeout only covers the TCP connect. Since we're using a TLS dialer here, I expect we intend timeouts to cover the handshake too. May need to move the timeout into a context deadline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout applies to connection and TLS handshake as a whole.
See https://github.com/golang/go/blob/master/src/crypto/tls/tls.go#L123

@@ -27,6 +27,7 @@ func (c *webhookSupportabilityController) updateMutatingAdmissionWebhookConfigur
Name: webhook.Name,
CABundle: webhook.ClientConfig.CABundle,
FailurePolicyIsIgnore: webhook.FailurePolicy != nil && *webhook.FailurePolicy == admissionregistrationv1.Ignore,
TimeoutSeconds: webhook.TimeoutSeconds,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to perform defaulting here rather than as part of every check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The defaulting is cheap. In the future we might consider adding some logs.
In addition to that it is consistent with defaulting the port.

See

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I'm not worried about the cost, I think it would be a tidier separation of responsibility. Since we already translate the API object into an internal representation for use by the dial probes, it didn't make sense to me that the internal representation (i.e. webhookInfo) wasn't directly usable by the dial probe.

It's only my preference and I'm satisfied with consistency too.

@benluddy
Copy link
Contributor

Is this a good time to add test coverage for dial timeouts?

@p0lyn0mial
Copy link
Contributor Author

Is this a good time to add test coverage for dial timeouts?

It turns out that unit test for dial timeouts would complicated, we would need to start a web server, a fake dns server and we would need to measure duration of the test.

I was also thinking about refactoring the code just to validate if a timeout value is applied but i don’t see a huge gain here.

I would like to merge this PR without timeout coverage.

@benluddy
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 27, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 27, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: benluddy, p0lyn0mial

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [benluddy,p0lyn0mial]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 6270111 and 2 for PR HEAD 1156b3a in total

@openshift-merge-robot openshift-merge-robot merged commit 8b64249 into openshift:master Jun 27, 2023
13 of 15 checks passed
@openshift-ci-robot
Copy link

@p0lyn0mial: Jira Issue OCPBUGS-13946: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-13946 has been moved to the MODIFIED state.

In response to this:

Previously the dial timeout to a webook was set to one second which seems to be very aggressive and could cause failures which could put the operator into degraded state.

This PR reads the timeout value for a webhook from the spec or uses a default value of 10 seconds if it wasn't specified

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants