Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1939054: Disable startup timeout for Spot MHC #830

Merged

Conversation

JoelSpeed
Copy link
Contributor

The spot termination handler MHC should only ever remove Machines that meet the spot termination condition. We do not want it to remove machines because they took too long to start up. That should be a user decision.

This PR adds the ability to disable the node startup timeout completely, and uses that for the spot termination MHC.

@openshift-ci-robot openshift-ci-robot added the bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. label Mar 18, 2021
@openshift-ci-robot
Copy link
Contributor

@JoelSpeed: This pull request references Bugzilla bug 1939054, which is invalid:

  • expected the bug to target the "4.8.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1939054: Disable startup timeout for Spot MHC

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Mar 18, 2021
@JoelSpeed
Copy link
Contributor Author

/bugzilla refresh

CC @n1r1 I think you might be interested in this

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Mar 18, 2021
@openshift-ci-robot
Copy link
Contributor

@JoelSpeed: This pull request references Bugzilla bug 1939054, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.8.0) matches configured target release for branch (4.8.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @sunzhaohua2

In response to this:

/bugzilla refresh

CC @n1r1 I think you might be interested in this

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot removed the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Mar 18, 2021
@@ -104,7 +104,7 @@ spec:
type: string
timeout:
description: Expects an unsigned duration string of decimal numbers each with optional fraction and a unit suffix, eg "300ms", "1.5h" or "2h45m". Valid time units are "ns", "us" (or "µs"), "ms", "s", "m", "h".
Copy link
Contributor

@lobziik lobziik Mar 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to extend this description and explain "-1" behaviour there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather keep this undocumented for now, let's test it out with just our MHC and then negotiate with upstream to get feature parity, once we have done that, we can document it

@@ -629,12 +629,17 @@ func (t *target) needsRemediation(timeoutForMachineToHaveNode time.Duration) (bo
if t.Machine.Status.LastUpdated == nil {
return false, timeoutForMachineToHaveNode, nil
}
if t.Machine.Status.LastUpdated.Add(timeoutForMachineToHaveNode).Before(now) {
if timeoutForMachineToHaveNode != time.Duration(-1) && t.Machine.Status.LastUpdated.Add(timeoutForMachineToHaveNode).Before(now) {
fmt.Printf("Timeout: %v", timeoutForMachineToHaveNode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug print?

@elmiko
Copy link
Contributor

elmiko commented Mar 29, 2021

/retest

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes sense to me, and while i generally agree with @lobziik about the documenting of this feature i think we can make an exception given @JoelSpeed's reasoning in this case.
/approve

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elmiko

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 29, 2021
@elmiko
Copy link
Contributor

elmiko commented Mar 29, 2021

these test failures seem unrelated to this change, but i'm not 100% sure yet.

@elmiko
Copy link
Contributor

elmiko commented Mar 29, 2021

digging in to the must-gather on one of the failures, i see this

Error creating: pods "aws-ebs-csi-driver-node-" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount, provider restricted: .spec.securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used, spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[2]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[3]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed, spec.containers[0].securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used, spec.containers[0].securityContext.containers[0].hostPort: Invalid value: 10300: Host ports are not allowed to be used, spec.containers[1].securityContext.privileged: Invalid value: true: Privileged containers are not allowed, spec.containers[1].securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used, spec.containers[1].securityContext.containers[0].hostPort: Invalid value: 10300: Host ports are not allowed to be used, spec.containers[2].securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used, spec.containers[2].securityContext.containers[0].hostPort: Invalid value: 10300: Host ports are not allowed to be used, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "node-exporter": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount]

not exactly sure what this means yet.

@elmiko
Copy link
Contributor

elmiko commented Mar 29, 2021

seems like we are hitting this bug, https://bugzilla.redhat.com/show_bug.cgi?id=1913069

@JoelSpeed
Copy link
Contributor Author

/retest

@JoelSpeed
Copy link
Contributor Author

/hold There is something wrong with the regex here, this needs further investigation

MachineHealthCheck.machine.openshift.io "machine-api-termination-handler" is invalid: spec.nodeStartupTimeout: Invalid value: "-1": spec.nodeStartupTimeout in body should match '^([0-9]+(\.[0-9]+)?(ns|us|µs|ms|s|m|h))+$'

The regex doesn't match the regex that is in the CRD manifest in the release image, so not sure why this error is happening in the CVO logs

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 6, 2021
@elmiko
Copy link
Contributor

elmiko commented Apr 13, 2021

/retest

@lobziik
Copy link
Contributor

lobziik commented May 7, 2021

/retest

@@ -738,6 +748,12 @@ func (t *target) needsRemediation(timeoutForMachineToHaveNode time.Duration) (bo

// the node has not been set yet
if t.Node == nil {
if timeoutForMachineToHaveNode.Seconds() == disabledNodeStartupTimeout.Seconds() {
// Startup timeout is disabled so no need to go any further.
// No node yet to check conditions, can return early here.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might it worth to have some log message here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the rest of this logic, we only log when the machine is actually being remediated. When we determine it's still healthy of possibly needs a check later, we don't log.

I'd be tempted here to keep this consistent here

@JoelSpeed
Copy link
Contributor Author

JoelSpeed commented May 12, 2021

/hold cancel

We have been working with the upstream and I think we've agreed on this disable via zero value feature

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 12, 2021
@lobziik
Copy link
Contributor

lobziik commented May 12, 2021

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 12, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 12, 2021

@JoelSpeed: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-libvirt 5e45590 link /test e2e-libvirt
ci/prow/e2e-aws-disruptive 5e45590 link /test e2e-aws-disruptive
ci/prow/e2e-gcp-operator 5e45590 link /test e2e-gcp-operator

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 6afce91 into openshift:master May 12, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 12, 2021

@JoelSpeed: All pull requests linked via external trackers have merged:

Bugzilla bug 1939054 has been moved to the MODIFIED state.

In response to this:

Bug 1939054: Disable startup timeout for Spot MHC

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@JoelSpeed JoelSpeed deleted the disable-startup-timeout branch May 13, 2021 14:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants