New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1939054: Disable startup timeout for Spot MHC #830
Bug 1939054: Disable startup timeout for Spot MHC #830
Conversation
|
@JoelSpeed: This pull request references Bugzilla bug 1939054, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/bugzilla refresh CC @n1r1 I think you might be interested in this |
|
@JoelSpeed: This pull request references Bugzilla bug 1939054, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
| @@ -104,7 +104,7 @@ spec: | |||
| type: string | |||
| timeout: | |||
| description: Expects an unsigned duration string of decimal numbers each with optional fraction and a unit suffix, eg "300ms", "1.5h" or "2h45m". Valid time units are "ns", "us" (or "µs"), "ms", "s", "m", "h". | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest to extend this description and explain "-1" behaviour there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather keep this undocumented for now, let's test it out with just our MHC and then negotiate with upstream to get feature parity, once we have done that, we can document it
| @@ -629,12 +629,17 @@ func (t *target) needsRemediation(timeoutForMachineToHaveNode time.Duration) (bo | |||
| if t.Machine.Status.LastUpdated == nil { | |||
| return false, timeoutForMachineToHaveNode, nil | |||
| } | |||
| if t.Machine.Status.LastUpdated.Add(timeoutForMachineToHaveNode).Before(now) { | |||
| if timeoutForMachineToHaveNode != time.Duration(-1) && t.Machine.Status.LastUpdated.Add(timeoutForMachineToHaveNode).Before(now) { | |||
| fmt.Printf("Timeout: %v", timeoutForMachineToHaveNode) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debug print?
b332305
to
bd2257f
Compare
|
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this makes sense to me, and while i generally agree with @lobziik about the documenting of this feature i think we can make an exception given @JoelSpeed's reasoning in this case.
/approve
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: elmiko The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
these test failures seem unrelated to this change, but i'm not 100% sure yet. |
|
digging in to the must-gather on one of the failures, i see this not exactly sure what this means yet. |
|
seems like we are hitting this bug, https://bugzilla.redhat.com/show_bug.cgi?id=1913069 |
|
/retest |
|
/hold There is something wrong with the regex here, this needs further investigation The regex doesn't match the regex that is in the CRD manifest in the release image, so not sure why this error is happening in the CVO logs |
|
/retest |
bd2257f
to
8a85fe1
Compare
8a85fe1
to
5e45590
Compare
|
/retest |
| @@ -738,6 +748,12 @@ func (t *target) needsRemediation(timeoutForMachineToHaveNode time.Duration) (bo | |||
|
|
|||
| // the node has not been set yet | |||
| if t.Node == nil { | |||
| if timeoutForMachineToHaveNode.Seconds() == disabledNodeStartupTimeout.Seconds() { | |||
| // Startup timeout is disabled so no need to go any further. | |||
| // No node yet to check conditions, can return early here. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might it worth to have some log message here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the rest of this logic, we only log when the machine is actually being remediated. When we determine it's still healthy of possibly needs a check later, we don't log.
I'd be tempted here to keep this consistent here
|
/hold cancel We have been working with the upstream and I think we've agreed on this disable via zero value feature |
|
/lgtm |
|
@JoelSpeed: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
@JoelSpeed: All pull requests linked via external trackers have merged: Bugzilla bug 1939054 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The spot termination handler MHC should only ever remove Machines that meet the spot termination condition. We do not want it to remove machines because they took too long to start up. That should be a user decision.
This PR adds the ability to disable the node startup timeout completely, and uses that for the spot termination MHC.