New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timeout Unschedulable Migration Target Pods #6737
Conversation
|
/hold I want to write a functional test before this is merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple of comments; the most important one - imo - is to name the variable to shed some light into the re-queing mechanism.
| // Make sure we check this again after some time | ||
| c.Queue.AddAfter(migrationKey, time.Second*time.Duration(c.unschedulableTimeoutSeconds-diffSeconds)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm just too thick, but I don't follow this c.unschedulableTimeoutSeconds-diffSeconds .
Would you assign it to a named variable to help people like me reason this out ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i changed this around a bit to try and help
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tks. Better now.
| return nil | ||
| } | ||
|
|
||
| isUnschedulable := false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this variable is required, nor the control code to exit the function if the pod is not k8sv1.PodReasonUnschedulable .
You could move all the code in https://github.com/kubevirt/kubevirt/pull/6737/files#diff-1baf192bb9c4560d7d40f008b7ba5c680ebce669b8f1b7639445abd3e7269688R820-R844 to a couple of new functions (named didPodTimeOut(pod *v1.Pod) bool {...} and func (c *MigrationController) podTimedOut(vmi, pod) error {...}) then directly call them when the right condition is met - i.e. condition.Type == k8sv1.PodScheduled && condition.Status == k8sv1.ConditionFalse && condition.Reason == k8sv1.PodReasonUnschedulable.
Not a must though, I won't insist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored this a bit to hopefully make it easier to follow
| @@ -584,6 +599,43 @@ var _ = Describe("Migration watcher", func() { | |||
| shouldExpectMigrationSchedulingState(migration) | |||
| controller.Execute() | |||
| }) | |||
|
|
|||
| table.DescribeTable("should handle pod stuck in unschedulable state", func(phase virtv1.VirtualMachineInstanceMigrationPhase, shouldTimeout bool, timeLapse int64) { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not fond of defining the expected test state in the test's input parameters - i.e. the shouldTimeout boolean input argument - but I don't see simpler alternatives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a fairly common pattern we use with tables. Pass in various input and declare what output is expected. It's a way to condense more tests into a single table.
776058b
to
db16906
Compare
|
/hold cancel I added a functional test |
|
/retest |
db16906
to
b762830
Compare
|
/assign @enp0s3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I think it reads better now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Migration takes more than 5minutes if vm with CD-ROM.
We need to consider increasing the timeout value here, till this bug is fixed, https://bugzilla.redhat.com/show_bug.cgi?id=2014438
Both the Source and Target virt-launcher Pods would be in Running state during this long migration period of more than 5 minutes.
The timeout in this PR only get's triggered if the target pod is stuck in pending due to scheduling issues for > 5 minutes. If the target pod is running, this time out will not occur. |
|
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me.
/approve
| @@ -103,6 +108,8 @@ func NewMigrationController(templateService services.TemplateService, | |||
| migrationStartLock: &sync.Mutex{}, | |||
| clusterConfig: clusterConfig, | |||
| statusUpdater: status.NewMigrationStatusUpdater(clientset), | |||
|
|
|||
| unschedulableTimeoutSeconds: defaultUnschedulableTimeoutSeconds, | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe not for this PR, but did you think about making this configurable?
In that case, we could then override this value in a migration policy. For example, in a policy that targets workloads with a requirement for specific resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, we definitely could make it configurable. I'll give this some thought. I'm doing a follow up PR to this now where i have a "catch all" timeout for a target pod stuck in pending for any unknown reason, i'll need the timeout to be configurable for that simply to be able to functionally test it.
|
/retest |
Also for me. 👍 /approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: rmohr, vladikr The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
b762830
to
24a6e64
Compare
|
@vladikr @rmohr take another look. I made a new change. We now observe two timeouts
I created the 15m catch all to account for all the unknown reasons for why a pod will never transition successfully to a running phase, this could be issues on the node with pulling the container image, or perhaps something new we can't predict. |
24a6e64
to
53bea6f
Compare
|
/retest |
Signed-off-by: David Vossel <davidvossel@gmail.com>
Signed-off-by: David Vossel <davidvossel@gmail.com>
Signed-off-by: David Vossel <davidvossel@gmail.com>
Signed-off-by: David Vossel <davidvossel@gmail.com>
Signed-off-by: David Vossel <davidvossel@gmail.com>
behavior for functional tests Signed-off-by: David Vossel <davidvossel@gmail.com>
Signed-off-by: David Vossel <davidvossel@gmail.com>
53bea6f
to
7a99b5a
Compare
|
@davidvossel: The following test failed, say
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
/retest |
|
/lgtm |
|
/cherry-pick release-0.44 |
|
@enp0s3: #6737 failed to apply on top of branch "release-0.44": In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
We now observe two timeouts
I created the 15m catch all to account for all the unknown reasons for why a pod will never transition successfully to a running phase, this could be issues on the node with pulling the container image, or perhaps something new we can't predict.