Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix transitions of Requeued condition #2063

Merged

Conversation

mbobrovskyi
Copy link
Contributor

@mbobrovskyi mbobrovskyi commented Apr 25, 2024

What type of PR is this?

/kind bug

What this PR does / why we need it:

  • fixed Requeued condition transition on activate/deactivate workload
  • fixed Requeued condition transition on WaitForPodsReady

Which issue(s) this PR fixes:

Fixes #2038

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fix transitions of Requeued condition.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 25, 2024
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 25, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @mbobrovskyi. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Apr 25, 2024
Copy link

netlify bot commented Apr 25, 2024

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit c1fa647
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/66351cc7a09ac500085fc9df

@trasc
Copy link
Contributor

trasc commented Apr 25, 2024

/assign
/ok-to-test
/test all

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 25, 2024

// If a deactivated workload is re-activated we need to set the WorkloadRequeued condition to true if already unset.
if workload.IsUnsetRequeuedByDeactivation(&wl) {
workload.SetRequeuedCondition(&wl, "ReactivateWorkload", "The workload is reactivated")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have an API constant for "ReactivateWorkload"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@alculquicondor
Copy link
Contributor

cc @mimowo

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 26, 2024
@mbobrovskyi mbobrovskyi marked this pull request as ready for review April 26, 2024 08:26
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 26, 2024
@mimowo
Copy link
Contributor

mimowo commented Apr 26, 2024

cc @mimowo

ack

@mbobrovskyi
Copy link
Contributor Author

/test pull-kueue-test-scheduling-perf-main

Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, I'm yet wondering if we handle properly QueueStopped scenario.

Comment on lines 592 to 600
if cond == nil || cond.Status != metav1.ConditionFalse || cond.Reason != kueue.WorkloadEvictedByPodsReadyTimeout {
return nil, false
}
return cond, true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if cond == nil || cond.Status != metav1.ConditionFalse || cond.Reason != kueue.WorkloadEvictedByPodsReadyTimeout {
return nil, false
}
return cond, true
return cond != nil && cond.Status == metav1.ConditionFalse && cond.Reason == kueue.WorkloadEvictedByPodsReadyTimeout

as above, much easier to read

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if condition, isUnset := workload.IsUnsetRequeuedByPodsReadyTimeout(&wl); condition != nil && isUnset {
var requeueAfter time.Duration
if wl.Status.RequeueState != nil && wl.Status.RequeueState.RequeueAt != nil {
requeueAfter = time.Until(wl.Status.RequeueState.RequeueAt.Time)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compute the time using the r.clock. should be available on the main branch. It allows us to use fakeClock for testing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

const (
// WorkloadRequeuedByReactivation indicates that the workload was requeued
// because spec.active is set to true.
WorkloadRequeuedByReactivation = "ReactivateWorkload"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
WorkloadRequeuedByReactivation = "ReactivateWorkload"
WorkloadReactivated = "WorkloadReactivated"

It is a reason. It will be clear that it is Requeued from the condition type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -329,6 +329,12 @@ const (
WorkloadEvictedByDeactivation = "InactiveWorkload"
)

const (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: put together with other workload reasons

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 26, 2024
@mbobrovskyi mbobrovskyi force-pushed the fix/transitions-of-requeued-conditions branch from 2fb25cd to e1a6163 Compare April 26, 2024 14:42
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Apr 26, 2024
@mbobrovskyi
Copy link
Contributor Author

/test pull-kueue-test-e2e-main-1-29

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels May 3, 2024
@mbobrovskyi mbobrovskyi force-pushed the fix/transitions-of-requeued-conditions branch from aab5766 to f21c166 Compare May 3, 2024 09:57
pkg/controller/core/workload_controller.go Outdated Show resolved Hide resolved
pkg/workload/workload.go Outdated Show resolved Hide resolved
@@ -1832,6 +1895,150 @@ var _ = ginkgo.Describe("Job controller interacting with scheduler", ginkgo.Orde
})
})

var _ = ginkgo.Describe("Job controller interacting with scheduler when waitForPodsReady enabled", ginkgo.Ordered, ginkgo.ContinueOnFailure, func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fear that there is high potential for flakiness here.

The ideal solution is to manipulate the passing of time in the test, if you can pass a fake time to every controller.

OTOH, we are not too concerned with what the scheduler is doing, so we could potentially remove it from the equation. What we are concerned mostly is the interaction between the job controller and the workload reconciler.

@mimowo do you have any suggestions on how to improve this test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With current solution I think we can't manipulate the passing of time in the test. Inside worload_controller we are using another incapsulated waitForPodsReadyConfig.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With scheduler I think it's more realistic case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the test itself is fine, but I also believe there is a potential for flakes in this assert "checking the workload is evicted", because historically 1s was not enough for some asserts, and we use BackoffBaseSeconds: 1.

I think we could pro-actively set the BackoffBaseSeconds to 2 or 3, but I'm also ok to wait until we observe flakes here.

@alculquicondor
Copy link
Contributor

Integration passed 1/5

/test pull-kueue-test-integration-main

@mbobrovskyi
Copy link
Contributor Author

Integration passed 2/5

/test pull-kueue-test-integration-main

@mbobrovskyi
Copy link
Contributor Author

Integration passed 3/5

/test pull-kueue-test-integration-main

@mbobrovskyi
Copy link
Contributor Author

Integration passed 4/5

/test pull-kueue-test-integration-main

@mbobrovskyi
Copy link
Contributor Author

Integration passed 5/5

/test pull-kueue-test-integration-main

@alculquicondor
Copy link
Contributor

/lgtm
/approve

If @mimowo has any suggestions, we can apply them in a follow up.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 3, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: e5287e4c21b2efecc94d6bf8fd1dec28022ab3e2

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, mbobrovskyi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 3, 2024
@k8s-ci-robot k8s-ci-robot merged commit b088365 into kubernetes-sigs:main May 3, 2024
15 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.7 milestone May 3, 2024
if requeueAfter > 0 {
return reconcile.Result{RequeueAfter: requeueAfter}, nil
}
workload.SetRequeuedCondition(&wl, kueue.WorkloadBackoffFinished, "The workload backoff was finished", true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks inconsistent: that we don't clearwl.Status.RequeueState for PodsReady timeout, but we clear it on L168 in case of Reactivation. In particular, I'm wondering if this could lead to incrementing the RequeueState.Count here on the second requeue. Please verify and fix if this is the case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that we need to clear wl.Status.RequeueState in this case, because we need to know how much attempts we have, to deactivate workload if we exceeded the limit here.

func (r *WorkloadReconciler) triggerDeactivationOrBackoffRequeue(ctx context.Context, wl *kueue.Workload) (bool, error) {
...
        // If requeuingBackoffLimitCount equals to null, the workloads is repeatedly and endless re-queued.
	requeuingCount := ptr.Deref(wl.Status.RequeueState.Count, 0) + 1
	if r.waitForPodsReady.requeuingBackoffLimitCount != nil && requeuingCount > *r.waitForPodsReady.requeuingBackoffLimitCount {
		wl.Spec.Active = ptr.To(false)
		if err := r.client.Update(ctx, wl); err != nil {
			return false, err
		}
		r.recorder.Eventf(wl, corev1.EventTypeNormal, kueue.WorkloadEvictedByDeactivation,
			"Deactivated Workload %q by reached re-queue backoffLimitCount", klog.KObj(wl))
		return true, nil
	}
...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or you mean to clear just RequeueAt field?

Copy link
Contributor

@mimowo mimowo May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that we need to clear wl.Status.RequeueState in this case, because we need to know how much attempts we have

Ah, got it, we need to preserve the count so that we can increment it if the PodsReady timeout is exceeded for the next time.

Or you mean to clear just RequeueAt field?

I guess the RequeueAt field is harmless, but might be confusing, since the workload is already requeued, so indeed, I would suggest to clear it.

@mbobrovskyi mbobrovskyi deleted the fix/transitions-of-requeued-conditions branch May 6, 2024 09:24
return ctrl.Result{}, err
}
// If stopped cluster queue is started we need to set the WorkloadRequeued condition to true.
if isDisabledRequeuedByClusterQueueStopped(&wl) && ptr.Deref(cq.Spec.StopPolicy, kueue.None) == kueue.None {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suspect it does not matter what was the StopPolicy that lead to the setting Requeued: false for ClusterQueueStopped. In particular, if the stopPolicy: HoldAndDrain, then I believe we would set the Requeued: false here. However, the extra check ptr.Deref(cq.Spec.StopPolicy, kueue.None) == kueue.None would prevent us to set Requeued: true.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I think the implementation is correct (as discussed offline), we only set Requeue: true if stopPolicy: None.

@mimowo
Copy link
Contributor

mimowo commented May 6, 2024

If @mimowo has any suggestions, we can apply them in a follow up.

@mbobrovskyi please verify and address (if needed) in a follow up the following comments:

@mimowo
Copy link
Contributor

mimowo commented May 6, 2024

@mbobrovskyi thanks for handling the comments, so the only follow up would be to clear RequeueAt when we set Requeued: true.

@mbobrovskyi
Copy link
Contributor Author

mbobrovskyi commented May 6, 2024

Fixed it on #2143.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix transitions of Requeued condition
6 participants