try to provide most recent copy of AW to enable fast deletion #558

asm582 · 2023-08-08T14:14:16Z

Issue link

What changes have been made

In the core AW update method the desire is to provide the best effort most recent copy of AW so that if changes like deletion happen to AW, it is reflected

Verification steps

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- Testing is not required for this change

metalcycling · 2023-08-08T19:50:54Z

What do you mean by best effort?

pkg/controller/queuejob/queuejob_controller_ex.go

asm582 · 2023-08-08T20:53:47Z

What do you mean by best effort?

meaning if AW is submitted and scheduleNext picks up, it will go through all the dispatch logic, when the time comes to API server update, this will fail which will cause AW to queue and then later delete. so there is a delay of approx 1 dispatch cycle to delete AW.

asm582 · 2023-08-08T21:32:34Z

The issue tagged in the PR could explain why this PR is needed. I submitted 1K AWs and deleted them with oc delete and later resubmitted the same 1K AWs. it now takes about >2 mins to schedule 1st AW from resubmitted 1K. I think is an improvement to the delays seen in the issue.

asm582 · 2023-08-09T03:29:26Z

It needs some more testing for quick deletions at scale. reduced the retries and also made a choice to not retry when there is a conflict error with etcd, the bias is to trust etcd more certainly in case of deletions.

z103cb · 2023-08-09T09:29:37Z

pkg/controller/queuejob/queuejob_controller_ex.go

@@ -932,7 +932,7 @@ func (qjm *XController) ScheduleNext() {
 	// the appwrapper from being added in syncjob
 	defer qjm.schedulingAWAtomicSet(nil)

-	scheduleNextRetrier := retrier.New(retrier.ExponentialBackoff(10, 100*time.Millisecond), &EtcdErrorClassifier{})
+	scheduleNextRetrier := retrier.New(retrier.ExponentialBackoff(1, 100*time.Millisecond), &EtcdErrorClassifier{})


If you are doing this, I think you are better off removing the retry logic from here and using it only when updating etcd.

For now, adding retry at the relevant spot is TDB.

pkg/controller/queuejob/queuejob_controller_ex.go

z103cb

I think that the changes to ScheduleNext function need to be a little deeper. Reducing the number of retry is not the right approach. IMHO for this PR to pass the change will need to be reverted or the retry needs to be moved to the individual etcd Updates.

z103cb · 2023-08-09T09:38:35Z

I think you need to look more closely and the interactions between the worker/ScheduleNext/UpdateQueueJob threads. I suspect that UpdateQueueJob is interfering with the others. I strongly recommend re-evaluating its purpose (IMHO it should be removed)

asm582 · 2023-08-09T13:03:50Z

I think that the changes to ScheduleNext function need to be a little deeper. Reducing the number of retry is not the right approach. IMHO for this PR to pass the change will need to be reverted or the retry needs to be moved to the individual etcd Updates.

Reducing/No retry seems the correct approach when scheduleNext function is processing an AW and an external client deletes the same AW from etcd.

asm582 · 2023-08-09T13:05:19Z

I think you need to look more closely and the interactions between the worker/ScheduleNext/UpdateQueueJob threads. I suspect that UpdateQueueJob is interfering with the others. I strongly recommend re-evaluating its purpose (IMHO it should be removed)

Thanks, I think the interaction between two threads is an orthogonal issue. it is easy to turn off upatequeuejob thread but we lose the ability to update the state of the AW once it is running.

tardieu · 2023-08-09T13:31:23Z

We should probably consider a more extensive revision of the handling of deletion. IMHO we should add a finalizer to the AppWrapper object on first encounter (and make sure the update went through). This finalizer then makes it possible to use the builtin deletion timestamp from Kubernetes as the trigger for deletion, not the object removal, the latter being unreliable. Updating the AppWrapper at deletion time, say to set the status to Terminating, then becomes entirely optional.

astefanutti · 2023-08-10T12:45:27Z

/lgtm

+1 for an over hall of the deletion logic to stick to idiomatic Kubernetes as suggested by @tardieu.

This also eventually be reworked when we'll migrate over controller-runtime.

asm582 · 2023-08-10T14:03:08Z

exp_upd.log
@sutaakar @Srihari1192 @astefanutti @anishasthana PFA local logs where all test cases pass, any comments on why we have discrepancies - build passing with few test cases failing?

asm582 · 2023-08-10T14:07:30Z

Opened issue #564 but the question still remains open about the discrepancy between build passing locally. For now I have rerun the build

metalcycling

LGTM

pkg/controller/queuejob/queuejob_controller_ex.go

openshift-ci · 2023-08-10T14:53:43Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: metalcycling

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [metalcycling]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

try to provide most recent copy of AW to enable fast deletion

6d185ec

openshift-ci bot requested review from dimakis and metalcycling August 8, 2023 14:14

asm582 requested review from tardieu and astefanutti August 8, 2023 14:15

metalcycling reviewed Aug 8, 2023

View reviewed changes

pkg/controller/queuejob/queuejob_controller_ex.go Outdated Show resolved Hide resolved

pkg/controller/queuejob/queuejob_controller_ex.go Show resolved Hide resolved

pkg/controller/queuejob/queuejob_controller_ex.go Show resolved Hide resolved

handle review for informer not synced

4449759

reduce retries and handle conflict errors differently

4a3ddf3

asm582 requested a review from z103cb August 9, 2023 03:26

z103cb reviewed Aug 9, 2023

View reviewed changes

pkg/controller/queuejob/queuejob_controller_ex.go Outdated Show resolved Hide resolved

z103cb suggested changes Aug 9, 2023

View reviewed changes

openshift-ci bot assigned z103cb Aug 9, 2023

revert duplicate informer sync logic

76708bf

remove stale copy from event queue

3e86cab

openshift-ci bot assigned astefanutti Aug 10, 2023

openshift-ci bot added the lgtm label Aug 10, 2023

metalcycling approved these changes Aug 10, 2023

View reviewed changes

pkg/controller/queuejob/queuejob_controller_ex.go Outdated Show resolved Hide resolved

openshift-ci bot assigned metalcycling Aug 10, 2023

openshift-ci bot added the approved label Aug 10, 2023

openshift-merge-robot merged commit 008f603 into project-codeflare:main Aug 10, 2023
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

try to provide most recent copy of AW to enable fast deletion #558

try to provide most recent copy of AW to enable fast deletion #558

asm582 commented Aug 8, 2023

metalcycling commented Aug 8, 2023

asm582 commented Aug 8, 2023

asm582 commented Aug 8, 2023

asm582 commented Aug 9, 2023

z103cb Aug 9, 2023

asm582 Aug 9, 2023

z103cb left a comment •

edited

z103cb commented Aug 9, 2023

asm582 commented Aug 9, 2023

asm582 commented Aug 9, 2023

tardieu commented Aug 9, 2023

astefanutti commented Aug 10, 2023

asm582 commented Aug 10, 2023

asm582 commented Aug 10, 2023

metalcycling left a comment

openshift-ci bot commented Aug 10, 2023

try to provide most recent copy of AW to enable fast deletion #558

try to provide most recent copy of AW to enable fast deletion #558

Conversation

asm582 commented Aug 8, 2023

Issue link

What changes have been made

Verification steps

Checks

metalcycling commented Aug 8, 2023

asm582 commented Aug 8, 2023

asm582 commented Aug 8, 2023

asm582 commented Aug 9, 2023

z103cb Aug 9, 2023

Choose a reason for hiding this comment

asm582 Aug 9, 2023

Choose a reason for hiding this comment

z103cb left a comment • edited

Choose a reason for hiding this comment

z103cb commented Aug 9, 2023

asm582 commented Aug 9, 2023

asm582 commented Aug 9, 2023

tardieu commented Aug 9, 2023

astefanutti commented Aug 10, 2023

asm582 commented Aug 10, 2023

asm582 commented Aug 10, 2023

metalcycling left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Aug 10, 2023

z103cb left a comment •

edited