Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

try to provide most recent copy of AW to enable fast deletion #558

Merged
merged 5 commits into from
Aug 10, 2023

Conversation

asm582
Copy link
Member

@asm582 asm582 commented Aug 8, 2023

Issue link

#477

What changes have been made

In the core AW update method the desire is to provide the best effort most recent copy of AW so that if changes like deletion happen to AW, it is reflected

Verification steps

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • Testing is not required for this change

@metalcycling
Copy link
Collaborator

What do you mean by best effort?

@asm582
Copy link
Member Author

asm582 commented Aug 8, 2023

What do you mean by best effort?

meaning if AW is submitted and scheduleNext picks up, it will go through all the dispatch logic, when the time comes to API server update, this will fail which will cause AW to queue and then later delete. so there is a delay of approx 1 dispatch cycle to delete AW.

@asm582
Copy link
Member Author

asm582 commented Aug 8, 2023

The issue tagged in the PR could explain why this PR is needed. I submitted 1K AWs and deleted them with oc delete and later resubmitted the same 1K AWs. it now takes about >2 mins to schedule 1st AW from resubmitted 1K. I think is an improvement to the delays seen in the issue.

@asm582 asm582 requested a review from z103cb August 9, 2023 03:26
@asm582
Copy link
Member Author

asm582 commented Aug 9, 2023

It needs some more testing for quick deletions at scale. reduced the retries and also made a choice to not retry when there is a conflict error with etcd, the bias is to trust etcd more certainly in case of deletions.

@@ -932,7 +932,7 @@ func (qjm *XController) ScheduleNext() {
// the appwrapper from being added in syncjob
defer qjm.schedulingAWAtomicSet(nil)

scheduleNextRetrier := retrier.New(retrier.ExponentialBackoff(10, 100*time.Millisecond), &EtcdErrorClassifier{})
scheduleNextRetrier := retrier.New(retrier.ExponentialBackoff(1, 100*time.Millisecond), &EtcdErrorClassifier{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are doing this, I think you are better off removing the retry logic from here and using it only when updating etcd.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, adding retry at the relevant spot is TDB.

Copy link
Contributor

@z103cb z103cb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the changes to ScheduleNext function need to be a little deeper. Reducing the number of retry is not the right approach. IMHO for this PR to pass the change will need to be reverted or the retry needs to be moved to the individual etcd Updates.

@z103cb
Copy link
Contributor

z103cb commented Aug 9, 2023

I think you need to look more closely and the interactions between the worker/ScheduleNext/UpdateQueueJob threads. I suspect that UpdateQueueJob is interfering with the others. I strongly recommend re-evaluating its purpose (IMHO it should be removed)

@asm582
Copy link
Member Author

asm582 commented Aug 9, 2023

I think that the changes to ScheduleNext function need to be a little deeper. Reducing the number of retry is not the right approach. IMHO for this PR to pass the change will need to be reverted or the retry needs to be moved to the individual etcd Updates.

Reducing/No retry seems the correct approach when scheduleNext function is processing an AW and an external client deletes the same AW from etcd.

@asm582
Copy link
Member Author

asm582 commented Aug 9, 2023

I think you need to look more closely and the interactions between the worker/ScheduleNext/UpdateQueueJob threads. I suspect that UpdateQueueJob is interfering with the others. I strongly recommend re-evaluating its purpose (IMHO it should be removed)

Thanks, I think the interaction between two threads is an orthogonal issue. it is easy to turn off upatequeuejob thread but we lose the ability to update the state of the AW once it is running.

@tardieu
Copy link
Member

tardieu commented Aug 9, 2023

We should probably consider a more extensive revision of the handling of deletion. IMHO we should add a finalizer to the AppWrapper object on first encounter (and make sure the update went through). This finalizer then makes it possible to use the builtin deletion timestamp from Kubernetes as the trigger for deletion, not the object removal, the latter being unreliable. Updating the AppWrapper at deletion time, say to set the status to Terminating, then becomes entirely optional.

@astefanutti
Copy link
Contributor

/lgtm

+1 for an over hall of the deletion logic to stick to idiomatic Kubernetes as suggested by @tardieu.

This also eventually be reworked when we'll migrate over controller-runtime.

@asm582
Copy link
Member Author

asm582 commented Aug 10, 2023

exp_upd.log
@sutaakar @Srihari1192 @astefanutti @anishasthana PFA local logs where all test cases pass, any comments on why we have discrepancies - build passing with few test cases failing?

@asm582
Copy link
Member Author

asm582 commented Aug 10, 2023

Opened issue #564 but the question still remains open about the discrepancy between build passing locally. For now I have rerun the build

Copy link
Collaborator

@metalcycling metalcycling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

pkg/controller/queuejob/queuejob_controller_ex.go Outdated Show resolved Hide resolved
@openshift-ci
Copy link

openshift-ci bot commented Aug 10, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: metalcycling

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 008f603 into project-codeflare:main Aug 10, 2023
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants