Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRA: scheduler event handlers via assume cache #124595

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Apr 28, 2024

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

Events that make pods scheduleable were triggered by the informer cache, not the assume cache. For "claim was deallocated", this led to a small, unlikely race if a pod got scheduled and stopped so quickly that the informer cache didn't ever see the "claim is allocated" state. The event handler now reacts to changes in the assume cache because that cache is guaranteed to receive the "claim is allocated" state which cause some pod to not get scheduled, because by definition the cache must have listed some other claim as using resources needed for that pod.

Which issue(s) this PR fixes:

Fixes ##123698

Does this PR introduce a user-facing change?

DRA: fix some small, unlikely race condition during pod scheduling

/assign @kerthcet

Do you have time to review?

/cc @towca

This is related to the work that you are doing for the cluster autoscaler.

This is a basic implementation of a first-in-first-out queue with unbounded
size. It's useful for cases where a channel with fixed size might deadlock.

The caller is responsible for locking.
Step simplifies using WithStep because it creates a local scope where the same
tCtx variable is the one with the step name.
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Apr 28, 2024
@k8s-ci-robot k8s-ci-robot requested a review from towca April 28, 2024 12:53
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 28, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Apr 28, 2024
@pohly pohly changed the title DRA: schedule event handlers via assume cache DRA: scheduler event handlers via assume cache Apr 28, 2024
@k8s-ci-robot k8s-ci-robot added area/test sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 28, 2024
@pohly pohly force-pushed the dra-scheduler-assume-cache-eventhandlers branch from 3cc6fe2 to 7d9abd5 Compare April 29, 2024 06:35
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pohly
Once this PR has been reviewed and has the lgtm label, please ask for approval from kerthcet. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

This enables using the assume cache for cluster events.
@pohly pohly force-pushed the dra-scheduler-assume-cache-eventhandlers branch from 7d9abd5 to 2d66ba2 Compare April 29, 2024 08:59
@pohly
Copy link
Contributor Author

pohly commented Apr 29, 2024

/retest

@bart0sh bart0sh added this to Triage in SIG Node PR Triage Apr 29, 2024
This enables connecting the event handler for ResourceClaim to the assume
cache, which addresses a theoretic race condition.

It may also be useful for implementing the autoscaler support, because now
the autoscaler can modify the content of the cache.
@pohly pohly force-pushed the dra-scheduler-assume-cache-eventhandlers branch from 2d66ba2 to 0b0e8e3 Compare April 29, 2024 12:43
@bart0sh bart0sh moved this from Triage to Needs Reviewer in SIG Node PR Triage May 8, 2024
@haircommander haircommander moved this from Triage to Archive-it in SIG Node CI/Test Board May 8, 2024
@kerthcet
Copy link
Member

Will take a look later, sorry I didn't notice the assignment.

Copy link
Member

@kerthcet kerthcet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The eventQueue is a bit hard to understand, I may need more time on that.

@@ -447,8 +450,13 @@ func TestAddAllEventHandlers(t *testing.T) {

dynclient := dyfake.NewSimpleDynamicClient(scheme)
dynInformerFactory := dynamicinformer.NewDynamicSharedInformerFactory(dynclient, 0)
var resourceClaimCache *assumecache.AssumeCache
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we didn't test the DRA in the test because the feature gate is always closed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I added new test cases which exercise the informer creation more completely.

The feature gate check in eventhandlers.go is a bit redundant: the plugin itself never asks for any of these events when it is disabled.

@@ -701,6 +702,10 @@ type Handle interface {

SharedInformerFactory() informers.SharedInformerFactory

// ResourceClaimInfos returns an assume cache of ResourceClaim objects
// which gets populated by the shared informer factory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// which gets populated by the shared informer factory.
// which gets populated by the shared informer factory and the dynamicResources plugin.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Handle interface is getting fatter ands seems not under a good design.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, with slightly different spelling.

}

objInfo := &objInfo{name: name, latestObj: obj, apiObj: obj}
if err = c.store.Update(objInfo); err != nil {
c.logger.Info("Error occurred while updating stored object", "err", err)
} else {
c.logger.V(10).Info("Adding object to assume cache", "description", c.description, "cacheKey", name, "assumeCache", obj)
for _, handler := range c.eventHandlers {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we're maintaining two level event handlers here? I didn't quite get the idea why we need this? I read the comment of eventQueue, but still can't quite understand...

And do we have several handlers, what I learned from the PR is we only build the handler for ResourceClaim.

	for gvk, at := range gvkMap {		
                 case framework.ResourceClaim:
			if utilfeature.DefaultFeatureGate.Enabled(features.DynamicResourceAllocation) {
				// No need to wait for this cache to be
				// populated. If a claim is not yet in the
				// cache, scheduling will get retried once it
				// is.
				resourceClaimCache.AddEventHandler(buildEvtResHandler(at, framework.ResourceClaim, "ResourceClaim"))
			}
        }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chain of events is this now: claim informer -> assume cache -> plugin.

This ensures that local changes to the assume cache are properly reported to the plugin. This wasn't the case before, leading to the race explained in #123698

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation, I got the idea here, seems we have a MPSC architecture, we have producers like informer, local update, single consumer as the event handler.

Based on that, maybe we'd better not consume the events immediately after producing them, but consuming them in another goroutine independently, that's say we'll move something like emitEvents to another place, decoupled with producer, which makes the code more readable and maintainable IMHO.

Moreover, we can hide the producer-consumer logic at cache layer, but locate them at the storage layer, take the update for example (this is just a demo) :

c.updateStore(objInfo) // rather than call c.store.Update() and push the events together.

func (c *AssumeCache) updateStore(oldObj, newObj) {
    c.store.Update()
    c.eventQueue.Push(oldObj, newObj)
}

My two cents.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you are proposing sounds very scary to me in terms of correctness and race conditions.

Can we keep the current architecture for event processing and just slightly extend it so that the source of events can be informers as well as local assume cache updates?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean something like this c7da77c, decouple the consumer logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why that is useful? What is the advantage of decoupling?

Event handlers are meant to complete quickly. That is already true for informer events. So making the code more complex to support long-running event handlers doesn't seem warranted to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the original idea is for simplify, the original code is a bit hard to understand, at least to me, so I don't think we're making it complex.

And I don't think it's slower, we'll consume the events continuously and we'll sent less events, on the contrary, I think we make less races, like 1) before when holding the lock, we'll iterate the handlers and send several related events, but now we only send one event, and 2) we'll call emitEvents at each OP, however, this will drain the whole events, which means we'll acquire unnecessary lock with several calls to emitEvents the same time.

BTW, for each OP, we'll send them to a queue and consume them from the queue in one function, what's your take here vs invoking handlers directly? Are you afraid the handlers will take a long time?

Would like to cc @alculquicondor @jsafrane for advices as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I am not following. The assume cache should emit the same number of events regardless in which goroutine the event handlers are called, so "now we only send one event" doesn't make sense to me.

What does "OP" stand for?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
Status: Needs Reviewer
SIG Node PR Triage
Needs Reviewer
Development

Successfully merging this pull request may close these issues.

None yet

3 participants