Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

plugin: Replace readClusterState with existing watch events #904

Merged
merged 3 commits into from
May 22, 2024

Conversation

sharnoff
Copy link
Member

... and a few other commits tossed in for fun.

Probably best to review commit-by-commit.

The thing I'm most worried about is that this potentially may change buffer handling.

Might be a good idea to go for #840 first, but I suspect that might also have its own issues.

This is a precursor to #517, to minimize the number of places we need to change resource handling.

@Omrigan
Copy link
Contributor

Omrigan commented Apr 15, 2024

Do you mean these watch events?

vwc := vmWatchCallbacks{
submitConfigUpdated: func(logger *zap.Logger, pod util.NamespacedName, newCfg api.VmConfig) {
pushToQueue(logger, pod.Name, func() { p.handleVMConfigUpdated(hlogger, pod, newCfg) })
},
submitBoundsChanged: func(logger *zap.Logger, vm *api.VmInfo, podName string) {
pushToQueue(logger, vm.Name, func() { p.handleUpdatedScalingBounds(hlogger, vm, podName) })
},
submitNonAutoscalingVmUsageChanged: func(logger *zap.Logger, vm *api.VmInfo, podName string) {
pushToQueue(logger, vm.Name, func() { p.handleNonAutoscalingUsageChange(hlogger, vm, podName) })
},
}

I am trying to understand how this relates to #873. For #873 my understanding was to handle those callbacks from pod watcher, and get rid of VM watcher entirely.

@sharnoff
Copy link
Member Author

Do you mean these watch events?

Broadly, watch events like those - but it only relies on the Pod watch events, I think. (and specifically, this PR waits for only the initial Pod "add" events to be completely processed)

I am trying to understand how this relates to #873.

Right, yeah fair. This PR should be ok - we only care about the Pod watch events. (unless I missed something - please feel free to double-check me no that)

Copy link
Contributor

@Omrigan Omrigan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have some questions

pkg/plugin/state.go Show resolved Hide resolved
pkg/plugin/state.go Outdated Show resolved Hide resolved
// The initial state building is complete when the counter reaches zero, at which point we close
// the channel that this function will wait on.

var initEventsCount atomic.Int32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use semaphore here instead?

  • Acquire(nil, n) when events are added
  • Release(1) when event is processed
  • Acquire(ctxWithTimeout, n) when waiting for events processing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems simpler. Implemented it in caba29f but tbh I'm not sure if it's better. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the incEventCount is left as a semaphore wrapper anyway, maybe instead revert to your old implementation, but wrap all in a struct synchronization primitive?

The reason I suggested semaphore is that I am wary of using raw channels and atomics in the code, where it feels like it should be a synchronization primitive.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made it a separate type in 521597e - LMK what you think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagined the implementation with initEventsCount hidden inside the new helper.

Also, you don't need incEventCount now

Copy link
Member Author

@sharnoff sharnoff May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagined the implementation with initEventsCount hidden inside the new helper.

Hm, it's hard to make that kind of abstraction safe to use (because the counter wouldn't be monotonic) -- maybe it'd be fine here, but I'd rather keep the logic inlined than make an unsafe abstraction for just this purpose.

So in its current state, incEventCount or equivalent is still needed (we can just directly add to the atomic; don't know whether that's better)

pkg/plugin/plugin.go Outdated Show resolved Hide resolved
pkg/plugin/config.go Show resolved Hide resolved
@sharnoff sharnoff force-pushed the sharnoff/plugin-replace-readClusterState branch from 521597e to b1e53de Compare May 10, 2024 00:20
pkg/plugin/config.go Show resolved Hide resolved
@sharnoff sharnoff force-pushed the sharnoff/plugin-replace-readClusterState branch from b1e53de to 6555a77 Compare May 21, 2024 01:06
@sharnoff
Copy link
Member Author

Realized that the necessity of debe3b2 (from #936) is partially due to 25bbe43 here — it adds Buffer always, but there's a handful of circumstances where we don't actually need it (like when a new VM is added but doesn't go through our scheduler)

@sharnoff
Copy link
Member Author

Adapted debe3b2 here as c8e250e to fix that issue. Impact should have been close to zero in practice, but wanted to fix it just to be safe. @Omrigan wdyt?

allowDeny: false,
// this may be a preexisting VM. If so, we should include it in "buffer" as long it's
// supposed to be handled by us (otherwise, the "buffer" will never be resolved)
includeBuffer: pod.Spec.SchedulerName == e.state.conf.SchedulerName,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we even have SchedulerName as a part of the spec, why do we ever want it to be different?

Copy link
Member Author

@sharnoff sharnoff May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Result of discussing this: it's part of the VM spec because neonvm is a logically separate dependency of "autoscaling", so the ability to run with/without the rest of autoscaling requires the ability to set the pod's scheduler name

@Omrigan
Copy link
Contributor

Omrigan commented May 22, 2024

Are you going to do rebase merge? If so, tmp commit needs to be squashed somewhere, maybe into middle commit.

After the changes from #865 that switched to fetching VmInfo from Pod
objects, the VM store is no longer needed long-term, just during initial
cluster state reading (which can *also* be swapped out) and for event
generation.

This commit changes the VM store from being stored in the
AutoscaleEnforcer struct to being passed in to readClusterState as an
argument.
In short, readClusterState is super complicated, separately reimplements
the reserveResources() logic, and may be the source of several
startup-related bugs (probably #671 and #852).

So, given that we *already* have a pathway for updating our internal
state from changes in the cluster (i.e. the watch events), we should
just use that instead.
@sharnoff sharnoff force-pushed the sharnoff/plugin-replace-readClusterState branch from c8e250e to ceceb07 Compare May 22, 2024 16:20
@sharnoff sharnoff merged commit e5b592c into main May 22, 2024
15 checks passed
sharnoff added a commit that referenced this pull request May 22, 2024
After the changes from #865 that switched to fetching VmInfo from Pod
objects, the VM store is no longer needed long-term, just during initial
cluster state reading (which can *also* be swapped out) and for event
generation.

This commit changes the VM store from being stored in the
AutoscaleEnforcer struct to being passed in to readClusterState as an
argument.
@sharnoff sharnoff deleted the sharnoff/plugin-replace-readClusterState branch May 22, 2024 17:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants