scheduler plugin should de-prioritize newer nodes #846

sharnoff · 2024-03-06T00:22:55Z

Problem description / Motivation

Currently the load on the scheduler is somewhat unusual: we have (usually) short (but uneven) lifetimes of computes, with varying external load producing regular usage spikes that

This load sometimes interacts with our node scoring algorithm to result in chaotic (in the mathematical sense) and cyclical fluctuations in reserved resources on the nodes. This has a single primary effect:

We fail to produce nodes with lower usage when the cluster has capacity to get rid of a node (meaning we remain overprovisioned)

In particular, this happens most visibly when a node is added due to external demand — sometimes it is removed after demand returns to normal, but sometimes another node's usage goes down instead (but not far enough to be removed).

Here's a recent example:

Discussion here: https://neondb.slack.com/archives/C03TN5G758R/p1709660933447909

Feature idea(s) / DoD

To mitigate the issues above, the scheduler plugin should de-prioritize newer nodes - providing both a consistent ordering (preventing "swapping" usage between nodes) and explicitly prioritizing removal of nodes that are added to satisfy immediate demand (which will have fewer long-running computes).

Implementation ideas

From the slack thread linked above:

I'm imagining that the new node scoring algorithm should be the following (note scores are always 0 to 100).

If a node's usage is >85%

Score is 33 * (1 - (usage fraction - 0.85)) — i.e. higher usage is worse

Else, if it's one of the youngest ceil(20% of N) nodes:

Score is 33 + min(33, rank within youngest nodes) — i.e. younger (rank is a smaller number) is worse (overloaded terms; I intend that youngest is rank 1, second-youngest is rank 2, etc.)

Otherwise

Score is 66 + (usage fraction * 33) — i.e. higher usage is better

(specific numbers to replace 85 and 20 TBD)

The text was updated successfully, but these errors were encountered:

sharnoff added t/feature Issue type: feature, for new features or requests c/autoscaling/scheduler Component: autoscaling: k8s scheduler labels Mar 6, 2024

sharnoff self-assigned this Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scheduler plugin should de-prioritize newer nodes #846

scheduler plugin should de-prioritize newer nodes #846

sharnoff commented Mar 6, 2024

scheduler plugin should de-prioritize newer nodes #846

scheduler plugin should de-prioritize newer nodes #846

Comments

sharnoff commented Mar 6, 2024

Problem description / Motivation

Feature idea(s) / DoD

Implementation ideas