Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler plugin should de-prioritize newer nodes #846

Open
sharnoff opened this issue Mar 6, 2024 · 0 comments
Open

scheduler plugin should de-prioritize newer nodes #846

sharnoff opened this issue Mar 6, 2024 · 0 comments
Assignees
Labels
c/autoscaling/scheduler Component: autoscaling: k8s scheduler t/feature Issue type: feature, for new features or requests

Comments

@sharnoff
Copy link
Member

sharnoff commented Mar 6, 2024

Problem description / Motivation

Currently the load on the scheduler is somewhat unusual: we have (usually) short (but uneven) lifetimes of computes, with varying external load producing regular usage spikes that

This load sometimes interacts with our node scoring algorithm to result in chaotic (in the mathematical sense) and cyclical fluctuations in reserved resources on the nodes. This has a single primary effect:

  • We fail to produce nodes with lower usage when the cluster has capacity to get rid of a node (meaning we remain overprovisioned)

In particular, this happens most visibly when a node is added due to external demand — sometimes it is removed after demand returns to normal, but sometimes another node's usage goes down instead (but not far enough to be removed).

Here's a recent example:

Graph of reserved CPU on nodes in us-west-2, showing a node added, its usage spiky between 10-40%, and then an hour later its usage swaps with one of the many nodes at 80% reserved, with that other node's usage slowly decreasing afterwards towards 20-25%

Discussion here: https://neondb.slack.com/archives/C03TN5G758R/p1709660933447909

Feature idea(s) / DoD

To mitigate the issues above, the scheduler plugin should de-prioritize newer nodes - providing both a consistent ordering (preventing "swapping" usage between nodes) and explicitly prioritizing removal of nodes that are added to satisfy immediate demand (which will have fewer long-running computes).

Implementation ideas

From the slack thread linked above:

I'm imagining that the new node scoring algorithm should be the following (note scores are always 0 to 100).

  • If a node's usage is >85%
    • Score is 33 * (1 - (usage fraction - 0.85)) — i.e. higher usage is worse
  • Else, if it's one of the youngest ceil(20% of N) nodes:
    • Score is 33 + min(33, rank within youngest nodes) — i.e. younger (rank is a smaller number) is worse (overloaded terms; I intend that youngest is rank 1, second-youngest is rank 2, etc.)
  • Otherwise
    • Score is 66 + (usage fraction * 33) — i.e. higher usage is better

(specific numbers to replace 85 and 20 TBD)

@sharnoff sharnoff added t/feature Issue type: feature, for new features or requests c/autoscaling/scheduler Component: autoscaling: k8s scheduler labels Mar 6, 2024
@sharnoff sharnoff self-assigned this Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/autoscaling/scheduler Component: autoscaling: k8s scheduler t/feature Issue type: feature, for new features or requests
Projects
None yet
Development

No branches or pull requests

1 participant