SCHED-380: Skip maintenance handling based on node labels by ali-sattari · Pull Request #1773 · nebius/soperator

ali-sattari · 2025-11-14T07:58:23Z

Problem

Soperator's handling of maintenance events is geared toward worker nodes and works by deleting the node to be recreated. For non-replicated stateful applications such as NFS we want to stop/start the node instead. Managed K8S control plane already handles maintenance the way we need for these applications.

Solution

Get a list of labels as input and ignore matching nodes when there is a maintenance condition.

Testing

Deploy a cluster with a separate node group for NFS in K8S
Config Soperator to pass labels for those nodes (e.g. slurm.nebius.ai/nodeset=nfs
Listen to logs of soperator-check-checks deployment
Manually set condition on node using kubectl proxy and curl (example below)
You should see skipping draincondition processing due to ignored labels in logs
Soperator should not take any action on the node matching above labels

Manually setting node condition:

curl -X PATCH http://127.0.0.1:8001/api/v1/nodes/computeinstance-xxx/status \
 -H "Content-Type: application/merge-patch+json" \
 --data '{
  "status": {
   "conditions": [
    {
     "type": "MaintenanceScheduled",
     "status": "True",
     "message": "Maintenance scheduled for node",
     "lastTransitionTime": "'"$(date -u +"%Y-%m-%dT%H:%M:%SZ")"'"
    }
   ]
  }
}'

Log line example:

2025-11-14T12:29:01.388403986Z manager {"level":"info","ts":"2025-11-14T12:29:01Z","logger":"K8SNodesController.processDrainCondition","msg":"skipping draincondition processing due to ignored labels","controller":"soperatorchecks.k8snodes","controllerGroup":"","controllerKind":"Node","Node":{"name":"computeinstance-xxx"},"namespace":"","name":"computeinstance-xxx","reconcileID":"yyyy","node":"computeinstance-xxx","nodeLabels":{"beta.kubernetes.io/arch":"amd64","beta.kubernetes.io/instance-type":"cpu-d3","beta.kubernetes.io/os":"linux","failure-domain.beta.kubernetes.io/region":"eu-north1","kubernetes.io/arch":"amd64","kubernetes.io/hostname":"computeinstance-xxx","kubernetes.io/os":"linux","nebius.com/node-group-id":"mk8snodegroup-zzz","nebius.com/resource-preset":"32vcpu-128gb","node.kubernetes.io/instance-type":"cpu-d3","slurm.nebius.ai/nodeset":"nfs","slurm.nebius.ai/workload":"cpu","topology.kubernetes.io/region":"eu-north1"},"ignoredLabels":{"slurm.nebius.ai/nodeset":"nfs"}}

Release Notes

Skip maintenance handling based on node labels. Useful to let other controllers (such as mk8s in Nebius) handle maintenance events.

cmd/soperatorchecks/main.go

ali-sattari added the feature label Nov 14, 2025

ali-sattari marked this pull request as ready for review November 14, 2025 12:51

asteny reviewed Nov 14, 2025

View reviewed changes

cmd/soperatorchecks/main.go Outdated Show resolved Hide resolved

ali-sattari requested a review from asteny November 14, 2025 14:11

asteny approved these changes Nov 14, 2025

View reviewed changes

ali-sattari force-pushed the ali/SCHED-381/soperator-skip-maintenance-for-node branch from 268b105 to 5e45836 Compare November 14, 2025 14:48

rdjjke approved these changes Nov 14, 2025

View reviewed changes

ali-sattari added 3 commits November 17, 2025 14:33

ignore nodes by label in maintenance handling

7659112

pass the new conf as env

4c0aa0e

rename flag

77da62a

ali-sattari force-pushed the ali/SCHED-381/soperator-skip-maintenance-for-node branch from 5e45836 to 77da62a Compare November 17, 2025 13:33

ali-sattari merged commit 5fae0e1 into main Nov 17, 2025
6 checks passed

ali-sattari deleted the ali/SCHED-381/soperator-skip-maintenance-for-node branch November 17, 2025 14:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SCHED-380: Skip maintenance handling based on node labels#1773

SCHED-380: Skip maintenance handling based on node labels#1773
ali-sattari merged 3 commits intomainfrom
ali/SCHED-381/soperator-skip-maintenance-for-node

ali-sattari commented Nov 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ali-sattari commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Testing

Release Notes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ali-sattari commented Nov 14, 2025 •

edited

Loading