Skip to content

SCHED-848: add ephemeral nodes#2101

Draft
Uburro wants to merge 2 commits intomainfrom
uburro/SCHED-848/0
Draft

SCHED-848: add ephemeral nodes#2101
Uburro wants to merge 2 commits intomainfrom
uburro/SCHED-848/0

Conversation

@Uburro
Copy link
Collaborator

@Uburro Uburro commented Jan 28, 2026

Problem

This pull request implements support for Slurm dynamic topology by enabling ephemeral nodes that receive topology configuration at runtime via ConfigMap distribution.

Testing

manually

Release Notes

Added new CRD fields EphemeralNodes and EphemeralTopologyWaitTimeout to NodeSet specification
Implemented TopologyDistributionReconciler to manage OpenKruise ResourceDistribution for syncing topology ConfigMaps to namespaces with ephemeral NodeSets

@Uburro Uburro added go Pull requests that update Go code feature labels Jan 28, 2026
@Uburro Uburro requested a review from Copilot January 28, 2026 15:14
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request implements support for Slurm dynamic topology by enabling ephemeral nodes that receive topology configuration at runtime via ConfigMap distribution.

Changes:

  • Added new CRD fields EphemeralNodes and EphemeralTopologyWaitTimeout to NodeSet specification
  • Implemented TopologyDistributionReconciler to manage OpenKruise ResourceDistribution for syncing topology ConfigMaps to namespaces with ephemeral NodeSets
  • Added Python wait_for_topology.py script that waits for and formats topology data from ConfigMap before starting slurmd

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
api/v1alpha1/nodeset_types.go Added EphemeralNodes and EphemeralTopologyWaitTimeout fields to NodeSetSpec
api/v1alpha1/zz_generated.deepcopy.go Generated deep copy methods for new EphemeralNodes pointer field
cmd/main.go Registered TopologyDistributionReconciler and added kruisev1alpha1 scheme
config/crd/bases/slurm.nebius.ai_nodesets.yaml Updated NodeSet CRD with new ephemeral nodes fields and validation
config/rbac/clustercontroller/role.yaml Added RBAC permissions for resourcedistributions resource
helm/soperator/crds/slurmcluster-crd.yaml Updated SlurmCluster CRD with new ephemeral nodes fields
helm/soperator/templates/manager-rbac.yaml Added resourcedistributions to Kruise resources in RBAC
helm/soperator/values.yaml Enabled ResourceDistributionGate feature gate for Kruise
helm/soperator-crds/templates/slurmcluster-crd.yaml Updated CRD template with ephemeral nodes configuration
helm/soperator-fluxcd/values.yaml Enabled ResourceDistributionGate feature gate
images/worker/slurmd.dockerfile Added wait_for_topology.py script to image and made it executable
images/worker/slurmd_entrypoint.sh Added build_dynamic_conf function and ephemeral nodes mode that sources topology env file
images/worker/wait_for_topology.py New Python script that waits for topology data in ConfigMap and writes formatted env file
images/worker/wait_for_topology_test.py Comprehensive unit tests for topology wait script
internal/consts/container.go Added ContainerNameWaitForTopology constant
internal/consts/volume.go Added volume and mount path constants for topology ConfigMap and env file
internal/controller/topologyconfcontroller/resourcedistribution_controller.go New controller that manages ResourceDistribution for topology ConfigMap syncing
internal/controller/topologyconfcontroller/resourcedistribution_controller_test.go Unit tests for TopologyDistributionReconciler
internal/render/worker/container.go Added RenderContainerWaitForTopology init container and topology env variables to slurmd container
internal/render/worker/statefulset.go Added topology volumes and init container for ephemeral nodes
internal/values/slurm_nodeset.go Added EphemeralNodes and EphemeralTopologyWaitTimeout fields to SlurmNodeSet

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Uburro Uburro force-pushed the uburro/SCHED-848/0 branch 2 times, most recently from afd3440 to bf57f8d Compare January 29, 2026 15:06
@Uburro Uburro requested a review from Copilot January 29, 2026 15:07
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 23 out of 23 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Uburro Uburro force-pushed the uburro/SCHED-848/0 branch 3 times, most recently from 64ead49 to 6558232 Compare February 3, 2026 09:36
@Uburro Uburro requested a review from Copilot February 3, 2026 09:50
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 29 out of 29 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

internal/controller/topologyconfcontroller/workertopology_controller.go:393

  • The InitializeTopologyConf function now always includes a "dummy" node in the topology configuration, even when there are actual worker nodes. This could cause issues with Slurm as it will try to manage a non-existent "dummy" node. The logic should be:
  • If there are actual nodes (after adding real nodes), don't add the dummy node
  • Only add the dummy node when there are no actual nodes to prevent an empty Nodes list

The condition at line 388 if len(nodes) == 0 will never be true after line 374 adds the dummy node, making this check ineffective.

func InitializeTopologyConf(asts *kruisev1b1.StatefulSetList) string {
	if asts == nil {
		return ""
	}

	var nodes []string
	switchName := "SwitchName=unknown"
	dummyNodeName := "dummy"
	nodes = append(nodes, dummyNodeName)

	if len(asts.Items) > 0 {
		for _, sts := range asts.Items {
			if sts.Spec.Replicas == nil || *sts.Spec.Replicas <= 0 {
				continue
			}

			for i := 0; i < int(*sts.Spec.Replicas); i++ {
				nodes = append(nodes, sts.Name+"-"+strconv.Itoa(i))
			}
		}
	}

	if len(nodes) == 0 {
		return switchName
	}

	return switchName + " Nodes=" + strings.Join(nodes, ",")
}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Uburro Uburro force-pushed the uburro/SCHED-848/0 branch from 6558232 to 2ccb9d1 Compare February 3, 2026 17:12
@Uburro Uburro changed the title SCHED-848: switch to Slurm dynamic topology SCHED-848: add ephemeral nodes Feb 4, 2026
@Uburro Uburro force-pushed the uburro/SCHED-848/0 branch from 2ccb9d1 to 98b38ea Compare February 4, 2026 14:33
@Uburro Uburro force-pushed the uburro/SCHED-848/0 branch from 75e9835 to 250814d Compare February 5, 2026 08:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature go Pull requests that update Go code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant