Conversation
There was a problem hiding this comment.
Pull request overview
This pull request implements support for Slurm dynamic topology by enabling ephemeral nodes that receive topology configuration at runtime via ConfigMap distribution.
Changes:
- Added new CRD fields
EphemeralNodesandEphemeralTopologyWaitTimeoutto NodeSet specification - Implemented TopologyDistributionReconciler to manage OpenKruise ResourceDistribution for syncing topology ConfigMaps to namespaces with ephemeral NodeSets
- Added Python wait_for_topology.py script that waits for and formats topology data from ConfigMap before starting slurmd
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| api/v1alpha1/nodeset_types.go | Added EphemeralNodes and EphemeralTopologyWaitTimeout fields to NodeSetSpec |
| api/v1alpha1/zz_generated.deepcopy.go | Generated deep copy methods for new EphemeralNodes pointer field |
| cmd/main.go | Registered TopologyDistributionReconciler and added kruisev1alpha1 scheme |
| config/crd/bases/slurm.nebius.ai_nodesets.yaml | Updated NodeSet CRD with new ephemeral nodes fields and validation |
| config/rbac/clustercontroller/role.yaml | Added RBAC permissions for resourcedistributions resource |
| helm/soperator/crds/slurmcluster-crd.yaml | Updated SlurmCluster CRD with new ephemeral nodes fields |
| helm/soperator/templates/manager-rbac.yaml | Added resourcedistributions to Kruise resources in RBAC |
| helm/soperator/values.yaml | Enabled ResourceDistributionGate feature gate for Kruise |
| helm/soperator-crds/templates/slurmcluster-crd.yaml | Updated CRD template with ephemeral nodes configuration |
| helm/soperator-fluxcd/values.yaml | Enabled ResourceDistributionGate feature gate |
| images/worker/slurmd.dockerfile | Added wait_for_topology.py script to image and made it executable |
| images/worker/slurmd_entrypoint.sh | Added build_dynamic_conf function and ephemeral nodes mode that sources topology env file |
| images/worker/wait_for_topology.py | New Python script that waits for topology data in ConfigMap and writes formatted env file |
| images/worker/wait_for_topology_test.py | Comprehensive unit tests for topology wait script |
| internal/consts/container.go | Added ContainerNameWaitForTopology constant |
| internal/consts/volume.go | Added volume and mount path constants for topology ConfigMap and env file |
| internal/controller/topologyconfcontroller/resourcedistribution_controller.go | New controller that manages ResourceDistribution for topology ConfigMap syncing |
| internal/controller/topologyconfcontroller/resourcedistribution_controller_test.go | Unit tests for TopologyDistributionReconciler |
| internal/render/worker/container.go | Added RenderContainerWaitForTopology init container and topology env variables to slurmd container |
| internal/render/worker/statefulset.go | Added topology volumes and init container for ephemeral nodes |
| internal/values/slurm_nodeset.go | Added EphemeralNodes and EphemeralTopologyWaitTimeout fields to SlurmNodeSet |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
internal/controller/topologyconfcontroller/resourcedistribution_controller.go
Outdated
Show resolved
Hide resolved
internal/controller/topologyconfcontroller/resourcedistribution_controller.go
Outdated
Show resolved
Hide resolved
internal/controller/topologyconfcontroller/resourcedistribution_controller_test.go
Outdated
Show resolved
Hide resolved
internal/controller/topologyconfcontroller/resourcedistribution_controller_test.go
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 21 out of 21 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
internal/controller/topologyconfcontroller/resourcedistribution_controller.go
Outdated
Show resolved
Hide resolved
afd3440 to
bf57f8d
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 23 out of 23 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
internal/controller/topologyconfcontroller/resourcedistribution_controller.go
Outdated
Show resolved
Hide resolved
internal/controller/topologyconfcontroller/resourcedistribution_controller_test.go
Outdated
Show resolved
Hide resolved
internal/controller/topologyconfcontroller/resourcedistribution_controller.go
Outdated
Show resolved
Hide resolved
64ead49 to
6558232
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 29 out of 29 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
internal/controller/topologyconfcontroller/workertopology_controller.go:393
- The InitializeTopologyConf function now always includes a "dummy" node in the topology configuration, even when there are actual worker nodes. This could cause issues with Slurm as it will try to manage a non-existent "dummy" node. The logic should be:
- If there are actual nodes (after adding real nodes), don't add the dummy node
- Only add the dummy node when there are no actual nodes to prevent an empty Nodes list
The condition at line 388 if len(nodes) == 0 will never be true after line 374 adds the dummy node, making this check ineffective.
func InitializeTopologyConf(asts *kruisev1b1.StatefulSetList) string {
if asts == nil {
return ""
}
var nodes []string
switchName := "SwitchName=unknown"
dummyNodeName := "dummy"
nodes = append(nodes, dummyNodeName)
if len(asts.Items) > 0 {
for _, sts := range asts.Items {
if sts.Spec.Replicas == nil || *sts.Spec.Replicas <= 0 {
continue
}
for i := 0; i < int(*sts.Spec.Replicas); i++ {
nodes = append(nodes, sts.Name+"-"+strconv.Itoa(i))
}
}
}
if len(nodes) == 0 {
return switchName
}
return switchName + " Nodes=" + strings.Join(nodes, ",")
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
6558232 to
2ccb9d1
Compare
2ccb9d1 to
98b38ea
Compare
75e9835 to
250814d
Compare
Problem
This pull request implements support for Slurm dynamic topology by enabling ephemeral nodes that receive topology configuration at runtime via ConfigMap distribution.
Testing
manually
Release Notes
Added new CRD fields EphemeralNodes and EphemeralTopologyWaitTimeout to NodeSet specification
Implemented TopologyDistributionReconciler to manage OpenKruise ResourceDistribution for syncing topology ConfigMaps to namespaces with ephemeral NodeSets