Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds AppArmor profile support to the soperator for Slurm worker and login nodes, addressing issue #184. It introduces a new UseDefaultAppArmorProfile field (defaulting to true) that generates and applies a default AppArmor profile to restrict write access to NVIDIA libraries and binaries. The PR notes that the mandatory linking to libnvidia-ml.so.1 was fixed in Slurm patch 24.05.5, but the operator currently uses version 22.05.2.
Key changes:
- Added
UseDefaultAppArmorProfileboolean field to the SlurmCluster CRD with a default value oftrue - Implemented AppArmor profile generation with deny rules for NVIDIA library/binary write access
- Integrated AppArmor profile reconciliation into the cluster controller workflow
Reviewed changes
Copilot reviewed 27 out of 28 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| api/v1/slurmcluster_types.go | Added UseDefaultAppArmorProfile field to SlurmClusterSpec |
| config/crd/bases/slurm.nebius.ai_slurmclusters.yaml | Updated CRD with new AppArmor profile field definition |
| helm/soperator-crds/templates/slurmcluster-crd.yaml | Synchronized CRD template with base CRD changes |
| helm/soperator/crds/slurmcluster-crd.yaml | Synchronized CRD with base CRD changes |
| helm/slurm-cluster/values.yaml | Added useDefaultAppArmorProfile: true configuration option |
| helm/slurm-cluster/templates/slurm-cluster-cr.yaml | Wired AppArmor profile setting to cluster CR |
| internal/render/common/apparmorprofile.go | Implemented AppArmor profile rendering and policy generation |
| internal/naming/naming.go | Added AppArmor profile naming function |
| internal/values/slurm_worker.go | Added AppArmor profile field to SlurmWorker and propagated value |
| internal/values/slurm_login.go | Added AppArmor profile field to SlurmLogin and propagated value |
| internal/values/slurm_cluster.go | Integrated AppArmor profile flag into cluster value building |
| internal/values/slurm_worker_test.go | Updated test to pass new AppArmor parameter |
| internal/render/worker/statefulset.go | Refactored annotations to conditionally use AppArmor profile |
| internal/render/login/statefulset.go | Refactored annotations to conditionally use AppArmor profile |
| internal/controller/reconciler/apparmorprofile.go | Added AppArmor profile reconciler implementation |
| internal/controller/reconciler/reconciler.go | Registered AppArmor profile type in reconciler |
| internal/controller/reconciler/k8s_statefulset.go | Added template annotations/labels to statefulset patching |
| internal/controller/clustercontroller/reconcile.go | Added AppArmorProfile reconciler field and initialization |
| internal/controller/clustercontroller/common.go | Integrated AppArmor profile reconciliation step |
| internal/check/installed_crd.go | Added AppArmor CRD installation check |
| cmd/main.go | Registered AppArmor scheme when CRD is installed |
| config/rbac/role.yaml | Added RBAC permissions for AppArmor profiles |
| helm/soperator/templates/manager-rbac.yaml | Added RBAC permissions for AppArmor profiles |
| helm/soperator/values.yaml | Added isApparmorCrdInstalled environment variable |
| helm/soperator/templates/deployment.yaml | Added AppArmor CRD environment variable to deployment |
| config/manager/manager.yaml | Added IS_APPARMOR_CRD_INSTALLED environment variable |
| go.mod | Added security-profiles-operator dependency |
| go.sum | Updated checksums for new dependencies |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ) | ||
|
|
||
| func RenderAppArmorProfile(cluster *values.SlurmCluster) *apparmor.AppArmorProfile { | ||
| if !cluster.NodeLogin.UseDefaultAppArmorProfile || !cluster.NodeWorker.UseDefaultAppArmorProfile { |
There was a problem hiding this comment.
Logic error: The condition uses OR (||) which means the AppArmor profile will NOT be created if EITHER NodeLogin OR NodeWorker has UseDefaultAppArmorProfile set to false. This should use AND (&&) so that the profile is only skipped when BOTH are false. With the current logic, if only one node type wants to use the default profile, it won't be created.
| if !cluster.NodeLogin.UseDefaultAppArmorProfile || !cluster.NodeWorker.UseDefaultAppArmorProfile { | |
| if !cluster.NodeLogin.UseDefaultAppArmorProfile && !cluster.NodeWorker.UseDefaultAppArmorProfile { |
| stepLogger.Info("AppArmor CRD is not installed, skipping AppArmor profile reconciliation") | ||
| return nil | ||
| } | ||
| if !clusterValues.NodeLogin.UseDefaultAppArmorProfile || !clusterValues.NodeWorker.UseDefaultAppArmorProfile { |
There was a problem hiding this comment.
Logic error: Same as in RenderAppArmorProfile, this condition uses OR (||) when it should use AND (&&). The profile reconciliation should be skipped only when BOTH NodeLogin and NodeWorker have UseDefaultAppArmorProfile set to false, not when either one is false.
| if !clusterValues.NodeLogin.UseDefaultAppArmorProfile || !clusterValues.NodeWorker.UseDefaultAppArmorProfile { | |
| if !clusterValues.NodeLogin.UseDefaultAppArmorProfile && !clusterValues.NodeWorker.UseDefaultAppArmorProfile { |
| Error(err, "Failed to reconcile AppArmorProfile ") | ||
| return errors.Wrap(err, "reconciling AppArmorProfile ") |
There was a problem hiding this comment.
Trailing space in error messages. Both "Failed to reconcile AppArmorProfile " and "reconciling AppArmorProfile " have trailing spaces that should be removed for consistency.
| Error(err, "Failed to reconcile AppArmorProfile ") | |
| return errors.Wrap(err, "reconciling AppArmorProfile ") | |
| Error(err, "Failed to reconcile AppArmorProfile") | |
| return errors.Wrap(err, "reconciling AppArmorProfile") |
| /** lrixw, | ||
|
|
||
|
|
||
| # remove [^m], when bump slurm 24.05.5 or higher |
There was a problem hiding this comment.
The comment says "remove [^m], when bump slurm 24.05.5 or higher" but should probably say "when bumping to slurm 24.05.5 or higher" for better grammar. Additionally, consider "bump to" instead of "bump" for clarity.
| # remove [^m], when bump slurm 24.05.5 or higher | |
| # remove [^m] when bumping to slurm 24.05.5 or higher |
| deps ...metav1.Object, | ||
| ) error { | ||
| if desired == nil { | ||
| return errors.New("AppArmorProfile is not needed") |
There was a problem hiding this comment.
The error message "AppArmorProfile is not needed" is misleading for the error case. If desired is nil, it's not necessarily an error condition - the calling code checks if AppArmor is needed and may intentionally pass nil. This check should either be removed (let the calling code handle nil), or the error message should be clearer, such as "desired AppArmorProfile cannot be nil".
| return errors.New("AppArmorProfile is not needed") | |
| return errors.New("desired AppArmorProfile cannot be nil") |
| clusterName: "slurm1" | ||
| # Additional annotations for the cluster | ||
| annotations: {} | ||
| # Add appArmor profile to the cluster |
There was a problem hiding this comment.
Inconsistent capitalization: "appArmor" should be "AppArmor" to match the standard capitalization used throughout the codebase and in the API field name.
| # Add appArmor profile to the cluster | |
| # Add AppArmor profile to the cluster |
| func renderAnnotations(worker *values.SlurmWorker, clusterName, namespace string) map[string]string { | ||
| mungeAppArmorProfile := worker.ContainerMunge.AppArmorProfile | ||
| workerAppArmorProfile := worker.ContainerSlurmd.AppArmorProfile | ||
|
|
||
| if worker.UseDefaultAppArmorProfile { | ||
| workerAppArmorProfile = fmt.Sprintf("%s/%s", "localhost", naming.BuildAppArmorProfileName(clusterName, namespace)) | ||
| mungeAppArmorProfile = fmt.Sprintf("%s/%s", "localhost", naming.BuildAppArmorProfileName(clusterName, namespace)) | ||
| } | ||
|
|
||
| annotations := map[string]string{ | ||
| fmt.Sprintf( | ||
| "%s/%s", consts.AnnotationApparmorKey, consts.ContainerNameSlurmd, | ||
| ): workerAppArmorProfile, | ||
| fmt.Sprintf( | ||
| "%s/%s", consts.AnnotationApparmorKey, consts.ContainerNameMunge, | ||
| ): mungeAppArmorProfile, | ||
| consts.DefaultContainerAnnotationName: consts.ContainerNameSlurmd, | ||
| } | ||
|
|
||
| return annotations | ||
| } |
There was a problem hiding this comment.
[nitpick] Code duplication: The renderAnnotations function in worker/statefulset.go and login/statefulset.go are nearly identical. Consider extracting this logic into a shared function in the common package to reduce duplication and improve maintainability.
| deny /usr/lib/x86_64-linux-gnu/nvidia/wine/nvngx.dll w, | ||
| deny /usr/lib/x86_64-linux-gnu/nvidia/xorg/libglxserver_nvidia* w, | ||
| deny /usr/lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so w, | ||
| deny /usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia |
There was a problem hiding this comment.
Missing w, at the end of this deny rule. This line should end with w, to be consistent with the other deny rules and to ensure the AppArmor policy is syntactically correct.
| deny /usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia | |
| deny /usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia w, |
Partially addresses issue but not completely. The mandatory linking to
libnvidia-ml.so.1was fixed in patch 22.05.5. We are currently using version 22.05.2.