Skip to content

MSP-3392: add supporting apparmor to soperator#288

Merged
Uburro merged 8 commits intodevfrom
MSP-3392
Jan 7, 2025
Merged

MSP-3392: add supporting apparmor to soperator#288
Uburro merged 8 commits intodevfrom
MSP-3392

Conversation

@Uburro
Copy link
Collaborator

@Uburro Uburro commented Dec 30, 2024

Partially addresses issue but not completely. The mandatory linking to libnvidia-ml.so.1 was fixed in patch 22.05.5. We are currently using version 22.05.2.

asteny
asteny previously approved these changes Dec 31, 2024
asteny
asteny previously approved these changes Dec 31, 2024
@Uburro Uburro merged commit 74841e3 into dev Jan 7, 2025
@Uburro Uburro deleted the MSP-3392 branch January 7, 2025 14:42
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds AppArmor profile support to the soperator for Slurm worker and login nodes, addressing issue #184. It introduces a new UseDefaultAppArmorProfile field (defaulting to true) that generates and applies a default AppArmor profile to restrict write access to NVIDIA libraries and binaries. The PR notes that the mandatory linking to libnvidia-ml.so.1 was fixed in Slurm patch 24.05.5, but the operator currently uses version 22.05.2.

Key changes:

  • Added UseDefaultAppArmorProfile boolean field to the SlurmCluster CRD with a default value of true
  • Implemented AppArmor profile generation with deny rules for NVIDIA library/binary write access
  • Integrated AppArmor profile reconciliation into the cluster controller workflow

Reviewed changes

Copilot reviewed 27 out of 28 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
api/v1/slurmcluster_types.go Added UseDefaultAppArmorProfile field to SlurmClusterSpec
config/crd/bases/slurm.nebius.ai_slurmclusters.yaml Updated CRD with new AppArmor profile field definition
helm/soperator-crds/templates/slurmcluster-crd.yaml Synchronized CRD template with base CRD changes
helm/soperator/crds/slurmcluster-crd.yaml Synchronized CRD with base CRD changes
helm/slurm-cluster/values.yaml Added useDefaultAppArmorProfile: true configuration option
helm/slurm-cluster/templates/slurm-cluster-cr.yaml Wired AppArmor profile setting to cluster CR
internal/render/common/apparmorprofile.go Implemented AppArmor profile rendering and policy generation
internal/naming/naming.go Added AppArmor profile naming function
internal/values/slurm_worker.go Added AppArmor profile field to SlurmWorker and propagated value
internal/values/slurm_login.go Added AppArmor profile field to SlurmLogin and propagated value
internal/values/slurm_cluster.go Integrated AppArmor profile flag into cluster value building
internal/values/slurm_worker_test.go Updated test to pass new AppArmor parameter
internal/render/worker/statefulset.go Refactored annotations to conditionally use AppArmor profile
internal/render/login/statefulset.go Refactored annotations to conditionally use AppArmor profile
internal/controller/reconciler/apparmorprofile.go Added AppArmor profile reconciler implementation
internal/controller/reconciler/reconciler.go Registered AppArmor profile type in reconciler
internal/controller/reconciler/k8s_statefulset.go Added template annotations/labels to statefulset patching
internal/controller/clustercontroller/reconcile.go Added AppArmorProfile reconciler field and initialization
internal/controller/clustercontroller/common.go Integrated AppArmor profile reconciliation step
internal/check/installed_crd.go Added AppArmor CRD installation check
cmd/main.go Registered AppArmor scheme when CRD is installed
config/rbac/role.yaml Added RBAC permissions for AppArmor profiles
helm/soperator/templates/manager-rbac.yaml Added RBAC permissions for AppArmor profiles
helm/soperator/values.yaml Added isApparmorCrdInstalled environment variable
helm/soperator/templates/deployment.yaml Added AppArmor CRD environment variable to deployment
config/manager/manager.yaml Added IS_APPARMOR_CRD_INSTALLED environment variable
go.mod Added security-profiles-operator dependency
go.sum Updated checksums for new dependencies

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

)

func RenderAppArmorProfile(cluster *values.SlurmCluster) *apparmor.AppArmorProfile {
if !cluster.NodeLogin.UseDefaultAppArmorProfile || !cluster.NodeWorker.UseDefaultAppArmorProfile {
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic error: The condition uses OR (||) which means the AppArmor profile will NOT be created if EITHER NodeLogin OR NodeWorker has UseDefaultAppArmorProfile set to false. This should use AND (&&) so that the profile is only skipped when BOTH are false. With the current logic, if only one node type wants to use the default profile, it won't be created.

Suggested change
if !cluster.NodeLogin.UseDefaultAppArmorProfile || !cluster.NodeWorker.UseDefaultAppArmorProfile {
if !cluster.NodeLogin.UseDefaultAppArmorProfile && !cluster.NodeWorker.UseDefaultAppArmorProfile {

Copilot uses AI. Check for mistakes.
stepLogger.Info("AppArmor CRD is not installed, skipping AppArmor profile reconciliation")
return nil
}
if !clusterValues.NodeLogin.UseDefaultAppArmorProfile || !clusterValues.NodeWorker.UseDefaultAppArmorProfile {
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic error: Same as in RenderAppArmorProfile, this condition uses OR (||) when it should use AND (&&). The profile reconciliation should be skipped only when BOTH NodeLogin and NodeWorker have UseDefaultAppArmorProfile set to false, not when either one is false.

Suggested change
if !clusterValues.NodeLogin.UseDefaultAppArmorProfile || !clusterValues.NodeWorker.UseDefaultAppArmorProfile {
if !clusterValues.NodeLogin.UseDefaultAppArmorProfile && !clusterValues.NodeWorker.UseDefaultAppArmorProfile {

Copilot uses AI. Check for mistakes.
Comment on lines +44 to +45
Error(err, "Failed to reconcile AppArmorProfile ")
return errors.Wrap(err, "reconciling AppArmorProfile ")
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing space in error messages. Both "Failed to reconcile AppArmorProfile " and "reconciling AppArmorProfile " have trailing spaces that should be removed for consistency.

Suggested change
Error(err, "Failed to reconcile AppArmorProfile ")
return errors.Wrap(err, "reconciling AppArmorProfile ")
Error(err, "Failed to reconcile AppArmorProfile")
return errors.Wrap(err, "reconciling AppArmorProfile")

Copilot uses AI. Check for mistakes.
/** lrixw,


# remove [^m], when bump slurm 24.05.5 or higher
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says "remove [^m], when bump slurm 24.05.5 or higher" but should probably say "when bumping to slurm 24.05.5 or higher" for better grammar. Additionally, consider "bump to" instead of "bump" for clarity.

Suggested change
# remove [^m], when bump slurm 24.05.5 or higher
# remove [^m] when bumping to slurm 24.05.5 or higher

Copilot uses AI. Check for mistakes.
deps ...metav1.Object,
) error {
if desired == nil {
return errors.New("AppArmorProfile is not needed")
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message "AppArmorProfile is not needed" is misleading for the error case. If desired is nil, it's not necessarily an error condition - the calling code checks if AppArmor is needed and may intentionally pass nil. This check should either be removed (let the calling code handle nil), or the error message should be clearer, such as "desired AppArmorProfile cannot be nil".

Suggested change
return errors.New("AppArmorProfile is not needed")
return errors.New("desired AppArmorProfile cannot be nil")

Copilot uses AI. Check for mistakes.
clusterName: "slurm1"
# Additional annotations for the cluster
annotations: {}
# Add appArmor profile to the cluster
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent capitalization: "appArmor" should be "AppArmor" to match the standard capitalization used throughout the codebase and in the API field name.

Suggested change
# Add appArmor profile to the cluster
# Add AppArmor profile to the cluster

Copilot uses AI. Check for mistakes.
Comment on lines +116 to +136
func renderAnnotations(worker *values.SlurmWorker, clusterName, namespace string) map[string]string {
mungeAppArmorProfile := worker.ContainerMunge.AppArmorProfile
workerAppArmorProfile := worker.ContainerSlurmd.AppArmorProfile

if worker.UseDefaultAppArmorProfile {
workerAppArmorProfile = fmt.Sprintf("%s/%s", "localhost", naming.BuildAppArmorProfileName(clusterName, namespace))
mungeAppArmorProfile = fmt.Sprintf("%s/%s", "localhost", naming.BuildAppArmorProfileName(clusterName, namespace))
}

annotations := map[string]string{
fmt.Sprintf(
"%s/%s", consts.AnnotationApparmorKey, consts.ContainerNameSlurmd,
): workerAppArmorProfile,
fmt.Sprintf(
"%s/%s", consts.AnnotationApparmorKey, consts.ContainerNameMunge,
): mungeAppArmorProfile,
consts.DefaultContainerAnnotationName: consts.ContainerNameSlurmd,
}

return annotations
}
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Code duplication: The renderAnnotations function in worker/statefulset.go and login/statefulset.go are nearly identical. Consider extracting this logic into a shared function in the common package to reduce duplication and improve maintainability.

Copilot uses AI. Check for mistakes.
deny /usr/lib/x86_64-linux-gnu/nvidia/wine/nvngx.dll w,
deny /usr/lib/x86_64-linux-gnu/nvidia/xorg/libglxserver_nvidia* w,
deny /usr/lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so w,
deny /usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing w, at the end of this deny rule. This line should end with w, to be consistent with the other deny rules and to ensure the AppArmor policy is syntactically correct.

Suggested change
deny /usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia
deny /usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia w,

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants