Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Power management server #29

Open
jaywonchung opened this issue Oct 21, 2023 · 3 comments
Open

Power management server #29

jaywonchung opened this issue Oct 21, 2023 · 3 comments
Labels
enhancement New feature or request

Comments

@jaywonchung
Copy link
Member

jaywonchung commented Oct 21, 2023

NVML requires the Linux SYS_ADMIN capability for applications to set the GPU's power limit or frequency. In production environments, you can't just give your application containers SYS_ADMIN because it allows way too many things (man page). These excess permission can be exploited easily, for instance, if some open source dependency experiences a supply chain attack.

In order to reduce the attack surface, it should make sense to have a power management server per node, which exposes a handful of endpoints that allows applications (without SYS_ADMIN) to set the GPU's power limit or frequency. This is basically IPC and should have extremely very low latency. NVML function calls inside an application typically take 10-20 ms, and now the round trip including IPC time should not be that much higher.

Before doing this, we should first abstract away the GPU (#23) so that depending on whether the user's using the power management server or just directly setting power knobs, the GPU backend should be different.

@jaywonchung jaywonchung added enhancement New feature or request good first issue Good for newcomers labels Oct 21, 2023
@saketjajoo
Copy link

There are two potential approaches to address this issue, although additional options may also exist:

  • Making a change at the NVML library’s side to reduce the Linux privileges, or
  • Using something like a service account (that has the SYS_ADMIN privilege) that is attached to the process while it executes.

The rationale behind the pull request stems from the concern that the current excessive permissions could be easily exploited, particularly during a supply chain attack on an open-source dependency. If nvidia-ml-py (a PyPI library used to gather energy and power consumption in the Zeus project) is compromised via a supply chain attack, it could possibly lead to a security issue.

Currently, the documentation says that the nvmlDeviceSetPowerManagementLimit() API requires root/admin access. So, I don’t think much can be done for now about changing or using different permissions. However, if this changes in the future, it could solve the problem.

@jaywonchung
Copy link
Member Author

There are two potential approaches to address this issue, although additional options may also exist:

  • Making a change at the NVML library’s side to reduce the Linux privileges, or

NVML is closed source, so this is not a viable option. Also, SYS_ADMIN is required because changing hardware management knobs is indeed what only user/process with system admin role should be allowed to do. So I won't expect NVML to lift this constraint any time soon.

  • Using something like a service account (that has the SYS_ADMIN privilege) that is attached to the process while it executes.

Could elaborate a bit more? According to my limited knowledge, service accounts are typically used in the context of cloud environments to allow access to certain privileged API endpoints. Is there service account implementations in general Linux kernels that grant Linux security capabilities to processes? We don't want to tie anything to cloud.

The rationale behind the pull request stems from the concern that the current excessive permissions could be easily exploited, particularly during a supply chain attack on an open-source dependency. If nvidia-ml-py (a PyPI library used to gather energy and power consumption in the Zeus project) is compromised via a supply chain attack, it could possibly lead to a security issue.

Currently, the documentation says that the nvmlDeviceSetPowerManagementLimit() API requires root/admin access. So, I don’t think much can be done for now about changing or using different permissions. However, if this changes in the future, it could solve the problem.

That's why we need a separate server process on the node that has SYS_ADMIN and exposes APIs like set_power_limit and set_frequency. Applications without SYS_ADMIN will request the power management server with IPC to set the GPU's power limit or SM frequency on behalf of them.

@saketjajoo
Copy link

saketjajoo commented Nov 25, 2023

That's why we need a separate server process on the node that has SYS_ADMIN and exposes APIs like set_power_limit and set_frequency.

However, I believe, theoretically, anyone could call this process to update the power limit and frequency.

Is there service account implementations in general Linux kernels that grant Linux security capabilities to processes? We don't want to tie anything to cloud.

Yes, there is the concept of system accounts in Linux as well. The command useradd --system ... will create a system account that can have custom privileges attached to it. This can be used to set the power limit and frequency. Perhaps, the new system account could be added to a newly created group which also has the ID of the user running the process. This way, any unauthorized user may not be able to use the system account to further call the APIs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants