-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Power management server #29
Comments
There are two potential approaches to address this issue, although additional options may also exist:
The rationale behind the pull request stems from the concern that the current excessive permissions could be easily exploited, particularly during a supply chain attack on an open-source dependency. If nvidia-ml-py (a PyPI library used to gather energy and power consumption in the Zeus project) is compromised via a supply chain attack, it could possibly lead to a security issue. Currently, the documentation says that the nvmlDeviceSetPowerManagementLimit() API requires root/admin access. So, I don’t think much can be done for now about changing or using different permissions. However, if this changes in the future, it could solve the problem. |
NVML is closed source, so this is not a viable option. Also,
Could elaborate a bit more? According to my limited knowledge, service accounts are typically used in the context of cloud environments to allow access to certain privileged API endpoints. Is there service account implementations in general Linux kernels that grant Linux security capabilities to processes? We don't want to tie anything to cloud.
That's why we need a separate server process on the node that has |
However, I believe, theoretically, anyone could call this process to update the power limit and frequency.
Yes, there is the concept of system accounts in Linux as well. The command |
NVML requires the Linux
SYS_ADMIN
capability for applications to set the GPU's power limit or frequency. In production environments, you can't just give your application containersSYS_ADMIN
because it allows way too many things (man page). These excess permission can be exploited easily, for instance, if some open source dependency experiences a supply chain attack.In order to reduce the attack surface, it should make sense to have a power management server per node, which exposes a handful of endpoints that allows applications (without
SYS_ADMIN
) to set the GPU's power limit or frequency. This is basically IPC and should have extremely very low latency. NVML function calls inside an application typically take 10-20 ms, and now the round trip including IPC time should not be that much higher.Before doing this, we should first abstract away the GPU (#23) so that depending on whether the user's using the power management server or just directly setting power knobs, the GPU backend should be different.
The text was updated successfully, but these errors were encountered: