Measure GPU consumption #24

bpetit · 2020-12-04T09:57:39Z

Problem

Some power hungry use cases rely on GPU. It would be great to propose to measure its consumption from the infrastructure point of view.

Solution

We can inspire from codecarbon by using pynvml.

Alternatives

Any other library existing would be worth a look.

Additional context

The idea is to make easier collecting those metrics from the infrastructure and thus feed metrics pipelines that may make easier exposing their impact to cloud providers machine learning clients.

uggla · 2021-05-05T13:36:36Z

Hello, I did a couple of investigations on this topic.
There is a wrapper of nvml library written in Rust here: https://crates.io/crates/nvml-wrapper so getting info from an Nvidia board looks not really complcated.
I have extracted and updaded the example provided to extract the power usage: https://github.com/uggla/nvml-basic
Unfortunately, I have the following output:

 uggla   main  ~  workspace  rust  nvml-basic  cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
     Running `target/debug/basic`


Your NVIDIA GeForce GTX 1050 is currently sitting at 40 °C with a graphics clock of 139 MHz and a memory
clock of 405 MHz.
Memory usage is 4.92 MB out of an available 2.1 GB.
Right now the device is connected via a PCIe gen 1 x16 interface;
the max your hardware supports is PCIe gen 3 x16.
Power consumption is Not supported.

This device is not on a multi-GPU board.

System CUDA version: 11.3

So I manage to get data from my 1050 board but the power usage is not supported. :(
I have read that it can be a limitation of the driver. I expect more a limitation of my hardware. It would be great is someone could run this short code example on a different GPU before going ahead with the scaphandre implementation.

demeringo · 2021-05-10T23:19:04Z

Hi, this is neat !

Your feedback triggered my curiosity to test nvml-wrapper on an AWS EC2 instance that uses nvidia GPU.

Disclaimer: my knowledge or experience of GPU or related driver is absolutely zero. So if you find anything that does not make sense below, please tell me ;-)

EC2 instance

g3.4xlarge
eu-west-1
all defaults settings
using AWS provided AMI that comes with nvida tesla driver preinstalled amzn2-ami-graphics-hvm-2.0.20210427.0-x86_64-gp2-e6724620-3ffb-4cc9-9690-c310d8e794ef

First attempt: libnvidia-ml.so not found

It did not work out of the box (complaining about missing libnvidia-ml.so).

root@ip-172-31-3-186 nvml-basic]# cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
     Running `target/debug/basic`
Error: LibloadingError(DlOpen { desc: "libnvidia-ml.so: cannot open shared object file: No such file or directory" })

Second attempt: create a symlink to the lib

I did a couple of things to make it work

created the LD_LIBRARY_PATH env variable (export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/lib64) but it did not work either
created a symlink ln -s /usr/lib64/libnvidia-ml.so.1 /usr/lib64/libnvidia-ml.so

Relaunched and we have a measure:

[root@ip-172-31-3-186 nvml-basic]# cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
     Running `target/debug/basic`


Your Tesla M60 is currently sitting at 21 °C with a graphics clock of 405 MHz and a memory clock of 324 MHz. 
Memory usage is 0 B out of an available 7.99 GB. 
Right now the device is connected via a PCIe gen 1 x16 interface; the max your hardware supports is PCIe gen 3 x16. 
Power consumption is 14599.

This device is not on a multi-GPU board.

System CUDA version: 11.0

In retrospect I am not sure if creating the LD_LIBRARY_PATH was of any use.

Using nvidia-smi command

While trying to this work I came accross the nvidia-smi command (See https://serverfault.com/questions/395455/how-to-check-gpu-usages-on-aws-ec2-gpu-instance)

I tried running nvidia-smi -i 0 -l -q -d POWER which returned results in the same range (+- 14 watts idle).
I do not know how the calculation is done but it displays a measure summary every second (I include 3 successive outputs below).

nvidia-smi -i 0 -l -q -d POWER

==============NVSMI LOG==============

Timestamp                                 : Mon May 10 22:25:27 2021
Driver Version                            : 450.119.01
CUDA Version                              : 11.0

Attached GPUs                             : 1
GPU 00000000:00:1E.0
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 13.88 W
        Power Limit                       : 150.00 W
        Default Power Limit               : 150.00 W
        Enforced Power Limit              : 150.00 W
        Min Power Limit                   : 112.50 W
        Max Power Limit                   : 162.00 W
    Power Samples
        Duration                          : 40.52 sec
        Number of Samples                 : 119
        Max                               : 14.73 W
        Min                               : 13.39 W
        Avg                               : 14.07 W


==============NVSMI LOG==============

Timestamp                                 : Mon May 10 22:25:32 2021
Driver Version                            : 450.119.01
CUDA Version                              : 11.0

Attached GPUs                             : 1
GPU 00000000:00:1E.0
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 15.08 W
        Power Limit                       : 150.00 W
        Default Power Limit               : 150.00 W
        Enforced Power Limit              : 150.00 W
        Min Power Limit                   : 112.50 W
        Max Power Limit                   : 162.00 W
    Power Samples
        Duration                          : 40.52 sec
        Number of Samples                 : 119
        Max                               : 14.73 W
        Min                               : 13.39 W
        Avg                               : 14.08 W


==============NVSMI LOG==============

Timestamp                                 : Mon May 10 22:25:37 2021
Driver Version                            : 450.119.01
CUDA Version                              : 11.0

Attached GPUs                             : 1
GPU 00000000:00:1E.0
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 13.88 W
        Power Limit                       : 150.00 W
        Default Power Limit               : 150.00 W
        Enforced Power Limit              : 150.00 W
        Min Power Limit                   : 112.50 W
        Max Power Limit                   : 162.00 W
    Power Samples
        Duration                          : 40.22 sec
        Number of Samples                 : 119
        Max                               : 14.73 W
        Min                               : 13.39 W
        Avg                               : 14.07 W

I have no idea if the results are relevant for of an idle machine. But I find very exciting that we are able to probe something out of an AWS instance using GPU ;-)

I think it would be interesting to redo the test with some kind of representative workload, and also verify if it works the same with other providers like azure or gcp.

uggla · 2021-05-11T06:49:44Z

Hello @demeringo,

Thank you really much this is really helpful. Sorry I knew about the missing libnvidia-ml.so, but forget to mention it in the previous post.
14W idle for such card seems clearly possible and not completely wrong.

As I will not be able to fully test it with my laptop, I will mock the GPU results.
Though will you be able to do a test with scaphandre as soon as I will implement the GPU power reporting ? That will be great.

I think it would be interesting to redo the test with some kind of representative workload, and also verify if it works the same with other providers like azure or gcp.

Absolutelly but I dont't think there is a reason it will be different between providers as soon as it is nvidia GPU hardware.
Another interesting test would be with multiple GPU in order to know how the library react in such case.

demeringo · 2021-05-11T08:22:07Z

Yes, this would be perfect, I can setup different public cloud servers for testing during a limited time... but I lack rust skills to do the integration... so if you could take it I would be more than happy to test a branch ;-)

mindrunner · 2021-05-11T08:24:50Z

I am happy to test on a bare metal box with a 1050ti (if testing is feasable in production mode)

However, it seem that power-draw might not be supported by some cards :(

==============NVSMI LOG==============

Timestamp                                 : Tue May 11 10:24:09 2021
Driver Version                            : 465.27
CUDA Version                              : 11.3

Attached GPUs                             : 1
GPU 00000000:02:00.0
    Power Readings
        Power Management                  : Supported
        Power Draw                        : N/A
        Power Limit                       : 75.00 W
        Default Power Limit               : 75.00 W
        Enforced Power Limit              : 75.00 W
        Min Power Limit                   : 52.50 W
        Max Power Limit                   : 75.00 W
    Power Samples
        Duration                          : 18446744073707.55 sec
        Number of Samples                 : 119
        Max                               : 35.50 W
        Min                               : 35.50 W
        Avg                               : 0.00 W

uggla · 2021-05-11T16:42:28Z

However, it seem that power-draw might not be supported by some cards :(

@mindrunner, yes we have the same issue, I have a 1050 (not Ti) on my laptop it is not supported. That's the reason why I requested people with different HW to check.

on a bare metal box

Is your 1050Ti an embedded chip on a laptop or solder on a motherboard, or a "real" card plugged on pci express bus ? I understand that it is the last option, but this is just to be sure.

uggla · 2021-05-11T16:46:50Z

@demeringo ,

Yes, this would be perfect, I can setup different public cloud servers for testing during a limited time... but I lack rust skills to do the integration... so if you could take it I would be more than happy to test a branch ;-)

Super cool, I'll notify you as soon as I have something usable. I just need to find a bit of spare time to handle it....

mindrunner · 2021-05-11T16:47:52Z

Yeah, Laptop cards have a different PM, also due to the fact they are usually driven next to an intel card and so on...

The card in my Laptop let's me read the power draw:
(01:00.0 VGA compatible controller: NVIDIA Corporation TU106GLM [Quadro RTX 3000 Mobile / Max-Q] (rev a1))

==============NVSMI LOG==============

Timestamp                                 : Tue May 11 18:44:53 2021
Driver Version                            : 465.27
CUDA Version                              : 11.3

Attached GPUs                             : 1
GPU 00000000:01:00.0
    Power Readings
        Power Management                  : N/A
        Power Draw                        : 12.45 W
        Power Limit                       : N/A
        Default Power Limit               : N/A
        Enforced Power Limit              : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Power Samples
        Duration                          : Not Found
        Number of Samples                 : Not Found
        Max                               : Not Found
        Min                               : Not Found
        Avg                               : Not Found

The card I was talking about in my previous post is a "normal" PCIe card:
(02:00.0 VGA compatible controller: NVIDIA Corporation GP107 [GeForce GTX 1050 Ti] (rev a1))
It is a PALIT GeForce® GTX 1050 Ti KalmX 4GB passive, just for reference.

uggla · 2021-05-11T16:59:36Z

@mindrunne, thank you. This is really helpful.
I think 1050 is entry level hardware, so maybe that's the reason why there is no power sensor. Or maybe there is one which is disabled by the driver....

Yahoo, you have a laptop with a Quadro chip. It seems to be a high end laptop. I did not know that laptop can have such kind of chip.

mindrunner · 2021-05-11T17:04:25Z

Yahoo, you have a laptop with a Quadro chip. It seems to be a high end laptop. I did not know that laptop can have such kind of chip.

I guess it is pretty high end. DELL Precision 5750, the business brother of the 2020 XPS 17....

Anyways, searching the internet about this issue creates even more confusion. Some say, it is a driver issue, supposed to work with an older driver version. Not sure about that. If I figure out more, I will get back in touch here. Would be nice to have the GPU power included into my grafana dashboard, but in my case, really only eyecandy and nothing urgent :D

itwars · 2021-05-21T13:23:11Z

Hi,
You can perhaps have a look to : https://pypi.org/project/pyJoules/ maybe same as pynvml?

uggla · 2021-05-21T13:59:48Z

Hello @itwars thank you. In fact all these solutions rely on nvml library from Nvidia and the appropriate driver and hardware.
The rust nvml wrapping library (https://crates.io/crates/nvml-wrapper) is working very well. So soon scaphandre will be able to report Nvidia GPU consumption. It might take a bit more time than expected as @bpetit and @PierreRust are currently changing some internal stuff.

itwars · 2021-05-21T14:15:40Z

Excellent! I'm really excited by having GPU power monitoring for my AI GPU powered lab. Any chance to have something similar for both AMD and Intel GPU?

uggla · 2021-05-21T14:46:55Z

@itwars , it seems only a subset of Nvidia boards support these feature mostly the highend.
Regarding Intel and Amd, I have not done really extensive researches but it seems power data are not available. Equivalent libraries to nvml are really limited. Only good news, the one from Amd is open source if I remember well (not the case for nvml).
Sounds like energy management was not really a priority for GPU suppliers. Hoping that it will change in a near future.

quantumsheep · 2022-11-17T15:42:51Z

Hi, is there any news on this issue?

uggla · 2022-11-17T15:58:45Z

@quantumsheep not really.
Do you need this feature ? I would say if someone needs that one I could be motivated to implement it.

quantumsheep · 2022-11-17T16:46:10Z

@uggla We have some servers with multiple GPUs that we want to get electrical consumption. We can take some time to implement the feature but if you can guide us on how to do it we would love it ❤️

samuelrince · 2022-11-21T16:43:56Z

Hey @uggla and @quantumsheep I also need this feature! It would be perfect to have it in Scaphandre directly. Currently I rely on this project utkuozdemir/nvidia_gpu_exporter. But it is built around Prometheus and there is no other way to export data (to my knowledge). In Boavizta/boagent we use the JSON exporter from Scaphandre and would like to keep that workflow for GPU metrics as well.
Happy to help if I can, but I don't think you can count on my Rust programming skills unfortunately 🙃

uggla · 2022-11-22T16:28:33Z

Ok, I need to discuss with @bpetit about his plan for the next release.
I also need to discuss how Benoit wanted to deal with input plugins. I think this is the main difficulty with this issue.
Then I will try to put this issue on the TODO list.

yuxin1234 · 2023-06-30T14:29:41Z

@uggla @bpetit Any update on this issue? Thanks @filga

bpetit · 2023-07-25T16:05:38Z

Hi !

I have a lot to catch up this thread, sorry !

@uggla don't hesitate to open a PR on dev, we are not so much on internals changes these days, more new features, so there shouldn't be too much conflicts.

I'll be more than happy to look at your PR soon after next release.

bpetit added the enhancement New feature or request label Dec 4, 2020

bpetit added the help wanted Extra attention is needed label Dec 17, 2020

bpetit added this to Triage in General Jan 12, 2021

bpetit added the Hacktoberfest Issues ready to welcome contributors from Hacktoberfest ! label Oct 5, 2021

bpetit added the good first issue Good for newcomers label Nov 3, 2021

bpetit mentioned this issue Nov 5, 2021

Moving scaphandre to energy section instead of carbon Green-Software-Foundation/awesome-green-software#42

Merged

bpetit moved this from Triage to To do in General Feb 9, 2023

bpetit pinned this issue Mar 3, 2023

bpetit mentioned this issue May 18, 2023

Support for alternative ways to collect the power usage data #140

Closed

bpetit added this to the Release v1.2.0 milestone Jul 25, 2023

bpetit mentioned this issue Mar 6, 2024

Wrong watt consumptions #241

Closed

adrianco mentioned this issue Apr 11, 2024

How should GPU energy use be estimated? Green-Software-Foundation/real-time-cloud#37

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Measure GPU consumption #24

Measure GPU consumption #24

bpetit commented Dec 4, 2020

uggla commented May 5, 2021

demeringo commented May 10, 2021

uggla commented May 11, 2021

demeringo commented May 11, 2021

mindrunner commented May 11, 2021

uggla commented May 11, 2021 •

edited

uggla commented May 11, 2021

mindrunner commented May 11, 2021

uggla commented May 11, 2021

mindrunner commented May 11, 2021

itwars commented May 21, 2021

uggla commented May 21, 2021

itwars commented May 21, 2021

uggla commented May 21, 2021

quantumsheep commented Nov 17, 2022 •

edited

uggla commented Nov 17, 2022

quantumsheep commented Nov 17, 2022

samuelrince commented Nov 21, 2022

uggla commented Nov 22, 2022

yuxin1234 commented Jun 30, 2023

bpetit commented Jul 25, 2023

Measure GPU consumption #24

Measure GPU consumption #24

Comments

bpetit commented Dec 4, 2020

Problem

Solution

Alternatives

Additional context

uggla commented May 5, 2021

demeringo commented May 10, 2021

First attempt: libnvidia-ml.so not found

Second attempt: create a symlink to the lib

Using nvidia-smi command

uggla commented May 11, 2021

demeringo commented May 11, 2021

mindrunner commented May 11, 2021

uggla commented May 11, 2021 • edited

uggla commented May 11, 2021

mindrunner commented May 11, 2021

uggla commented May 11, 2021

mindrunner commented May 11, 2021

itwars commented May 21, 2021

uggla commented May 21, 2021

itwars commented May 21, 2021

uggla commented May 21, 2021

quantumsheep commented Nov 17, 2022 • edited

uggla commented Nov 17, 2022

quantumsheep commented Nov 17, 2022

samuelrince commented Nov 21, 2022

uggla commented Nov 22, 2022

yuxin1234 commented Jun 30, 2023

bpetit commented Jul 25, 2023

uggla commented May 11, 2021 •

edited

quantumsheep commented Nov 17, 2022 •

edited