Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Negative cpu_energy values #309

Closed
JohannesGetzner opened this issue May 13, 2022 · 11 comments
Closed

Negative cpu_energy values #309

JohannesGetzner opened this issue May 13, 2022 · 11 comments
Assignees

Comments

@JohannesGetzner
Copy link

  • CodeCarbon version: 2.1.0
  • Python version: 3.9
  • Operating System: Ubuntu

Description

As before the update to 2.10 I used the tracker.start() and tracker.stop() way of measuring. I found that codecarbon sometimes measures negative energy values.

What I Did

Cannot really reproduce the scenario as all parameters of what I am measuring are random. Hardware is:
negativ_energy
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz

@benoit-cty
Copy link
Contributor

Hello,
Thanks for reporting this.
I think it came from

self.energy_delta = energy - self.last_energy
Where we have to return 0 if the result is negative.

Can you confirm that RAPL is used for mesuring ? You will have a log like this at codecarbon startup:

[codecarbon INFO @ 13:11:29] Tracking Intel CPU via RAPL interface

@vict0rsch I see that for power we use abs(), do you think it's a better idea ?

@vict0rsch
Copy link
Contributor

vict0rsch commented May 16, 2022

That's surprising indeed. We had assumed the energy written in RAPL files would be strictly increasing. Not sure in what scenario it would not be the case. While using abs would indeed prevent negative value, not sure to what extend we'd not just be pushing dirt under the rug... 🤔

@JohannesGetzner
Copy link
Author

JohannesGetzner commented May 17, 2022

Hello, Thanks for reporting this. I think it came from

self.energy_delta = energy - self.last_energy

Where we have to return 0 if the result is negative.

Can you confirm that RAPL is used for mesuring ? You will have a log like this at codecarbon startup:

[codecarbon INFO @ 13:11:29] Tracking Intel CPU via RAPL interface

@vict0rsch I see that for power we use abs(), do you think it's a better idea ?

I can indeed confirm that RAPL is used for measuring.

@JohannesGetzner
Copy link
Author

JohannesGetzner commented May 17, 2022

For further context:
Unfortunately this problem does not occur as rarely as I hoped it would and I keep finding multiple negative energy values in the data collected by codecarbon.
I can also verify that this happens independent of the tracking time, but perhaps it could also have something to do with having set measure_power_secs=1.
I also noticed that in at least one of the data-points where the cpu_energy is negative the cpu_power is incredibly and unreasonably high (see screenshot).

Screenshot_20220517_211840

If I have time I will try to debug the issue a bit further.

@benoit-cty
Copy link
Contributor

Well, setting measure_power_secs=1 is probably the source of the problem.
If you confirm it, we will have to prevent this.
Why do you need such measurement frequency ?

@JohannesGetzner
Copy link
Author

JohannesGetzner commented May 19, 2022

I will experiment around with the parameter and see if I still get negative values. EDIT: changing measure_power_secs or leaving it at default does not fix the issue.

Regarding why I set it that low (maybe I misunderstood something):
I assumed that the lower I set it, the more precise my measurements will be.
Let's say everything that I want to measure between Tracker.start() and Tracker.stop() takes less time (e.g. 7secs) than the 15sec default for measure_power_secs. Does that mean that it will never actually measure the power or will it just accumulate everything correctly after the 7secs? Is that parameter only about how often the RAPL file is accessed?

So to sum up, I am not sure whether measure_power_secs actually refers to how often and therefore how accurately it measures or whether this parameter is just for the logging frequency (to show how much was consumed every 15 seconds).

I would actually really appreciate it if somebody could clarify this for me.

While we are on the subject I just wanna make sure that I correctly understand the difference between cpu_power and cpu_energy, because I didn't find much in the docs.

I assume cpu_energy is the total energy consumed by the CPU in kWh, for the tracking duration.

And I assume that cpu_power is the measured wattage of the CPU under the current load. But is the cpu_power that is reported at the end an average of the measurements, or is it the last measured one?

@benoit-cty
Copy link
Contributor

CodeCarbon was made originally for long training, so it's interesting to see you are using it for so short operation.
To measure CPU with RAPL, we read a file containing the total energy in Joules, so we could read it only at startup and shutdown and be accurate. The power is computed by dividing energy with duration.
For GPU we read the instant power and compute the energy by multiplying power and duration between measure. So here you are true, the more measure you make the more accurate you are. But in training you are suppose to use GPU at 100% most of the time.
We do not compute mean power, so it is the last measured power that is reported in the CSV file.
With the Code Carbon API you could get power data point at every measure.

@JohannesGetzner
Copy link
Author

Thanks for the explanation! I understand that codecarbon was built to measure long training-times. Yes, so for my use-case the shorter I can make the tracking duration the better, because overall compute time will be shorter. I am currently using codecarbon to track many short experiments. But it is possible for me to vary the duration of my experiments.

But so far I can confirm that measure_power_secs is not the cause of the negative values. I have tried smaller values, the default and up to 30secs. I can also confirm that the problem is likely to be hardware independent, as I got negative values on two different sets of hardware (both using RAPL).

@benoit-cty benoit-cty self-assigned this May 24, 2022
@benoit-cty
Copy link
Contributor

Hello,

Could you try to execute the file https://github.com/mlco2/codecarbon/blob/f893556f4507126897c9227a3c790857ee360b54/examples/logging_to_file.py in your environment ?
Let it run few minutes then Ctrl-C to stop it.
It will write RAPL reading to a file codecarbon.log that you could attach to this issues.

If you know how to do it you could test with the codecarbon branch fix-running-exception so you could test a patch that disable negative values and log an error if they occurs.

@QQiun
Copy link

QQiun commented May 29, 2022

@benoit-cty @vict0rsch CodeCarbon reads RAPL files to measure cpu_energy, when it runs, cpu_power increases quickly at the beginning, see logs. I think when other processes ready to run may lead to negative cpu_energy values.
image

@benoit-cty
Copy link
Contributor

Fixed in previous release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants