Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add max_energy_range_uj to deal with RAPL wraparound #323

Merged
merged 1 commit into from
Aug 3, 2022

Conversation

mp15
Copy link
Contributor

@mp15 mp15 commented Jul 19, 2022

When we detect a negative delta from RAPL energy usage then assume we have wrapped around and add max_energy_range_uj to it. According to Intel docs this should happen about every 60 secs under a heavy load. Fixes #322

@mp15 mp15 marked this pull request as ready for review July 19, 2022 13:17
@mp15
Copy link
Contributor Author

mp15 commented Jul 19, 2022

We probably should write a test to verify the delta functionality with mock rollover readings but this should be better than returning 0.

@mp15 mp15 force-pushed the rapl_wraparound branch 2 times, most recently from df5dc3b to 8302892 Compare July 25, 2022 12:59
@benoit-cty
Copy link
Contributor

benoit-cty commented Aug 2, 2022

Thank you very much for this investigation !

If we understand well:

  • RAPL write energy to a counter energy_uj.
  • At some point the counter energy_uj goes above the constant max_energy_range_uj and fall back to zero.
  • RAPL continue to write energy to energy_uj.

So if it goes to zero while codecarbon is running, we have the negative value problem.

With your patch codecarbon will add max_energy_range_uj to last_energy in case of negative energy to avoid the problem.

May I suggest to refactor your code like this ?

if self.last_energy > energy:
    logger.debug(
      f"In RAPLFile : Current energy value ({energy}) is lower than previous value ({self.last_energy}). Assuming wrap-around! Source file : {self.path}"
    )
    energy = energy + self.max_energy_reading
self.power = self.power.from_energies_and_delay(
energy, self.last_energy, duration
)
self.energy_delta = energy - self.last_energy
self.last_energy = energy
return

@mp15 mp15 force-pushed the rapl_wraparound branch 4 times, most recently from 23fd735 to acd6de0 Compare August 2, 2022 22:29
@mp15
Copy link
Contributor Author

mp15 commented Aug 2, 2022

I've followed your refactor but I have added a new_last_energy variable to it so that it works correctly the next time we call delta(duration). I assume we are accumulating the delta values across calls?
I've also modified the test to compare the results and use almostEqual (as Energy is held as a float internally and non-zero floats are fun to compare).

Copy link
Collaborator

@SaboniAmine SaboniAmine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, thanks!
This corrects the previous patch we made on this and we have learnt something.
Thanks for your contribution!

codecarbon/core/cpu.py Show resolved Hide resolved
tests/test_energy.py Outdated Show resolved Hide resolved
@mp15 mp15 force-pushed the rapl_wraparound branch 3 times, most recently from 3bcf0ec to efb4fbf Compare August 3, 2022 09:50
When we detect a negative delta from RAPL energy usage then assume
we have wrapped around and add max_energy_range_uj to it.
According to Intel docs this should happen about every 60 secs
under a heavy load. Fixes mlco2#322
@benoit-cty
Copy link
Contributor

Thanks, you're right for new_last_energy 👍

@benoit-cty benoit-cty merged commit 96c1ce1 into mlco2:master Aug 3, 2022
@mp15 mp15 deleted the rapl_wraparound branch August 3, 2022 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Intel RAPL wraparound
3 participants