-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add initial support for DPTF adaptive performance policy #224
Conversation
Different thermal policies require different UUIDs to be written to the INT3400 interface. Avoid hardcoding that and allow engines to override it as needed.
Systems may provide the DPTF policy information in an ACPI table. Allow that to be used directly, and disable parsing of XML config to avoid confusion.
If we're going to read a binary sysfs file, we want to know how big it is in advance. Add a helper to do that. In addition, add support for reading files greater than a single page.
We can obtain PPCC parameters from the firmware, so abstract that rather than calling directly into the XML parser code.
The adaptive target tables may include information targeted at specific device types. Add core support for passing that through to the cooling devices.
Firmware may provide multiple PPCC entries, including one for DRAM. Try to use it if we can, but fix up the XML parser to it makes sure that we don't break existing setups.
Add support for setting RAPL parameters based on adaptive targets. Incomplete at present.
In adaptive mode we prefer the use of the MMIO RAPL interface if available.
We want engines to be able to override this.
This is something of a hack right now (it should be event driven rather than polling), but we need to update the adaptive engine state to check whether we should transition to a new target.
The firmware doesn't pay attention to the configured UUID in the INT3400 device until _OSC is called, and this only happens when we enable the associated thermal zone. Make sure we do so, and disable it again on exit.
DPTM provides an adaptive power policy that allows for improved integration between the platform and the OS. Firmware can provide a set of complex conditions to the OS, and an OS agent (in this case thermal_daemon) is responsible for evaluating them, potentially making use of information that is easily available to the OS and not the firmware. The agent evaluates each set of conditions in turn, and once the first evaluates completely it triggers a set of actions. This implementation only covers a subset of the conditions - notably, it only currently handles conditions that are associated with OEM-specific variables exported via ACPI. It is also only currently capable of managing RAPL targets or loading new passive thermal policies, since that's all that my test system does.
Add a command line option to enable the use of the adaptive engine.
GDDV payloads can be compressed using LZMA. Add support for handling that.
PTTV entries may have a uint64 rather than a string for the target argument, so make sure we handle that and also deal with cases where we have multiple PSVTs in the parent object rather than in the shared space by passing through a unique name.
Skipping a uint64 requires also skipping the type defintion, so fix up the offset calculations.
We're not going to do anything with this data right now, so if it's not something we can parse, bail.
This changes both the fatal into an error, and the loop to report _all_ unsupported conditions at once. It should make it easier to gauge how much work is neded to support a certain firmware.
Some systems use APPC to provide additional information associated with conditions. Parse that and then merge it into our conditions.
Doing it this way was a bad idea, but here we are.
Make sure we don't inappropriately walk off the end of the APAT table.
PSVT tables may contain multiple passive trip points. Add support for that.
Add rough support for temperature comparisons. Right now this ignores hysteresis.
If the APPC doesn't have anything that looks like a legitimate version number at the start, ignore it.
This relies on upower, so add a dependency on that as well.
Add support for conditions that depend on whether the system is on AC or not
Add a handler to report that the workload is bursty at all times - we don't currently have a good way to report something meaningful here.
@mj59 Thanks for you effort. What do you think about this Lenovo statement. It Is not clear to me the how that firmware will interact with this. |
/cc @mrhpearson if he can disclose something more from Lenovo. |
Hi, Lenovo hat back on - I don't really know how this will interact with our new firmware. I assume if you're not using thermald then it doesn't really matter, but otherwise the two pieces will potentially be fighting one another. I was planning to try it out on one of our devices to see what happens but haven't had time this week. I did forward the details about @mjg59 work being done to the firmware team last week to get their feedback but they didn't really have anything to add. If there is anything specific that I should ask them please let me know. Sorry - in all that rambling nothing very useful. If you are looking for specifics let me know and I'll do the best to find answers. Thanks for adding me to the thread. |
Thank for your feedback. Yes It would be really nice to know if the new firmware will conflict with Thermald with or without this patch and so if Is It the case to uninstall Thermald in each cases or just if this will not merged on these Lenovo devices. I.e. the other workarund solution conflicted with Thermald: https://github.com/erpalma/throttled/blob/master/README.md#thermald /cc @erpalma |
We were out of sync for higher conditions, which caused misinterpretation of legitimate values.
I played with this some and I think I need some guidance on what needs testing and if what I'm doing makes sense. I appreciate I really need to read the code but it's been a crazy week and I wanted to get some results back so haven't done that. I built and ran this version of thermald. I needed to generate a thermal-conf.xml so used dptfxtract for that - it was interesting how it generated 5 different config files and I really need to understand what it is doing as I suspect that is important (let me know if you want them posted somewhere to review) . I've not personally had much to do with thermald so if there are pointers on what is better please let me know. I'm running the latest version of our firmware and it has a feature where you can switch between thermal modes (low, normal, high - with corresponding power settings of 7.5W for low, 14.5 for normal and 51W for high). With this thermald running I seem to be limited to 15W (and around 75C). I'm running Intels Thermal Analysis Tool and the Package power was clamped at 15 (running a kernel build to keep things busy). If I'm not running thermald but just Lenovo firmware then in performance mode the power goes up to 51W and Temperature to 92 - so better performance (as long as it's in desk mode - you won't get that when it's in lap mode). So...in the 'battle' to control the power thermald is winning, but in this case it's not giving the optimal performance. I suspect that's related to the config files I've generated. I tested lap mode. In this case our firmware detects it's on a lap and lowers the power to 14.5W (or 7.5W if you have low mode selected). If I'm running thermald I get 10W (it was a bit quirky - I saw 15W at one point...but only briefly). Hope the above is helpful - happier to collect more details or try anything out if it would be useful. Especially if I'm not doing things right :) I'm going to try it on some platforms which don't have the updated firmware - I know we have plenty of unhappy customers because we haven't been able to backport it everywhere and I'm really hoping this is a good solution there (longer term I also suspect we'll all be better off if we can support DPTF properly) Let me know if any questions. |
@mrhpearson You need to run with the --adaptive option, which should then avoid the need to use dptfxtract and any xml config files. If you don't do that it's equivalent to the current upstream. |
(But also it would be super helpful to know how Lenovo firmware differentiates between Linux and a DPTF-enabled operating system, and what the differences in behaviour are) |
@mjg59 If I run with --adaptive I get: I'll check with the firmware team how they determine the difference - afraid you'll have to wait a while for the reply as they're in Japan and its a holiday there this week. |
@mrhpearson Did you build a kernel with the additional patches in the repo? They're not upstream yet. |
Crud...missed those. Sorry! |
Send me your thermal-conf.xml. You need to at least have 5.4 kernel (For ICL). Also cat /sys/bus/pci/devices/0000:00:04:.0/power_limits/*.
For Lenovo systems also #rdmsr 0x1a2
Thanks,
Srinivas
From: Mark Pearson <notifications@github.com>
Sent: Tuesday, April 28, 2020 6:58 PM
To: intel/thermal_daemon <thermal_daemon@noreply.github.com>
Cc: Pandruvada, Srinivas <srinivas.pandruvada@intel.com>; Comment <comment@noreply.github.com>
Subject: Re: [intel/thermal_daemon] Add initial support for DPTF adaptive performance policy (#224)
I played with this some and I think I need some guidance on what needs testing and if what I'm doing makes sense. I appreciate I really need to read the code but it's been a crazy week and I wanted to get some results back so haven't done that.
I built and ran this version of thermald. I needed to generate a thermal-conf.xml so used dptfxtract for that - it was interesting how it generated 5 different config files and I really need to understand what it is doing as I suspect that is important (let me know if you want them posted somewhere to review) . I've not personally had much to do with thermald so if there are pointers on what is better please let me know.
I'm running the latest version of our firmware and it has a feature where you can switch between thermal modes (low, normal, high - with corresponding power settings of 7.5W for low, 14.5 for normal and 51W for high).
With this thermald running I seem to be limited to 15W (and around 75C). I'm running Intels Thermal Analysis Tool and the Package power was clamped at 15 (running a kernel build to keep things busy).
If I'm not running thermald but just Lenovo firmware then in performance mode the power goes up to 51W and Temperature to 92 - so better performance (as long as it's in desk mode - you won't get that when it's in lap mode). So...in the 'battle' to control the power thermald is winning, but in this case it's not giving the optimal performance. I suspect that's related to the config files I've generated.
I tested lap mode. In this case our firmware detects it's on a lap and lowers the power to 14.5W (or 7.5W if you have low mode selected). If I'm running thermald I get 10W (it was a bit quirky - I saw 15W at one point...but only briefly).
Hope the above is helpful - happier to collect more details or try anything out if it would be useful. Especially if I'm not doing things right :)
I'm going to try it on some platforms which don't have the updated firmware - I know we have plenty of unhappy customers because we haven't been able to backport it everywhere and I'm really hoping this is a good solution there (longer term I also suspect we'll all be better off if we can support DPTF properly)
Let me know if any questions.
Thanks
Mark
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#224 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AA5ALNWHR2NNKNN7BSABQQLRO6CSFANCNFSM4MGUAYLA>.
|
So some interesting results but not sure they will be what you are looking for. Thanks for the pointer on the patches - I hadn't seen the instructions on your fork and I think I'm now setup correctly. I confirmed with the Lenovo firmware that it was in deskmode with high performance mode enabled (power up to 51W) and then ran thermald --adaptive. At this point the power fluctuated between 10 and 15W - going up and down in small increments (MSR PL1 changing value). Temperature stayed steady around 65 to 70C. I then picked up the laptop to get it in lap mode (I'm assuming the DPTF tables should handle this as that's their raison d'etre) and interestingly the power then fixed solidly at 15W (temp at 70C). As a minor note - stopping thermald doesn't return everything to 'normal' operation. A reboot is needed at that point (I'm assuming registers different to what our FW uses have been tweaked). Srinivas - I'll forward you my xml file shortly. The other details are: |
@mrhpearson Ok, thanks - I should have an X1v7 showing up in the next few weeks and I'll dig into it. Are you able to open an issue in the github.com/mjg59/thermal_daemon repository and attach the output of the acpidump command when run as root? |
No problem - done. |
/cc @bbian |
As an open source developer I don't have access to the DPTF/DTT specifications. I will bring in these changes to main line in the next release. It may be modified from what is submitted with due credit given to the author. We need volunteers to test implement workaround if this causes any negative user experience. So initially it will still have some command line option to turn on this feature. |
I hoped that some DPTF team member for Chrome OS could comment here but.... |
Afaict Chromeos only supports passive policies today. so I'm not sure how much their comments help this discussion. |
Probably not so much if they don't have access to the dynamic specs. |
@spandruvada Let us know when you will try to submit the kenel patches. |
What is the STOA? |
kernel patches are merged to 5.8-rc1. So someone can try on top of 5.8-rc1. |
Great initiative @mjg59. I tried this out on my Lenovo Thinkpad X1 Carbon 6th (20KH007BMX), running Ubuntu 20.04 with mainline 5.8rc2 kernel, with laptop on desk and plugged into standard 65 W charger, running Looks like it's still being throttled down to 75-80° C / 15 W under load. Ambient temperature was 28.5° C (hot in Sweden today and no AC :p). This is similar to the behavior I see with default Ubuntu 20.04 kernel (5.4.0-39.43) and thermald 1.9.1-1ubuntu0.1 package. Before I tried this, I completely removed the Let me know if you want me to dump some tables out or something that would help. EDIT: I posted an issue on your repo instead, with an acpidump: mjg59#7 |
On Sat, 2020-06-27 at 11:35 -0700, Elvis Stansvik wrote:
Great initiative @mjg59<https://github.com/mjg59>.
I tried this out on my Lenovo Thinkpad X1 Carbon 6th (20KH007BMX), running Ubuntu 20.04 with mainline 5.8rc2 kernel, with laptop on desk and plugged into standard 65 W charger, running thermald --no-daemon --adaptive and starting stress -c 8:
[bild]<https://user-images.githubusercontent.com/86675/85929359-5fda7d80-b8b4-11ea-9e94-a78c1b817d5c.png>
Please attach output of acpidump.
Thanks.
Looks like it's still being throttled down to 75-80° C / 15 W under load.
Ambient temperature was 29° C (hot in Sweden today and no AC :p).
This is similar to the behavior I see with default Ubuntu 20.04 kernel (5.4.0-39.43) and thermald 1.9.1-1ubuntu0.1 package.
Let me know if you want me to dump some tables out or something that would help.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#224 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AA5ALNVWROXSUXU5LNRQQC3RYY3YXANCNFSM4MGUAYLA>.
|
On Sat, 2020-06-27 at 11:35 -0700, Elvis Stansvik wrote:
Great initiative @mjg59<https://github.com/mjg59>.
I tried this out on my Lenovo Thinkpad X1 Carbon 6th (20KH007BMX), running Ubuntu 20.04 with mainline 5.8rc2 kernel, with laptop on desk and plugged into standard 65 W charger, running thermald --no-daemon --adaptive and starting stress -c 8:
[bild]<https://user-images.githubusercontent.com/86675/85929359-5fda7d80-b8b4-11ea-9e94-a78c1b817d5c.png>
Looks like it's still being throttled down to 75-80° C / 15 W under load.
Ambient temperature was 29° C (hot in Sweden today and no AC :p).
This is similar to the behavior I see with default Ubuntu 20.04 kernel (5.4.0-39.43) and thermald 1.9.1-1ubuntu0.1 package.
Let me know if you want me to dump some tables out or something that would help.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#224 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AA5ALNVWROXSUXU5LNRQQC3RYY3YXANCNFSM4MGUAYLA>.
|
@spandruvada I moved this over to an issue on @mjg59's fork here: mjg59#7. It has an acpidump attached. |
I merged all changes to v2.3_development branch with minor change for removing my name from authorship. I have to use manual merge for conflicts. I hope merged correctly. I did a diff with mjg59/thermal_daemon to check. |
This is a very rough attempt at implementing support for the adaptive performance DPTF policy. I have no documentation on this, so it's a best guess in many places. It only supports a subset of the conditions and actions and I'm really not sure that the hack that uses the dram RAPL engine if the sensor is called TMEM is correct, so please be as critical as necessary.