Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU stuck at 400MHz with thermald enabled with kernel 5.8.16 #280

Closed
dubst3pp4 opened this issue Nov 4, 2020 · 28 comments
Closed

CPU stuck at 400MHz with thermald enabled with kernel 5.8.16 #280

dubst3pp4 opened this issue Nov 4, 2020 · 28 comments

Comments

@dubst3pp4
Copy link

Hello,
I'm running into the strange case, that my processor gets throttled to 400MHz when I'm enabling thermald. This happens since I upgraded to a new release of Fedora (Fedora 33), which uses the kernel version 5.8.16-300.x86_64.

When I run

$ cpu-power frequency-info

with thermald enabled, I get the following output:

analyzing CPU 0:
  driver: intel_pstate
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency:  Cannot determine or is not supported.
  hardware limits: 400 MHz - 2.00 GHz
  available cpufreq governors: performance powersave
  current policy: frequency should be within 400 MHz and 2.00 GHz.
                  The governor "powersave" may decide which speed to use
                  within this range.
  current CPU frequency: Unable to call hardware
  current CPU frequency: 400 MHz (asserted by call to kernel)
  boost state support:
    Supported: no
    Active: no

The system is not responsive and very slow.

When I disable thermald with

$ sudo systemctl disable thermald

everything works fine again:

$ cpu-power frequency-info
analyzing CPU 0:
  driver: intel_pstate
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency:  Cannot determine or is not supported.
  hardware limits: 400 MHz - 2.00 GHz
  available cpufreq governors: performance powersave
  current policy: frequency should be within 400 MHz and 2.00 GHz.
                  The governor "powersave" may decide which speed to use
                  within this range.
  current CPU frequency: Unable to call hardware
  current CPU frequency: 2.00 GHz (asserted by call to kernel)
  boost state support:
    Supported: no
    Active: no

My system is a ASUS Laptop (Modelnr. X541UAK) with an Intel CORE i3 processor. thermald hasn't made any problems in the last years to me. To make sure that no other service is causing the problems, I've removed powertop as well as tlp.

Any suggestions? Thanks in advance!

@spandruvada
Copy link
Contributor

spandruvada commented Nov 4, 2020 via email

@dubst3pp4
Copy link
Author

I've done exactly that and the laptop was slow again immediately after starting thermald. I've attached the output - many thanks! :-)
thermald-output.txt

@spandruvada
Copy link
Contributor

spandruvada commented Nov 4, 2020 via email

@spandruvada
Copy link
Contributor

Any update?

@dubst3pp4
Copy link
Author

Sorry, I will post more details later this week! :-)

@dubst3pp4
Copy link
Author

So I finally run the commands you mentioned. I've attached two files: one with the output when thermald is enabled (thermald_enabled.txt) and the other with the output when thermald is disabled (thermald_disabled.txt).

There are some differences between the outputs, but unfortunately I can't interpret them.

Thank you so much for your help :-)

@spandruvada
Copy link
Contributor

In one window
After thermald started

echo 0 > /sys/class/powercap/intel-rapl/intel-rapl:0/intel-rapl:0:2/enabled

while true; do echo /sys/class/powercap/intel-rapl/intel-rapl:0/intel-rapl:0:2/enabled; sleep 1; done

And check if you go to this state? I guess you have to run something to go to this state.

@spandruvada
Copy link
Contributor

Mistake in above while command, change echo to "cat"
while true; do cat /sys/class/powercap/intel-rapl/intel-rapl:0/intel-rapl:0:2/enabled; sleep 1; done

If the above doesn't fix then also do this step

echo 0 > /sys/class/powercap/intel-rapl-mmio/intel-rapl-mmio:0/enabled
while true; do cat /sys/class/powercap/intel-rapl-mmio/intel-rapl-mmio:0/enabled; sleep 1; done

@spandruvada
Copy link
Contributor

I have added a fix. Please help to verify. This is on special branch.
https://github.com/intel/thermal_daemon/commits/asus_pl2_fix

After checkout thermald change the branch to
$ git checkout remotes/origin/asus_pl2_fix -b asus_pl2_fix

Then follow README.txt for build procedure.

@robmusial
Copy link

Hey @spandruvada I am having the same issue on a Lenovo S730-13IWL with Intel Core i7-8565U 4 x 1.8 - 4.6 GHz, 16 GB RAM, and Intel UHD Graphics 620.

I see that the branch is tagged for Asus but I thought it'd give it a shot and see if it worked for me. After building and installing it I am still having the same issue as @dubst3pp4.

Witt thermald stopped and disabled:
[rob@localhost ~]$ cat /proc/cpuinfo | grep MHz
cpu MHz : 2982.409
cpu MHz : 3007.629
cpu MHz : 3053.956
cpu MHz : 2987.189
cpu MHz : 2965.826
cpu MHz : 2993.816
cpu MHz : 3017.187
cpu MHz : 2912.934

with thermald started and enabled (both latest for f33 and the one from this branch)
[rob@localhost ~]$ cat /proc/cpuinfo | grep MHz
cpu MHz : 500.000
cpu MHz : 500.000
cpu MHz : 500.000
cpu MHz : 500.000
cpu MHz : 499.999
cpu MHz : 499.999
cpu MHz : 499.999
cpu MHz : 499.999

Thanks for looking into this!

@spandruvada
Copy link
Contributor

Lets' start with

Disable thermald
reboot

#rdmsr -a 0x774
#grep -r . /sys/class/powercap/intel-rapl/intel-rapl:0/*
#grep -r . /sys/class/powercap/intel-rapl-mmio/intel-rapl-mmio:0/*

#thermald --no-daemon --loglevel =info --adaptive
In another window

#rdmsr -a 0x774
#grep -r . /sys/class/powercap/intel-rapl/intel-rapl:0/*
#grep -r . /sys/class/powercap/intel-rapl-mmio/intel-rapl-mmio:0/*

Send the thermald log and also sysfs dumps before and after

We can address this ASAP.

@robmusial
Copy link

For the sake of clarity I ran this with the thermald built from the asus_pl2_fix branch.

Logs attached as requested. Let me know if there is anything else you'd like

thermald.log
sysfs_after.log
sysfs_before.log

@robmusial
Copy link

robmusial commented Nov 18, 2020

Attached is another thermald log when seeing sustained throttling at 400 MHz.

Edit: I also didn't notice the kernel version in this issue, I have my packages fully updated in F33 and I am running 5.9.8-200.fc33.x86_64

thermald.0.log

@spandruvada
Copy link
Contributor

Please apply the attached patch with git apply after unzip

git apply test_lenovo_patch.diff
test_lenovo_patch.zip

@dubst3pp4
Copy link
Author

I have added a fix. Please help to verify. This is on special branch.
https://github.com/intel/thermal_daemon/commits/asus_pl2_fix

After checkout thermald change the branch to
$ git checkout remotes/origin/asus_pl2_fix -b asus_pl2_fix

Then follow README.txt for build procedure.

Thanks @spandruvada, I've just build thermald and started it. The system runs without problems now! When I run

cpupower frequency-info

the output varies depending on the load (and is not stuck at 400MHz):

analyzing CPU 0:
  driver: intel_pstate
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency:  Cannot determine or is not supported.
  hardware limits: 400 MHz - 2.00 GHz
  available cpufreq governors: performance powersave
  current policy: frequency should be within 400 MHz and 2.00 GHz.
                  The governor "powersave" may decide which speed to use
                  within this range.
  current CPU frequency: Unable to call hardware
  current CPU frequency: 1.66 GHz (asserted by call to kernel)
  boost state support:
    Supported: no
    Active: no

Do you need any additional info? Thanks for your support!

@robmusial
Copy link

Please apply the attached patch with git apply after unzip

git apply test_lenovo_patch.diff
test_lenovo_patch.zip

After applying this patch I was not able to replicate my issue and it appears to be fixed. I gave it as much load as it could take and the thermald performance was great. I have included the log in case it yields any more useful info.

Thank you so much for your efforts!

thermald.1.log

@spandruvada
Copy link
Contributor

spandruvada commented Nov 18, 2020 via email

@spandruvada
Copy link
Contributor

spandruvada commented Nov 18, 2020 via email

@spandruvada
Copy link
Contributor

@dubst3pp4
Please test the 2.4-pre release. This is pushed to master branch. I see that we can improve more on your platform.
Check the version with "thermald -v".
Please attach a log.

@robmusial
Copy link

@dubst3pp4
Please test the 2.4-pre release. This is pushed to master branch. I see that we can improve more on your platform.
Check the version with "thermald -v".
Please attach a log.

I know you addressed this to @dubst3pp4 but I gave it a run as well and was not able to replicate my previous errors. To confirm, this had the Lenovo fix as well?

Thanks!

@spandruvada
Copy link
Contributor

spandruvada commented Nov 22, 2020 via email

@robmusial
Copy link

Excellent, thanks. I just wanted to make sure I wasn't getting a false result and that my testing was correct. I've been running it for about 24 hours and it has been great. Thanks again for working on this so quickly. Really impressed.

@spandruvada
Copy link
Contributor

spandruvada commented Nov 22, 2020 via email

@dubst3pp4
Copy link
Author

@dubst3pp4
Please test the 2.4-pre release. This is pushed to master branch. I see that we can improve more on your platform.
Check the version with "thermald -v".
Please attach a log.

I will test this tomorrow! Sorry once again for the slow response time ;-)

@ferdnyc
Copy link

ferdnyc commented Nov 24, 2020

I will test this tomorrow! Sorry once again for the slow response time ;-)

Perhaps it's you, and not your PC, that's being throttled down to 400MHz @dubst3pp4 ? 😉

@dubst3pp4
Copy link
Author

@ferdnyc Yes, indeed, I'm a little bit throttled because of some stack overflows ;-)

@spandruvada I've just build the new version, checked that I'm using the correct binary version and run thermald with the debug log level. Everything seems to work very well 👍 See my logfile attached.

I can just second @robmusial 's statement: thank you so much for your work! I rarely had so much support from the maintainers of an OSS project! 😀

log_thermald.txt
log_version.txt

@spandruvada
Copy link
Contributor

spandruvada commented Nov 25, 2020 via email

@spandruvada
Copy link
Contributor

Released version 2.4 with changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants