Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dell Lattitude 5420 (TGL) throttled to 400MHz CPU / 100MHz GPU after 30s #293

Closed
majanes-intel opened this issue Mar 5, 2021 · 220 comments
Closed

Comments

@majanes-intel
Copy link

Kernel: 5.11.3
Debian: Testing
thermald: 2.4.3 (debian unstable)
processor: i7-1185G7 -- 28 W TDP

After running power-intensive workloads for a short amount of time, the CPU and/or GPU will be throttled down drastically to ~10% of peak.

Running turbostat reveals that the peak current is ~16W, far below the TDP limit.

Running lm-sensors shows that the peak temp is ~50C, far below the limit.

After reading #291 and #280, I enabled debug logs for thermald.
thermald.log

@spandruvada let me know if more information is needed. I can also bring the system to you in JF1. Mesa team will be using this laptop model for perf analysis.

@majanes-intel
Copy link
Author

I neglected to mention: running an identical workload on windows completes with no degradation in power. The system gets much warmer and the fans run at a clearly higher speed. Based on this observation, it seems clear that the system is not being limited by some physical thermal problem.

@spandruvada
Copy link
Contributor

spandruvada commented Mar 5, 2021 via email

@spandruvada
Copy link
Contributor

spandruvada commented Mar 5, 2021 via email

@spandruvada
Copy link
Contributor

spandruvada commented Mar 8, 2021

I see different behavior setting on 2.4.3 version from this repository and version in debian. The power limit is not getting set in Debian version. So is Debian back-porting patches? If that is the case they should have different private version.
Who can help here? @ColinIanKing

@ColinIanKing
Copy link
Contributor

I'll sort that out first thing Tuesday.

@majanes-intel
Copy link
Author

@spandruvada thanks for the work to figure out why this system was turning the gpu down to 100mhz. Your test branch improves the situation substantially, although it looks like there is still a long way to go.
Running longer benchmarks, I can see that the CoreTmp climbs all the way to 72 degrees, with the GFXAMHz stable at the 1350MHz maximum. Tthe PkgWatt is around 25W, near the TDP limit.
After that, power is cut to the system, limiting the GPU to 400MHz. The temp declines steadily, with the PkgWatt at 12W. For a short duration, the GFXAMHz oscillates between 400 and 1000, then stays at 400MHz. The temp declines to 40 degrees by the time the benchmark is done.

lm-sensors reports that the package max temp is 100 degrees Celsius. Is that accurate/realistic? If so, then it seems like thermald should wait longer before cutting power. If not, then it seems like thermald could settle on a much higher current for the package... at 12W, the package temp declines below what is necessary and performance suffers.

I used unigine heaven for this data point. I took a look ath the Thermal Analysis Tool on Windows, but I couldn't see how to get similar data from that platform. If you can give me some pointers, I should be able to at least understand what frequency/power levels windows achieves, and what the stable max temp is.

@majanes-intel
Copy link
Author

When I booted to the windows partition, I noticed that updates were running in the background, which can perturb performance measurements. I let the system complete a full software update, which updated the firmware on the device.
With the firmware update, I now get a stable 1000MHz GPU clock, with the package temp stable at 50 degrees Celsius.
While this is much closer to optimal, It still seems to me that the package could target a higher package temperature.

@benzea
Copy link

benzea commented Apr 9, 2021

So, the power slider condition we could support (either using a default value or pulling a value from p-p-d).

However, that would only help if we can resolve the \_SB_.PCI0.LPCB.ECDV.NGFF sensor. And, even if we do that, the OEM conditions coming through ACPI might still not have a sane value to proceed.

@zamazan4ik
Copy link

I've met the same issue. @benzea does thermald 2.4.4 resolve the issue?

I've built 2.4.4 for Fedora 34, installed it, and now have almost constant 1700Mhz instead of 400Mhz. That's fine, but my CPU temp is still too low (~54C), so I am sure that the CPU can gain a higher clock speed. Is it possible somehow?

@zamazan4ik
Copy link

If it will help - that's a log from journalctl -r for thermald 2.4.4 on Fedora 34, which is launched with options --systemd --dbus-enable --adaptive:

мая 02 06:05:24 localhost.localdomain thermald[3010]: ppcc limits is less than def PL1 max power :28000000 check thermal-conf.xml.auto
мая 02 06:05:24 localhost.localdomain thermald[3010]: sensor id 10 : No temp sysfs for reading raw temp
мая 02 06:05:24 localhost.localdomain thermald[3010]: sensor id 10 : No temp sysfs for reading raw temp
мая 02 06:05:24 localhost.localdomain thermald[3010]: sensor id 10 : No temp sysfs for reading raw temp
мая 02 06:05:23 localhost.localdomain thermald[3010]: Polling mode is enabled: 4
мая 02 06:05:23 localhost.localdomain thermald[3010]: 27 CPUID levels; family:model:stepping 0x6:8c:1 (6:140:1)
мая 02 06:05:23 localhost.localdomain thermald[3010]: Unable to find a sensor for \_SB_.PCI0.LPCB.ECDV.NGFF
мая 02 06:05:23 localhost.localdomain thermald[3010]: Unable to find a sensor for \_SB_.PCI0.LPCB.ECDV.NGFF
мая 02 06:05:23 localhost.localdomain thermald[3010]: Unable to find a sensor for \_SB_.PCI0.LPCB.ECDV.NGFF
мая 02 06:05:23 localhost.localdomain thermald[3010]: Unable to find a sensor for \_SB_.PCI0.LPCB.ECDV.NGFF
мая 02 06:05:23 localhost.localdomain thermald[3010]: Unable to find a sensor for \_SB_.PCI0.LPCB.ECDV.NGFF
мая 02 06:05:23 localhost.localdomain thermald[3010]: Unable to find a sensor for \_SB_.PCI0.LPCB.ECDV.NGFF
мая 02 06:05:23 localhost.localdomain thermald[3010]: Unable to find a sensor for \_SB_.PCI0.LPCB.ECDV.NGFF
мая 02 06:05:23 localhost.localdomain thermald[3010]: Unable to find a sensor for \_SB_.PCI0.LPCB.ECDV.NGFF
мая 02 06:05:23 localhost.localdomain thermald[3010]: Unable to find a sensor for \_SB_.PCI0.LPCB.ECDV.NGFF
мая 02 06:05:23 localhost.localdomain thermald[3010]: Unable to find a sensor for \_SB_.PCI0.LPCB.ECDV.NGFF
мая 02 06:05:23 localhost.localdomain thermald[3010]: Unable to find a sensor for \_SB_.PCI0.LPCB.ECDV.NGFF
мая 02 06:05:23 localhost.localdomain thermald[3010]: Unsupported conditions are present
мая 02 06:05:23 localhost.localdomain thermald[3010]: Unsupported condition 57 (UKNKNOWN)
мая 02 06:05:23 localhost.localdomain thermald[3010]: Unsupported condition 57 (UKNKNOWN)
мая 02 06:05:23 localhost.localdomain thermald[3010]: Unsupported condition 57 (UKNKNOWN)
мая 02 06:05:23 localhost.localdomain thermald[3010]: Unsupported condition 57 (UKNKNOWN)
мая 02 06:05:23 localhost.localdomain thermald[3010]: Unsupported condition 57 (UKNKNOWN)
мая 02 06:05:23 localhost.localdomain thermald[3010]: Unsupported condition 57 (UKNKNOWN)
мая 02 06:05:23 localhost.localdomain thermald[3010]: Unsupported condition 57 (UKNKNOWN)
мая 02 06:05:23 localhost.localdomain thermald[3010]: Unsupported condition 57 (UKNKNOWN)
мая 02 06:05:23 localhost.localdomain thermald[3010]: Unsupported condition 57 (UKNKNOWN)
мая 02 06:05:23 localhost.localdomain thermald[3010]: Unsupported condition 57 (UKNKNOWN)
мая 02 06:05:16 localhost.localdomain thermald[3010]: 27 CPUID levels; family:model:stepping 0x6:8c:1 (6:140:1)
мая 02 06:05:16 localhost.localdomain systemd[1]: Started Thermal Daemon Service.

Laptop: Dell Latitude 5420 with 11th Gen Intel(R) Core(TM) i7-1165G7 CPU.

@benzea
Copy link

benzea commented May 2, 2021

I've met the same issue. @benzea does thermald 2.4.4 resolve the issue?

I've built 2.4.4 for Fedora 34, installed it, and now have almost constant 1700Mhz instead of 400Mhz. That's fine, but my CPU temp is still too low (~54C), so I am sure that the CPU can gain a higher clock speed. Is it possible somehow?

Oh, a newer thermald for Fedora would help?

Sorry about that. I thought I had picked up the important patches downstream already (even if I had an older version). I can update the package so that others benefit from that.

@zamazan4ik
Copy link

zamazan4ik commented May 2, 2021

Yeah, 2.4.4 helps somehow on Fedora but not completely resolve the issue. So without thermald 2.4.4 (with older thermald version or without it) is still downclocked to 400Mhz after ~30 secs. With thermald 2.4.4 the highest clock is 1700 Mhz. Would be awesome if you'll build thermald 2.4.4 for Fedora :)

It's still too low since the usual clock for the CPU is 2800Mhz. And I have no idea how it can be fixed :(

@zamazan4ik
Copy link

@benzea any news about Fedora updates?

@benzea
Copy link

benzea commented May 14, 2021

@benzea any news about Fedora updates?

On its way now.

@zamazan4ik
Copy link

zamazan4ik commented May 14, 2021

I am not familiar with modern Linux CPU scheduling but I think the real root of the issue is some bugs in intel_pstate implementation in Linux kernel. Because on Windows I can gain stable 2.8 Ghz CPU clock on the same hardware. On Linux (Fedora 34) without thermald I can get only 400 Mhz and with thermald - 1.7 GHz.

Maybe anyone from thermald team can provide more information. I will try to test another Dell Latitude 5420. Also in a few days I'll test Dell Latitude 5410 (hope it'll work better).

By the way - with modern Intel CPUs is using Thermald necessary or not?

@benzea
Copy link

benzea commented May 14, 2021

I am not familiar with modern Linux CPU scheduling but I think the real root of the issue is some bugs in intel_pstate implementation in Linux kernel. Because on Windows I can gain stable 2.8 Ghz CPU clock on the same hardware. On Linux (Fedora 34) without thermald I can get only 400 Mhz and with thermald - 1.7 GHz.

Please don't jump to such conclusions. The problem is that we need to do thermal management in userspace. To do so, we need to parse data from ACPI which we are not fully implementing because Intel is not publishing the specification. And, on top of that, there may also be vendor specific things.

i.e. probably мая 02 06:05:23 localhost.localdomain thermald[3010]: Unsupported condition 57 (UKNKNOWN) is the issue. If you figure out whwat that condition means, then one might implement it and it will likely help you.

Maybe anyone from thermald team can provide more information. I will try to test another Dell Latitude 5420. Also in a few days I'll test Dell Latitude 5410 (hope it'll work better).

By the way - with modern Intel CPUs is using Thermald necessary or not?

Yes.

@zamazan4ik
Copy link

@benzea Thanks! Can you please describe to me a little bit more, what is the real difference in thermal management between the intel_pstate subsystem and thermald? Or just provide a link, where I can read about it. Thanks in advance!

If you figure out what that condition means, then one might implement it and it will likely help you.

Do you have any suggestions, how can I debug it? Maybe there is some already existing guide for it. I am ready to invest some time into it and assist you as much as I can.

@benzea
Copy link

benzea commented May 14, 2021

@benzea Thanks! Can you please describe to me a little bit more, what is the real difference in thermal management between the intel_pstate subsystem and thermald? Or just provide a link, where I can read about it. Thanks in advance!

If you figure out what that condition means, then one might implement it and it will likely help you.

Do you have any suggestions, how can I debug it? Maybe there is some already existing guide for it. I am ready to invest some time into it and assist you as much as I can.

Not really. You can enable debug logging for thermald and it'll dump more detailed information. It might be possible to guess what the condition is based on by looking at the values and the various limits that are being applied.

At the end, if we can just emulate a sane default value, we might not even need to know the exact meaning. For power-slider we just assume a "balanced" performance right now for example.

@spandruvada
Copy link
Contributor

I pushed another change to fix the performance gap once you update BIOS on this system.

@mazzz1y
Copy link
Contributor

mazzz1y commented Jun 29, 2021

Absolutelly the same issue with Latitude 7520

@spandruvada
Copy link
Contributor

If the issue is same on 7520, does the latest thermald fix the issue?

@mazzz1y
Copy link
Contributor

mazzz1y commented Jun 29, 2021

The same as for @zamazan4ik

[root@dell tmp]# thermald --version
2.4.6

With the latest version CPU stuck on 1800mhz, without thermald -- 400mhz

@zamazan4ik
Copy link

Since now I have Dell Latitude 5410 - I cannot test the latest thermald on 5420. I'll try to test the latest thermald on the 5410. I hope @benzea ported latest changes to the Fedora version.

@benzea
Copy link

benzea commented Jun 29, 2021

Since now I have Dell Latitude 5410 - I cannot test the latest thermald on 5420. I'll try to test the latest thermald on the 5410. I hope @benzea ported latest changes to the Fedora version.

Fedora 34 and 35 both have thermald 2.4.6 currently.

@mazzz1y
Copy link
Contributor

mazzz1y commented Jun 29, 2021

I've attached debug log with the latest(2.4.6) version of thermald. Not sure if it's helpful

In the log we can see dropping frequency to 1800mhz(temp down to 55 from 73) after a few seconds of stress -c 8

thermald --no-daemon --loglevel=debug --dbus-enable > /tmp/thermald.log

thermald.log

@spandruvada
Copy link
Contributor

spandruvada commented Jun 29, 2021 via email

@mazzz1y
Copy link
Contributor

mazzz1y commented Jun 29, 2021

no, I attached a new with adaptive option

thermald-adaptive.log

@spandruvada
Copy link
Contributor

spandruvada commented Jun 29, 2021 via email

@mazzz1y
Copy link
Contributor

mazzz1y commented Jun 29, 2021

[root@dell ~]# cat /sys/class/thermal/thermal_zone*/type
INT3400 Thermal
TCPU
iwlwifi_1
x86_pkg_temp
[root@dell ~]# uname -a
Linux dell 5.12.13-arch1-2 #1 SMP PREEMPT Fri, 25 Jun 2021 22:56:51 +0000 x86_64 GNU/Linux

@spandruvada
Copy link
Contributor

spandruvada commented Jun 29, 2021 via email

@sebastianha
Copy link

How long did you run the stress test and at what were the temperatures?

@VitaliiSerdiuk
Copy link

@sebastianha 2 minutes. 67C. It's very unstable. Currently for one minutes I got 3700 MHZ with 97C and then drop to 2300MHZ. Each time different result. Sometimes stuck with 400 MHz

@sebastianha
Copy link

Thanks, that is really strange behaviour. Will test it as soon as I have time.

@VitaliiSerdiuk
Copy link

Does anyone try solution like that https://www.ultrabookreview.com/14875-fix-throttling-xps-15/? on 5420 or 7420?

@sebastianha
Copy link

As it works for me correctly when using Windows I don't assume a hardware problem here. With Windows and throttle stop I get constant high frequencies at high temperatures. Of course with new cooling paste you will get a little longer the higher speeds but currently my problem is the extrem throttling which only occurs on Linux.

@spamik
Copy link

spamik commented Jan 13, 2022

On Linux on 7320 no matter what I do I can't reach 60°C on CPU or more, also cooling fan even on ultra performance mode won't hit fastest speed. So I believe that it has nothing to do with temperature. Also I played with throttled - there is some perfmon which tells that system is throttled because of power. That sounds more probably that something is throttling to get power consumption under some level. But issue is that it's not possible to move that level (and doesn't matter if laptop is powered from dock, travel charger or battery, same behaviour).

@VitaliiSerdiuk
Copy link

Open new ticket as current was closed.

@spamik
Copy link

spamik commented Jan 14, 2022

I've just upgraded BIOS on 7320 to 1.14.1 and no change. Still downscale to 1800MHz under load, CPU temperature ~50°C. Something new about Dell guys? :-)

@sebastianha
Copy link

Same here, sadly nothing new from Dell :(

@Joshua-Riek
Copy link

I do have a slight update from Dell, looks like the remote Linux engineering team has agreed to take a look at this issue, but they don't currently have a local engineer assigned to interface with this team yet, so I don't have any further information.

I have been advised that there is no guarantee that this will be fixed, and even hoping for a fix should be cautioned.

@sebastianha
Copy link

Dell is collection information: https://www.dell.com/community/Latitude/Latitude-5420-7420-7520-CPU-Throttling-Issue-on-Linux/m-p/8129749/highlight/true#M39458

@zamazan4ik
Copy link

@sebastianha since I have no Dell account, can you please re-post information from me to this Dell thread? Thanks in advance!

Model: Dell Latitude 5410
BIOS and other information: https://pastebin.com/8Wek1nTp
Linux: Fedora 35 with Linux kernel 5.15.13-200
Issue on Windows: after install Intel Dynamic Tuning driver - no, before - yes
What did I do: Stress CPU with "stress-ng -c 8" for some amount of time (more than half an hour is usually enough in all cases). Also is reproducible with other CPU-heavy loads like a compilation process.
OOB: always
What troubleshooting have you done: see #293

@sebastianha
Copy link

@sebastianha since I have no Dell account, can you please re-post information from me to this Dell thread? Thanks in advance!

done

@sebastianha
Copy link

A finding from my side:

When booting without thermald the throttling goes down to 400MHz. When starting systemd it is higher and throttles to 1800MHz.

I have the same effect when not using thermald but only disabling "intel-rapl:0":

echo 0 > /sys/devices/virtual/powercap/intel-rapl/intel-rapl:0/enabled

Then the CPU does not go down to 400MHz anymore!

I did a strace on thermald and checked for changed in /sys/. Then I did it step by step and broke it down to the line above.I still think that thermald is not in control of the CPU at all on this notebooks.

My hypothesis: Currently system is only triggering that the CPU is no longer limited to 400MHz but everything else has no effect.

@VitaliiSerdiuk
Copy link

VitaliiSerdiuk commented Jan 21, 2022

#318 (comment)
I found that comment. So for my understanding for some models firmware should take care of throttling.
In my case with disabled thermald service I got 2300 MHz under load and it start throttling to 400 MHz when I run GPU intensive task(Google meet with visual effect) As soon as I disable GPU intensive task - throttling also up to 2300 MHz level
image

With enabled 2.4.8 thermald.service I got stable 1500MHz.

Current BIOS version: 1.15.1
Ubuntu 20.04 5.15.13 kernel

@egberts
Copy link

egberts commented Feb 14, 2022

All the commentators need to start mentioned their Linux version. Many incorporations of TOPower features were incorporated into the Linux kernel over revisions up to 5.15.

@AdamantGarth
Copy link

Dell is collection information: https://www.dell.com/community/Latitude/Latitude-5420-7420-7520-CPU-Throttling-Issue-on-Linux/m-p/8129749/highlight/true#M39458

Sorry for being a bit off topic, but did Dell just silently delete this post? I see message not found when I go there and it vanished from my subscriptions in account settings.

@pjssilva
Copy link

Same for me, the post vanished and is not appearing in the subscriptions in my account settings.

@MikP0
Copy link

MikP0 commented Feb 14, 2022

That's ridiculous!! They also deleted related post here:
https://www.dell.com/community/Linux-General/Ubuntu-external-display-extreme-lags/m-p/8086446

@Grtschnk
Copy link

Grtschnk commented Feb 14, 2022

They could collect all our posts in one thread, or we all just create a new thread each 😇
Just asked support why the posts were deleted/vanished, there's a 1% chance we get a useful answer.

@AdamantGarth
Copy link

Well, people recommending Lenovo on Dell's forum probably didn't help. Still, this is totally ridiculous - I'd get if they just deleted Lenovo promotion posts (that were totally justified, IMO), but nuking whole conversation where their own community manager asked for help from community which community happily provided is just insane.

@Grtschnk
Copy link

Grtschnk commented Feb 14, 2022

From Dell support:

Regarding the issue with the forum posts/threads regarding this issue that is deleted, We are working on it and would need some time.
We will get back to you in a few hours of time.

(Edit) And a second one:

Regarding the thread that was deleted, our internal team is working on it and would need some time.

image

@Grtschnk
Copy link

Grtschnk commented Feb 14, 2022

All the commentators need to start mentioned their Linux version. Many incorporations of TOPower features were incorporated into the Linux kernel over revisions up to 5.15.

I am on 5.16.8, issue still persists. Seems Dell does not play nicely with Linux kernels and updates.

@0x501D
Copy link

0x501D commented Feb 16, 2022

5.16.9, issue exists.

@JoshuaPK
Copy link

5.16.10-1.el8.elrepo.x86_64 it still exists here as well. There's one other thing that just came to mind: the fan sensors don't work. In all of my previous Dell laptops, the system is able to read fan speed from lm_sensors. In this case the fan speed is not available. I wonder if the lack of fan speed data is causing thermald to make some assumptions that aren't correct.

@Silcet
Copy link

Silcet commented Feb 25, 2022

I have a Dell Latitude 5420 running Ubuntu 20.04 with kernel 5.14.0-1024-oem and BIOS 1.15.1. Running stress all cores start at 4,3GHz and then stabilize at ~3,8GHz until I hit exactly 5 minutes of stress test. At this point, all cores go down to 2,5GHz and the cores stay at ~60ºC.
Another interesting behavior is that when the same test is run on battery instead of AC power the cores will stabilize at ~3,3GHz and 80ºC and after 5 minutes all cores start alternating between 2,3GHz and 3,3GHz.
I have been wondering about the issue and I have been looking a lot at what @JoshuaPK points out about not being able to read the fan speed. I've tried all sorts of tools and even using random smm codes with the dellfan utility to try to read and control the fans without luck.
I can use the laptop exclusively for testing, so please let me know if someone smarter than me knows what to test.

@sebastianha
Copy link

That is definitely different behavior than on my 7320. I can see similar behavior when using windows. There also there is top speed at high temperatures for a few minutes and then it throttles and goes down to ~70°C.

@sebastianha
Copy link

To clean up things I opened a new ticket only for the Latitude 7320 with all my current findings: #341

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests