Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Nvidia Optimus] SOUND_POWER_SAVE_ON_AC=0 prevents power down of the GPU #495

Closed
marcmerlin opened this issue May 31, 2020 · 23 comments
Closed

Comments

@marcmerlin
Copy link

marcmerlin commented May 31, 2020

I'm not sure if there is a way to auto detect or document this.

Problem:
When the nouveau driver is loaded on my lenovo thinkpad P73 laptop with hybrid graphics (where I only use i915 to save batteries), while the driver does not yet properly support my chip for external display, it is required to power down the chip and save 40W.
With TLP, I started getting hard hangs when plugging power in

Fix:

SOUND_POWER_SAVE_CONTROLLER=Y
SOUND_POWER_SAVE_ON_AC=1
SOUND_POWER_SAVE_ON_BAT=1

Symptom:

intel-lpss 0000:00:15.0: power state changed by ACPI to D3cold
intel-lpss 0000:00:15.1: power state changed by ACPI to D3cold
snd_hda_intel 0000:00:1f.3: PME# enabled
intel-lpss 0000:00:1e.0: power state changed by ACPI to D3cold
snd_hda_intel 0000:00:1f.3: power state changed by ACPI to D3hot
xhci_hcd 0000:01:00.2: PME# enabled
nvidia-gpu 0000:01:00.3: PME# enabled
pcieport 0000:05:00.0: PME# enabled
xhci_hcd 0000:2c:00.0: PME# enabled
pcieport 0000:05:02.0: PME# enabled
pcieport 0000:04:00.0: PME# enabled
pcieport 0000:00:1c.0: PME# enabled
pcieport 0000:00:1c.0: power state changed by ACPI to D3cold
nouveau 0000:01:00.0: power state changed by ACPI to D3cold
pcieport 0000:00:01.0: PME# enabled
pcieport 0000:00:01.0: power state changed by ACPI to D3cold

(hang)

Rationale:
https://lists.freedesktop.org/archives/nouveau/2020-May/036016.html
https://lists.freedesktop.org/archives/nouveau/2020-May/036019.html
"note that TLP has a problem where it forces the audio
sub-function to always-on which prevents the GPU from suspending."

@karolherbst
Copy link

karolherbst commented May 31, 2020

the hang was caused by a kernel bug we fixed in the meantime. I don't think that TLP was responsible for that. But I agree that on laptops those defaults need to change in order to power down GPUs even when the AC is plugged in.

@karolherbst
Copy link

By the way: this is a reoccurring problem on our end and it wastes our time as we need to debug with users why their GPU is not suspending.

So please fix this.

@linrunner
Copy link
Owner

SOUND_POWER_SAVE_ON_AC=0 (and splitting the setting between AC and BAT) was chosen as compromise for the sound disturbances that many user had when sound power save was active for both AC and BAT.

So the simple solution of changing the default leads to annoyance with other users.

@karolherbst : what are the kernel defaults (without TLP) for systems with nouveau?

/sys/bus/pci/devices/0000:01:00.0/power/control = auto

for the nvidia card and

/sys/module/snd_hda_intel/parameters/power_save = 1
/sys/module/snd_hda_intel/parameters/power_save_controller = Y

for the audio?

I'm thinking about blacklisting i.e. not touching audio power save at all when nouveau ist detected.

@marcmerlin
Copy link
Author

marcmerlin commented May 31, 2020

@linrunner I obviously don't know all the configurations and tradeofs like you do, but 2 ideas:

  1. optionally default to the safe mode, the one that might make your sound pop, but better than crashing other laptops. It's also easier to google 'sound skip/pop' and find the sound setting than to have your laptop hard deadlock, maybe without any logs you can read, and google that it was related to a sound setting.
  2. add a big warning on the config file that either the setting can cause crashes with at least nvidia GPUs (which is totally not something I'd have figured out if the nouveau folks hadn't told me, and I guess they're tired of having to tell everyone :) ), or it can cause pops/skips.

@karolherbst
Copy link

karolherbst commented Jun 1, 2020

@karolherbst : what are the kernel defaults (without TLP) for systems with nouveau?

by default everything suspends.

@karolherbst
Copy link

well.. to be more correct here, it depends on the kernel build, but distributions tend to build with SND_HDA_POWER_SAVE_DEFAULT=y these days.

@linrunner
Copy link
Owner

linrunner commented Jun 2, 2020

@karolherbst

by default everything suspends.

This is probably the reason why owners of Nvidia Optimus laptops are so enthusiastic about the power consumption ;)

Just to be sure (as I don't have access to Optimus hardware): the nouveau kernel driver sets control = auto for the PCIe device?

@linrunner
Copy link
Owner

linrunner commented Jun 2, 2020

@marcmerlin : the real tradeoff is that your charger is severely underspecced and this could very well be the reason for the crashes. You have to come to terms with the consequences of your free decision. Enough said about that (including #494).

But the core problem here is that the Nvidia doesn't suspend when audio power save is off.

Could you do me a favour?

  1. Disable TLP via configuring

    TLP_ENABLE=0

  2. Make shure it stays that way by disabling your nifty scripting logic,.

  3. Reboot once on AC and on BAT and show the full output of

    tlp-stat

for both via https://gist.github.com/

@linrunner linrunner changed the title documentation/default change: turning off sound power save causes hard hang with nouveau driver [Nvidia Optimus] SOUND_POWER_SAVE_ON_AC=0 prevents autosuspend of the GPU Jun 2, 2020
@linrunner linrunner changed the title [Nvidia Optimus] SOUND_POWER_SAVE_ON_AC=0 prevents autosuspend of the GPU [Nvidia Optimus] SOUND_POWER_SAVE_ON_AC=0 prevents power down of the GPU Jun 2, 2020
@marcmerlin
Copy link
Author

@linrunner to be clear, I have no more crashes since I applied the sound suspend fix, so honestly all is good on my side.
As for my charger underspeced, I'm using the 230W charger at my desk, and had the crash when using it. The charging hacking I did is for when I travel, which right now, not so much.
Given that, did you still need that data from me?

@linrunner
Copy link
Owner

Yes, please. And thanks for the clarification.

@karolherbst
Copy link

karolherbst commented Jun 3, 2020

Just to be sure (as I don't have access to Optimus hardware): the nouveau kernel driver sets control = auto for the PCIe device?

yes

linrunner added a commit that referenced this issue Jun 7, 2020
Rationale:
* amdgpu, nvidia: blacklisting will actually prevent the GPU from suspending
  when unused, which is not what users expect
* pcieport: there have been no indications of problems recently

Note: nouveau selects 'auto' by default, so keep it blacklisted to
not interfere.

References:
* #488
* #498
* #495 (comment)
linrunner added a commit that referenced this issue Jun 7, 2020
Rationale: disabling sound power saving interferes with nouveau
(driver enables runtime pm by default) so that the GPU doesn't
suspend when idle.

Solution: default to SOUND_POWER_SAVE_ON_AC=1.

Discussion: consideration showed that excessive power consumption
of the GPU will annoy users of TLP much more than occasional sound
disturbances.

Reference:
* #495
@linrunner
Copy link
Owner

@marcmerlin : I changed the default to default to SOUND_POWER_SAVE_ON_AC=1, thanks for your arguments. Because of @karolherbst remark i waive warning about crashes.

@linrunner
Copy link
Owner

@marcmerlin : i'm still longing for your outputs ...

@karolherbst
Copy link

@marcmerlin : I changed the default to default to SOUND_POWER_SAVE_ON_AC=1, thanks for your arguments. Because of @karolherbst remark i waive warning about crashes.

I think it would make more sense to default to whatever distributions are doing here as it seems like in the past there were some bugs and I'd feel better if by default distributions choice is honored here.

On newer kernels it's probably not an issue.. more thinking about older debian releases or something.

@marcmerlin
Copy link
Author

@linrunner sorry for the delay. I have lots of work on my laptop right now that I can't easily shut down. it will probably take a while before I can reboot, sorry. I will get this to you though.

@linrunner
Copy link
Owner

@karolherbst : TLP's mission is to be more aggressive than the defensive kernel/distribution defaults. People have been telling me for 10 years that kernel optimizations will soon make TLP redundant; however, this has not yet happened ...

@marcmerlin : Don't stress yourself, please, it's not that important.

linrunner added a commit to linrunner/tlp-doc that referenced this issue Jun 9, 2020
* Hybrid graphics, moved from faq/graphics
* Limit CPU PD, moved from faq/graphics
* Remove refs to Bumblebee (way deprecated)

References:
* linrunner/TLP#488
* linrunner/TLP#495
* linrunner/TLP#498
@marcmerlin
Copy link
Author

There you go, AC and BAT with tlp off, and then again with after TLP was turned on. Hope this helps.
ac.txt
ac-enabled.txt
bat.txt
bat-enabled.txt

00:00.0 Host bridge: Intel Corporation Device 3e20 (rev 0d)
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) (rev 0d)
00:02.0 VGA compatible controller: Intel Corporation UHD Graphics 630 (Mobile) (rev 02)
00:04.0 Signal processing controller: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem (rev 0d)
00:08.0 System peripheral: Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 / 6th/7th/8th Gen Core Processor Gaussian Mixture Model
00:12.0 Signal processing controller: Intel Corporation Cannon Lake PCH Thermal Controller (rev 10)
00:14.0 USB controller: Intel Corporation Cannon Lake PCH USB 3.1 xHCI Host Controller (rev 10)
00:14.2 RAM memory: Intel Corporation Cannon Lake PCH Shared SRAM (rev 10)
00:15.0 Serial bus controller [0c80]: Intel Corporation Cannon Lake PCH Serial IO I2C Controller #0 (rev 10)
00:15.1 Serial bus controller [0c80]: Intel Corporation Cannon Lake PCH Serial IO I2C Controller #1 (rev 10)
00:16.0 Communication controller: Intel Corporation Cannon Lake PCH HECI Controller (rev 10)
00:17.0 SATA controller: Intel Corporation Cannon Lake Mobile PCH SATA AHCI Controller (rev 10)
00:1b.0 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #17 (rev f0)
00:1c.0 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #1 (rev f0)
00:1c.5 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #6 (rev f0)
00:1c.7 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #8 (rev f0)
00:1e.0 Communication controller: Intel Corporation Cannon Lake PCH Serial IO UART Host Controller (rev 10)
00:1f.0 ISA bridge: Intel Corporation Cannon Lake LPC Controller (rev 10)
00:1f.3 Audio device: Intel Corporation Cannon Lake PCH cAVS (rev 10)
00:1f.4 SMBus: Intel Corporation Cannon Lake PCH SMBus Controller (rev 10)
00:1f.5 Serial bus controller [0c80]: Intel Corporation Cannon Lake PCH SPI Controller (rev 10)
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (7) I219-LM (rev 10)
01:00.0 VGA compatible controller: NVIDIA Corporation TU104GLM [Quadro RTX 4000 Mobile / Max-Q] (rev a1)
01:00.1 Audio device: NVIDIA Corporation TU104 HD Audio Controller (rev a1)
01:00.2 USB controller: NVIDIA Corporation TU104 USB 3.1 Host Controller (rev a1)
01:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU104 USB Type-C UCSI Controller (rev a1)
02:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
04:00.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] (rev 06)
05:00.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] (rev 06)
05:01.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] (rev 06)
05:02.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] (rev 06)
05:04.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] (rev 06)
06:00.0 System peripheral: Intel Corporation JHL7540 Thunderbolt 3 NHI [Titan Ridge 4C 2018] (rev 06)
2c:00.0 USB controller: Intel Corporation JHL7540 Thunderbolt 3 USB Controller [Titan Ridge 4C 2018] (rev 06)
52:00.0 Network controller: Intel Corporation Wi-Fi 6 AX200 (rev 1a)
54:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS525A PCI Express Card Reader (rev 01)

@linrunner
Copy link
Owner

linrunner commented Jun 15, 2020

Now I'm confused. In your initial post you characterized your problem with:

When the nouveau driver is loaded

But your outputs show that no driver is loaded, which is an indication that the proprietary driver nvidia is installed and Intel (Power Saving Mode) selected in their tool:

/sys/bus/pci/devices/0000:01:00.0/power/control = on (0x030000, VGA compatible controller, no driver)

So I'm afraid the outputs are (a) unrelated to your problem and (b) don't help me to check the runtime pm default of the nouveau driver on your hardware.

@marcmerlin
Copy link
Author

marcmerlin commented Jun 15, 2020

@linrunner I can assure you that the binary nvidia driver was never installed on that machine

sauron:~# find /lib/modules | grep nvidia
/lib/modules/4.19.0-5-amd64/kernel/drivers/net/ethernet/nvidia
/lib/modules/4.19.0-5-amd64/kernel/drivers/net/ethernet/nvidia/forcedeth.ko
/lib/modules/5.5.11-amd64-preempt-sysrq-20190816/kernel/drivers/i2c/busses/i2c-nvidia-gpu.ko
/lib/modules/5.5.11-amd64-preempt-sysrq-20190816/kernel/drivers/net/ethernet/nvidia
/lib/modules/5.5.11-amd64-preempt-sysrq-20190816/kernel/drivers/net/ethernet/nvidia/forcedeth.ko
/lib/modules/5.5.11-amd64-preempt-sysrq-20190816/kernel/drivers/usb/typec/altmodes/typec_nvidia.ko
/lib/modules/5.5.11-amd64-preempt-sysrq-20190816/kernel/drivers/video/fbdev/nvidia
/lib/modules/5.5.11-amd64-preempt-sysrq-20190816/kernel/drivers/video/fbdev/nvidia/nvidiafb.ko
/lib/modules/5.6.15-amd64-preempt-sysrq-20190817/kernel/drivers/i2c/busses/i2c-nvidia-gpu.ko
/lib/modules/5.6.15-amd64-preempt-sysrq-20190817/kernel/drivers/net/ethernet/nvidia
/lib/modules/5.6.15-amd64-preempt-sysrq-20190817/kernel/drivers/net/ethernet/nvidia/forcedeth.ko
/lib/modules/5.6.15-amd64-preempt-sysrq-20190817/kernel/drivers/usb/typec/altmodes/typec_nvidia.ko
/lib/modules/5.6.15-amd64-preempt-sysrq-20190817/kernel/drivers/video/fbdev/nvidia
/lib/modules/5.6.15-amd64-preempt-sysrq-20190817/kernel/drivers/video/fbdev/nvidia/nvidiafb.ko

The binary driver, if present, would be here:

/lib/modules/5.6.15-amd64-preempt-sysrq-20190817/updates/dkms/nvidia.ko
/lib/modules/5.6.15-amd64-preempt-sysrq-20190817/updates/dkms/nvidia-uvm.ko
/lib/modules/5.6.15-amd64-preempt-sysrq-20190817/updates/dkms/nvidia-modeset.ko
/lib/modules/5.6.15-amd64-preempt-sysrq-20190817/updates/dkms/nvidia-drm.ko

(that's another machine)

I do have nouveau installed and loaded, as it is required to turn the nvidia chip off:

auron:~# lsmod | grep -E '(nvidia|nouveau)'
nouveau              1945600  1
mxm_wmi                16384  1 nouveau
ttm                   102400  1 nouveau
nvidiafb               53248  0
vgastate               16384  1 nvidiafb
fb_ddc                 16384  1 nvidiafb
hwmon                  32768  3 coretemp,thinkpad_acpi,nouveau
i2c_nvidia_gpu         16384  0
wmi                    32768  4 intel_wmi_thunderbolt,wmi_bmof,mxm_wmi,nouveau

@linrunner
Copy link
Owner

Thanks for the clarification and your patience. Then i'll have to get hold of a similar hardware to check why tlp-stat doesn't display the driver.

@marcmerlin
Copy link
Author

So, one thing I did notice, is that the nouveau driver hadn't self loaded after boot, not entirely sure why. Maybe it's because I booted in battery (as per your request). I did manually insert the driver while X was running and got this
stat3.txt

Maybe I can'shoutld force the loading of that driver in /etc/modules , not sure why it stopped autoloading this time.

@linrunner
Copy link
Owner

/sys/bus/pci/devices/0000:01:00.0/power/control = auto (0x030000, VGA compatible controller, nouveau)

OK, that's the expected result.

linrunner added a commit to linrunner/tlp-doc that referenced this issue Oct 25, 2020
linrunner added a commit to linrunner/tlp-doc that referenced this issue Oct 25, 2020
@linrunner
Copy link
Owner

FAQ done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants