Lemp9 freezes, requires hard reboot after ugpraded to 5.11.0-7612.13~1617215757~20.04~97a8d1a #45

favadi · 2021-04-03T14:44:28Z

OS: Ubuntu 20.04.2 LTS

Kernel: 5.11.0-7612.13~1617215757~20.04~97a8d1a

Hardware: Lemur Pro 9
Processor: i7-10510U

The system will randomly freeze and require hard reboot after a few minute. Ubuntu's official 5.8 kernel works without any issue.

The text was updated successfully, but these errors were encountered:

bflanagin · 2021-04-05T15:22:59Z

@favadi: I'm not seeing any instability here on my lemur Pro. Would you mind giving us a bit more info?

Bluetooth devices:
USB Devices:
Non-default gnome extensions:
Any software or service you might be running:

favadi · 2021-04-05T15:53:55Z

Hi @bflanagin,

Bluetooth devices

none

USB Devices

none

Non-default gnome extensions

none

Any software or service you might be running

Nothing special, I can reproduce the issue with nothing running.

@bflanagin I can send you my system/dmesg log via email if you want.

bflanagin · 2021-04-05T16:18:25Z

@favadi: I might not be the only one to look into the problem so if you can run sudo dmesg > dmesg.txt and then drag and drop it into your next reply we can get more eyes on it than sending things via email.

favadi · 2021-04-05T16:28:15Z

@bflanagin you are right, this is the dmesg log as requested. Let me know if there is anything else I can provide.

dmesg.txt

bflanagin · 2021-04-05T18:41:15Z

hmm, the only difference between our two machines seems to be that you're using docker. Could you get me your installed package lists and add them to this issue. Maybe one of your packages is causing the issue. (if we can narrow it down we can back port a newer version or report it to the right repo)

deb list:
dpkg -l > debs.txt

flatpak:
flatpak list > flatpak.txt

also lets get your docker images too.

docker image ls > docker.txt

favadi · 2021-04-06T07:45:57Z

@bflanagin here are the requested information.

deb list:

debs.txt

flatpak

I don't have flatpak installed.

docker images

I can't share the image list as it contains some private information. But I did change the docker data dir to an empty one, restart and the issue still happens.

For completeness, I uploaded list of installed snap packages, but not sure relevant.

snap.txt

favadi · 2021-04-07T09:08:43Z

In an attempt to gather more information, I try to boot my laptop using the 5.11 kernel, run sudo dmesg -w in a terminal and capture a screenshot when the freezing happen.

Let me know if it is helpful and sorry for the dusts.

al12gamer · 2021-04-12T16:56:23Z

A darp7 running 5.11.0-7612.13-generic is having issues shutting down completely after this update as well.

bflanagin · 2021-04-12T17:31:53Z

@favadi Thanks for the information, and sorry for the delay. I think you're right about the snaps not being an issue, nothing really there to cause the problem.

I've run my lemp 9 under medium load for 48 hrs straight without issues. Though I know it will probably put a crimp in your ability to work, but lets disable docker for a moment to see what happens.

$ sudo systemctl disable docker.service
$ sudo systemctl disable docker.socket

After rebooting go about your business and let it run, do non-docker tasks, etc. and report back.

You can re-enable them with these commands once the test is complete:

$ sudo systemctl enable docker.service
$ sudo systemctl enable docker.socket

And reboot.

Also if you could reply with the drive/s make and model that might also have something to do with the instability as we have seen on other issues. You can get them various ways but the easiest way is to use the Disks utility UI

bflanagin · 2021-04-12T18:36:16Z

@al12gamer: We've been looking into that issue as well. Would you mind putting in an issue directly stating the problem you're having. This way we can keep things organized.

jacobgkau · 2021-04-12T20:41:45Z

@bflanagin @al12gamer There is already an issue open for the power-off issue here: #41

favadi · 2021-04-13T05:14:49Z

@bflanagin I disabled the docker.service and it doesn't make any different, the system is still working fine with kernel 5.8 and randomly hangs with 5.11.

The installed drive in my lemp9 is Samsung SSD 970 EVO Plus 250GB.

bflanagin · 2021-04-13T16:51:53Z

@favadi: Gah! I can't believe I missed this before. What firmware are you running on your Lemp9?
I'm running 2021-03-11-50eedc2

Edit: disregard I can see it in the screenshot. You're running the same firmware as I am.

favadi · 2021-04-14T00:08:38Z

What firmware are you running on your Lemp9?
I'm running 2021-03-11-50eedc2

@bflanagin I'm running the same version of firmware.

kanru · 2021-04-14T12:40:05Z

Could you try to add intel_idle.max_cstate=4 to the kernel parameter

I reported a similar issue here: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=981807

favadi · 2021-04-15T02:16:31Z

@kanru Hey, thank for the suggestion, I added the kernel parameter and my machine has been working fine for few hours already. This raises another question, why it only happens on my machine but not @bflanagin's machine?

bflanagin · 2021-04-15T13:32:57Z

@favadi, if it makes you feel better you're not alone. The tech-support team has been working on the problem with other lemp9 owners.

Please let us know how the computer holds up for the rest of the day.

favadi · 2021-04-15T14:42:59Z

Please let us know how the computer holds up for the rest of the day.

My computer has been running without any problem for a whole working day.

bflanagin · 2021-04-15T14:58:20Z

Hurray for time differences! Time to test it against my machine.

kanru · 2021-04-15T16:02:48Z

Copying relevant info from my bug report against Debian kernel here

I think I found the root cause. My CPU model is Intel(R) Core(TM)
i5-10210U CPU @ 1.60GHz. On linux 5.9 the max c-state it can enter is
C6 but after 5.10 it'll happily go to C8. After entering deep c-state
all sort of weird memory corruption and lockup starts to show
up. Testing with boot parameter intel_idle.max_cstate=4 it seems the
system becomes stable again and c-state is capped at C6. C7 is also
not safe.

I bisected and it appears this commit caused the symptom.

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=0268eed10f12f785a618880920d90ee306fb2a50

I guess before the RTS522A PCI Express Card Reader was preventing the system
go into deeper c-state and this change fixed that, but revealed other
problems. Reverting the commit fixes the problem by keeping c-state at C6.

@favadi I'm not sure why some machines are affected. My lemp9 has the same problem.
I guess if there is some process or settings (performance mode?) that prevents the CPU to enter deep
sleep mode then the problem will not appear.

jkibele · 2021-04-16T03:41:20Z

I also just started having complete catastrophic freezes on my LemurPro with kernel: 5.11.0.7612.13~1617215757~20.04~97a8d1a I couldn't even get past the encryption password dialog without freezing. Since booting back into 5.8, everything seems fine. I opened a ticket with System76, but figured I'd throw in a "yep, me too" here as well.

bflanagin · 2021-04-16T04:02:10Z

@jkibele: Try the fix that @kanru as its the best option we have at the moment. If we get enough positive results we'll push it to everyone.

To why it affects some and not others, we're still determining that. In testing my vanilla install with the 5.11 update my Lemur Pro has no issues and works as well as it did in 5.8. However, when I install the applications that @favadi has on his machine I start experiencing some of the same symptoms, though not as severe.

jkibele · 2021-04-16T04:12:11Z

Thanks for the quick reply @bflanagin. I'm happy to give that fix a try, but I'm pretty clueless about the inner workings of linux, so I'll need a little guidance. Can you point me toward some instructions on how to set intel_idle.max_cstate=4 as a kernel parameter?

favadi · 2021-04-16T05:39:23Z

@jkibele you could boot into kernel 5.8 and follow the instruction in this article to add intel_idle.max_cstate=4 to the GRUB_CMDLINE_LINUX_DEFAULT configuration.

bflanagin · 2021-04-16T11:44:38Z

@jkibele If you are using Ubuntu then what favadi suggests. On Pop!_OS the command is
sudo kernelstub --add-options intel_idle.max_cstate=4 and then reboot.

You can also use -a in place of --add-options if you prefer to save a few key presses.

jkibele · 2021-04-16T16:17:02Z

Thanks @bflanagin and @favadi. I am on Pop, so I went with the kernelstub command. I've rebooted now with 5.11. Seems to be working fine for me now (though it's only been a few minutes). I'll let you know if anything changes.

Thanks again for your help. Please let me know if I'll need to unset the cstate option at some point down the road. (I've got a very shallow understanding of kernel guts)

arbitrary-dev · 2021-04-16T20:20:36Z

Testing with boot parameter intel_idle.max_cstate=4 it seems the system becomes stable again and c-state is capped at C6.

~~Ain't with this setting the system is capped at C4 instead?~~

Seems like it depends on a processor, my i7-1165G7 have only these:

$ grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name
/sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL
/sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1_ACPI
/sys/devices/system/cpu/cpu0/cpuidle/state2/name:C2_ACPI
/sys/devices/system/cpu/cpu0/cpuidle/state3/name:C3_ACPI

So probably I should go with intel_idle.max_cstate=3 instead.

Interestingly, intel_idle reports higher available C-state:

$ cat /sys/module/intel_idle/parameters/max_cstate
9

kanru · 2021-04-16T22:33:03Z

@arbitrary-dev a little off topic. But no, the number is not the n of Cn state. It is the index into the table of supported state of your CPU. On many cpu C3 is followed by C6.

I'll use powertop to check enabled C-state. Note if you disable deeper C-state then they aren't enumerated in sysfs.

arbitrary-dev · 2021-04-17T16:43:33Z

I have intel_idle.max_cstate=3, but still my lemp10 went unresponsive after some time laying unattended with lid closed.

stickr · 2021-09-12T19:52:52Z

Before reloading Pop!_OS 21.04 I decided to execute the command from the following comment:

@jkibele If you are using Ubuntu then what favadi suggests. On Pop!_OS the command is
sudo kernelstub --add-options intel_idle.max_cstate=4 and then reboot.

You can also use -a in place of --add-options if you prefer to save a few key presses.

I've been successfully running the latest kernel 5.11.0-7633.

garybowers · 2021-09-17T06:06:26Z

My Lemp9 has once again started freezing with fans running after running the firmware update and a apt-get upgrade to 5.13 but booting into 5.11 .... this machine is now unusable. Raised a ticket with System76 but my trust in this system is deteriorating fast.

leviport · 2021-09-17T21:00:06Z

apt-get upgrade to 5.13 but booting into 5.11

For clarification: are you experiencing the same trouble while running 5.13?

stickr · 2021-09-17T21:27:30Z

My Lemp9 has once again started freezing with fans running after running the firmware update and a apt-get upgrade to 5.13 but booting into 5.11 .... this machine is now unusable. Raised a ticket with System76 but my trust in this system is deteriorating fast.

Same here. I updated to 5.13 and removed the intel_idle.max_cstate=4 kernel flag and rebooted and it froze shortly after each test reboot. I added the kernel flag back and it stopped freezing.

garybowers · 2021-09-18T06:00:40Z

apt-get upgrade to 5.13 but booting into 5.11

For clarification: are you experiencing the same trouble while running 5.13?

Both, also get freezing in the recovery image.

stickr · 2021-09-18T12:39:03Z

apt-get upgrade to 5.13 but booting into 5.11

For clarification: are you experiencing the same trouble while running 5.13?

Both, also get freezing in the recovery image.

Same with mine. 5.11, 5.13, and recovery image. Setting the cstate kernel flag prevents lockup, but not sure if that's a long term solution.

garybowers · 2021-09-22T07:00:41Z

I've raised a support ticket, System76 want me to RMA the machine, not sure what that will do but we can see.

varac · 2021-10-20T07:13:58Z

I can confirm that the intel_idle.max_cstate=4 also did fix my lemp9 freezing shortly after login.
@system76: Is there a follow-up issue for fixing this bug root cause so this kernel boot arg is not needed in the long run ? Limiting sleep states is not a desired long-term state, but for the mean time I'm happy with my Laptop being useable again.

jacobgkau · 2021-11-02T14:52:42Z

my trackpad is rattly and broken and screws missing from the bottom of the case

These sound like physical problems that are unrelated to this kernel issue. Please bring this up in your support case, the support team would be happy to send out replacement screws and address any other problems you're still having.

jacobgkau · 2021-11-02T17:57:39Z

Indeed, I was highlighting the poor QA all around and what a bad experience it is, yet these issues are getting closed when first party hardware still is having issues with the software.

Please keep items 1 and 4 of the Pop!_OS Code of Conduct in mind: 1. Be considerate and 4. Assume good intentions. Everyone here is willing to help you, and your feedback regarding QC procedures (a different thing from QA) is appreciated, but we cannot fix a rattling trackpad or missing screws via a GitHub issue. A support case is the appropriate place for this, a development bug tracker is not.

hyperair · 2021-12-06T08:52:30Z

New datapoint to add: I just upgraded my lemp9 (24GB) firmware to 2021-07-20_93c2809 about a week ago, and the freezes started happening, usually after around 20 hours of uptime. I'm not sure which firmware version I had before, but it's probably either 2021-01-19_d6de7c0 or 2021-03-11_50eedc2.

My kernel version was 5.13.4 at the time of the firmware upgrade, and upgrading to 5.15.4 didn't fix the issue.

I usually use my laptop plugged into a USB-C PD external monitor, but it also froze once while completely disconnected from peripherals and power supplies. The freezing mostly happens when the machine is idle, and it didn't seem to matter whether the screen is on or off. Using intel_idle.max_state=4 seems to have stopped the freezing from happening. I haven't tried running it without the 16GB RAM stick yet though.

jacobgkau · 2021-12-06T15:26:43Z

Using intel_idle.max_state=4 seems to have stopped the freezing from happening.

pop-os/system76-driver#217 should have applied that option automatically if you have the System76 Driver app installed; I'm assuming you're running a non-Pop!_OS and non-Ubuntu-based distribution without the driver app installed. I would recommend keeping that boot parameter in place since it stops the freezing.

garybowers · 2021-12-06T17:07:48Z

The intel_ide.max_state=4 option is a workaround not a solution - it doesnt' fix freezes when booting from the rescue image nor the iso.

hyperair · 2021-12-11T14:34:48Z

@jacobgkau Yep, I'm running an old installation of Ubuntu 21.04 with a custom kernel, and I didn't install the system76-driver package.

Btw, an update -- I tried booting without my 16GB RAM stick and then again with the RAM reinstalled. 31 hours so far without intel_idle.max_cstate=4, and it hasn't hung. Looks like removing and reinstalling the RAM stick really does help.

t-lock · 2021-12-26T11:30:59Z

Hi all. I was experiencing this issue on my lemp9 (i7-10510U with 32 gb additional ram) immediately after updating to 21.04. (or was it the firmware update.... I did one right before the other).... Memtest revealed no issues. Contacted support and they told me to remove my RAM temporarily. I literally just didn't have a small screwdriver around so ended up finding this github thread instead and setting the intel_ide.max_state=4. This is a work machine so I just figured "set it and forget it".

Until now. I want to upgrade to 21.10, but I want to make sure I'm up-to-date on this issue before I make any more changes under the hood.

Could anyone expand on what that setting actually does (I don't notice any performance difference except no more freezing), and whether or not it is suitable to leave it this way indefinitely? Or should I remove it / change the value then do the upgrade to 21.10?

Thanks in advance!!

Also, probably wrong place to ask, but I just checked to see if I have a firmware update available as well, and I see "Managed Firmware Unavailable. No devices supporting automatic firmware updates detected". This doesn't make much sense, seeing as this is a System76 machine...

kanru · 2021-12-26T15:21:20Z

Could anyone expand on what that setting actually does (I don't notice any performance difference except no more freezing), and whether or not it is suitable to leave it this way indefinitely? Or should I remove it / change the value then do the upgrade to 21.10?

The setting prevents the CPU from entering deeper idle state. You might see slightly more power consumption while on battery (it's probably negligible and you won't notice it.) It's safe to leave the setting on for normal use or during upgrade.

tvilot · 2022-02-01T04:43:16Z

I had the same problem. I've been using the intel_idle.max_cstate=4 argument for all kernels.

I just tried removing my RAM temporarily, per S76 support suggestion for @t-lock, and everything seems to work. My issue was the machine would never even boot without the intel_idle.max_cstate=4 argument with any. kernel above 5.8.0. I'd see the system boot and go through a partial log of events and then ... poof ... no way to the login screen.

Let's see if this lasts!

commit 59c026c upstream. When use 'echo c > /proc/sysrq-trigger' to trigger kdump, riscv_crash_save_regs() will be called to save regs for vmcore, we found "epc" value 00ffffffa5537400 is not a valid kernel virtual address, but is a user virtual address. Other regs(eg, ra, sp, gp...) are correct kernel virtual address. Actually 0x00ffffffb0dd9400 is the user mode PC of 'PID: 113 Comm: sh', which is saved in the task's stack. [ 21.201701] CPU: 0 PID: 113 Comm: sh Kdump: loaded Not tainted 5.18.9 #45 [ 21.201979] Hardware name: riscv-virtio,qemu (DT) [ 21.202160] epc : 00ffffffa5537400 ra : ffffffff80088640 sp : ff20000010333b90 [ 21.202435] gp : ffffffff810dde38 tp : ff6000000226c200 t0 : ffffffff8032be7c [ 21.202707] t1 : 0720072007200720 t2 : 30203a7375746174 s0 : ff20000010333cf0 [ 21.202973] s1 : 0000000000000000 a0 : ff20000010333b98 a1 : 0000000000000001 [ 21.203243] a2 : 0000000000000010 a3 : 0000000000000000 a4 : 28c8f0aeffea4e00 [ 21.203519] a5 : 28c8f0aeffea4e00 a6 : 0000000000000009 a7 : ffffffff8035c9b8 [ 21.203794] s2 : ffffffff810df0a8 s3 : ffffffff810df718 s4 : ff20000010333b98 [ 21.204062] s5 : 0000000000000000 s6 : 0000000000000007 s7 : ffffffff80c4a468 [ 21.204331] s8 : 00ffffffef451410 s9 : 0000000000000007 s10: 00aaaaaac0510700 [ 21.204606] s11: 0000000000000001 t3 : ff60000001218f00 t4 : ff60000001218f00 [ 21.204876] t5 : ff60000001218000 t6 : ff200000103338b8 [ 21.205079] status: 0000000200000020 badaddr: 0000000000000000 cause: 0000000000000008 With the incorrect PC, the backtrace showed by crash tool as below, the first stack frame is abnormal, crash> bt PID: 113 TASK: ff60000002269600 CPU: 0 COMMAND: "sh" #0 [ff2000001039bb90] __efistub_.Ldebug_info0 at 00ffffffa5537400 <-- Abnormal #1 [ff2000001039bcf0] panic at ffffffff806578ba #2 [ff2000001039bd50] sysrq_reset_seq_param_set at ffffffff8038c030 #3 [ff2000001039bda0] __handle_sysrq at ffffffff8038c5f8 #4 [ff2000001039be00] write_sysrq_trigger at ffffffff8038cad8 #5 [ff2000001039be20] proc_reg_write at ffffffff801b7edc #6 [ff2000001039be40] vfs_write at ffffffff80152ba6 #7 [ff2000001039be80] ksys_write at ffffffff80152ece #8 [ff2000001039bed0] sys_write at ffffffff80152f46 With the patch, we can get current kernel mode PC, the output as below, [ 17.607658] CPU: 0 PID: 113 Comm: sh Kdump: loaded Not tainted 5.18.9 #42 [ 17.607937] Hardware name: riscv-virtio,qemu (DT) [ 17.608150] epc : ffffffff800078f8 ra : ffffffff8008862c sp : ff20000010333b90 [ 17.608441] gp : ffffffff810dde38 tp : ff6000000226c200 t0 : ffffffff8032be68 [ 17.608741] t1 : 0720072007200720 t2 : 666666666666663c s0 : ff20000010333cf0 [ 17.609025] s1 : 0000000000000000 a0 : ff20000010333b98 a1 : 0000000000000001 [ 17.609320] a2 : 0000000000000010 a3 : 0000000000000000 a4 : 0000000000000000 [ 17.609601] a5 : ff60000001c78000 a6 : 000000000000003c a7 : ffffffff8035c9a4 [ 17.609894] s2 : ffffffff810df0a8 s3 : ffffffff810df718 s4 : ff20000010333b98 [ 17.610186] s5 : 0000000000000000 s6 : 0000000000000007 s7 : ffffffff80c4a468 [ 17.610469] s8 : 00ffffffca281410 s9 : 0000000000000007 s10: 00aaaaaab5bb6700 [ 17.610755] s11: 0000000000000001 t3 : ff60000001218f00 t4 : ff60000001218f00 [ 17.611041] t5 : ff60000001218000 t6 : ff20000010333988 [ 17.611255] status: 0000000200000020 badaddr: 0000000000000000 cause: 0000000000000008 With the correct PC, the backtrace showed by crash tool as below, crash> bt PID: 113 TASK: ff6000000226c200 CPU: 0 COMMAND: "sh" #0 [ff20000010333b90] riscv_crash_save_regs at ffffffff800078f8 <--- Normal #1 [ff20000010333cf0] panic at ffffffff806578c6 #2 [ff20000010333d50] sysrq_reset_seq_param_set at ffffffff8038c03c #3 [ff20000010333da0] __handle_sysrq at ffffffff8038c604 #4 [ff20000010333e00] write_sysrq_trigger at ffffffff8038cae4 #5 [ff20000010333e20] proc_reg_write at ffffffff801b7ee8 #6 [ff20000010333e40] vfs_write at ffffffff80152bb2 #7 [ff20000010333e80] ksys_write at ffffffff80152eda #8 [ff20000010333ed0] sys_write at ffffffff80152f52 Fixes: e53d281 ("RISC-V: Add kdump support") Co-developed-by: Guo Ren <guoren@kernel.org> Signed-off-by: Xianting Tian <xianting.tian@linux.alibaba.com> Link: https://lore.kernel.org/r/20220811074150.3020189-3-xianting.tian@linux.alibaba.com Cc: stable@vger.kernel.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

varac · 2022-11-28T12:04:35Z

What's the current state of this issue ? I'm using intel_idle.max_cstate=4 as kernel parameter for PopOS for a long time now, and it works fine (lemp9).
However, I tried a few live-cds (grml, arch linux) and most of them failed to properly boot without this kernel parameter, which is annoying because for live-cds, you have to enter that kernel parameter on every boot manually.

13r0ck · 2022-11-28T14:22:34Z

@varac I believe this issue has been resolved? As well pop is no longer on the 5.11 kernel. Please update your system, and if the issue persists please contact support https://support.system76.com/articles/before-you-open-a-support-ticket/ because you own a system76 machine.

thomas-zimmerman · 2022-11-28T18:41:02Z

@13r0ck I'm pretty sure this is still and issue with the kernel. We set this option from 'system76-driver' for the 'lemp9': https://github.com/pop-os/system76-driver/blob/70c25a75d7f6b1291f4d025da13db785dfd56bad/system76driver/products.py#L484

kanru · 2022-11-30T13:53:33Z

@varac Have you tried the "remove one ram stick temporarily" and see if it helps?

varac · 2022-12-01T07:40:11Z

@kanru I'm sorry but I don't have much time for extensive debugging right now, will report back once I get to it.

commit f6044cc upstream. When preparing an AER-CTR request, the driver copies the key provided by the user into a data structure that is accessible by the firmware. If the target device is QAT GEN4, the key size is rounded up by 16 since a rounded up size is expected by the device. If the key size is rounded up before the copy, the size used for copying the key might be bigger than the size of the region containing the key, causing an out-of-bounds read. Fix by doing the copy first and then update the keylen. This is to fix the following warning reported by KASAN: [ 138.150574] BUG: KASAN: global-out-of-bounds in qat_alg_skcipher_init_com.isra.0+0x197/0x250 [intel_qat] [ 138.150641] Read of size 32 at addr ffffffff88c402c0 by task cryptomgr_test/2340 [ 138.150651] CPU: 15 PID: 2340 Comm: cryptomgr_test Not tainted 6.2.0-rc1+ #45 [ 138.150659] Hardware name: Intel Corporation ArcherCity/ArcherCity, BIOS EGSDCRB1.86B.0087.D13.2208261706 08/26/2022 [ 138.150663] Call Trace: [ 138.150668] <TASK> [ 138.150922] kasan_check_range+0x13a/0x1c0 [ 138.150931] memcpy+0x1f/0x60 [ 138.150940] qat_alg_skcipher_init_com.isra.0+0x197/0x250 [intel_qat] [ 138.151006] qat_alg_skcipher_init_sessions+0xc1/0x240 [intel_qat] [ 138.151073] crypto_skcipher_setkey+0x82/0x160 [ 138.151085] ? prepare_keybuf+0xa2/0xd0 [ 138.151095] test_skcipher_vec_cfg+0x2b8/0x800 Fixes: 67916c9 ("crypto: qat - add AES-CTR support for QAT GEN4 devices") Cc: <stable@vger.kernel.org> Reported-by: Vladis Dronov <vdronov@redhat.com> Signed-off-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com> Reviewed-by: Fiona Trahe <fiona.trahe@intel.com> Reviewed-by: Vladis Dronov <vdronov@redhat.com> Tested-by: Vladis Dronov <vdronov@redhat.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

varac · 2023-05-26T14:13:24Z

Wow - removing the additional ram stick worked. Thanks @kanru ! This bug has been super annoying over the last 2 years, glad I finially found at least the cause.

[ Upstream commit d527f51 ] There is a UAF when xfstests on cifs: BUG: KASAN: use-after-free in smb2_is_network_name_deleted+0x27/0x160 Read of size 4 at addr ffff88810103fc08 by task cifsd/923 CPU: 1 PID: 923 Comm: cifsd Not tainted 6.1.0-rc4+ #45 ... Call Trace: <TASK> dump_stack_lvl+0x34/0x44 print_report+0x171/0x472 kasan_report+0xad/0x130 kasan_check_range+0x145/0x1a0 smb2_is_network_name_deleted+0x27/0x160 cifs_demultiplex_thread.cold+0x172/0x5a4 kthread+0x165/0x1a0 ret_from_fork+0x1f/0x30 </TASK> Allocated by task 923: kasan_save_stack+0x1e/0x40 kasan_set_track+0x21/0x30 __kasan_slab_alloc+0x54/0x60 kmem_cache_alloc+0x147/0x320 mempool_alloc+0xe1/0x260 cifs_small_buf_get+0x24/0x60 allocate_buffers+0xa1/0x1c0 cifs_demultiplex_thread+0x199/0x10d0 kthread+0x165/0x1a0 ret_from_fork+0x1f/0x30 Freed by task 921: kasan_save_stack+0x1e/0x40 kasan_set_track+0x21/0x30 kasan_save_free_info+0x2a/0x40 ____kasan_slab_free+0x143/0x1b0 kmem_cache_free+0xe3/0x4d0 cifs_small_buf_release+0x29/0x90 SMB2_negotiate+0x8b7/0x1c60 smb2_negotiate+0x51/0x70 cifs_negotiate_protocol+0xf0/0x160 cifs_get_smb_ses+0x5fa/0x13c0 mount_get_conns+0x7a/0x750 cifs_mount+0x103/0xd00 cifs_smb3_do_mount+0x1dd/0xcb0 smb3_get_tree+0x1d5/0x300 vfs_get_tree+0x41/0xf0 path_mount+0x9b3/0xdd0 __x64_sys_mount+0x190/0x1d0 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 The UAF is because: mount(pid: 921) | cifsd(pid: 923) -------------------------------|------------------------------- | cifs_demultiplex_thread SMB2_negotiate | cifs_send_recv | compound_send_recv | smb_send_rqst | wait_for_response | wait_event_state [1] | | standard_receive3 | cifs_handle_standard | handle_mid | mid->resp_buf = buf; [2] | dequeue_mid [3] KILL the process [4] | resp_iov[i].iov_base = buf | free_rsp_buf [5] | | is_network_name_deleted [6] | callback 1. After send request to server, wait the response until mid->mid_state != SUBMITTED; 2. Receive response from server, and set it to mid; 3. Set the mid state to RECEIVED; 4. Kill the process, the mid state already RECEIVED, get 0; 5. Handle and release the negotiate response; 6. UAF. It can be easily reproduce with add some delay in [3] - [6]. Only sync call has the problem since async call's callback is executed in cifsd process. Add an extra state to mark the mid state to READY before wakeup the waitter, then it can get the resp safely. Fixes: ec637e3 ("[CIFS] Avoid extra large buffer allocation (and memcpy) in cifs_readpages") Reviewed-by: Paulo Alcantara (SUSE) <pc@manguebit.com> Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>

commit 9fce92f upstream. After the blamed commit below, the TCP sockets (and the MPTCP subflows) can build egress packets larger than 64K. That exceeds the maximum DSS data size, the length being misrepresent on the wire and the stream being corrupted, as later observed on the receiver: WARNING: CPU: 0 PID: 9696 at net/mptcp/protocol.c:705 __mptcp_move_skbs_from_subflow+0x2604/0x26e0 CPU: 0 PID: 9696 Comm: syz-executor.7 Not tainted 6.6.0-rc5-gcd8bdf563d46 #45 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 netlink: 8 bytes leftover after parsing attributes in process `syz-executor.4'. RIP: 0010:__mptcp_move_skbs_from_subflow+0x2604/0x26e0 net/mptcp/protocol.c:705 RSP: 0018:ffffc90000006e80 EFLAGS: 00010246 RAX: ffffffff83e9f674 RBX: ffff88802f45d870 RCX: ffff888102ad0000 netlink: 8 bytes leftover after parsing attributes in process `syz-executor.4'. RDX: 0000000080000303 RSI: 0000000000013908 RDI: 0000000000003908 RBP: ffffc90000007110 R08: ffffffff83e9e078 R09: 1ffff1100e548c8a R10: dffffc0000000000 R11: ffffed100e548c8b R12: 0000000000013908 R13: dffffc0000000000 R14: 0000000000003908 R15: 000000000031cf29 FS: 00007f239c47e700(0000) GS:ffff88811b200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f239c45cd78 CR3: 000000006a66c006 CR4: 0000000000770ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 PKRU: 55555554 Call Trace: <IRQ> mptcp_data_ready+0x263/0xac0 net/mptcp/protocol.c:819 subflow_data_ready+0x268/0x6d0 net/mptcp/subflow.c:1409 tcp_data_queue+0x21a1/0x7a60 net/ipv4/tcp_input.c:5151 tcp_rcv_established+0x950/0x1d90 net/ipv4/tcp_input.c:6098 tcp_v6_do_rcv+0x554/0x12f0 net/ipv6/tcp_ipv6.c:1483 tcp_v6_rcv+0x2e26/0x3810 net/ipv6/tcp_ipv6.c:1749 ip6_protocol_deliver_rcu+0xd6b/0x1ae0 net/ipv6/ip6_input.c:438 ip6_input+0x1c5/0x470 net/ipv6/ip6_input.c:483 ipv6_rcv+0xef/0x2c0 include/linux/netfilter.h:304 __netif_receive_skb+0x1ea/0x6a0 net/core/dev.c:5532 process_backlog+0x353/0x660 net/core/dev.c:5974 __napi_poll+0xc6/0x5a0 net/core/dev.c:6536 net_rx_action+0x6a0/0xfd0 net/core/dev.c:6603 __do_softirq+0x184/0x524 kernel/softirq.c:553 do_softirq+0xdd/0x130 kernel/softirq.c:454 Address the issue explicitly bounding the maximum GSO size to what MPTCP actually allows. Reported-by: Christoph Paasch <cpaasch@apple.com> Closes: multipath-tcp/mptcp_net-next#450 Fixes: 7c4e983 ("net: allow gso_max_size to exceed 65536") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matttbe@kernel.org> Link: https://lore.kernel.org/r/20231114-upstream-net-20231113-mptcp-misc-fixes-6-7-rc2-v1-1-7b9cd6a7b7f4@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 829e0c9 upstream. There is another found exception that the "timerlat/1" thread was scheduled on CPU0, and lead to timer corruption finally: ``` ODEBUG: init active (active state 0) object: ffff888237c2e108 object type: hrtimer hint: timerlat_irq+0x0/0x220 WARNING: CPU: 0 PID: 426 at lib/debugobjects.c:518 debug_print_object+0x7d/0xb0 Modules linked in: CPU: 0 UID: 0 PID: 426 Comm: timerlat/1 Not tainted 6.11.0-rc7+ #45 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 RIP: 0010:debug_print_object+0x7d/0xb0 ... Call Trace: <TASK> ? __warn+0x7c/0x110 ? debug_print_object+0x7d/0xb0 ? report_bug+0xf1/0x1d0 ? prb_read_valid+0x17/0x20 ? handle_bug+0x3f/0x70 ? exc_invalid_op+0x13/0x60 ? asm_exc_invalid_op+0x16/0x20 ? debug_print_object+0x7d/0xb0 ? debug_print_object+0x7d/0xb0 ? __pfx_timerlat_irq+0x10/0x10 __debug_object_init+0x110/0x150 hrtimer_init+0x1d/0x60 timerlat_main+0xab/0x2d0 ? __pfx_timerlat_main+0x10/0x10 kthread+0xb7/0xe0 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x2d/0x40 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> ``` After tracing the scheduling event, it was discovered that the migration of the "timerlat/1" thread was performed during thread creation. Further analysis confirmed that it is because the CPU online processing for osnoise is implemented through workers, which is asynchronous with the offline processing. When the worker was scheduled to create a thread, the CPU may has already been removed from the cpu_online_mask during the offline process, resulting in the inability to select the right CPU: T1 | T2 [CPUHP_ONLINE] | cpu_device_down() osnoise_hotplug_workfn() | | cpus_write_lock() | takedown_cpu(1) | cpus_write_unlock() [CPUHP_OFFLINE] | cpus_read_lock() | start_kthread(1) | cpus_read_unlock() | To fix this, skip online processing if the CPU is already offline. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20240924094515.3561410-4-liwei391@huawei.com Fixes: c8895e2 ("trace/osnoise: Support hotplug operations") Signed-off-by: Wei Li <liwei391@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

bflanagin mentioned this issue Apr 15, 2021

System blanked and required hard reboot pop-os/pop#1638

Open

jackpot51 mentioned this issue Sep 29, 2021

lemp9: Add intel_idle.max_cstate=4 kernel parameter to fix freezes pop-os/system76-driver#217

Merged

jackpot51 closed this as completed in pop-os/system76-driver#217 Sep 29, 2021

Lemp9 freezes, requires hard reboot after ugpraded to 5.11.0-7612.13~1617215757~20.04~97a8d1a #45

Lemp9 freezes, requires hard reboot after ugpraded to 5.11.0-7612.13~1617215757~20.04~97a8d1a #45

Comments

favadi commented Apr 3, 2021 • edited Loading

bflanagin commented Apr 5, 2021

favadi commented Apr 5, 2021

bflanagin commented Apr 5, 2021

favadi commented Apr 5, 2021

bflanagin commented Apr 5, 2021

favadi commented Apr 6, 2021

favadi commented Apr 7, 2021

al12gamer commented Apr 12, 2021

bflanagin commented Apr 12, 2021

bflanagin commented Apr 12, 2021

jacobgkau commented Apr 12, 2021

favadi commented Apr 13, 2021

bflanagin commented Apr 13, 2021 • edited Loading

favadi commented Apr 14, 2021

kanru commented Apr 14, 2021

favadi commented Apr 15, 2021

bflanagin commented Apr 15, 2021

favadi commented Apr 15, 2021

bflanagin commented Apr 15, 2021

kanru commented Apr 15, 2021

jkibele commented Apr 16, 2021 • edited Loading

bflanagin commented Apr 16, 2021

jkibele commented Apr 16, 2021

favadi commented Apr 16, 2021

bflanagin commented Apr 16, 2021 • edited Loading

jkibele commented Apr 16, 2021

arbitrary-dev commented Apr 16, 2021 • edited Loading

kanru commented Apr 16, 2021

arbitrary-dev commented Apr 17, 2021

stickr commented Sep 12, 2021

garybowers commented Sep 17, 2021 • edited Loading

leviport commented Sep 17, 2021

stickr commented Sep 17, 2021

garybowers commented Sep 18, 2021

stickr commented Sep 18, 2021

garybowers commented Sep 22, 2021

varac commented Oct 20, 2021

jacobgkau commented Nov 2, 2021

jacobgkau commented Nov 2, 2021

hyperair commented Dec 6, 2021 • edited Loading

jacobgkau commented Dec 6, 2021 • edited Loading

garybowers commented Dec 6, 2021

hyperair commented Dec 11, 2021

t-lock commented Dec 26, 2021 • edited Loading

kanru commented Dec 26, 2021

tvilot commented Feb 1, 2022

varac commented Nov 28, 2022

13r0ck commented Nov 28, 2022 • edited Loading

thomas-zimmerman commented Nov 28, 2022

kanru commented Nov 30, 2022

varac commented Dec 1, 2022

varac commented May 26, 2023

favadi commented Apr 3, 2021 •

edited

Loading

bflanagin commented Apr 13, 2021 •

edited

Loading

jkibele commented Apr 16, 2021 •

edited

Loading

bflanagin commented Apr 16, 2021 •

edited

Loading

arbitrary-dev commented Apr 16, 2021 •

edited

Loading

garybowers commented Sep 17, 2021 •

edited

Loading

hyperair commented Dec 6, 2021 •

edited

Loading

jacobgkau commented Dec 6, 2021 •

edited

Loading

t-lock commented Dec 26, 2021 •

edited

Loading

13r0ck commented Nov 28, 2022 •

edited

Loading