Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lemp9 freezes, requires hard reboot after ugpraded to 5.11.0-7612.13~1617215757~20.04~97a8d1a #45

Closed
favadi opened this issue Apr 3, 2021 · 132 comments · Fixed by pop-os/system76-driver#217
Assignees

Comments

@favadi
Copy link

favadi commented Apr 3, 2021

OS: Ubuntu 20.04.2 LTS

Kernel: 5.11.0-7612.13~1617215757~20.04~97a8d1a

Hardware: Lemur Pro 9
Processor: i7-10510U

The system will randomly freeze and require hard reboot after a few minute. Ubuntu's official 5.8 kernel works without any issue.

@bflanagin
Copy link

@favadi: I'm not seeing any instability here on my lemur Pro. Would you mind giving us a bit more info?

Bluetooth devices:
USB Devices:
Non-default gnome extensions:
Any software or service you might be running:

@favadi
Copy link
Author

favadi commented Apr 5, 2021

Hi @bflanagin,

Bluetooth devices

none

USB Devices

none

Non-default gnome extensions

none

Any software or service you might be running

Nothing special, I can reproduce the issue with nothing running.

@bflanagin I can send you my system/dmesg log via email if you want.

@bflanagin
Copy link

@favadi: I might not be the only one to look into the problem so if you can run sudo dmesg > dmesg.txt and then drag and drop it into your next reply we can get more eyes on it than sending things via email.

@favadi
Copy link
Author

favadi commented Apr 5, 2021

@bflanagin you are right, this is the dmesg log as requested. Let me know if there is anything else I can provide.

dmesg.txt

@bflanagin
Copy link

hmm, the only difference between our two machines seems to be that you're using docker. Could you get me your installed package lists and add them to this issue. Maybe one of your packages is causing the issue. (if we can narrow it down we can back port a newer version or report it to the right repo)

deb list:
dpkg -l > debs.txt

flatpak:
flatpak list > flatpak.txt

also lets get your docker images too.

docker image ls > docker.txt

@favadi
Copy link
Author

favadi commented Apr 6, 2021

@bflanagin here are the requested information.

deb list:

debs.txt

flatpak

I don't have flatpak installed.

docker images

I can't share the image list as it contains some private information. But I did change the docker data dir to an empty one, restart and the issue still happens.

For completeness, I uploaded list of installed snap packages, but not sure relevant.

snap.txt

@favadi
Copy link
Author

favadi commented Apr 7, 2021

In an attempt to gather more information, I try to boot my laptop using the 5.11 kernel, run sudo dmesg -w in a terminal and capture a screenshot when the freezing happen.

IMG_20210407_160413

Let me know if it is helpful and sorry for the dusts.

@al12gamer
Copy link

A darp7 running 5.11.0-7612.13-generic is having issues shutting down completely after this update as well.

@bflanagin
Copy link

@favadi Thanks for the information, and sorry for the delay. I think you're right about the snaps not being an issue, nothing really there to cause the problem.

I've run my lemp 9 under medium load for 48 hrs straight without issues. Though I know it will probably put a crimp in your ability to work, but lets disable docker for a moment to see what happens.

$ sudo systemctl disable docker.service
$ sudo systemctl disable docker.socket

After rebooting go about your business and let it run, do non-docker tasks, etc. and report back.

You can re-enable them with these commands once the test is complete:

$ sudo systemctl enable docker.service
$ sudo systemctl enable docker.socket

And reboot.

Also if you could reply with the drive/s make and model that might also have something to do with the instability as we have seen on other issues. You can get them various ways but the easiest way is to use the Disks utility UI

@bflanagin
Copy link

@al12gamer: We've been looking into that issue as well. Would you mind putting in an issue directly stating the problem you're having. This way we can keep things organized.

@jacobgkau
Copy link
Member

@bflanagin @al12gamer There is already an issue open for the power-off issue here: #41

@favadi
Copy link
Author

favadi commented Apr 13, 2021

@bflanagin I disabled the docker.service and it doesn't make any different, the system is still working fine with kernel 5.8 and randomly hangs with 5.11.

The installed drive in my lemp9 is Samsung SSD 970 EVO Plus 250GB.

@bflanagin
Copy link

bflanagin commented Apr 13, 2021

@favadi: Gah! I can't believe I missed this before. What firmware are you running on your Lemp9?
I'm running 2021-03-11-50eedc2

Edit: disregard I can see it in the screenshot. You're running the same firmware as I am.

@favadi
Copy link
Author

favadi commented Apr 14, 2021

What firmware are you running on your Lemp9?
I'm running 2021-03-11-50eedc2

@bflanagin I'm running the same version of firmware.

@kanru
Copy link

kanru commented Apr 14, 2021

Could you try to add intel_idle.max_cstate=4 to the kernel parameter

I reported a similar issue here: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=981807

@favadi
Copy link
Author

favadi commented Apr 15, 2021

@kanru Hey, thank for the suggestion, I added the kernel parameter and my machine has been working fine for few hours already. This raises another question, why it only happens on my machine but not @bflanagin's machine?

@bflanagin
Copy link

@favadi, if it makes you feel better you're not alone. The tech-support team has been working on the problem with other lemp9 owners.

Please let us know how the computer holds up for the rest of the day.

@favadi
Copy link
Author

favadi commented Apr 15, 2021

Please let us know how the computer holds up for the rest of the day.

My computer has been running without any problem for a whole working day.

@bflanagin
Copy link

Hurray for time differences! Time to test it against my machine.

@kanru
Copy link

kanru commented Apr 15, 2021

Copying relevant info from my bug report against Debian kernel here

I think I found the root cause. My CPU model is Intel(R) Core(TM)
i5-10210U CPU @ 1.60GHz. On linux 5.9 the max c-state it can enter is
C6 but after 5.10 it'll happily go to C8. After entering deep c-state
all sort of weird memory corruption and lockup starts to show
up. Testing with boot parameter intel_idle.max_cstate=4 it seems the
system becomes stable again and c-state is capped at C6. C7 is also
not safe.

I bisected and it appears this commit caused the symptom.

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=0268eed10f12f785a618880920d90ee306fb2a50

I guess before the RTS522A PCI Express Card Reader was preventing the system
go into deeper c-state and this change fixed that, but revealed other
problems. Reverting the commit fixes the problem by keeping c-state at C6.

@favadi I'm not sure why some machines are affected. My lemp9 has the same problem.
I guess if there is some process or settings (performance mode?) that prevents the CPU to enter deep
sleep mode then the problem will not appear.

@jkibele
Copy link

jkibele commented Apr 16, 2021

I also just started having complete catastrophic freezes on my LemurPro with kernel: 5.11.0.7612.13~1617215757~20.04~97a8d1a I couldn't even get past the encryption password dialog without freezing. Since booting back into 5.8, everything seems fine. I opened a ticket with System76, but figured I'd throw in a "yep, me too" here as well.

@bflanagin
Copy link

@jkibele: Try the fix that @kanru as its the best option we have at the moment. If we get enough positive results we'll push it to everyone.

To why it affects some and not others, we're still determining that. In testing my vanilla install with the 5.11 update my Lemur Pro has no issues and works as well as it did in 5.8. However, when I install the applications that @favadi has on his machine I start experiencing some of the same symptoms, though not as severe.

@jkibele
Copy link

jkibele commented Apr 16, 2021

Thanks for the quick reply @bflanagin. I'm happy to give that fix a try, but I'm pretty clueless about the inner workings of linux, so I'll need a little guidance. Can you point me toward some instructions on how to set intel_idle.max_cstate=4 as a kernel parameter?

@favadi
Copy link
Author

favadi commented Apr 16, 2021

@jkibele you could boot into kernel 5.8 and follow the instruction in this article to add intel_idle.max_cstate=4 to the GRUB_CMDLINE_LINUX_DEFAULT configuration.

@bflanagin
Copy link

bflanagin commented Apr 16, 2021

@jkibele If you are using Ubuntu then what favadi suggests. On Pop!_OS the command is
sudo kernelstub --add-options intel_idle.max_cstate=4 and then reboot.

You can also use -a in place of --add-options if you prefer to save a few key presses.

@jkibele
Copy link

jkibele commented Apr 16, 2021

Thanks @bflanagin and @favadi. I am on Pop, so I went with the kernelstub command. I've rebooted now with 5.11. Seems to be working fine for me now (though it's only been a few minutes). I'll let you know if anything changes.

Thanks again for your help. Please let me know if I'll need to unset the cstate option at some point down the road. (I've got a very shallow understanding of kernel guts)

@arbitrary-dev
Copy link

arbitrary-dev commented Apr 16, 2021

Testing with boot parameter intel_idle.max_cstate=4 it seems the system becomes stable again and c-state is capped at C6.

Ain't with this setting the system is capped at C4 instead?

Seems like it depends on a processor, my i7-1165G7 have only these:

$ grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name
/sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL
/sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1_ACPI
/sys/devices/system/cpu/cpu0/cpuidle/state2/name:C2_ACPI
/sys/devices/system/cpu/cpu0/cpuidle/state3/name:C3_ACPI

So probably I should go with intel_idle.max_cstate=3 instead.

Interestingly, intel_idle reports higher available C-state:

$ cat /sys/module/intel_idle/parameters/max_cstate
9

@kanru
Copy link

kanru commented Apr 16, 2021

@arbitrary-dev a little off topic. But no, the number is not the n of Cn state. It is the index into the table of supported state of your CPU. On many cpu C3 is followed by C6.

I'll use powertop to check enabled C-state. Note if you disable deeper C-state then they aren't enumerated in sysfs.

@arbitrary-dev
Copy link

I have intel_idle.max_cstate=3, but still my lemp10 went unresponsive after some time laying unattended with lid closed.

@stickr
Copy link

stickr commented Sep 12, 2021

Before reloading Pop!_OS 21.04 I decided to execute the command from the following comment:

@jkibele If you are using Ubuntu then what favadi suggests. On Pop!_OS the command is
sudo kernelstub --add-options intel_idle.max_cstate=4 and then reboot.

You can also use -a in place of --add-options if you prefer to save a few key presses.

I've been successfully running the latest kernel 5.11.0-7633.

@garybowers
Copy link

garybowers commented Sep 17, 2021

My Lemp9 has once again started freezing with fans running after running the firmware update and a apt-get upgrade to 5.13 but booting into 5.11 .... this machine is now unusable. Raised a ticket with System76 but my trust in this system is deteriorating fast.

@leviport
Copy link
Member

apt-get upgrade to 5.13 but booting into 5.11

For clarification: are you experiencing the same trouble while running 5.13?

@stickr
Copy link

stickr commented Sep 17, 2021

My Lemp9 has once again started freezing with fans running after running the firmware update and a apt-get upgrade to 5.13 but booting into 5.11 .... this machine is now unusable. Raised a ticket with System76 but my trust in this system is deteriorating fast.

Same here. I updated to 5.13 and removed the intel_idle.max_cstate=4 kernel flag and rebooted and it froze shortly after each test reboot. I added the kernel flag back and it stopped freezing.

@garybowers
Copy link

apt-get upgrade to 5.13 but booting into 5.11

For clarification: are you experiencing the same trouble while running 5.13?

Both, also get freezing in the recovery image.

@stickr
Copy link

stickr commented Sep 18, 2021

apt-get upgrade to 5.13 but booting into 5.11

For clarification: are you experiencing the same trouble while running 5.13?

Both, also get freezing in the recovery image.

Same with mine. 5.11, 5.13, and recovery image. Setting the cstate kernel flag prevents lockup, but not sure if that's a long term solution.

@garybowers
Copy link

I've raised a support ticket, System76 want me to RMA the machine, not sure what that will do but we can see.

@varac
Copy link

varac commented Oct 20, 2021

I can confirm that the intel_idle.max_cstate=4 also did fix my lemp9 freezing shortly after login.
@system76: Is there a follow-up issue for fixing this bug root cause so this kernel boot arg is not needed in the long run ? Limiting sleep states is not a desired long-term state, but for the mean time I'm happy with my Laptop being useable again.

@jacobgkau
Copy link
Member

my trackpad is rattly and broken and screws missing from the bottom of the case

These sound like physical problems that are unrelated to this kernel issue. Please bring this up in your support case, the support team would be happy to send out replacement screws and address any other problems you're still having.

@jacobgkau
Copy link
Member

Indeed, I was highlighting the poor QA all around and what a bad experience it is, yet these issues are getting closed when first party hardware still is having issues with the software.

Please keep items 1 and 4 of the Pop!_OS Code of Conduct in mind: 1. Be considerate and 4. Assume good intentions. Everyone here is willing to help you, and your feedback regarding QC procedures (a different thing from QA) is appreciated, but we cannot fix a rattling trackpad or missing screws via a GitHub issue. A support case is the appropriate place for this, a development bug tracker is not.

@hyperair
Copy link

hyperair commented Dec 6, 2021

New datapoint to add: I just upgraded my lemp9 (24GB) firmware to 2021-07-20_93c2809 about a week ago, and the freezes started happening, usually after around 20 hours of uptime. I'm not sure which firmware version I had before, but it's probably either 2021-01-19_d6de7c0 or 2021-03-11_50eedc2.

My kernel version was 5.13.4 at the time of the firmware upgrade, and upgrading to 5.15.4 didn't fix the issue.

I usually use my laptop plugged into a USB-C PD external monitor, but it also froze once while completely disconnected from peripherals and power supplies. The freezing mostly happens when the machine is idle, and it didn't seem to matter whether the screen is on or off. Using intel_idle.max_state=4 seems to have stopped the freezing from happening. I haven't tried running it without the 16GB RAM stick yet though.

@jacobgkau
Copy link
Member

jacobgkau commented Dec 6, 2021

Using intel_idle.max_state=4 seems to have stopped the freezing from happening.

pop-os/system76-driver#217 should have applied that option automatically if you have the System76 Driver app installed; I'm assuming you're running a non-Pop!_OS and non-Ubuntu-based distribution without the driver app installed. I would recommend keeping that boot parameter in place since it stops the freezing.

@garybowers
Copy link

The intel_ide.max_state=4 option is a workaround not a solution - it doesnt' fix freezes when booting from the rescue image nor the iso.

@hyperair
Copy link

@jacobgkau Yep, I'm running an old installation of Ubuntu 21.04 with a custom kernel, and I didn't install the system76-driver package.

Btw, an update -- I tried booting without my 16GB RAM stick and then again with the RAM reinstalled. 31 hours so far without intel_idle.max_cstate=4, and it hasn't hung. Looks like removing and reinstalling the RAM stick really does help.

@t-lock
Copy link

t-lock commented Dec 26, 2021

Hi all. I was experiencing this issue on my lemp9 (i7-10510U with 32 gb additional ram) immediately after updating to 21.04. (or was it the firmware update.... I did one right before the other).... Memtest revealed no issues. Contacted support and they told me to remove my RAM temporarily. I literally just didn't have a small screwdriver around so ended up finding this github thread instead and setting the intel_ide.max_state=4. This is a work machine so I just figured "set it and forget it".

Until now. I want to upgrade to 21.10, but I want to make sure I'm up-to-date on this issue before I make any more changes under the hood.

Could anyone expand on what that setting actually does (I don't notice any performance difference except no more freezing), and whether or not it is suitable to leave it this way indefinitely? Or should I remove it / change the value then do the upgrade to 21.10?

Thanks in advance!!

Also, probably wrong place to ask, but I just checked to see if I have a firmware update available as well, and I see "Managed Firmware Unavailable. No devices supporting automatic firmware updates detected". This doesn't make much sense, seeing as this is a System76 machine...

@kanru
Copy link

kanru commented Dec 26, 2021

Could anyone expand on what that setting actually does (I don't notice any performance difference except no more freezing), and whether or not it is suitable to leave it this way indefinitely? Or should I remove it / change the value then do the upgrade to 21.10?

The setting prevents the CPU from entering deeper idle state. You might see slightly more power consumption while on battery (it's probably negligible and you won't notice it.) It's safe to leave the setting on for normal use or during upgrade.

@tvilot
Copy link

tvilot commented Feb 1, 2022

I had the same problem. I've been using the intel_idle.max_cstate=4 argument for all kernels.

I just tried removing my RAM temporarily, per S76 support suggestion for @t-lock, and everything seems to work. My issue was the machine would never even boot without the intel_idle.max_cstate=4 argument with any. kernel above 5.8.0. I'd see the system boot and go through a partial log of events and then ... poof ... no way to the login screen.

Let's see if this lasts!

13r0ck pushed a commit that referenced this issue Aug 31, 2022
commit 59c026c upstream.

When use 'echo c > /proc/sysrq-trigger' to trigger kdump, riscv_crash_save_regs()
will be called to save regs for vmcore, we found "epc" value 00ffffffa5537400
is not a valid kernel virtual address, but is a user virtual address. Other
regs(eg, ra, sp, gp...) are correct kernel virtual address.
Actually 0x00ffffffb0dd9400 is the user mode PC of 'PID: 113 Comm: sh', which
is saved in the task's stack.

[   21.201701] CPU: 0 PID: 113 Comm: sh Kdump: loaded Not tainted 5.18.9 #45
[   21.201979] Hardware name: riscv-virtio,qemu (DT)
[   21.202160] epc : 00ffffffa5537400 ra : ffffffff80088640 sp : ff20000010333b90
[   21.202435]  gp : ffffffff810dde38 tp : ff6000000226c200 t0 : ffffffff8032be7c
[   21.202707]  t1 : 0720072007200720 t2 : 30203a7375746174 s0 : ff20000010333cf0
[   21.202973]  s1 : 0000000000000000 a0 : ff20000010333b98 a1 : 0000000000000001
[   21.203243]  a2 : 0000000000000010 a3 : 0000000000000000 a4 : 28c8f0aeffea4e00
[   21.203519]  a5 : 28c8f0aeffea4e00 a6 : 0000000000000009 a7 : ffffffff8035c9b8
[   21.203794]  s2 : ffffffff810df0a8 s3 : ffffffff810df718 s4 : ff20000010333b98
[   21.204062]  s5 : 0000000000000000 s6 : 0000000000000007 s7 : ffffffff80c4a468
[   21.204331]  s8 : 00ffffffef451410 s9 : 0000000000000007 s10: 00aaaaaac0510700
[   21.204606]  s11: 0000000000000001 t3 : ff60000001218f00 t4 : ff60000001218f00
[   21.204876]  t5 : ff60000001218000 t6 : ff200000103338b8
[   21.205079] status: 0000000200000020 badaddr: 0000000000000000 cause: 0000000000000008

With the incorrect PC, the backtrace showed by crash tool as below, the first
stack frame is abnormal,

crash> bt
PID: 113      TASK: ff60000002269600  CPU: 0    COMMAND: "sh"
 #0 [ff2000001039bb90] __efistub_.Ldebug_info0 at 00ffffffa5537400 <-- Abnormal
 #1 [ff2000001039bcf0] panic at ffffffff806578ba
 #2 [ff2000001039bd50] sysrq_reset_seq_param_set at ffffffff8038c030
 #3 [ff2000001039bda0] __handle_sysrq at ffffffff8038c5f8
 #4 [ff2000001039be00] write_sysrq_trigger at ffffffff8038cad8
 #5 [ff2000001039be20] proc_reg_write at ffffffff801b7edc
 #6 [ff2000001039be40] vfs_write at ffffffff80152ba6
 #7 [ff2000001039be80] ksys_write at ffffffff80152ece
 #8 [ff2000001039bed0] sys_write at ffffffff80152f46

With the patch, we can get current kernel mode PC, the output as below,

[   17.607658] CPU: 0 PID: 113 Comm: sh Kdump: loaded Not tainted 5.18.9 #42
[   17.607937] Hardware name: riscv-virtio,qemu (DT)
[   17.608150] epc : ffffffff800078f8 ra : ffffffff8008862c sp : ff20000010333b90
[   17.608441]  gp : ffffffff810dde38 tp : ff6000000226c200 t0 : ffffffff8032be68
[   17.608741]  t1 : 0720072007200720 t2 : 666666666666663c s0 : ff20000010333cf0
[   17.609025]  s1 : 0000000000000000 a0 : ff20000010333b98 a1 : 0000000000000001
[   17.609320]  a2 : 0000000000000010 a3 : 0000000000000000 a4 : 0000000000000000
[   17.609601]  a5 : ff60000001c78000 a6 : 000000000000003c a7 : ffffffff8035c9a4
[   17.609894]  s2 : ffffffff810df0a8 s3 : ffffffff810df718 s4 : ff20000010333b98
[   17.610186]  s5 : 0000000000000000 s6 : 0000000000000007 s7 : ffffffff80c4a468
[   17.610469]  s8 : 00ffffffca281410 s9 : 0000000000000007 s10: 00aaaaaab5bb6700
[   17.610755]  s11: 0000000000000001 t3 : ff60000001218f00 t4 : ff60000001218f00
[   17.611041]  t5 : ff60000001218000 t6 : ff20000010333988
[   17.611255] status: 0000000200000020 badaddr: 0000000000000000 cause: 0000000000000008

With the correct PC, the backtrace showed by crash tool as below,

crash> bt
PID: 113      TASK: ff6000000226c200  CPU: 0    COMMAND: "sh"
 #0 [ff20000010333b90] riscv_crash_save_regs at ffffffff800078f8 <--- Normal
 #1 [ff20000010333cf0] panic at ffffffff806578c6
 #2 [ff20000010333d50] sysrq_reset_seq_param_set at ffffffff8038c03c
 #3 [ff20000010333da0] __handle_sysrq at ffffffff8038c604
 #4 [ff20000010333e00] write_sysrq_trigger at ffffffff8038cae4
 #5 [ff20000010333e20] proc_reg_write at ffffffff801b7ee8
 #6 [ff20000010333e40] vfs_write at ffffffff80152bb2
 #7 [ff20000010333e80] ksys_write at ffffffff80152eda
 #8 [ff20000010333ed0] sys_write at ffffffff80152f52

Fixes: e53d281 ("RISC-V: Add kdump support")
Co-developed-by: Guo Ren <guoren@kernel.org>
Signed-off-by: Xianting Tian <xianting.tian@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220811074150.3020189-3-xianting.tian@linux.alibaba.com
Cc: stable@vger.kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
@varac
Copy link

varac commented Nov 28, 2022

What's the current state of this issue ? I'm using intel_idle.max_cstate=4 as kernel parameter for PopOS for a long time now, and it works fine (lemp9).
However, I tried a few live-cds (grml, arch linux) and most of them failed to properly boot without this kernel parameter, which is annoying because for live-cds, you have to enter that kernel parameter on every boot manually.

@13r0ck
Copy link

13r0ck commented Nov 28, 2022

@varac I believe this issue has been resolved? As well pop is no longer on the 5.11 kernel. Please update your system, and if the issue persists please contact support https://support.system76.com/articles/before-you-open-a-support-ticket/ because you own a system76 machine.

@thomas-zimmerman
Copy link

@13r0ck I'm pretty sure this is still and issue with the kernel. We set this option from 'system76-driver' for the 'lemp9': https://github.com/pop-os/system76-driver/blob/70c25a75d7f6b1291f4d025da13db785dfd56bad/system76driver/products.py#L484

@kanru
Copy link

kanru commented Nov 30, 2022

@varac Have you tried the "remove one ram stick temporarily" and see if it helps?

@varac
Copy link

varac commented Dec 1, 2022

@kanru I'm sorry but I don't have much time for extensive debugging right now, will report back once I get to it.

13r0ck pushed a commit that referenced this issue Mar 13, 2023
commit f6044cc upstream.

When preparing an AER-CTR request, the driver copies the key provided by
the user into a data structure that is accessible by the firmware.
If the target device is QAT GEN4, the key size is rounded up by 16 since
a rounded up size is expected by the device.
If the key size is rounded up before the copy, the size used for copying
the key might be bigger than the size of the region containing the key,
causing an out-of-bounds read.

Fix by doing the copy first and then update the keylen.

This is to fix the following warning reported by KASAN:

	[  138.150574] BUG: KASAN: global-out-of-bounds in qat_alg_skcipher_init_com.isra.0+0x197/0x250 [intel_qat]
	[  138.150641] Read of size 32 at addr ffffffff88c402c0 by task cryptomgr_test/2340

	[  138.150651] CPU: 15 PID: 2340 Comm: cryptomgr_test Not tainted 6.2.0-rc1+ #45
	[  138.150659] Hardware name: Intel Corporation ArcherCity/ArcherCity, BIOS EGSDCRB1.86B.0087.D13.2208261706 08/26/2022
	[  138.150663] Call Trace:
	[  138.150668]  <TASK>
	[  138.150922]  kasan_check_range+0x13a/0x1c0
	[  138.150931]  memcpy+0x1f/0x60
	[  138.150940]  qat_alg_skcipher_init_com.isra.0+0x197/0x250 [intel_qat]
	[  138.151006]  qat_alg_skcipher_init_sessions+0xc1/0x240 [intel_qat]
	[  138.151073]  crypto_skcipher_setkey+0x82/0x160
	[  138.151085]  ? prepare_keybuf+0xa2/0xd0
	[  138.151095]  test_skcipher_vec_cfg+0x2b8/0x800

Fixes: 67916c9 ("crypto: qat - add AES-CTR support for QAT GEN4 devices")
Cc: <stable@vger.kernel.org>
Reported-by: Vladis Dronov <vdronov@redhat.com>
Signed-off-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
Reviewed-by: Fiona Trahe <fiona.trahe@intel.com>
Reviewed-by: Vladis Dronov <vdronov@redhat.com>
Tested-by: Vladis Dronov <vdronov@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
@varac
Copy link

varac commented May 26, 2023

Wow - removing the additional ram stick worked. Thanks @kanru ! This bug has been super annoying over the last 2 years, glad I finially found at least the cause.

mmstick pushed a commit that referenced this issue Oct 13, 2023
[ Upstream commit d527f51 ]

There is a UAF when xfstests on cifs:

  BUG: KASAN: use-after-free in smb2_is_network_name_deleted+0x27/0x160
  Read of size 4 at addr ffff88810103fc08 by task cifsd/923

  CPU: 1 PID: 923 Comm: cifsd Not tainted 6.1.0-rc4+ #45
  ...
  Call Trace:
   <TASK>
   dump_stack_lvl+0x34/0x44
   print_report+0x171/0x472
   kasan_report+0xad/0x130
   kasan_check_range+0x145/0x1a0
   smb2_is_network_name_deleted+0x27/0x160
   cifs_demultiplex_thread.cold+0x172/0x5a4
   kthread+0x165/0x1a0
   ret_from_fork+0x1f/0x30
   </TASK>

  Allocated by task 923:
   kasan_save_stack+0x1e/0x40
   kasan_set_track+0x21/0x30
   __kasan_slab_alloc+0x54/0x60
   kmem_cache_alloc+0x147/0x320
   mempool_alloc+0xe1/0x260
   cifs_small_buf_get+0x24/0x60
   allocate_buffers+0xa1/0x1c0
   cifs_demultiplex_thread+0x199/0x10d0
   kthread+0x165/0x1a0
   ret_from_fork+0x1f/0x30

  Freed by task 921:
   kasan_save_stack+0x1e/0x40
   kasan_set_track+0x21/0x30
   kasan_save_free_info+0x2a/0x40
   ____kasan_slab_free+0x143/0x1b0
   kmem_cache_free+0xe3/0x4d0
   cifs_small_buf_release+0x29/0x90
   SMB2_negotiate+0x8b7/0x1c60
   smb2_negotiate+0x51/0x70
   cifs_negotiate_protocol+0xf0/0x160
   cifs_get_smb_ses+0x5fa/0x13c0
   mount_get_conns+0x7a/0x750
   cifs_mount+0x103/0xd00
   cifs_smb3_do_mount+0x1dd/0xcb0
   smb3_get_tree+0x1d5/0x300
   vfs_get_tree+0x41/0xf0
   path_mount+0x9b3/0xdd0
   __x64_sys_mount+0x190/0x1d0
   do_syscall_64+0x35/0x80
   entry_SYSCALL_64_after_hwframe+0x46/0xb0

The UAF is because:

 mount(pid: 921)               | cifsd(pid: 923)
-------------------------------|-------------------------------
                               | cifs_demultiplex_thread
SMB2_negotiate                 |
 cifs_send_recv                |
  compound_send_recv           |
   smb_send_rqst               |
    wait_for_response          |
     wait_event_state      [1] |
                               |  standard_receive3
                               |   cifs_handle_standard
                               |    handle_mid
                               |     mid->resp_buf = buf;  [2]
                               |     dequeue_mid           [3]
     KILL the process      [4] |
    resp_iov[i].iov_base = buf |
 free_rsp_buf              [5] |
                               |   is_network_name_deleted [6]
                               |   callback

1. After send request to server, wait the response until
    mid->mid_state != SUBMITTED;
2. Receive response from server, and set it to mid;
3. Set the mid state to RECEIVED;
4. Kill the process, the mid state already RECEIVED, get 0;
5. Handle and release the negotiate response;
6. UAF.

It can be easily reproduce with add some delay in [3] - [6].

Only sync call has the problem since async call's callback is
executed in cifsd process.

Add an extra state to mark the mid state to READY before wakeup the
waitter, then it can get the resp safely.

Fixes: ec637e3 ("[CIFS] Avoid extra large buffer allocation (and memcpy) in cifs_readpages")
Reviewed-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
mmstick pushed a commit that referenced this issue Dec 11, 2023
commit 9fce92f upstream.

After the blamed commit below, the TCP sockets (and the MPTCP subflows)
can build egress packets larger than 64K. That exceeds the maximum DSS
data size, the length being misrepresent on the wire and the stream being
corrupted, as later observed on the receiver:

  WARNING: CPU: 0 PID: 9696 at net/mptcp/protocol.c:705 __mptcp_move_skbs_from_subflow+0x2604/0x26e0
  CPU: 0 PID: 9696 Comm: syz-executor.7 Not tainted 6.6.0-rc5-gcd8bdf563d46 #45
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
  netlink: 8 bytes leftover after parsing attributes in process `syz-executor.4'.
  RIP: 0010:__mptcp_move_skbs_from_subflow+0x2604/0x26e0 net/mptcp/protocol.c:705
  RSP: 0018:ffffc90000006e80 EFLAGS: 00010246
  RAX: ffffffff83e9f674 RBX: ffff88802f45d870 RCX: ffff888102ad0000
  netlink: 8 bytes leftover after parsing attributes in process `syz-executor.4'.
  RDX: 0000000080000303 RSI: 0000000000013908 RDI: 0000000000003908
  RBP: ffffc90000007110 R08: ffffffff83e9e078 R09: 1ffff1100e548c8a
  R10: dffffc0000000000 R11: ffffed100e548c8b R12: 0000000000013908
  R13: dffffc0000000000 R14: 0000000000003908 R15: 000000000031cf29
  FS:  00007f239c47e700(0000) GS:ffff88811b200000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f239c45cd78 CR3: 000000006a66c006 CR4: 0000000000770ef0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
  PKRU: 55555554
  Call Trace:
   <IRQ>
   mptcp_data_ready+0x263/0xac0 net/mptcp/protocol.c:819
   subflow_data_ready+0x268/0x6d0 net/mptcp/subflow.c:1409
   tcp_data_queue+0x21a1/0x7a60 net/ipv4/tcp_input.c:5151
   tcp_rcv_established+0x950/0x1d90 net/ipv4/tcp_input.c:6098
   tcp_v6_do_rcv+0x554/0x12f0 net/ipv6/tcp_ipv6.c:1483
   tcp_v6_rcv+0x2e26/0x3810 net/ipv6/tcp_ipv6.c:1749
   ip6_protocol_deliver_rcu+0xd6b/0x1ae0 net/ipv6/ip6_input.c:438
   ip6_input+0x1c5/0x470 net/ipv6/ip6_input.c:483
   ipv6_rcv+0xef/0x2c0 include/linux/netfilter.h:304
   __netif_receive_skb+0x1ea/0x6a0 net/core/dev.c:5532
   process_backlog+0x353/0x660 net/core/dev.c:5974
   __napi_poll+0xc6/0x5a0 net/core/dev.c:6536
   net_rx_action+0x6a0/0xfd0 net/core/dev.c:6603
   __do_softirq+0x184/0x524 kernel/softirq.c:553
   do_softirq+0xdd/0x130 kernel/softirq.c:454

Address the issue explicitly bounding the maximum GSO size to what MPTCP
actually allows.

Reported-by: Christoph Paasch <cpaasch@apple.com>
Closes: multipath-tcp/mptcp_net-next#450
Fixes: 7c4e983 ("net: allow gso_max_size to exceed 65536")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts <matttbe@kernel.org>
Link: https://lore.kernel.org/r/20231114-upstream-net-20231113-mptcp-misc-fixes-6-7-rc2-v1-1-7b9cd6a7b7f4@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
mmstick pushed a commit that referenced this issue Oct 16, 2024
commit 829e0c9 upstream.

There is another found exception that the "timerlat/1" thread was
scheduled on CPU0, and lead to timer corruption finally:

```
ODEBUG: init active (active state 0) object: ffff888237c2e108 object type: hrtimer hint: timerlat_irq+0x0/0x220
WARNING: CPU: 0 PID: 426 at lib/debugobjects.c:518 debug_print_object+0x7d/0xb0
Modules linked in:
CPU: 0 UID: 0 PID: 426 Comm: timerlat/1 Not tainted 6.11.0-rc7+ #45
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
RIP: 0010:debug_print_object+0x7d/0xb0
...
Call Trace:
 <TASK>
 ? __warn+0x7c/0x110
 ? debug_print_object+0x7d/0xb0
 ? report_bug+0xf1/0x1d0
 ? prb_read_valid+0x17/0x20
 ? handle_bug+0x3f/0x70
 ? exc_invalid_op+0x13/0x60
 ? asm_exc_invalid_op+0x16/0x20
 ? debug_print_object+0x7d/0xb0
 ? debug_print_object+0x7d/0xb0
 ? __pfx_timerlat_irq+0x10/0x10
 __debug_object_init+0x110/0x150
 hrtimer_init+0x1d/0x60
 timerlat_main+0xab/0x2d0
 ? __pfx_timerlat_main+0x10/0x10
 kthread+0xb7/0xe0
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x2d/0x40
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1a/0x30
 </TASK>
```

After tracing the scheduling event, it was discovered that the migration
of the "timerlat/1" thread was performed during thread creation. Further
analysis confirmed that it is because the CPU online processing for
osnoise is implemented through workers, which is asynchronous with the
offline processing. When the worker was scheduled to create a thread, the
CPU may has already been removed from the cpu_online_mask during the offline
process, resulting in the inability to select the right CPU:

T1                       | T2
[CPUHP_ONLINE]           | cpu_device_down()
osnoise_hotplug_workfn() |
                         |     cpus_write_lock()
                         |     takedown_cpu(1)
                         |     cpus_write_unlock()
[CPUHP_OFFLINE]          |
    cpus_read_lock()     |
    start_kthread(1)     |
    cpus_read_unlock()   |

To fix this, skip online processing if the CPU is already offline.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20240924094515.3561410-4-liwei391@huawei.com
Fixes: c8895e2 ("trace/osnoise: Support hotplug operations")
Signed-off-by: Wei Li <liwei391@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.