-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lemp9 freezes, requires hard reboot after ugpraded to 5.11.0-7612.13~1617215757~20.04~97a8d1a #45
Comments
@favadi: I'm not seeing any instability here on my lemur Pro. Would you mind giving us a bit more info? Bluetooth devices: |
Hi @bflanagin,
none
none
none
Nothing special, I can reproduce the issue with nothing running. @bflanagin I can send you my system/dmesg log via email if you want. |
@favadi: I might not be the only one to look into the problem so if you can run |
@bflanagin you are right, this is the dmesg log as requested. Let me know if there is anything else I can provide. |
hmm, the only difference between our two machines seems to be that you're using docker. Could you get me your installed package lists and add them to this issue. Maybe one of your packages is causing the issue. (if we can narrow it down we can back port a newer version or report it to the right repo) deb list: flatpak: also lets get your docker images too.
|
@bflanagin here are the requested information.
I don't have flatpak installed.
I can't share the image list as it contains some private information. But I did change the docker data dir to an empty one, restart and the issue still happens. For completeness, I uploaded list of installed snap packages, but not sure relevant. |
A darp7 running 5.11.0-7612.13-generic is having issues shutting down completely after this update as well. |
@favadi Thanks for the information, and sorry for the delay. I think you're right about the snaps not being an issue, nothing really there to cause the problem. I've run my lemp 9 under medium load for 48 hrs straight without issues. Though I know it will probably put a crimp in your ability to work, but lets disable docker for a moment to see what happens.
After rebooting go about your business and let it run, do non-docker tasks, etc. and report back. You can re-enable them with these commands once the test is complete: $ sudo systemctl enable docker.service And reboot. Also if you could reply with the drive/s make and model that might also have something to do with the instability as we have seen on other issues. You can get them various ways but the easiest way is to use the Disks utility UI |
@al12gamer: We've been looking into that issue as well. Would you mind putting in an issue directly stating the problem you're having. This way we can keep things organized. |
@bflanagin @al12gamer There is already an issue open for the power-off issue here: #41 |
@bflanagin I disabled the docker.service and it doesn't make any different, the system is still working fine with kernel 5.8 and randomly hangs with 5.11. The installed drive in my lemp9 is Samsung SSD 970 EVO Plus 250GB. |
@favadi: Gah! I can't believe I missed this before. What firmware are you running on your Lemp9? Edit: disregard I can see it in the screenshot. You're running the same firmware as I am. |
@bflanagin I'm running the same version of firmware. |
Could you try to add I reported a similar issue here: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=981807 |
@kanru Hey, thank for the suggestion, I added the kernel parameter and my machine has been working fine for few hours already. This raises another question, why it only happens on my machine but not @bflanagin's machine? |
@favadi, if it makes you feel better you're not alone. The tech-support team has been working on the problem with other lemp9 owners. Please let us know how the computer holds up for the rest of the day. |
My computer has been running without any problem for a whole working day. |
Hurray for time differences! Time to test it against my machine. |
Copying relevant info from my bug report against Debian kernel here
@favadi I'm not sure why some machines are affected. My lemp9 has the same problem. |
I also just started having complete catastrophic freezes on my LemurPro with kernel: 5.11.0.7612.13~1617215757~20.04~97a8d1a I couldn't even get past the encryption password dialog without freezing. Since booting back into 5.8, everything seems fine. I opened a ticket with System76, but figured I'd throw in a "yep, me too" here as well. |
@jkibele: Try the fix that @kanru as its the best option we have at the moment. If we get enough positive results we'll push it to everyone. To why it affects some and not others, we're still determining that. In testing my vanilla install with the 5.11 update my Lemur Pro has no issues and works as well as it did in 5.8. However, when I install the applications that @favadi has on his machine I start experiencing some of the same symptoms, though not as severe. |
Thanks for the quick reply @bflanagin. I'm happy to give that fix a try, but I'm pretty clueless about the inner workings of linux, so I'll need a little guidance. Can you point me toward some instructions on how to set intel_idle.max_cstate=4 as a kernel parameter? |
@jkibele you could boot into kernel 5.8 and follow the instruction in this article to add |
@jkibele If you are using Ubuntu then what favadi suggests. On Pop!_OS the command is You can also use -a in place of --add-options if you prefer to save a few key presses. |
Thanks @bflanagin and @favadi. I am on Pop, so I went with the Thanks again for your help. Please let me know if I'll need to unset the cstate option at some point down the road. (I've got a very shallow understanding of kernel guts) |
Seems like it depends on a processor, my i7-1165G7 have only these:
So probably I should go with Interestingly,
|
@arbitrary-dev a little off topic. But no, the number is not the n of Cn state. It is the index into the table of supported state of your CPU. On many cpu C3 is followed by C6. I'll use powertop to check enabled C-state. Note if you disable deeper C-state then they aren't enumerated in sysfs. |
I have |
Before reloading Pop!_OS 21.04 I decided to execute the command from the following comment:
I've been successfully running the latest kernel 5.11.0-7633. |
My Lemp9 has once again started freezing with fans running after running the firmware update and a apt-get upgrade to 5.13 but booting into 5.11 .... this machine is now unusable. Raised a ticket with System76 but my trust in this system is deteriorating fast. |
For clarification: are you experiencing the same trouble while running 5.13? |
Same here. I updated to 5.13 and removed the intel_idle.max_cstate=4 kernel flag and rebooted and it froze shortly after each test reboot. I added the kernel flag back and it stopped freezing. |
Both, also get freezing in the recovery image. |
Same with mine. 5.11, 5.13, and recovery image. Setting the cstate kernel flag prevents lockup, but not sure if that's a long term solution. |
I've raised a support ticket, System76 want me to RMA the machine, not sure what that will do but we can see. |
I can confirm that the |
These sound like physical problems that are unrelated to this kernel issue. Please bring this up in your support case, the support team would be happy to send out replacement screws and address any other problems you're still having. |
Please keep items 1 and 4 of the Pop!_OS Code of Conduct in mind: |
New datapoint to add: I just upgraded my lemp9 (24GB) firmware to My kernel version was I usually use my laptop plugged into a USB-C PD external monitor, but it also froze once while completely disconnected from peripherals and power supplies. The freezing mostly happens when the machine is idle, and it didn't seem to matter whether the screen is on or off. Using |
pop-os/system76-driver#217 should have applied that option automatically if you have the System76 Driver app installed; I'm assuming you're running a non-Pop!_OS and non-Ubuntu-based distribution without the driver app installed. I would recommend keeping that boot parameter in place since it stops the freezing. |
The intel_ide.max_state=4 option is a workaround not a solution - it doesnt' fix freezes when booting from the rescue image nor the iso. |
@jacobgkau Yep, I'm running an old installation of Ubuntu 21.04 with a custom kernel, and I didn't install the Btw, an update -- I tried booting without my 16GB RAM stick and then again with the RAM reinstalled. 31 hours so far without |
Hi all. I was experiencing this issue on my lemp9 (i7-10510U with 32 gb additional ram) immediately after updating to 21.04. (or was it the firmware update.... I did one right before the other).... Memtest revealed no issues. Contacted support and they told me to remove my RAM temporarily. I literally just didn't have a small screwdriver around so ended up finding this github thread instead and setting the Until now. I want to upgrade to 21.10, but I want to make sure I'm up-to-date on this issue before I make any more changes under the hood. Could anyone expand on what that setting actually does (I don't notice any performance difference except no more freezing), and whether or not it is suitable to leave it this way indefinitely? Or should I remove it / change the value then do the upgrade to 21.10? Thanks in advance!! Also, probably wrong place to ask, but I just checked to see if I have a firmware update available as well, and I see "Managed Firmware Unavailable. No devices supporting automatic firmware updates detected". This doesn't make much sense, seeing as this is a System76 machine... |
The setting prevents the CPU from entering deeper idle state. You might see slightly more power consumption while on battery (it's probably negligible and you won't notice it.) It's safe to leave the setting on for normal use or during upgrade. |
I had the same problem. I've been using the I just tried removing my RAM temporarily, per S76 support suggestion for @t-lock, and everything seems to work. My issue was the machine would never even boot without the Let's see if this lasts! |
commit 59c026c upstream. When use 'echo c > /proc/sysrq-trigger' to trigger kdump, riscv_crash_save_regs() will be called to save regs for vmcore, we found "epc" value 00ffffffa5537400 is not a valid kernel virtual address, but is a user virtual address. Other regs(eg, ra, sp, gp...) are correct kernel virtual address. Actually 0x00ffffffb0dd9400 is the user mode PC of 'PID: 113 Comm: sh', which is saved in the task's stack. [ 21.201701] CPU: 0 PID: 113 Comm: sh Kdump: loaded Not tainted 5.18.9 #45 [ 21.201979] Hardware name: riscv-virtio,qemu (DT) [ 21.202160] epc : 00ffffffa5537400 ra : ffffffff80088640 sp : ff20000010333b90 [ 21.202435] gp : ffffffff810dde38 tp : ff6000000226c200 t0 : ffffffff8032be7c [ 21.202707] t1 : 0720072007200720 t2 : 30203a7375746174 s0 : ff20000010333cf0 [ 21.202973] s1 : 0000000000000000 a0 : ff20000010333b98 a1 : 0000000000000001 [ 21.203243] a2 : 0000000000000010 a3 : 0000000000000000 a4 : 28c8f0aeffea4e00 [ 21.203519] a5 : 28c8f0aeffea4e00 a6 : 0000000000000009 a7 : ffffffff8035c9b8 [ 21.203794] s2 : ffffffff810df0a8 s3 : ffffffff810df718 s4 : ff20000010333b98 [ 21.204062] s5 : 0000000000000000 s6 : 0000000000000007 s7 : ffffffff80c4a468 [ 21.204331] s8 : 00ffffffef451410 s9 : 0000000000000007 s10: 00aaaaaac0510700 [ 21.204606] s11: 0000000000000001 t3 : ff60000001218f00 t4 : ff60000001218f00 [ 21.204876] t5 : ff60000001218000 t6 : ff200000103338b8 [ 21.205079] status: 0000000200000020 badaddr: 0000000000000000 cause: 0000000000000008 With the incorrect PC, the backtrace showed by crash tool as below, the first stack frame is abnormal, crash> bt PID: 113 TASK: ff60000002269600 CPU: 0 COMMAND: "sh" #0 [ff2000001039bb90] __efistub_.Ldebug_info0 at 00ffffffa5537400 <-- Abnormal #1 [ff2000001039bcf0] panic at ffffffff806578ba #2 [ff2000001039bd50] sysrq_reset_seq_param_set at ffffffff8038c030 #3 [ff2000001039bda0] __handle_sysrq at ffffffff8038c5f8 #4 [ff2000001039be00] write_sysrq_trigger at ffffffff8038cad8 #5 [ff2000001039be20] proc_reg_write at ffffffff801b7edc #6 [ff2000001039be40] vfs_write at ffffffff80152ba6 #7 [ff2000001039be80] ksys_write at ffffffff80152ece #8 [ff2000001039bed0] sys_write at ffffffff80152f46 With the patch, we can get current kernel mode PC, the output as below, [ 17.607658] CPU: 0 PID: 113 Comm: sh Kdump: loaded Not tainted 5.18.9 #42 [ 17.607937] Hardware name: riscv-virtio,qemu (DT) [ 17.608150] epc : ffffffff800078f8 ra : ffffffff8008862c sp : ff20000010333b90 [ 17.608441] gp : ffffffff810dde38 tp : ff6000000226c200 t0 : ffffffff8032be68 [ 17.608741] t1 : 0720072007200720 t2 : 666666666666663c s0 : ff20000010333cf0 [ 17.609025] s1 : 0000000000000000 a0 : ff20000010333b98 a1 : 0000000000000001 [ 17.609320] a2 : 0000000000000010 a3 : 0000000000000000 a4 : 0000000000000000 [ 17.609601] a5 : ff60000001c78000 a6 : 000000000000003c a7 : ffffffff8035c9a4 [ 17.609894] s2 : ffffffff810df0a8 s3 : ffffffff810df718 s4 : ff20000010333b98 [ 17.610186] s5 : 0000000000000000 s6 : 0000000000000007 s7 : ffffffff80c4a468 [ 17.610469] s8 : 00ffffffca281410 s9 : 0000000000000007 s10: 00aaaaaab5bb6700 [ 17.610755] s11: 0000000000000001 t3 : ff60000001218f00 t4 : ff60000001218f00 [ 17.611041] t5 : ff60000001218000 t6 : ff20000010333988 [ 17.611255] status: 0000000200000020 badaddr: 0000000000000000 cause: 0000000000000008 With the correct PC, the backtrace showed by crash tool as below, crash> bt PID: 113 TASK: ff6000000226c200 CPU: 0 COMMAND: "sh" #0 [ff20000010333b90] riscv_crash_save_regs at ffffffff800078f8 <--- Normal #1 [ff20000010333cf0] panic at ffffffff806578c6 #2 [ff20000010333d50] sysrq_reset_seq_param_set at ffffffff8038c03c #3 [ff20000010333da0] __handle_sysrq at ffffffff8038c604 #4 [ff20000010333e00] write_sysrq_trigger at ffffffff8038cae4 #5 [ff20000010333e20] proc_reg_write at ffffffff801b7ee8 #6 [ff20000010333e40] vfs_write at ffffffff80152bb2 #7 [ff20000010333e80] ksys_write at ffffffff80152eda #8 [ff20000010333ed0] sys_write at ffffffff80152f52 Fixes: e53d281 ("RISC-V: Add kdump support") Co-developed-by: Guo Ren <guoren@kernel.org> Signed-off-by: Xianting Tian <xianting.tian@linux.alibaba.com> Link: https://lore.kernel.org/r/20220811074150.3020189-3-xianting.tian@linux.alibaba.com Cc: stable@vger.kernel.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
What's the current state of this issue ? I'm using |
@varac I believe this issue has been resolved? As well pop is no longer on the 5.11 kernel. Please update your system, and if the issue persists please contact support https://support.system76.com/articles/before-you-open-a-support-ticket/ because you own a system76 machine. |
@13r0ck I'm pretty sure this is still and issue with the kernel. We set this option from 'system76-driver' for the 'lemp9': https://github.com/pop-os/system76-driver/blob/70c25a75d7f6b1291f4d025da13db785dfd56bad/system76driver/products.py#L484 |
@varac Have you tried the "remove one ram stick temporarily" and see if it helps? |
@kanru I'm sorry but I don't have much time for extensive debugging right now, will report back once I get to it. |
commit f6044cc upstream. When preparing an AER-CTR request, the driver copies the key provided by the user into a data structure that is accessible by the firmware. If the target device is QAT GEN4, the key size is rounded up by 16 since a rounded up size is expected by the device. If the key size is rounded up before the copy, the size used for copying the key might be bigger than the size of the region containing the key, causing an out-of-bounds read. Fix by doing the copy first and then update the keylen. This is to fix the following warning reported by KASAN: [ 138.150574] BUG: KASAN: global-out-of-bounds in qat_alg_skcipher_init_com.isra.0+0x197/0x250 [intel_qat] [ 138.150641] Read of size 32 at addr ffffffff88c402c0 by task cryptomgr_test/2340 [ 138.150651] CPU: 15 PID: 2340 Comm: cryptomgr_test Not tainted 6.2.0-rc1+ #45 [ 138.150659] Hardware name: Intel Corporation ArcherCity/ArcherCity, BIOS EGSDCRB1.86B.0087.D13.2208261706 08/26/2022 [ 138.150663] Call Trace: [ 138.150668] <TASK> [ 138.150922] kasan_check_range+0x13a/0x1c0 [ 138.150931] memcpy+0x1f/0x60 [ 138.150940] qat_alg_skcipher_init_com.isra.0+0x197/0x250 [intel_qat] [ 138.151006] qat_alg_skcipher_init_sessions+0xc1/0x240 [intel_qat] [ 138.151073] crypto_skcipher_setkey+0x82/0x160 [ 138.151085] ? prepare_keybuf+0xa2/0xd0 [ 138.151095] test_skcipher_vec_cfg+0x2b8/0x800 Fixes: 67916c9 ("crypto: qat - add AES-CTR support for QAT GEN4 devices") Cc: <stable@vger.kernel.org> Reported-by: Vladis Dronov <vdronov@redhat.com> Signed-off-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com> Reviewed-by: Fiona Trahe <fiona.trahe@intel.com> Reviewed-by: Vladis Dronov <vdronov@redhat.com> Tested-by: Vladis Dronov <vdronov@redhat.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Wow - removing the additional ram stick worked. Thanks @kanru ! This bug has been super annoying over the last 2 years, glad I finially found at least the cause. |
[ Upstream commit d527f51 ] There is a UAF when xfstests on cifs: BUG: KASAN: use-after-free in smb2_is_network_name_deleted+0x27/0x160 Read of size 4 at addr ffff88810103fc08 by task cifsd/923 CPU: 1 PID: 923 Comm: cifsd Not tainted 6.1.0-rc4+ #45 ... Call Trace: <TASK> dump_stack_lvl+0x34/0x44 print_report+0x171/0x472 kasan_report+0xad/0x130 kasan_check_range+0x145/0x1a0 smb2_is_network_name_deleted+0x27/0x160 cifs_demultiplex_thread.cold+0x172/0x5a4 kthread+0x165/0x1a0 ret_from_fork+0x1f/0x30 </TASK> Allocated by task 923: kasan_save_stack+0x1e/0x40 kasan_set_track+0x21/0x30 __kasan_slab_alloc+0x54/0x60 kmem_cache_alloc+0x147/0x320 mempool_alloc+0xe1/0x260 cifs_small_buf_get+0x24/0x60 allocate_buffers+0xa1/0x1c0 cifs_demultiplex_thread+0x199/0x10d0 kthread+0x165/0x1a0 ret_from_fork+0x1f/0x30 Freed by task 921: kasan_save_stack+0x1e/0x40 kasan_set_track+0x21/0x30 kasan_save_free_info+0x2a/0x40 ____kasan_slab_free+0x143/0x1b0 kmem_cache_free+0xe3/0x4d0 cifs_small_buf_release+0x29/0x90 SMB2_negotiate+0x8b7/0x1c60 smb2_negotiate+0x51/0x70 cifs_negotiate_protocol+0xf0/0x160 cifs_get_smb_ses+0x5fa/0x13c0 mount_get_conns+0x7a/0x750 cifs_mount+0x103/0xd00 cifs_smb3_do_mount+0x1dd/0xcb0 smb3_get_tree+0x1d5/0x300 vfs_get_tree+0x41/0xf0 path_mount+0x9b3/0xdd0 __x64_sys_mount+0x190/0x1d0 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 The UAF is because: mount(pid: 921) | cifsd(pid: 923) -------------------------------|------------------------------- | cifs_demultiplex_thread SMB2_negotiate | cifs_send_recv | compound_send_recv | smb_send_rqst | wait_for_response | wait_event_state [1] | | standard_receive3 | cifs_handle_standard | handle_mid | mid->resp_buf = buf; [2] | dequeue_mid [3] KILL the process [4] | resp_iov[i].iov_base = buf | free_rsp_buf [5] | | is_network_name_deleted [6] | callback 1. After send request to server, wait the response until mid->mid_state != SUBMITTED; 2. Receive response from server, and set it to mid; 3. Set the mid state to RECEIVED; 4. Kill the process, the mid state already RECEIVED, get 0; 5. Handle and release the negotiate response; 6. UAF. It can be easily reproduce with add some delay in [3] - [6]. Only sync call has the problem since async call's callback is executed in cifsd process. Add an extra state to mark the mid state to READY before wakeup the waitter, then it can get the resp safely. Fixes: ec637e3 ("[CIFS] Avoid extra large buffer allocation (and memcpy) in cifs_readpages") Reviewed-by: Paulo Alcantara (SUSE) <pc@manguebit.com> Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 9fce92f upstream. After the blamed commit below, the TCP sockets (and the MPTCP subflows) can build egress packets larger than 64K. That exceeds the maximum DSS data size, the length being misrepresent on the wire and the stream being corrupted, as later observed on the receiver: WARNING: CPU: 0 PID: 9696 at net/mptcp/protocol.c:705 __mptcp_move_skbs_from_subflow+0x2604/0x26e0 CPU: 0 PID: 9696 Comm: syz-executor.7 Not tainted 6.6.0-rc5-gcd8bdf563d46 #45 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 netlink: 8 bytes leftover after parsing attributes in process `syz-executor.4'. RIP: 0010:__mptcp_move_skbs_from_subflow+0x2604/0x26e0 net/mptcp/protocol.c:705 RSP: 0018:ffffc90000006e80 EFLAGS: 00010246 RAX: ffffffff83e9f674 RBX: ffff88802f45d870 RCX: ffff888102ad0000 netlink: 8 bytes leftover after parsing attributes in process `syz-executor.4'. RDX: 0000000080000303 RSI: 0000000000013908 RDI: 0000000000003908 RBP: ffffc90000007110 R08: ffffffff83e9e078 R09: 1ffff1100e548c8a R10: dffffc0000000000 R11: ffffed100e548c8b R12: 0000000000013908 R13: dffffc0000000000 R14: 0000000000003908 R15: 000000000031cf29 FS: 00007f239c47e700(0000) GS:ffff88811b200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f239c45cd78 CR3: 000000006a66c006 CR4: 0000000000770ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 PKRU: 55555554 Call Trace: <IRQ> mptcp_data_ready+0x263/0xac0 net/mptcp/protocol.c:819 subflow_data_ready+0x268/0x6d0 net/mptcp/subflow.c:1409 tcp_data_queue+0x21a1/0x7a60 net/ipv4/tcp_input.c:5151 tcp_rcv_established+0x950/0x1d90 net/ipv4/tcp_input.c:6098 tcp_v6_do_rcv+0x554/0x12f0 net/ipv6/tcp_ipv6.c:1483 tcp_v6_rcv+0x2e26/0x3810 net/ipv6/tcp_ipv6.c:1749 ip6_protocol_deliver_rcu+0xd6b/0x1ae0 net/ipv6/ip6_input.c:438 ip6_input+0x1c5/0x470 net/ipv6/ip6_input.c:483 ipv6_rcv+0xef/0x2c0 include/linux/netfilter.h:304 __netif_receive_skb+0x1ea/0x6a0 net/core/dev.c:5532 process_backlog+0x353/0x660 net/core/dev.c:5974 __napi_poll+0xc6/0x5a0 net/core/dev.c:6536 net_rx_action+0x6a0/0xfd0 net/core/dev.c:6603 __do_softirq+0x184/0x524 kernel/softirq.c:553 do_softirq+0xdd/0x130 kernel/softirq.c:454 Address the issue explicitly bounding the maximum GSO size to what MPTCP actually allows. Reported-by: Christoph Paasch <cpaasch@apple.com> Closes: multipath-tcp/mptcp_net-next#450 Fixes: 7c4e983 ("net: allow gso_max_size to exceed 65536") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matttbe@kernel.org> Link: https://lore.kernel.org/r/20231114-upstream-net-20231113-mptcp-misc-fixes-6-7-rc2-v1-1-7b9cd6a7b7f4@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 829e0c9 upstream. There is another found exception that the "timerlat/1" thread was scheduled on CPU0, and lead to timer corruption finally: ``` ODEBUG: init active (active state 0) object: ffff888237c2e108 object type: hrtimer hint: timerlat_irq+0x0/0x220 WARNING: CPU: 0 PID: 426 at lib/debugobjects.c:518 debug_print_object+0x7d/0xb0 Modules linked in: CPU: 0 UID: 0 PID: 426 Comm: timerlat/1 Not tainted 6.11.0-rc7+ #45 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 RIP: 0010:debug_print_object+0x7d/0xb0 ... Call Trace: <TASK> ? __warn+0x7c/0x110 ? debug_print_object+0x7d/0xb0 ? report_bug+0xf1/0x1d0 ? prb_read_valid+0x17/0x20 ? handle_bug+0x3f/0x70 ? exc_invalid_op+0x13/0x60 ? asm_exc_invalid_op+0x16/0x20 ? debug_print_object+0x7d/0xb0 ? debug_print_object+0x7d/0xb0 ? __pfx_timerlat_irq+0x10/0x10 __debug_object_init+0x110/0x150 hrtimer_init+0x1d/0x60 timerlat_main+0xab/0x2d0 ? __pfx_timerlat_main+0x10/0x10 kthread+0xb7/0xe0 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x2d/0x40 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> ``` After tracing the scheduling event, it was discovered that the migration of the "timerlat/1" thread was performed during thread creation. Further analysis confirmed that it is because the CPU online processing for osnoise is implemented through workers, which is asynchronous with the offline processing. When the worker was scheduled to create a thread, the CPU may has already been removed from the cpu_online_mask during the offline process, resulting in the inability to select the right CPU: T1 | T2 [CPUHP_ONLINE] | cpu_device_down() osnoise_hotplug_workfn() | | cpus_write_lock() | takedown_cpu(1) | cpus_write_unlock() [CPUHP_OFFLINE] | cpus_read_lock() | start_kthread(1) | cpus_read_unlock() | To fix this, skip online processing if the CPU is already offline. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20240924094515.3561410-4-liwei391@huawei.com Fixes: c8895e2 ("trace/osnoise: Support hotplug operations") Signed-off-by: Wei Li <liwei391@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
OS: Ubuntu 20.04.2 LTS
Kernel:
5.11.0-7612.13~1617215757~20.04~97a8d1a
Hardware: Lemur Pro 9
Processor: i7-10510U
The system will randomly freeze and require hard reboot after a few minute. Ubuntu's official 5.8 kernel works without any issue.
The text was updated successfully, but these errors were encountered: