Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QEMU (KVM) 在 3C5000 上工作不正常 #25

Open
MingcongBai opened this issue Dec 9, 2023 · 11 comments
Open

QEMU (KVM) 在 3C5000 上工作不正常 #25

MingcongBai opened this issue Dec 9, 2023 · 11 comments

Comments

@MingcongBai
Copy link
Member

MingcongBai commented Dec 9, 2023

问题描述

在 3C5000 上使用如下命令启动带 KVM 加速的 QEMU,宿主机图形界面会卡死(SSH 依然可用):

qemu-system-loongarch64 -accel kvm

此时,内核会不定时输出诸如 workqueue lockupwatchdog: BUG: soft lockup - CPU#8 stuck for 33s! [QSGRenderThread:1532] 乃至 watchdog: Watchdog detected hard LOCKUP on cpu 15 等错误;如附图中两例:

图片
图片

如从 https://mirrors.wsyu.edu.cn/loongarch/archlinux/images/ 下载 QEMU-EFI-8.1.fd,并指定 -bios 参数:

qemu-system-loongarch64 -accel kvm -bios QEMU-EFI-8.1.fd

则一切正常,可以启动到 EFI Shell。

但是,问题还没结束,如果此时下载上述链接中的 minimal 镜像并指定镜像启动:

qemu-system-loongarch64 -accel kvm -bios QEMU-EFI-8.1.fd -hda https://mirrors.wsyu.edu.cn/loongarch/archlinux/images/archlinux-minimal-2023.05.10-loong64.qcow2

QEMU 能够启动到 GRUB,但按回车引导系统后,客户机终端只会输出几行,在一段时间后便会复位重启:

MemoryMapPteRange 507 Address DCE0000 End DD20000 Attributes 53
SetUefiImageMemoryAttributes - 0x000000000DC40000 - 0x0000000000040000 (0x0000000000000000)

这一部分的问题是因为没有在客户机指定 console=ttyS0,115200 内核参数导致的(先前测试的同事没有提到这点),属于乌龙;但不指定 -bios 参数导致宿主机内核故障的问题依然存在;如指定 -device virtio-gpu-pci 参数则不需要附加串口参数

调试操作

我们已尝试过如下操作,均无法缓解问题(症状一致):

  • 使用 nr_cpus=4 内核参数限制核心数量为 4
  • 将内存拆剩 1 根后,使用 nr_cpus=4 启动系统

运行环境

附注

同样测试环境,在 3A5000 及 3A6000 平台均无法复现问题:

  • QEMU+KVM 无论是否指定 -bios 参数均不会导致宿主系统死机
  • 使用上述命令启动 Arch Linux 镜像可成功进入系统
@cthbleachbit
Copy link

尝试测试 loongnix 下进行相同操作。在一台 3C5000 上使用更新到最新版本的 loongnix 20.5:

qemu-system-loongarch64 -accel kvm -bios QEMU-EFI-8.1.fd 打开的 qemu 窗口会停在 "guest has not initialized the display yet",同时终端里输出:

/sys/devices/system/cpu/cpu0/cpufreq/     cpuinfo_max_freq not exist!
Try /proc/cpuinfo...

不过即使不指定 -accel kvm 也会卡在这里。

@MingcongBai
Copy link
Member Author

上面出现的问题是因为使用了新世界 EFI 镜像导致的,使用 Loongnix 提供的 OVMF 镜像后,一切工作正常。看起来上面报告的问题可能是新世界系统特有的。

@chenhuacai
Copy link

内部反馈欧拉系统在龙芯3C5000上工作正常,请尝试一下勇宝当Host系统。
https://mirrors.wsyu.edu.cn/fedora/linux/Yongbao/20231201/

@liushuyu
Copy link

内部反馈欧拉系统在龙芯3C5000上工作正常,请尝试一下勇宝当Host系统。 https://mirrors.wsyu.edu.cn/fedora/linux/Yongbao/20231201/

我使用了 Yongbai 20231201 作为宿主系统测试,内核依然出现与其他发行版一样的症状:

[ 2023-12-12T20:50:45+08:00 ] [  269.598480] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 2023-12-12T20:50:45+08:00 ] [  269.604563] rcu: 	3-...0: (1 GPs behind) idle=5f7c/1/0x4000000000000000 softirq=1448/1448 fqs=2626
[ 2023-12-12T20:50:45+08:00 ] [  269.613474] rcu: 	(detected by 1, t=5255 jiffies, g=5917, q=141 ncpus=16)
[ 2023-12-12T20:50:55+08:00 ] [  279.620509] rcu: rcu_preempt kthread starved for 2494 jiffies! g5917 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=5
[ 2023-12-12T20:50:55+08:00 ] [  279.630711] rcu: 	Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[ 2023-12-12T20:50:55+08:00 ] [  279.639788] rcu: RCU grace-period kthread stack dump:
[ 2023-12-12T20:50:55+08:00 ] [  279.644951] rcu: Stack dump where RCU GP kthread last ran:
[ 2023-12-12T20:51:58+08:00 ] [  342.669638] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 2023-12-12T20:51:58+08:00 ] [  342.675698] rcu: 	3-...0: (1 GPs behind) idle=5f7c/1/0x4000000000000000 softirq=1448/1448 fqs=10509
[ 2023-12-12T20:51:58+08:00 ] [  342.684692] rcu: 	(detected by 1, t=23520 jiffies, g=5917, q=479 ncpus=16)
[ 2023-12-12T20:52:08+08:00 ] [  352.691810] rcu: rcu_preempt kthread timer wakeup didn't happen for 2502 jiffies! g5917 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[ 2023-12-12T20:52:08+08:00 ] [  352.703047] rcu: 	Possible timer handling issue on cpu=1 timer-softirq=3498
[ 2023-12-12T20:52:08+08:00 ] [  352.709963] rcu: rcu_preempt kthread starved for 2508 jiffies! g5917 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=1
[ 2023-12-12T20:52:08+08:00 ] [  352.720250] rcu: 	Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[ 2023-12-12T20:52:08+08:00 ] [  352.729326] rcu: RCU grace-period kthread stack dump:
[ 2023-12-12T20:52:08+08:00 ] [  352.734465] rcu: Stack dump where RCU GP kthread last ran:
[ 2023-12-12T20:53:11+08:00 ] [  415.752308] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 2023-12-12T20:53:11+08:00 ] [  415.758369] rcu: 	3-...0: (1 GPs behind) idle=5f7c/1/0x4000000000000000 softirq=1448/1448 fqs=17582
[ 2023-12-12T20:53:11+08:00 ] [  415.767364] rcu: 	(detected by 1, t=41791 jiffies, g=5917, q=1049 ncpus=16)
[ 2023-12-12T20:53:21+08:00 ] [  425.774569] rcu: rcu_preempt kthread starved for 2495 jiffies! g5917 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=15
[ 2023-12-12T20:53:21+08:00 ] [  425.784857] rcu: 	Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[ 2023-12-12T20:53:21+08:00 ] [  425.793933] rcu: RCU grace-period kthread stack dump:
[ 2023-12-12T20:53:21+08:00 ] [  425.799075] rcu: Stack dump where RCU GP kthread last ran:
[ 2023-12-12T20:54:24+08:00 ] [  488.814843] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 2023-12-12T20:54:24+08:00 ] [  488.820905] rcu: 	3-...0: (1 GPs behind) idle=5f7c/1/0x4000000000000000 softirq=1448/1448 fqs=23006
[ 2023-12-12T20:54:24+08:00 ] [  488.829901] rcu: 	(detected by 11, t=60060 jiffies, g=5917, q=1232 ncpus=16)
[ 2023-12-12T20:54:34+08:00 ] [  498.837194] rcu: rcu_preempt kthread starved for 2498 jiffies! g5917 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=15
[ 2023-12-12T20:54:34+08:00 ] [  498.847481] rcu: 	Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[ 2023-12-12T20:54:34+08:00 ] [  498.856557] rcu: RCU grace-period kthread stack dump:
[ 2023-12-12T20:54:34+08:00 ] [  498.861699] rcu: Stack dump where RCU GP kthread last ran:

硬件为龙芯 3C5000 + 7A2000,内存 128 GB

内核版本:

[root@Sunhaiyong ~]# uname -a
Linux Sunhaiyong 6.7.0-rc1 #1 SMP PREEMPT Thu Nov 30 02:07:13 UTC 2023 loongarch64 GNU/Linux

尝试在 Yongbai 20231201 上编译 QEMU 时发生工具链相关的问题:

collect2 版本 14.0.0 20231117 (experimental)
/usr/bin/ld -plugin /usr/libexec/gcc/loongarch64-unknown-linux-gnu/14.0.0/liblto_plugin.so -plugin-opt=/usr/libexec/gcc/loongarch64-unknown-linux-gnu/14.0.0/lto-wrapper -plugin-opt=-fresolution=/tmp/ccbJ3P9u.res -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s -plugin-opt=-pass-through=-lc -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s --build-id --eh-frame-hdr --hash-style=gnu -m elf64loongarch -dynamic-linker /lib64/ld-linux-loongarch-lp64d.so.1 /usr/lib64/gcc/loongarch64-unknown-linux-gnu/14.0.0/../../../crt1.o /usr/lib64/gcc/loongarch64-unknown-linux-gnu/14.0.0/../../../crti.o /usr/lib64/gcc/loongarch64-unknown-linux-gnu/14.0.0/crtbegin.o -L/usr/lib64/gcc/loongarch64-unknown-linux-gnu/14.0.0 -L/usr/lib64/gcc/loongarch64-unknown-linux-gnu/14.0.0/../../.. -L/lib64 -L/usr/lib64 --version -lgcc --push-state --as-needed -lgcc_s --pop-state -lc -lgcc --push-state --as-needed -lgcc_s --pop-state /usr/lib64/gcc/loongarch64-unknown-linux-gnu/14.0.0/crtend.o /usr/lib64/gcc/loongarch64-unknown-linux-gnu/14.0.0/../../../crtn.o
-----------
Sanity testing C compiler: cc
Is cross compiler: False.
Sanity check compiler command line: cc sanitycheckc.c -o sanitycheckc.exe -D_FILE_OFFSET_BITS=64
Sanity check compile stdout:

-----
Sanity check compile stderr:
/usr/bin/ld: 找不到 /usr/lib64/libc_nonshared.a: 没有那个文件或目录
collect2: 错误:ld 返回 1

-----

../meson.build:1:0: ERROR: Compiler cc cannot compile programs.
[root@Sunhaiyong qemu]#

@bibo-mao
Copy link

bibo-mao commented Dec 12, 2023

内部反馈欧拉系统在龙芯3C5000上工作正常,请尝试一下勇宝当Host系统。 https://mirrors.wsyu.edu.cn/fedora/linux/Yongbao/20231201/

我使用了 Yongbai 20231201 作为宿主系统测试,内核依然出现与其他发行版一样的症状:

是物理机内核还是虚拟机内核报这个rcu 错误?

[ 2023-12-12T20:50:45+08:00 ] [  269.598480] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 2023-12-12T20:50:45+08:00 ] [  269.604563] rcu: 	3-...0: (1 GPs behind) idle=5f7c/1/0x4000000000000000 softirq=1448/1448 fqs=2626
[ 2023-12-12T20:50:45+08:00 ] [  269.613474] rcu: 	(detected by 1, t=5255 jiffies, g=5917, q=141 ncpus=16)
[ 2023-12-12T20:50:55+08:00 ] [  279.620509] rcu: rcu_preempt kthread starved for 2494 jiffies! g5917 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=5
[ 2023-12-12T20:50:55+08:00 ] [  279.630711] rcu: 	Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[ 2023-12-12T20:50:55+08:00 ] [  279.639788] rcu: RCU grace-period kthread stack dump:
[ 2023-12-12T20:50:55+08:00 ] [  279.644951] rcu: Stack dump where RCU GP kthread last ran:
[ 2023-12-12T20:51:58+08:00 ] [  342.669638] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 2023-12-12T20:51:58+08:00 ] [  342.675698] rcu: 	3-...0: (1 GPs behind) idle=5f7c/1/0x4000000000000000 softirq=1448/1448 fqs=10509
[ 2023-12-12T20:51:58+08:00 ] [  342.684692] rcu: 	(detected by 1, t=23520 jiffies, g=5917, q=479 ncpus=16)
[ 2023-12-12T20:52:08+08:00 ] [  352.691810] rcu: rcu_preempt kthread timer wakeup didn't happen for 2502 jiffies! g5917 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[ 2023-12-12T20:52:08+08:00 ] [  352.703047] rcu: 	Possible timer handling issue on cpu=1 timer-softirq=3498
[ 2023-12-12T20:52:08+08:00 ] [  352.709963] rcu: rcu_preempt kthread starved for 2508 jiffies! g5917 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=1
[ 2023-12-12T20:52:08+08:00 ] [  352.720250] rcu: 	Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[ 2023-12-12T20:52:08+08:00 ] [  352.729326] rcu: RCU grace-period kthread stack dump:
[ 2023-12-12T20:52:08+08:00 ] [  352.734465] rcu: Stack dump where RCU GP kthread last ran:
[ 2023-12-12T20:53:11+08:00 ] [  415.752308] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 2023-12-12T20:53:11+08:00 ] [  415.758369] rcu: 	3-...0: (1 GPs behind) idle=5f7c/1/0x4000000000000000 softirq=1448/1448 fqs=17582
[ 2023-12-12T20:53:11+08:00 ] [  415.767364] rcu: 	(detected by 1, t=41791 jiffies, g=5917, q=1049 ncpus=16)
[ 2023-12-12T20:53:21+08:00 ] [  425.774569] rcu: rcu_preempt kthread starved for 2495 jiffies! g5917 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=15
[ 2023-12-12T20:53:21+08:00 ] [  425.784857] rcu: 	Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[ 2023-12-12T20:53:21+08:00 ] [  425.793933] rcu: RCU grace-period kthread stack dump:
[ 2023-12-12T20:53:21+08:00 ] [  425.799075] rcu: Stack dump where RCU GP kthread last ran:
[ 2023-12-12T20:54:24+08:00 ] [  488.814843] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 2023-12-12T20:54:24+08:00 ] [  488.820905] rcu: 	3-...0: (1 GPs behind) idle=5f7c/1/0x4000000000000000 softirq=1448/1448 fqs=23006
[ 2023-12-12T20:54:24+08:00 ] [  488.829901] rcu: 	(detected by 11, t=60060 jiffies, g=5917, q=1232 ncpus=16)
[ 2023-12-12T20:54:34+08:00 ] [  498.837194] rcu: rcu_preempt kthread starved for 2498 jiffies! g5917 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=15
[ 2023-12-12T20:54:34+08:00 ] [  498.847481] rcu: 	Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[ 2023-12-12T20:54:34+08:00 ] [  498.856557] rcu: RCU grace-period kthread stack dump:
[ 2023-12-12T20:54:34+08:00 ] [  498.861699] rcu: Stack dump where RCU GP kthread last ran:

硬件为龙芯 3C5000 + 7A2000,内存 128 GB

内核版本:

[root@Sunhaiyong ~]# uname -a
Linux Sunhaiyong 6.7.0-rc1 #1 SMP PREEMPT Thu Nov 30 02:07:13 UTC 2023 loongarch64 GNU/Linux

尝试在 Yongbai 20231201 上编译 QEMU 时发生工具链相关的问题:

collect2 版本 14.0.0 20231117 (experimental)
/usr/bin/ld -plugin /usr/libexec/gcc/loongarch64-unknown-linux-gnu/14.0.0/liblto_plugin.so -plugin-opt=/usr/libexec/gcc/loongarch64-unknown-linux-gnu/14.0.0/lto-wrapper -plugin-opt=-fresolution=/tmp/ccbJ3P9u.res -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s -plugin-opt=-pass-through=-lc -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s --build-id --eh-frame-hdr --hash-style=gnu -m elf64loongarch -dynamic-linker /lib64/ld-linux-loongarch-lp64d.so.1 /usr/lib64/gcc/loongarch64-unknown-linux-gnu/14.0.0/../../../crt1.o /usr/lib64/gcc/loongarch64-unknown-linux-gnu/14.0.0/../../../crti.o /usr/lib64/gcc/loongarch64-unknown-linux-gnu/14.0.0/crtbegin.o -L/usr/lib64/gcc/loongarch64-unknown-linux-gnu/14.0.0 -L/usr/lib64/gcc/loongarch64-unknown-linux-gnu/14.0.0/../../.. -L/lib64 -L/usr/lib64 --version -lgcc --push-state --as-needed -lgcc_s --pop-state -lc -lgcc --push-state --as-needed -lgcc_s --pop-state /usr/lib64/gcc/loongarch64-unknown-linux-gnu/14.0.0/crtend.o /usr/lib64/gcc/loongarch64-unknown-linux-gnu/14.0.0/../../../crtn.o
-----------
Sanity testing C compiler: cc
Is cross compiler: False.
Sanity check compiler command line: cc sanitycheckc.c -o sanitycheckc.exe -D_FILE_OFFSET_BITS=64
Sanity check compile stdout:

-----
Sanity check compile stderr:
/usr/bin/ld: 找不到 /usr/lib64/libc_nonshared.a: 没有那个文件或目录
collect2: 错误:ld 返回 1

-----

../meson.build:1:0: ERROR: Compiler cc cannot compile programs.
[root@Sunhaiyong qemu]#

QEMU 编译命令是什么,我这边在openEuler系统上编译社区qemu 是可以的。

@MingcongBai
Copy link
Member Author

@bibo-mao 上面是宿主内核报错,至于编译问题,后来从 @sunhaiyong1978 得知是 Yongbao 需要打开开发相关组件才能编译,明天 @liushuyu 会继续测试

@MingcongBai
Copy link
Member Author

根据 @chenhuacai 收到的提示,我们更新了目前尚未合并的 KVM LSX/LASX 补丁,并将其搭配 loongarch-next 分支补丁应用到 6.7.0-rc5 内核上,原帖中的症状没有变化

@bibo-mao
Copy link

有机器可以远程登录吗,我们看一下原因。
我们这边测试过3C5000 双路、3C5000单路、3A6000单路没发现host上报rcu 问题,只是guest运行压力测试在guest上报rcu 超时问题

@MingcongBai
Copy link
Member Author

有机器可以远程登录吗,我们看一下原因。 我们这边测试过3C5000 双路、3C5000单路、3A6000单路没发现host上报rcu 问题,只是guest运行压力测试在guest上报rcu 超时问题

已联系并提供访问

@MingcongBai
Copy link
Member Author

MingcongBai commented Dec 13, 2023

经过调查,我们发现这个问题报告一部分是摆乌龙了(我已经用删除线标记乌龙部分):

  1. Qemu 启动虚拟机,必须指定 console=ttyS0,115200,否则不会有任何输出(先前复位的原因是其实是内核找不到硬盘,kernel panic 了)
  2. 不指定 -bios 导致 3C5000 宿主机死机的问题依然成立
  3. 看起来 -vga 参数不能用,但是如果指定 -device virtio-gpu-pci 则不需要指定上述串口参数

@bibo-mao

@MingcongBai
Copy link
Member Author

开了 LSX 优化的系统都会出现 SIGILL 错误,但属于另外一个报告的范畴,详见 #24

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants