Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vm fails to resume(start) after memhotplug+managedsave+start sequence on latest devel branch #28

Closed
sathnaga opened this issue Nov 7, 2017 · 8 comments

Comments

@sathnaga
Copy link
Member

@sathnaga sathnaga commented Nov 7, 2017

cde:info Mirrored with LTC bug https://bugzilla.linux.ibm.com/show_bug.cgi?id=161042 </cde:info>

kernel: 4.14.0-1.rc4.dev.gitb27fc5c.el7.centos.ppc64le + proposed patch for open-power-host-os/linux#24
qemu-kvm-2.10.0-3.dev.gitbf0fd83.el7.centos.ppc64le
libvirt-3.6.0-3.dev.gitdd9401b.el7.centos.ppc64le

Guest used: Centos 7.4

[root@localhost ~]# free -h
              total        used        free      shared  buff/cache   available
Mem:            30G        367M         30G         11M        497M         30G
Swap:          3.0G          0B        3.0G

#cat mem_hp.xml 
<?xml version='1.0' encoding='UTF-8'?>
<memory model="dimm"><target><size unit="KiB">8388608</size><node>0</node></target></memory>

# virsh attach-device virt-tests-vm1 mem_hp.xml --live
Device attached successfully

guest console:
[root@localhost ~]# [ 2873.120261] pseries-hotplug-mem: Attempting to hot-add 32 LMB(s) at index 80000080
[ 2873.120819] lpar: Attempting to resize HPT to shift 26
[ 2874.310521] lpar: HPT resize to shift 26 complete (104 ms / 1085 ms)
[ 2874.512430] pseries-hotplug-mem: Memory at 800000000 (drc index 80000080) was hot-added
[ 2874.512544] pseries-hotplug-mem: Memory at 810000000 (drc index 80000081) was hot-added
[ 2874.512636] pseries-hotplug-mem: Memory at 820000000 (drc index 80000082) was hot-added
[ 2874.512889] pseries-hotplug-mem: Memory at 830000000 (drc index 80000083) was hot-added
[ 2874.512976] pseries-hotplug-mem: Memory at 840000000 (drc index 80000084) was hot-added
[ 2874.513206] pseries-hotplug-mem: Memory at 850000000 (drc index 80000085) was hot-added
[ 2874.513501] pseries-hotplug-mem: Memory at 860000000 (drc index 80000086) was hot-added
[ 2874.513828] pseries-hotplug-mem: Memory at 870000000 (drc index 80000087) was hot-added
[ 2874.514087] pseries-hotplug-mem: Memory at 880000000 (drc index 80000088) was hot-added
[ 2874.514249] pseries-hotplug-mem: Memory at 890000000 (drc index 80000089) was hot-added
[ 2874.514758] pseries-hotplug-mem: Memory at 8a0000000 (drc index 8000008a) was hot-added
[ 2874.514849] pseries-hotplug-mem: Memory at 8b0000000 (drc index 8000008b) was hot-added
[ 2874.515342] pseries-hotplug-mem: Memory at 8c0000000 (drc index 8000008c) was hot-added
[ 2874.515757] pseries-hotplug-mem: Memory at 8d0000000 (drc index 8000008d) was hot-added
[ 2874.516164] pseries-hotplug-mem: Memory at 8e0000000 (drc index 8000008e) was hot-added
[ 2874.516422] pseries-hotplug-mem: Memory at 8f0000000 (drc index 8000008f) was hot-added
[ 2874.516738] pseries-hotplug-mem: Memory at 900000000 (drc index 80000090) was hot-added
[ 2874.517061] pseries-hotplug-mem: Memory at 910000000 (drc index 80000091) was hot-added
[ 2874.517303] pseries-hotplug-mem: Memory at 920000000 (drc index 80000092) was hot-added
[ 2874.517549] pseries-hotplug-mem: Memory at 930000000 (drc index 80000093) was hot-added
[ 2874.517794] pseries-hotplug-mem: Memory at 940000000 (drc index 80000094) was hot-added
[ 2874.518048] pseries-hotplug-mem: Memory at 950000000 (drc index 80000095) was hot-added
[ 2874.518293] pseries-hotplug-mem: Memory at 960000000 (drc index 80000096) was hot-added
[ 2874.518530] pseries-hotplug-mem: Memory at 970000000 (drc index 80000097) was hot-added
[ 2874.518768] pseries-hotplug-mem: Memory at 980000000 (drc index 80000098) was hot-added
[ 2874.519012] pseries-hotplug-mem: Memory at 990000000 (drc index 80000099) was hot-added
[ 2874.519255] pseries-hotplug-mem: Memory at 9a0000000 (drc index 8000009a) was hot-added
[ 2874.519443] pseries-hotplug-mem: Memory at 9b0000000 (drc index 8000009b) was hot-added
[ 2874.519619] pseries-hotplug-mem: Memory at 9c0000000 (drc index 8000009c) was hot-added
[ 2874.519845] pseries-hotplug-mem: Memory at 9d0000000 (drc index 8000009d) was hot-added
[ 2874.520015] pseries-hotplug-mem: Memory at 9e0000000 (drc index 8000009e) was hot-added
[ 2874.520242] pseries-hotplug-mem: Memory at 9f0000000 (drc index 8000009f) was hot-added

[root@localhost ~]# free -h
              total        used        free      shared  buff/cache   available
Mem:            38G        415M         38G         11M        511M         38G
Swap:          3.0G          0B        3.0G

..
# virsh managedsave virt-tests-vm1

Domain virt-tests-vm1 state saved by libvirt

# virsh start --console virt-tests-vm1

Domain virt-tests-vm1 started
Connected to domain virt-tests-vm1
Escape character is ^]

======> stuck here though domstate shows `running`

without memhotplug, managedsave + start vm works fine
@sathnaga

This comment has been minimized.

Copy link
Member Author

@sathnaga sathnaga commented Nov 16, 2017

 ERROR 107-guest_sanity.hotplug.memory.qemu.qcow2.virtio_scsi.smp2.virtio_net.CentOS.7.4.ppc64le.powerkvm-libvirt.libvirt_mem.positive_test.hot_plug -> LoginTimeoutError: Login timeout expired```
@danielhb

This comment has been minimized.

Copy link

@danielhb danielhb commented Nov 21, 2017

@sathnaga I wasn't able to hit this issue with the following setup:

Host:

cpu: Power8
qemu-kvm 2.10.0.3.dev.gitbf0fd83.el7.centos
kernel 4.13.0-4.dev.git49564cb.el7.centos.ppc64le
libvirt 3.6.0.3.dev.gitdd9401b.el7.centos

Guest:

Ubuntu 1704
kernel 4.10.0-28

The guest is restarted back at the same point it was before firing virsh managedsave, including the hotplugged LMBs .

Can you send the logs from the VM (/var/log/libvirt/qemu/<vm_name>.log) ? It can help debugging if libvirt somehow is misbehaving in your case.

aik pushed a commit that referenced this issue Jan 31, 2018
Spotted thanks to ASAN:

==25226==ERROR: AddressSanitizer: global-buffer-overflow on address 0x556715a1f120 at pc 0x556714b6f6b1 bp 0x7ffcdfac1360 sp 0x7ffcdfac1350
READ of size 1 at 0x556715a1f120 thread T0
    #0 0x556714b6f6b0 in init_disasm /home/elmarco/src/qemu/disas/s390.c:219
    #1 0x556714b6fa6a in print_insn_s390 /home/elmarco/src/qemu/disas/s390.c:294
    #2 0x55671484d031 in monitor_disas /home/elmarco/src/qemu/disas.c:635
    #3 0x556714862ec0 in memory_dump /home/elmarco/src/qemu/monitor.c:1324
    #4 0x55671486342a in hmp_memory_dump /home/elmarco/src/qemu/monitor.c:1418
    #5 0x5567148670be in handle_hmp_command /home/elmarco/src/qemu/monitor.c:3109
    #6 0x5567148674ed in qmp_human_monitor_command /home/elmarco/src/qemu/monitor.c:613
    #7 0x556714b00918 in qmp_marshal_human_monitor_command /home/elmarco/src/qemu/build/qmp-marshal.c:1704
    #8 0x556715138a3e in do_qmp_dispatch /home/elmarco/src/qemu/qapi/qmp-dispatch.c:104
    #9 0x556715138f83 in qmp_dispatch /home/elmarco/src/qemu/qapi/qmp-dispatch.c:131
    #10 0x55671485cf88 in handle_qmp_command /home/elmarco/src/qemu/monitor.c:3839
    #11 0x55671514e80b in json_message_process_token /home/elmarco/src/qemu/qobject/json-streamer.c:105
    #12 0x5567151bf2dc in json_lexer_feed_char /home/elmarco/src/qemu/qobject/json-lexer.c:323
    #13 0x5567151bf827 in json_lexer_feed /home/elmarco/src/qemu/qobject/json-lexer.c:373
    #14 0x55671514ee62 in json_message_parser_feed /home/elmarco/src/qemu/qobject/json-streamer.c:124
    #15 0x556714854b1f in monitor_qmp_read /home/elmarco/src/qemu/monitor.c:3881
    #16 0x556715045440 in qemu_chr_be_write_impl /home/elmarco/src/qemu/chardev/char.c:172
    #17 0x556715047184 in qemu_chr_be_write /home/elmarco/src/qemu/chardev/char.c:184
    #18 0x55671505a8e6 in tcp_chr_read /home/elmarco/src/qemu/chardev/char-socket.c:440
    #19 0x5567150943c3 in qio_channel_fd_source_dispatch /home/elmarco/src/qemu/io/channel-watch.c:84
    #20 0x7fb90292b90b in g_main_dispatch ../glib/gmain.c:3182
    #21 0x7fb90292c7ac in g_main_context_dispatch ../glib/gmain.c:3847
    #22 0x556715162eca in glib_pollfds_poll /home/elmarco/src/qemu/util/main-loop.c:214
    #23 0x556715163001 in os_host_main_loop_wait /home/elmarco/src/qemu/util/main-loop.c:261
    #24 0x5567151631fa in main_loop_wait /home/elmarco/src/qemu/util/main-loop.c:515
    #25 0x556714ad6d3b in main_loop /home/elmarco/src/qemu/vl.c:1950
    #26 0x556714ade329 in main /home/elmarco/src/qemu/vl.c:4865
    #27 0x7fb8fe5c9009 in __libc_start_main (/lib64/libc.so.6+0x21009)
    #28 0x5567147af4d9 in _start (/home/elmarco/src/qemu/build/s390x-softmmu/qemu-system-s390x+0xf674d9)

0x556715a1f120 is located 32 bytes to the left of global variable 'char_hci_type_info' defined in '/home/elmarco/src/qemu/hw/bt/hci-csr.c:493:23' (0x556715a1f140) of size 104
0x556715a1f120 is located 8 bytes to the right of global variable 's390_opcodes' defined in '/home/elmarco/src/qemu/disas/s390.c:860:33' (0x556715a15280) of size 40600

This fix is based on Andreas Arnez <arnez@linux.vnet.ibm.com> upstream
commit:
https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=commitdiff;h=9ace48f3d7d80ce09c5df60cccb433470410b11b

2014-08-19  Andreas Arnez  <arnez@linux.vnet.ibm.com>

       * s390-dis.c (init_disasm): Simplify initialization of
       opc_index[].  This also fixes an access after the last element
       of s390_opcodes[].

Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Message-Id: <20180104160523.22995-19-marcandre.lureau@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
@sathnaga

This comment has been minimized.

Copy link
Member Author

@sathnaga sathnaga commented Feb 8, 2018

@danielhb I tried with latest levels again and able to hit the issue

Host:
4.15.0-3.dev.gitd34a158.el7.centos.ppc64le
qemu-2.11.50-1.dev.gita815ffa.el7.centos.ppc64le
libvirt-4.0.0-2.dev.git5e6f8a1.el7.centos.ppc64le

Guest:
3.10.0-693.5.2.el7.ppc64le (Centos 7.4)
4.15.0-3.dev.gitd34a158.el7.centos.ppc64le (hostos)

2018-02-08 07:23:39.147+0000: starting up libvirt version: 4.0.0, package: 2.dev.git5e6f8a1.el7.centos (Unknown, 2018-01-21-15:28:09, baratheon), qemu version: 2.11.50, hostname: ltc-test-ci2.aus.stglabs.ibm.com
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=none /usr/bin/qemu-kvm -name guest=vm1,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-18-vm1/master-key.aes -machine pseries-2.12,accel=kvm,usb=off,dump-guest-core=off -m size=33554432k,slots=16,maxmem=67108864k -realtime mlock=off -smp 32,sockets=32,cores=1,threads=1 -numa node,nodeid=0,cpus=0-31,mem=32768 -uuid 3e8d0ad8-39d0-408f-ad7f-0ea5c6f31012 -display none -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-18-vm1/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on -device qemu-xhci,id=usb,bus=pci.0,addr=0x3 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x2 -drive file=/home/sath/hostos-ppc64le.qcow2,format=qcow2,if=none,id=drive-scsi0-0-0-0 -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 -netdev tap,fd=26,id=hostnet0,vhost=on,vhostfd=28 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:b3:b4:b5,bus=pci.0,addr=0x1 -chardev pty,id=charserial0 -device spapr-vty,chardev=charserial0,id=serial0,reg=0x30000000 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 -msg timestamp=on
2018-02-08T07:23:39.273386Z qemu-system-ppc64: -chardev pty,id=charserial0: char device redirected to /dev/pts/1 (label charserial0)
2018-02-08 07:25:47.520+0000: shutting down, reason=saved
2018-02-08T07:25:47.524256Z qemu-system-ppc64: terminating on signal 15 from pid 46977 (<unknown process>)
2018-02-08 07:25:56.185+0000: starting up libvirt version: 4.0.0, package: 2.dev.git5e6f8a1.el7.centos (Unknown, 2018-01-21-15:28:09, baratheon), qemu version: 2.11.50, hostname: ltc-test-ci2.aus.stglabs.ibm.com
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=none /usr/bin/qemu-kvm -name guest=vm1,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-19-vm1/master-key.aes -machine pseries-2.12,accel=kvm,usb=off,dump-guest-core=off -m size=33554432k,slots=16,maxmem=67108864k -realtime mlock=off -smp 32,sockets=32,cores=1,threads=1 -numa node,nodeid=0,cpus=0-31,mem=32768 -object memory-backend-ram,id=memdimm0,size=8589934592 -device pc-dimm,node=0,memdev=memdimm0,id=dimm0,slot=0,addr=34359738368 -uuid 3e8d0ad8-39d0-408f-ad7f-0ea5c6f31012 -display none -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-19-vm1/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on -device qemu-xhci,id=usb,bus=pci.0,addr=0x3 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x2 -drive file=/home/sath/hostos-ppc64le.qcow2,format=qcow2,if=none,id=drive-scsi0-0-0-0 -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 -netdev tap,fd=27,id=hostnet0,vhost=on,vhostfd=29 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:b3:b4:b5,bus=pci.0,addr=0x1 -chardev pty,id=charserial0 -device spapr-vty,chardev=charserial0,id=serial0,reg=0x30000000 -incoming defer -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 -msg timestamp=on
2018-02-08T07:25:56.316388Z qemu-system-ppc64: -chardev pty,id=charserial0: char device redirected to /dev/pts/1 (label charserial0)

(process:99500): GLib-CRITICAL **: g_source_unref: assertion 'source != NULL' failed

(process:99500): GLib-CRITICAL **: g_source_unref: assertion 'source != NULL' failed

(process:99500): GLib-CRITICAL **: g_source_unref: assertion 'source != NULL' failed

(process:99500): GLib-CRITICAL **: g_source_unref: assertion 'source != NULL' failed

(process:99500): GLib-CRITICAL **: g_source_unref: assertion 'source != NULL' failed

(process:99500): GLib-CRITICAL **: g_source_unref: assertion 'source != NULL' failed

(process:99500): GLib-CRITICAL **: g_source_unref: assertion 'source != NULL' failed

(process:99500): GLib-CRITICAL **: g_source_unref: assertion 'source != NULL' failed

(process:99500): GLib-CRITICAL **: g_source_unref: assertion 'source != NULL' failed

(process:99500): GLib-CRITICAL **: g_source_unref: assertion 'source != NULL' failed

(process:99500): GLib-CRITICAL **: g_source_unref: assertion 'source != NULL' failed

(process:99500): GLib-CRITICAL **: g_source_unref: assertion 'source != NULL' failed

(process:99500): GLib-CRITICAL **: g_source_unref: assertion 'source != NULL' failed

(process:99500): GLib-CRITICAL **: g_source_unref: assertion 'source != NULL' failed

(process:99500): GLib-CRITICAL **: g_source_unref: assertion 'source != NULL' failed

(process:99500): GLib-CRITICAL **: g_source_unref: assertion 'source != NULL' failed

(process:99500): GLib-CRITICAL **: g_source_unref: assertion 'source != NULL' failed

@cdeadmin

This comment has been minimized.

Copy link

@cdeadmin cdeadmin commented Feb 8, 2018

------- Comment From danielhb@br.ibm.com 2018-02-08 10:04:39 EDT-------
I am still not able to reproduce this bug, although my enviroment is a bit different:

Host Pegas1.1 running P9 DD2.0

  • kernel 4.14.0-37.el7a.ppc64le
  • upstream QEMU
  • libvirt 3.9.0-9.el7

Guest: Pegas 1.1 kernel 4.14.0-37.el7a.ppc64le

Guest XML:

<domain type='kvm'>
<name>dhb-memhotplug</name>
<uuid>7fda2bdb-a32b-4f84-9b89-868774195fc1</uuid>
<maxMemory slots='16' unit='KiB'>16777216</maxMemory>
<memory unit='KiB'>4194304</memory>
<currentMemory unit='KiB'>4194304</currentMemory>
<vcpu placement='static'>8</vcpu>
<resource>
<partition>/machine</partition>
</resource>
<os>
<type arch='ppc64le' machine='pseries-2.12'>hvm</type>
<boot dev='hd'/>
</os>
<clock offset='utc'/>
<on_poweroff>destroy</on_poweroff>
<on_reboot>restart</on_reboot>
<on_crash>destroy</on_crash>
<devices>
<emulator>/usr/local/bin/qemu-system-ppc64</emulator>
<disk type='file' device='disk'>
<driver name='qemu' type='qcow2' cache='none'/>
<source file='/home/danielhb/vm_imgs/pegas1.1.qcow2'/>
<target dev='sda' bus='virtio'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
</disk>
<controller type='scsi' index='0' model='virtio-scsi'>
<address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
</controller>
<controller type='usb' index='0' model='qemu-xhci'>
<address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
</controller>
<controller type='pci' index='0' model='pci-root'>
<model name='spapr-pci-host-bridge'/>
<target index='0'/>
</controller>
<interface type='bridge'>
<mac address='52:54:00:84:85:86'/>
<source bridge='virbr0'/>
<model type='virtio'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/>
</interface>
<serial type='pty'>
<target type='spapr-vio-serial' port='0'>
<model name='spapr-vty'/>
</target>
<address type='spapr-vio' reg='0x30000000'/>
</serial>
<console type='pty'>
<target type='serial' port='0'/>
<address type='spapr-vio' reg='0x30000000'/>
</console>
<memballoon model='virtio'>
<address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
</memballoon>
<panic model='pseries'/>
</devices>
<seclabel type='none' model='selinux'/>
<seclabel type='none' model='dac'/>
</domain>

This is the output right after the VM is started:

localhost login: root
Password:
Last login: Thu Feb 8 07:17:59 on hvc0
[root@localhost ~]#
[root@localhost ~]#
[root@localhost ~]#
[root@localhost ~]# grep Mem /proc/meminfo
MemTotal: 3633408 kB
MemFree: 3180928 kB
MemAvailable: 3014016 kB
[root@localhost ~]# [ 71.728923] pseries-hotplug-mem: Attempting to hot-add 16 LMB(s) at index 80000010
[ 71.765142] radix-mmu: Mapped 0xc000000100000000-0xc000000110000000 with 2.00 MiB pages
[ 71.767171] Built 1 zonelists, mobility grouping on. Total pages: 56655
[ 71.767242] Policy zone: Normal
[ 71.767477] radix-mmu: Mapped 0xc000000110000000-0xc000000120000000 with 2.00 MiB pages
[ 71.770559] radix-mmu: Mapped 0xc000000120000000-0xc000000130000000 with 2.00 MiB pages
[ 71.772844] radix-mmu: Mapped 0xc000000130000000-0xc000000140000000 with 2.00 MiB pages
[ 71.775209] radix-mmu: Mapped 0xc000000140000000-0xc000000150000000 with 2.00 MiB pages
[ 71.777407] radix-mmu: Mapped 0xc000000150000000-0xc000000160000000 with 2.00 MiB pages
[ 71.779665] radix-mmu: Mapped 0xc000000160000000-0xc000000170000000 with 2.00 MiB pages
[ 71.781897] radix-mmu: Mapped 0xc000000170000000-0xc000000180000000 with 2.00 MiB pages
[ 71.784211] radix-mmu: Mapped 0xc000000180000000-0xc000000190000000 with 2.00 MiB pages
[ 71.786375] radix-mmu: Mapped 0xc000000190000000-0xc0000001a0000000 with 2.00 MiB pages
[ 71.788558] radix-mmu: Mapped 0xc0000001a0000000-0xc0000001b0000000 with 2.00 MiB pages
[ 71.791183] radix-mmu: Mapped 0xc0000001b0000000-0xc0000001c0000000 with 2.00 MiB pages
[ 71.794673] radix-mmu: Mapped 0xc0000001c0000000-0xc0000001d0000000 with 2.00 MiB pages
[ 71.797431] radix-mmu: Mapped 0xc0000001d0000000-0xc0000001e0000000 with 2.00 MiB pages
[ 71.799781] radix-mmu: Mapped 0xc0000001e0000000-0xc0000001f0000000 with 2.00 MiB pages
[ 71.803004] radix-mmu: Mapped 0xc0000001f0000000-0xc000000200000000 with 2.00 MiB pages
[ 71.805369] pseries-hotplug-mem: Memory at 100000000 (drc index 80000010) was hot-added
[ 71.805502] pseries-hotplug-mem: Memory at 110000000 (drc index 80000011) was hot-added
[ 71.805631] pseries-hotplug-mem: Memory at 120000000 (drc index 80000012) was hot-added
[ 71.805769] pseries-hotplug-mem: Memory at 130000000 (drc index 80000013) was hot-added
[ 71.805907] pseries-hotplug-mem: Memory at 140000000 (drc index 80000014) was hot-added
[ 71.806038] pseries-hotplug-mem: Memory at 150000000 (drc index 80000015) was hot-added
[ 71.806186] pseries-hotplug-mem: Memory at 160000000 (drc index 80000016) was hot-added
[ 71.806324] pseries-hotplug-mem: Memory at 170000000 (drc index 80000017) was hot-added
[ 71.806451] pseries-hotplug-mem: Memory at 180000000 (drc index 80000018) was hot-added
[ 71.806585] pseries-hotplug-mem: Memory at 190000000 (drc index 80000019) was hot-added
[ 71.806716] pseries-hotplug-mem: Memory at 1a0000000 (drc index 8000001a) was hot-added
[ 71.806847] pseries-hotplug-mem: Memory at 1b0000000 (drc index 8000001b) was hot-added
[ 71.806970] pseries-hotplug-mem: Memory at 1c0000000 (drc index 8000001c) was hot-added
[ 71.807097] pseries-hotplug-mem: Memory at 1d0000000 (drc index 8000001d) was hot-added
[ 71.807226] pseries-hotplug-mem: Memory at 1e0000000 (drc index 8000001e) was hot-added
[ 71.807353] pseries-hotplug-mem: Memory at 1f0000000 (drc index 8000001f) was hot-added

$
$ sudo virsh managedsave dhb-memhotplug

Domain dhb-memhotplug state saved by libvirt

$
$ sudo virsh start --console dhb-memhotplug
Domain dhb-memhotplug started
Connected to domain dhb-memhotplug
Escape character is ^]

[root@localhost ~]#
[root@localhost ~]#
[root@localhost ~]# grep Mem /proc/meminfo
MemTotal: 7827712 kB
MemFree: 7363520 kB
MemAvailable: 7497984 kB
[root@localhost ~]#

Tried about 10 times in the hopes that the problem might be intermittent. Still no luck - the VM is restarted back even after doing a hotplug.

I'll try to simulate with bigger amounts of memory being hotplugged to see if there is a difference. I'll also try with a Host OS guest as the one used in the bug.

@cdeadmin

This comment has been minimized.

Copy link

@cdeadmin cdeadmin commented Feb 8, 2018

------- Comment From danielhb@br.ibm.com 2018-02-08 14:37:40 EDT-------
Thanks for lending the env, Satheesh.

I was able to reproduce the error in Satheesh's env and also in another Power8E environment I have access (also running Host OS).

I am not too familiar with how virsh managedsave works - at first I thought it was a savevm call from QEMU, but savevm takes way longer than managedsave to execute. I'll see how can I reproduce the exact behavior of virsh managedsave using QEMU only.

@cdeadmin

This comment has been minimized.

Copy link

@cdeadmin cdeadmin commented Feb 13, 2018

------- Comment From danielhb@br.ibm.com 2018-02-13 12:44:21 EDT-------
This problem will occur when using HPT guests with newer kernels that has the htab resize feature. This feature allows the guest kernel to resize its HPT based on the new memory configuration. This resize can happen when hotplugging or unplugging memory.

The problem is in the QEMU side. QEMU isn't setting the new HTAB size properly after the guest kernel resized it. This does not affect the guest immediately, but any loadvm operation (like the situation described here or even in a migration) will trigger the bug, making the guest unresponsive.

This impacts all guests that uses HPT (i.e. everyone but P9 Radix).

I've sent a fix proposal to the QEMU mailing list.

dgibson added a commit to dgibson/qemu that referenced this issue Feb 14, 2018
Newer kernels have a htab resize capability when adding or remove
memory. At these situations, the guest kernel might reallocate its
htab to a more suitable size based on the resulting memory.

However, we're not setting the new value back into the machine state
when a KVM guest resizes its htab. At first this doesn't seem harmful,
but when migrating or saving the guest state (via virsh managedsave,
for instance) this mismatch between the htab size of QEMU and the
kernel makes the guest hangs when trying to load its state.

Inside h_resize_hpt_commit, the hypercall that commits the hash page
resize changes, let's set spapr->htab_shift to the new value if we're
sure that kvmppc_resize_hpt_commit were successful.

While we're here, add a "not RADIX" sanity check as it is already done
in the related hypercall h_resize_hpt_prepare.

Fixes: open-power-host-os#28
Reported-by: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
Signed-off-by: Daniel Henrique Barboza <danielhb@linux.vnet.ibm.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
dgibson added a commit to dgibson/qemu that referenced this issue Feb 16, 2018
Newer kernels have a htab resize capability when adding or remove
memory. At these situations, the guest kernel might reallocate its
htab to a more suitable size based on the resulting memory.

However, we're not setting the new value back into the machine state
when a KVM guest resizes its htab. At first this doesn't seem harmful,
but when migrating or saving the guest state (via virsh managedsave,
for instance) this mismatch between the htab size of QEMU and the
kernel makes the guest hangs when trying to load its state.

Inside h_resize_hpt_commit, the hypercall that commits the hash page
resize changes, let's set spapr->htab_shift to the new value if we're
sure that kvmppc_resize_hpt_commit were successful.

While we're here, add a "not RADIX" sanity check as it is already done
in the related hypercall h_resize_hpt_prepare.

Fixes: open-power-host-os#28
Reported-by: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
Signed-off-by: Daniel Henrique Barboza <danielhb@linux.vnet.ibm.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
@cdeadmin

This comment has been minimized.

Copy link

@cdeadmin cdeadmin commented Feb 21, 2018

------- Comment From danielhb@br.ibm.com 2018-02-16 11:25:39 EDT-------
The fix was accepted and is already available upstream:

commit 9478956
Author: Daniel Henrique Barboza <danielhb@linux.vnet.ibm.com>
Date: Tue Feb 13 15:37:16 2018 -0200

hw/ppc/spapr_hcall: set htab_shift after kvmppc_resize_hpt_commit

I'll move this to FAT. You can test it right now by using QEMU upstream - it is not available on Host OS yet.

And let me emphasize my comment #10: this affects all HPT guests running 4.14+ kernels. I think this is a good candidate to be backported to Pegas.

------- Comment From bssrikanth@in.ibm.com 2018-02-21 00:22:36 EDT-------
Satheesh will clone this to Pegas.
What's the timeline to get this on Hostos?

@sathnaga

This comment has been minimized.

Copy link
Member Author

@sathnaga sathnaga commented May 7, 2018

Tested and found working in the latest builds.

4.16.0-2.dev.gitb24758c.el7.centos.ppc64le
qemu-2.11.91-1.dev.git46a0689.el7.centos.ppc64le
libvirt-4.2.0-2.dev.gitbf217de.el7.centos.ppc64le

https://ltc-jenkins.aus.stglabs.ibm.com/job/runAvocadoFVTTest/1336/artifact/avocado-fvt-wrapper/results/job-2018-05-07T01.24-a1877cd/test-results/103-guest_sanity.hotplug.memory.qemu.qcow2.virtio_scsi.smp2.virtio_net.Guest.CentOS.7.4.ppc64le.powerkvm-libvirt.libvirt_mem.positive_test.memory.hot.plug/debug.log

<maxMemory slots="16" unit="KiB">67108864</maxMemory><cpu><numa><cell cpus="0-31" id="0" memory="33554432" unit="KiB" /></numa></cpu></domain>
02:06:59 DEBUG| Undefine VM virt-tests-vm1
02:06:59 DEBUG| Define VM from /var/tmp/xml_utils_temp_93u5HX.xml
02:06:59 DEBUG| Starting vm 'virt-tests-vm1'
02:07:01 DEBUG| waiting for domain virt-tests-vm1 to start (0.000019 secs)
02:07:02 DEBUG| Attempting to log into 'virt-tests-vm1' (timeout 240s)
02:07:02 DEBUG| Found/Verified IP 192.168.122.2 for VM virt-tests-vm1 NIC 0
02:07:27 DEBUG| Sending command: cat /proc/meminfo
02:07:27 DEBUG| Memtotal on guest: 32449984
02:07:27 DEBUG| Memory device xml: <?xml version='1.0' encoding='UTF-8'?>
<memory model="dimm"><target><size unit="KiB">8388608</size><node>0</node></target></memory>
02:07:27 DEBUG| Command output:
Device attached successfully
02:07:27 DEBUG| Attempting to log into 'virt-tests-vm1' (timeout 240s)
02:07:27 DEBUG| Found/Verified IP 192.168.122.2 for VM virt-tests-vm1 NIC 0
02:07:29 DEBUG| Sending command: grep memory /lib/udev/rules.d/80-hotplug-cpu-mem.rules || echo 'SUBSYSTEM=="memory", ACTION=="add", TEST=="state", ATTR{state}=="offline", ATTR{state}="online"' >> /lib/udev/rules.d/80-hotplug-cpu-mem.rules
02:07:30 DEBUG| Sending command: echo $?
02:07:45 DEBUG| Sending command: cat /proc/meminfo
02:07:45 DEBUG| Sending command: cat /proc/meminfo
02:07:45 DEBUG| Memtotal on guest: 40838592
02:07:45 DEBUG| Running virsh command: managedsave --domain virt-tests-vm1 
02:07:45 INFO | Running '/bin/virsh managedsave --domain virt-tests-vm1 '
02:07:50 DEBUG| [stdout] 
02:07:51 INFO | Command '/bin/virsh managedsave --domain virt-tests-vm1 ' finished with 0 after 5.41704893112s
02:07:51 DEBUG| [stdout] Domain virt-tests-vm1 state saved by libvirt
02:07:51 DEBUG| [stdout] 
02:07:51 DEBUG| status: 0
02:07:51 DEBUG| stdout: Domain virt-tests-vm1 state saved by libvirt
02:07:51 DEBUG| stderr: 
02:07:51 DEBUG| Command output:
Domain virt-tests-vm1 state saved by libvirt
02:07:51 DEBUG| Starting vm 'virt-tests-vm1'
02:07:58 DEBUG| waiting for domain virt-tests-vm1 to start (0.000014 secs)
02:07:58 DEBUG| Attempting to log into 'virt-tests-vm1' (timeout 240s)
02:07:58 DEBUG| Found/Verified IP 192.168.122.2 for VM virt-tests-vm1 NIC 0
02:08:00 DEBUG| Running virsh command: save virt-tests-vm1 /var/tmp/avocado_ET4mb7/avocado_job_nNtNnP/virt-tests-vm1.save 
02:08:00 INFO | Running '/bin/virsh save virt-tests-vm1 /var/tmp/avocado_ET4mb7/avocado_job_nNtNnP/virt-tests-vm1.save '
02:08:06 DEBUG| [stdout] 
02:08:06 INFO | Command '/bin/virsh save virt-tests-vm1 /var/tmp/avocado_ET4mb7/avocado_job_nNtNnP/virt-tests-vm1.save ' finished with 0 after 6.07074785233s
02:08:06 DEBUG| [stdout] Domain virt-tests-vm1 saved to /var/tmp/avocado_ET4mb7/avocado_job_nNtNnP/virt-tests-vm1.save
02:08:06 DEBUG| [stdout] 
02:08:06 DEBUG| status: 0
02:08:06 DEBUG| stdout: Domain virt-tests-vm1 saved to /var/tmp/avocado_ET4mb7/avocado_job_nNtNnP/virt-tests-vm1.save
02:08:06 DEBUG| stderr: 
02:08:06 DEBUG| Command output:
Domain virt-tests-vm1 saved to /var/tmp/avocado_ET4mb7/avocado_job_nNtNnP/virt-tests-vm1.save
02:08:06 DEBUG| Running virsh command: restore /var/tmp/avocado_ET4mb7/avocado_job_nNtNnP/virt-tests-vm1.save 
02:08:06 INFO | Running '/bin/virsh restore /var/tmp/avocado_ET4mb7/avocado_job_nNtNnP/virt-tests-vm1.save '
02:08:14 DEBUG| [stdout] Domain restored from /var/tmp/avocado_ET4mb7/avocado_job_nNtNnP/virt-tests-vm1.save
02:08:14 DEBUG| [stdout] 
02:08:14 INFO | Command '/bin/virsh restore /var/tmp/avocado_ET4mb7/avocado_job_nNtNnP/virt-tests-vm1.save ' finished with 0 after 7.98436999321s
02:08:14 DEBUG| status: 0
02:08:14 DEBUG| stdout: Domain restored from /var/tmp/avocado_ET4mb7/avocado_job_nNtNnP/virt-tests-vm1.save
02:08:14 DEBUG| stderr: 
02:08:14 DEBUG| Command output:
Domain restored from /var/tmp/avocado_ET4mb7/avocado_job_nNtNnP/virt-tests-vm1.save
02:08:14 DEBUG| Attempting to log into 'virt-tests-vm1' (timeout 240s)
02:08:14 DEBUG| Found/Verified IP 192.168.122.2 for VM virt-tests-vm1 NIC 0
02:08:16 INFO | Running 'true'```
@sathnaga sathnaga closed this May 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.