Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pmcd causes complete system lockup on CentOS 7 on VMware #107

Closed
ghost opened this issue Aug 15, 2016 · 10 comments
Closed

pmcd causes complete system lockup on CentOS 7 on VMware #107

ghost opened this issue Aug 15, 2016 · 10 comments

Comments

@ghost
Copy link

ghost commented Aug 15, 2016

Let me start off by saying I know nothing about PCP. I installed PCP on about 80 compute nodes of an HPC cluster. Most of these are working fine but I noticed 4 virtual machines completely lock up and die shortly after starting pmcd after booting or starting it during boot. The physical systems, which are configured exactly the same, do not lock up. By lock up I mean the machine is completely dead. Not 100% CPU busy, not out of memory, etc. but completely unresponsive. When this happens the system no longer even responds to ping so the kernel itself (or its networking) is dead. However, I do not see a kernel panic and can't get a crash dump so I can't see what is happening. The fact that this userland daemon is somehow killing the kernel but not triggering a kernel panic is very odd and worrying.

Here's how I am using PCP. I am using XDMod which has a plugin, SUPReMM, which requires PCP. I installed and configured PCP via this SaltStack config:

xdmod.sls.txt

I then enabled and started pmcd, pmlogger, and pmie. At this point the VMs will hang in as little as under a minute. Since there are four machines effected and three daemons in the mix, I set the following for what daemon to start on boot (left side being the names of the VMs) to narrow down the issue:

dn1 = nothing, control for this test
dn2 = pmcd
dn3 = pmie
dn4 = pmlogger

After a number of reboots only dn2 will hang. I also was able to hang a machine by starting pmcd via systemctl after the system booted. Oddly, I have not been able to hit this issue every time I start the daemon. Any clue what is going on or how we can proceed with this issue?

@kmcdonell
Copy link
Member

Jeff this is very odd and does not match any failure scenario I've seen with PCP over, er, the last 22 years.

Let's start with some configuration stuff ... what sorto of distro is this (flavour and version)? what version of pcp is installed? and can you please post the contents of /etc/pcp/pmcd/pmcd.conf and /var/log/pcp/pmcd/pmcd.log (preferably from a failed start, but that may not be possible if the filesystem cache is not flushed as the system is taken down, in which case the pmcd.log file from a successful start of pmcd may be helpful)

@ghost
Copy link
Author

ghost commented Aug 16, 2016

Yup, this is a very odd issue. Wouldn't be the first time I've found a kernel bug, so let's hope it isn't that. I'm now also seeing problems on my physical machines but I do not know if it is related. The curious bit of that is that since installing PCP I noticed the VMs sometimes hang on shutdown, even if they weren't running pmcd at the time but had it (and other PCP things) installed and possibly running at some point before reboot. My physical machines are now doing that too. Does PCP load or include any kernel modules when it is started?

All systems in question run CentOS 7, kernel 3.10.0-327.13.1.el7.x86_64. Here are the files you requested:

pmcd.conf.txt
pmcd.log.txt

@natoscott
Copy link
Member

| By lock up I mean the machine is completely dead. Not 100% CPU busy, not out of memory, etc. but completely unresponsive

This sounds like a kernel / hardware problem (like Ken, I've never come across anything like this, nor seen reports from anyone else along these lines FWIW). You may have some success extracting additional diagnostics via a kernel debugger, and/or sysrq-'t' from the console - https://www.kernel.org/doc/Documentation/sysrq.txt

| Does PCP load or include any kernel modules when it is started?

No it doesn't. Your log file and configuration file look in good shape too - no warnings/errors there, so from a PCP point of view, everything looks normal with your machine setups.

@ghost
Copy link
Author

ghost commented Aug 16, 2016

@natoscott, hardware issue is unlikely as we're talking about VMs hanging when pmcd runs and numerous physical machines exhibiting off behavior like hanging on shutdown. I agree on the kernel being a likely source of the problem as this is indeed a kernel crash. pmcd is somehow triggering it. After much fighting I have been able to get kdump working and I have crash dumps of the problem. One of them is here:

https://drive.google.com/open?id=0B6emnuNXtougZmhGSWhrS2ZJZVU

I also have this poorly made screencast demonstrating the behavior by simply starting pmcd:

https://youtu.be/Gq25aZrWodg

I'm not much of a kernel engineer so I'm of limited help at this point.

@pcpemail
Copy link

Hi Jeff, I'm looking at the vmcore now - it's a stock RHEL7 kernel

pcp-vmcore: Kdump compressed dump v6, system Linux, node dn1, release
3.10.0-327.13.1.el7.x86_64, version #1 SMP Thu Mar 31 16:04:38 UTC 2016,
machine x86_64, domain (none)

more later ..

On Wed, Aug 17, 2016 at 4:46 AM, Jeff White notifications@github.com
wrote:

@natoscott https://github.com/natoscott, hardware issue is unlikely as
we're talking about VMs hanging when pmcd runs and numerous physical
machines exhibiting off behavior like hanging on shutdown. I agree on the
kernel being a likely source of the problem as this is indeed a kernel
crash. pmcd is somehow triggering it. After much fighting I have been able
to get kdump working and I have crash dumps of the problem. One of them is
here:

https://drive.google.com/open?id=0B6emnuNXtougZmhGSWhrS2ZJZVU

I also have this poorly made screencast demonstrating the behavior by
simply starting pmcd:

https://youtu.be/Gq25aZrWodg

I'm not much of a kernel engineer so I'm of limited help at this point.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#107 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AL0ItkpuJ_OFM8JAJ2G5WQ05OtN5cXTaks5qggWEgaJpZM4JktgK
.


pcp mailing list
pcp@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/pcp

@kmcdonell
Copy link
Member

Screencast suggests hang is about 20secs after pmcd start, which is interesting and suggests it is NOT an initialization error, but possibly pmFetch related or some self-timer driven event in a PMDA.

Was pmlogger enabled on this system?

Another possible approach is trying to find the PMDA that is responsible (it is unlikely to be pmcd itself). You have 11 PMDAs in /etc/pcp/pmcd/pmcd.conf ... I'd start by commenting about half of them out (insert a # at the start of the line) especially the ones with low-level hardware contact or deep kernel contact, e.g. perfevent, jbd2, nvidia, slurm, xfs, linux, proc. Then try again.

If this survives, you may be able to binary-chop your way to identifying which PMDA is the culprit.

@pcpemail
Copy link

The vmcore shows It's the perfevent PMDA - this is a VMware guest and there
are known kernel issues with x86_perf_event_update()
calling native_read_pmc() .. but VMware doesn't (apparently) implement all
the h/w events. Your actual crash was triggered by 'salt-minion', which is
also tripping up in native_read_pmc().

I guess h/w perf events are probably not much use in a virtual machine, so
either manually comment out 'perfevent' in pmcd.conf, or run
/var/lib/pcp/pmdas/perfevent/Remove. You may also have to turn off
'salt-minion'.

For PCP, the perf event PMDA should probably detect it's in a guest and not
run unless forced or something. Not sure of a programmatic way to determine
that, but there will be some way for sure. This particular issue has been
reported before, see BZ 1178606 - 'general protection fault in
native_read_pmc while running perf on VMware guest', which was posted
against RHEL6.

[ 134.800273] general protection fault: 0000 [#1] SMP
[ 134.800304] Modules linked in: ext4 mbcache jbd2 rdma_ucm(OE) ib_ucm(OE)
rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE)
mlx5_ib(OE) mlx5_core(OE) mlx4_en(OE) vxlan ip6_udp_tunnel udp_tunnel ptp
pps_core mlx4_ib(OE) ib_sa(OE) ib_mad(OE) ib_core(OE) ib_addr(OE)
mlx4_core(OE) mlx_compat(OE) coretemp ppdev sg vmw_balloon pcspkr shpchp
parport_pc i2c_piix4 parport vmw_vmci nfsd knem(OE) auth_rpcgss ip_tables
nfsv3 nfs_acl nfs lockd grace fscache sd_mod crc_t10dif crct10dif_generic
sr_mod cdrom ata_generic pata_acpi crct10dif_pclmul crct10dif_common
crc32_pclmul crc32c_intel ghash_clmulni_intel vmwgfx aesni_intel lrw
gf128mul glue_helper ablk_helper cryptd serio_raw drm_kms_helper ttm
vmxnet3 ahci vmw_pvscsi libahci drm ata_piix libata i2c_core floppy sunrpc
dm_mirror dm_region_hash
[ 134.800645] dm_log dm_mod
[ 134.800656] CPU: 0 PID: 2934 Comm: salt-minion Tainted: G OE
------------ 3.10.0-327.13.1.el7.x86_64 #1
[ 134.800691] Hardware name: VMware, Inc. VMware Virtual Platform/440BX
Desktop Reference Platform, BIOS 6.00 09/17/2015
[ 134.800735] task: ffff880073299700 ti: ffff88007977c000 task.ti:
ffff88007977c000
[ 134.800762] RIP: 0010:[] []
native_read_pmc+0x6/0x20
[ 134.800796] RSP: 0000:ffff88007ce03ef0 EFLAGS: 00010083
[ 134.800814] RAX: ffffffff81957ee0 RBX: 0000000000000000 RCX:
0000000040000002
[ 134.800838] RDX: 0000000051c31ddb RSI: ffff88007ce17fa8 RDI:
0000000040000002
[ 134.800863] RBP: ffff88007ce03ef0 R08: 000000000000001b R09:
00007fff2dc71714
[ 134.800887] R10: 0000000000000001 R11: 00007ff043fd9c40 R12:
ffffffff80000001
[ 134.800910] R13: ffff880077763400 R14: ffff880077763578 R15:
0000000000000010
[ 134.800934] FS: 00007ff045117740(0000) GS:ffff88007ce00000(0000)
knlGS:0000000000000000
[ 134.800960] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 134.800979] CR2: 0000000001342220 CR3: 0000000077c78000 CR4:
00000000001407f0
[ 134.801030] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 134.801082] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[ 134.801106] Stack:
[ 134.801115] ffff88007ce03f28 ffffffff81029e03 0000000000000000
ffff880077763400
[ 134.801146] ffff88007ce17fb4 00007ff044ee9540 00007ff044ee9710
ffff88007ce03f38
[ 134.801175] ffffffff8102a079 ffff88007ce03f60 ffffffff811591fe
ffff88007323fd90
[ 134.801205] Call Trace:
[ 134.801215]
[ 134.801223]
[ 134.801234] [] x86_perf_event_update+0x43/0x90
[ 134.801252] [] x86_pmu_read+0x9/0x10
[ 134.801272] [] __perf_event_read+0xfe/0x110
[ 134.801294] []
flush_smp_call_function_queue+0x5d/0x130
[ 134.801318] []
generic_smp_call_function_single_interrupt+0x13/0x30
[ 134.801345] []
smp_call_function_single_interrupt+0x27/0x40
[ 134.801371] []
call_function_single_interrupt+0x6d/0x80
[ 134.801393]
[ 134.801401] Code:
[ 134.801411] c0 48 c1 e2 20 89 0e 48 09 c2 48 89 d0 5d c3 66 0f 1f 44 00
00 55 89 f0 89 f9 48 89 e5 0f 30 31 c0 5d c3 66 90 55 89 f9 48 89 e5 <0f>
33 89 c0 48 c1 e2 20 48 09 c2 48 89 d0 5d c3 66 2e 0f 1f 84
[ 134.801600] RIP [] native_read_pmc+0x6/0x20
[ 134.801633] RSP

Interestingly, both the perfevent PMDA and salt-minion were running when
the crash occurred, and both were reading a perf event :

crash> ps | grep '^>'

2934 1 0 ffff880073299700 RU 3.3 712948 69680 salt-minion
4958 4950 1 ffff880073355c00 RU 0.2 76468 3412 pmdaperfevent

crash> bt 4958
PID: 4958 TASK: ffff880073355c00 CPU: 1 COMMAND: "pmdaperfevent"
#0 [ffff88007cf05e70] crash_nmi_callback at ffffffff810458f2
#1 [ffff88007cf05e80] nmi_handle at ffffffff8163e8d9
#2 [ffff88007cf05ec8] do_nmi at ffffffff8163e9f0
#3 [ffff88007cf05ef0] nmi_restore at ffffffff8163dd13
[exception RIP: generic_exec_single+314]
RIP: ffffffff810e687a RSP: ffff88007323fd90 RFLAGS: 00000202
RAX: 00000000000008fb RBX: ffff88007323fd90 RCX: 0000000000000000
RDX: 00000000000008fb RSI: 00000000000000fb RDI: 0000000000000286
RBP: ffff88007323fdd8 R8: 0000000000000001 R9: 0000000000000000
R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
R13: 0000000000000001 R14: ffff880077763400 R15: ffff88007323fea0
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- ---
#4 [ffff88007323fd90] generic_exec_single at ffffffff810e687a
#5 [ffff88007323fde0] smp_call_function_single at ffffffff810e697f
#6 [ffff88007323fe10] perf_event_read_value at ffffffff811584e2
#7 [ffff88007323fe40] perf_event_read_value at ffffffff81158533
#8 [ffff88007323fe80] perf_read at ffffffff81158cf0
#9 [ffff88007323ff08] vfs_read at ffffffff811de4ec
#10 [ffff88007323ff38] sys_write at ffffffff811df03f
#11 [ffff88007323ff80] sysret_check at ffffffff81645ec9
RIP: 00007f72e884222d RSP: 00007fff33242fd8 RFLAGS: 00010206
RAX: 0000000000000000 RBX: ffffffff81645ec9 RCX: 0000000000000001
RDX: 0000000000000018 RSI: 0000000000789250 RDI: 0000000000000006
RBP: 0000000000000000 R8: 0000000000000000 R9: 0000000051c2fbf5
R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000789b48
R13: 0000000000000000 R14: 0000000000789568 R15: 0000000000789250
ORIG_RAX: 0000000000000000 CS: 0033 SS: 002b

On Wed, Aug 17, 2016 at 7:08 AM, Ken McDonell notifications@github.com
wrote:

Screencast suggests hang is about 20secs after pmcd start, which is
interesting and suggests it is NOT an initialization error, but possibly
pmFetch related or some self-timer driven event in a PMDA.

Was pmlogger enabled on this system?

Another possible approach is trying to find the PMDA that is responsible
(it is unlikely to be pmcd itself). You have 11 PMDAs in
/etc/pcp/pmcd/pmcd.conf ... I'd start by commenting about half of them out
(insert a # at the start of the line) especially the ones with low-level
hardware contact or deep kernel contact, e.g. perfevent, jbd2, nvidia,
slurm, xfs, linux, proc. Then try again.

If this survives, you may be able to binary-chop your way to identifying
which PMDA is the culprit.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#107 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AL0Itkf8u3Yk8CrnhI0sCTqwnZqjRL3xks5qgibbgaJpZM4JktgK
.


pcp mailing list
pcp@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/pcp

@ghost
Copy link
Author

ghost commented Aug 16, 2016

Thank you for looking into this. Here are some more vmcore dumps in case you are interested. The first one shows "swapper" as the process which triggered it so salt-minion isn't really at fault here, it could happen to anyone who does a call that triggers the bug.

https://drive.google.com/open?id=0B6emnuNXtougWXktR0l2VGRuYVk
https://drive.google.com/open?id=0B6emnuNXtougUEEwU183ZjFMNUU

I sure hope the calls which trigger the issue can only be ran by a privileged user, otherwise this is a security issue as well. Do you know if that is possible?

@fche
Copy link
Contributor

fche commented Sep 1, 2016

mgoodwin kindly filed a RHEL7.2 kernel bug for this problem. (Sorry, it's set to 'private' at the moment.)
https://bugzilla.redhat.com/show_bug.cgi?id=1370023

@natoscott
Copy link
Member

Appears to be a VMware issue outside of PCP from discussion in Marks RH BZ. There's no kernel code in PCP, and kernel panics are definitely not something we can fix.

natoscott added a commit to natoscott/pcp that referenced this issue Sep 7, 2022
…8f460da..65fc7b81f3

65fc7b81f3 Release 0.8.1
a63aa04fed Fix async api issues (performancecopilot#107)
a439a20f4f No command retries due to CROSSSLOT
b96d43fd95 Connect using redisConnectWithOptions() (performancecopilot#103)
c47e8a9cf3 Update hiredis version in build examples
0a3b20f691 Update Makefile
4a581e91e4 Use common build warnings
08ffdd30f0 Remove usage of hiredis internal flags (performancecopilot#101)
f3091fef52 Release 0.8.0 (performancecopilot#99)
0e741c6dd8 Add Redis compatibility testing to CI (performancecopilot#97)
1aa93a0862 Add crude support for BITFIELD and BITFIELD_RO (performancecopilot#96)
295bf3c81e Add async transaction tests
6c0aecfcf1 Deprecate non-block options which have no effect (performancecopilot#89)
16ec08bb8b Accept multiple field and value arguments in HSET (performancecopilot#86)
ff76aac0f5 Timeout tests and corrections (performancecopilot#84)
7c39940fa9 Move SSL support to own library (performancecopilot#80)
573c1006f9 README updates (performancecopilot#81)
e642e42df1 Add windows and macOS builds to CI (performancecopilot#76)
5c9e294f75 tests: fix error handling in clusterclient_reconnect_async (performancecopilot#70)
4a69cb65d1 tests: add reconnect test (performancecopilot#68)
512a790dad reset cluster context errors in redisClusterAsyncFormattedCommandToNode

git-subtree-dir: vendor/github.com/Nordix/hiredis-cluster
git-subtree-split: 65fc7b81f31389b878a669710bac9d7042c2404b
natoscott pushed a commit to natoscott/pcp that referenced this issue Mar 16, 2023
Fix performancecopilot#107.
Offline CPU were not properly taken into account when creating SVG
graphs for CPU statistics.
This was because min and max values were not updated for the metrics
(%user, %nice, etc.) This should be done even for offline (and also
tickless) CPUs.

Signed-off-by: Sebastien GODARD <sysstat@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants