-
-
Notifications
You must be signed in to change notification settings - Fork 231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pmcd causes complete system lockup on CentOS 7 on VMware #107
Comments
Jeff this is very odd and does not match any failure scenario I've seen with PCP over, er, the last 22 years. Let's start with some configuration stuff ... what sorto of distro is this (flavour and version)? what version of pcp is installed? and can you please post the contents of /etc/pcp/pmcd/pmcd.conf and /var/log/pcp/pmcd/pmcd.log (preferably from a failed start, but that may not be possible if the filesystem cache is not flushed as the system is taken down, in which case the pmcd.log file from a successful start of pmcd may be helpful) |
Yup, this is a very odd issue. Wouldn't be the first time I've found a kernel bug, so let's hope it isn't that. I'm now also seeing problems on my physical machines but I do not know if it is related. The curious bit of that is that since installing PCP I noticed the VMs sometimes hang on shutdown, even if they weren't running pmcd at the time but had it (and other PCP things) installed and possibly running at some point before reboot. My physical machines are now doing that too. Does PCP load or include any kernel modules when it is started? All systems in question run CentOS 7, kernel 3.10.0-327.13.1.el7.x86_64. Here are the files you requested: |
| By lock up I mean the machine is completely dead. Not 100% CPU busy, not out of memory, etc. but completely unresponsive This sounds like a kernel / hardware problem (like Ken, I've never come across anything like this, nor seen reports from anyone else along these lines FWIW). You may have some success extracting additional diagnostics via a kernel debugger, and/or sysrq-'t' from the console - https://www.kernel.org/doc/Documentation/sysrq.txt | Does PCP load or include any kernel modules when it is started? No it doesn't. Your log file and configuration file look in good shape too - no warnings/errors there, so from a PCP point of view, everything looks normal with your machine setups. |
@natoscott, hardware issue is unlikely as we're talking about VMs hanging when pmcd runs and numerous physical machines exhibiting off behavior like hanging on shutdown. I agree on the kernel being a likely source of the problem as this is indeed a kernel crash. pmcd is somehow triggering it. After much fighting I have been able to get kdump working and I have crash dumps of the problem. One of them is here: https://drive.google.com/open?id=0B6emnuNXtougZmhGSWhrS2ZJZVU I also have this poorly made screencast demonstrating the behavior by simply starting pmcd: I'm not much of a kernel engineer so I'm of limited help at this point. |
Hi Jeff, I'm looking at the vmcore now - it's a stock RHEL7 kernel pcp-vmcore: Kdump compressed dump v6, system Linux, node dn1, release more later .. On Wed, Aug 17, 2016 at 4:46 AM, Jeff White notifications@github.com
|
Screencast suggests hang is about 20secs after pmcd start, which is interesting and suggests it is NOT an initialization error, but possibly pmFetch related or some self-timer driven event in a PMDA. Was pmlogger enabled on this system? Another possible approach is trying to find the PMDA that is responsible (it is unlikely to be pmcd itself). You have 11 PMDAs in /etc/pcp/pmcd/pmcd.conf ... I'd start by commenting about half of them out (insert a # at the start of the line) especially the ones with low-level hardware contact or deep kernel contact, e.g. perfevent, jbd2, nvidia, slurm, xfs, linux, proc. Then try again. If this survives, you may be able to binary-chop your way to identifying which PMDA is the culprit. |
The vmcore shows It's the perfevent PMDA - this is a VMware guest and there I guess h/w perf events are probably not much use in a virtual machine, so For PCP, the perf event PMDA should probably detect it's in a guest and not [ 134.800273] general protection fault: 0000 [#1] SMP Interestingly, both the perfevent PMDA and salt-minion were running when crash> ps | grep '^>'
crash> bt 4958 On Wed, Aug 17, 2016 at 7:08 AM, Ken McDonell notifications@github.com
|
Thank you for looking into this. Here are some more vmcore dumps in case you are interested. The first one shows "swapper" as the process which triggered it so salt-minion isn't really at fault here, it could happen to anyone who does a call that triggers the bug. https://drive.google.com/open?id=0B6emnuNXtougWXktR0l2VGRuYVk I sure hope the calls which trigger the issue can only be ran by a privileged user, otherwise this is a security issue as well. Do you know if that is possible? |
mgoodwin kindly filed a RHEL7.2 kernel bug for this problem. (Sorry, it's set to 'private' at the moment.) |
Appears to be a VMware issue outside of PCP from discussion in Marks RH BZ. There's no kernel code in PCP, and kernel panics are definitely not something we can fix. |
…8f460da..65fc7b81f3 65fc7b81f3 Release 0.8.1 a63aa04fed Fix async api issues (performancecopilot#107) a439a20f4f No command retries due to CROSSSLOT b96d43fd95 Connect using redisConnectWithOptions() (performancecopilot#103) c47e8a9cf3 Update hiredis version in build examples 0a3b20f691 Update Makefile 4a581e91e4 Use common build warnings 08ffdd30f0 Remove usage of hiredis internal flags (performancecopilot#101) f3091fef52 Release 0.8.0 (performancecopilot#99) 0e741c6dd8 Add Redis compatibility testing to CI (performancecopilot#97) 1aa93a0862 Add crude support for BITFIELD and BITFIELD_RO (performancecopilot#96) 295bf3c81e Add async transaction tests 6c0aecfcf1 Deprecate non-block options which have no effect (performancecopilot#89) 16ec08bb8b Accept multiple field and value arguments in HSET (performancecopilot#86) ff76aac0f5 Timeout tests and corrections (performancecopilot#84) 7c39940fa9 Move SSL support to own library (performancecopilot#80) 573c1006f9 README updates (performancecopilot#81) e642e42df1 Add windows and macOS builds to CI (performancecopilot#76) 5c9e294f75 tests: fix error handling in clusterclient_reconnect_async (performancecopilot#70) 4a69cb65d1 tests: add reconnect test (performancecopilot#68) 512a790dad reset cluster context errors in redisClusterAsyncFormattedCommandToNode git-subtree-dir: vendor/github.com/Nordix/hiredis-cluster git-subtree-split: 65fc7b81f31389b878a669710bac9d7042c2404b
Fix performancecopilot#107. Offline CPU were not properly taken into account when creating SVG graphs for CPU statistics. This was because min and max values were not updated for the metrics (%user, %nice, etc.) This should be done even for offline (and also tickless) CPUs. Signed-off-by: Sebastien GODARD <sysstat@users.noreply.github.com>
Let me start off by saying I know nothing about PCP. I installed PCP on about 80 compute nodes of an HPC cluster. Most of these are working fine but I noticed 4 virtual machines completely lock up and die shortly after starting pmcd after booting or starting it during boot. The physical systems, which are configured exactly the same, do not lock up. By lock up I mean the machine is completely dead. Not 100% CPU busy, not out of memory, etc. but completely unresponsive. When this happens the system no longer even responds to ping so the kernel itself (or its networking) is dead. However, I do not see a kernel panic and can't get a crash dump so I can't see what is happening. The fact that this userland daemon is somehow killing the kernel but not triggering a kernel panic is very odd and worrying.
Here's how I am using PCP. I am using XDMod which has a plugin, SUPReMM, which requires PCP. I installed and configured PCP via this SaltStack config:
xdmod.sls.txt
I then enabled and started pmcd, pmlogger, and pmie. At this point the VMs will hang in as little as under a minute. Since there are four machines effected and three daemons in the mix, I set the following for what daemon to start on boot (left side being the names of the VMs) to narrow down the issue:
After a number of reboots only dn2 will hang. I also was able to hang a machine by starting pmcd via systemctl after the system booted. Oddly, I have not been able to hit this issue every time I start the daemon. Any clue what is going on or how we can proceed with this issue?
The text was updated successfully, but these errors were encountered: