Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[POWER9] OCC locks up over time if power saving modes enabled #26

Open
madscientist159 opened this issue Jul 11, 2019 · 30 comments

Comments

@madscientist159
Copy link
Contributor

commented Jul 11, 2019

After a random amount of time, the OCC locks up and downclocks the host if default Linux power saving modes are enabled. When this happens, not only is the machine slowed down but fan controls are completely disabled and a full host power-off reboot is required to restore functionality. opal-prd occ reset is sometimes able to restore proper clocks without a reboot but the fan controls remain offline without a full reboot.

Host dmesg:

[32809.485411] powernv-cpufreq: PMSR = 6363630040000000
[32809.520296] powernv-cpufreq: CPU Frequency could be throttled
[32844.323314] powernv-cpufreq: PMSR =  12d630000000000
[32844.356229] powernv-cpufreq: CPU Frequency could be throttled

BMC dmesg:

[390435.513335] sbefifo 00:00:00:06: SBE FFDC package len 9 words but only 6 remaining
[390435.521174] occ sbefifo1-dev0: SRAM attn returned failure status: 00fe000a
[390436.569249] sbefifo 00:00:00:06: SBE error cmd a4:04 status=00fe:000a
[390436.575975] sbefifo 00:00:00:06: SBE FFDC package len 9 words but only 6 remaining
[390436.583663] occ sbefifo1-dev0: SRAM attn returned failure status: 00fe000a
[390436.706403] sbefifo 01:01:00:06: SBE error cmd a4:04 status=00fe:000a
[390436.713044] sbefifo 01:01:00:06: SBE FFDC package len 9 words but only 6 remaining
[390436.720885] occ sbefifo2-dev0: SRAM attn returned failure status: 00fe000a
[390437.780615] sbefifo 01:01:00:06: SBE error cmd a4:04 status=00fe:000a
[390437.787314] sbefifo 01:01:00:06: SBE FFDC package len 9 words but only 6 remaining
[390437.795153] occ sbefifo2-dev0: SRAM attn returned failure status: 00fe000a
[390441.406828] sbefifo 01:01:00:06: SBE error cmd a4:04 status=00fe:000a
[390441.413412] sbefifo 01:01:00:06: SBE FFDC package len 9 words but only 6 remaining
[390441.421240] occ sbefifo2-dev0: SRAM attn returned failure status: 00fe000a
[390442.895970] sbefifo 01:01:00:06: SBE error cmd a4:04 status=00fe:000a
[390442.902534] sbefifo 01:01:00:06: SBE FFDC package len 9 words but only 6 remaining
[390442.910341] occ sbefifo2-dev0: SRAM attn returned failure status: 00fe000a

The following command, run at host start, completely prevents the issue from occurring, but is undesirable as the machines use significantly more power at idle:

echo 1 | tee /sys/devices/system/cpu/cpu*/cpuidle/state?/disable

Host Linux version: 4.19.0-5-powerpc64le
OCC GIT hash: 58e422d

@cjcain

This comment has been minimized.

Copy link
Contributor

commented Jul 11, 2019

Were any eSELs generated to the BMC? There should be HTMGT traces in /var/log/syslog or /var/log/messages on the host (collect SOS data). You should also confirm that the opal-prd service is running on the host (services opal-prd status)

@madscientist159

This comment has been minimized.

Copy link
Contributor Author

commented Jul 11, 2019

The only thing in /var/log/messages is:

Jul 10 20:52:03 host kernel: [32809.485411] powernv-cpufreq: PMSR = 6363630040000000
Jul 10 20:52:03 host kernel: [32809.520296] powernv-cpufreq: CPU Frequency could be throttled
Jul 10 20:52:38 host kernel: [32844.323314] powernv-cpufreq: PMSR =  12d630000000000
Jul 10 20:52:38 host kernel: [32844.356229] powernv-cpufreq: CPU Frequency could be throttled

opal-prd did record the error and a ton of debugging information however in /var/log/opal-prd.log See attached.
opal-prd-occ-fail.log

@cjcain

This comment has been minimized.

Copy link
Contributor

commented Jul 11, 2019

Looks like OCC generated a log:

Jul 10 20:52:03 host opal-prd: HBRT: HTMGT:>>processOccAttn(0x0x7fffb94241d0)
...
Jul 10 20:52:03 host opal-prd: HBRT: HTMGT:E>elogProcessActions: OCC1 requested a WOF reset
Jul 10 20:52:03 host opal-prd: HBRT: HTMGT:E>elogProcessActions: WOF Reset Reasons for OCC1 = 0x00000040
Jul 10 20:52:03 host opal-prd: HBRT: HTMGT:bldErrLog(mod: 0x03, rc: 0x2ADF=WOF_DISABLED, data: 0x00000040 00000040 00000000 11030206, sev: 0x00, fw:n

This indicates that there was a VFRT timeout (WOF_RC_VFRT_REQ_TIMEOUT 0x00000040) on the second processor and the OCC requests a reset to try to recover. There were some SCOM errors during the reset which is not expected. After the reset, the OCC in first processor was not responsive:

Jul 10 20:52:09 host opal-prd: HBRT: HTMGT:processOccStartStatus(Start Success=y, failedOcc=0x00000000)
Jul 10 20:52:24 host opal-prd: HBRT: HTMGT:E>_waitForOccCheckpoint OCC0 still NOT ready! (last checkpoint=0x0000)
Jul 10 20:52:24 host opal-prd: HBRT: HTMGT:bldErrLog(mod: 0x0B, rc: 0x2605, data: 0x00000000 00000000 00000000 00000000, sev: 0x20, fw:n
Jul 10 20:52:24 host opal-prd: HBRT: ERRL:E>Error created : PLID=8900004A, RC=2605, Mod=0B, Userdata=0000000000000000 0000000000000000, Sev=PREDICTIVE
Jul 10 20:52:25 host opal-prd: HBRT: HTMGT:E>OCCs not all active (rc=0x2605).  Attempting OCC Reset

After this next reset, the OCCs did seem to recover:

Jul 10 20:52:29 host opal-prd: HBRT: HTMGT:processOccStartStatus(Start Success=y, failedOcc=0x00000000)
...
Jul 10 20:52:38 host opal-prd: HBRT: HTMGT:OCC0 Poll change: Status:C300 Occs:03 Cfg:00 State:03 Error:00000000/00000000
Jul 10 20:52:38 host opal-prd: HBRT: HTMGT:pollRspHandler: updating OCC0 state to ACTIVE
...
Jul 10 20:52:38 host opal-prd: HBRT: HTMGT:OCC1 Poll change: Status:0300 Occs:02 Cfg:00 State:03 Error:00000000/00000000
Jul 10 20:52:38 host opal-prd: HBRT: HTMGT:pollRspHandler: updating OCC1 state to ACTIVE
...
Jul 10 20:52:38 host opal-prd: HBRT: HTMGT:_setOccState: All OCCs have reached state 0x03
Jul 10 20:52:38 host opal-prd: HBRT: HTMGT:setOccActiveSensors: active

At this point both OCCs are active and running and I do not see any further errors. From this trace it shows that there should have only been a few minutes in time where the frequency would have been impacted (due to OCCs in recovery.) I believe there should have been eSELs generated.

You can confirm the current state of the OCCs with:

occtoolp9 -I

If states are active you can send poll commands to the OCCs:

occtoolp9 -p
occtoolp9 -p -o 1

So you are still seeing performance impacts? What is the governor sent to?

@madscientist159

This comment has been minimized.

Copy link
Contributor Author

commented Jul 11, 2019

@cjcain The system in question was a critical production system. It has since been restarted with the stop modes disabled so I can't gather current diagnostic information about the fault.

That being said, our rack management systems confirm that the OCC did /not/ fully recover. At minimum the BMC was unable to reconnect to the OCC for fan management, and our experience is that once the server enters this state it tends to fail with a hard CPU lockup within a week.

@cjcain

This comment has been minimized.

Copy link
Contributor

commented Jul 11, 2019

Was BMC data collected after the failure? (that should have the eSEL data)

@madscientist159

This comment has been minimized.

Copy link
Contributor Author

commented Jul 11, 2019

No, we tend not to gather eSEL data since we have no way of decoding / parsing it with the open source tools. My understanding was that opal-prd gathers basically the same diagnostic information, is this not the case?

@cjcain

This comment has been minimized.

Copy link
Contributor

commented Jul 11, 2019

opal-prd gathers some error data and sends it to the BMC. The system should be handling the recovery attempts, but if there is a persistent error or need for service, then the data will be sent to the BMC in the form of an eSEL. There is also other data on the BMC that is not known by the host. The BMC also maintains some data across power cycles which the host does not always do. You can run dreport on the BMC to collect data.

@madscientist159

This comment has been minimized.

Copy link
Contributor Author

commented Jul 11, 2019

@cjcain Thanks for the info. I'm of the opinion that the OCC shouldn't be crashing at all when stop states are enabled -- we've gone well over a year without any OCC problems with the stop states disabled, but as soon as they are enabled OCC goes down within a few days.

The eSEL data may not be too useful (see the following event list). You can see that the failure isn't isolated to a single event; just trying to run the server with stop states enabled generated two faults in four days:

Entry  | ID            | Timestamp            | Serviceable  | Severity  | Message
1      | FQPSPCR0063F  | 2019-07-08 21:23:24  | No           | Warning   | Failed to read from the OCC on CPU 0
2      | FQPSPCR0063F  | 2019-07-08 21:23:26  | No           | Warning   | Failed to read from the OCC on CPU 1
3      | FQPSPCR0063F  | 2019-07-11 01:52:11  | No           | Warning   | Failed to read from the OCC on CPU 0
4      | FQPSPCR0063F  | 2019-07-11 01:52:14  | No           | Warning   | Failed to read from the OCC on CPU 1
5      | FQPSPCR0063F  | 2019-07-11 01:52:36  | No           | Warning   | Failed to read from the OCC on CPU 0
@cjcain

This comment has been minimized.

Copy link
Contributor

commented Jul 11, 2019

What level/commit are you using for hostboot and hostboot-binaries?

@madscientist159

This comment has been minimized.

Copy link
Contributor Author

commented Jul 11, 2019

Hostboot 1e1ea69f, binaries (ring files) 2b4eedbd, HCODE 003286ea4

@rbatraAustinIBM

This comment has been minimized.

Copy link

commented Jul 11, 2019

@madscientist159 There was PGPE Hcode commit merged back in Feb/Mar that exacerbated a long standing bug. This fail signature of this issue matches with that. This issue was discovered end of May and a fix was put our in early Jun. Specifically, this commit fixes that issue and was pulled into open power on Jun 6th
open-power/hcode@cade833

Please pull that in and re-try. If the problem persists, then let me know.

@madscientist159

This comment has been minimized.

Copy link
Contributor Author

commented Jul 14, 2019

@rbatraAustinIBM Unfortunately the problem persists -- in fact it's gotten worse, when it disabled the BMC fan controls it locked one of the CPU fans to low speed so we had a box with a significantly overheating CPU until it was caught. Notably we have two machines that seem far more susceptible to this than the others, I'm not sure if this is because of workload or potentially marginal / faulty CPUs.

The SEL recorded a single instance of the same failure code as before on CPU1. opal-prd shows crash / recovery (attached), but the BMC again was unable to reestablish sensor reads so it locked the fan to a setting that could damage hardware.

BMC dmesg:

[310226.808382] sbefifo 00:00:00:06: SBE FFDC package len 9 words but only 6 remaining
[310226.816063] occ sbefifo1-dev0: SRAM attn returned failure status: 00fe000a
[310227.862235] sbefifo 00:00:00:06: SBE error cmd a4:04 status=00fe:000a
[310227.868921] sbefifo 00:00:00:06: SBE FFDC package len 9 words but only 6 remaining
[310227.876745] occ sbefifo1-dev0: SRAM attn returned failure status: 00fe000a
[310249.641800] sbefifo 00:00:00:06: SBE error cmd a4:04 status=00fe:000a
[310249.648524] sbefifo 00:00:00:06: SBE FFDC package len 9 words but only 6 remaining
[310249.656352] occ sbefifo1-dev0: SRAM attn returned failure status: 00fe000a

opal-prd-fail.log

@madscientist159

This comment has been minimized.

Copy link
Contributor Author

commented Jul 14, 2019

Digging a little further, the BMC was eventually able to restore communication, but it doesn't look like the OpenBMC userspace is written to handle the OCC dropping off and reappearing. Our watchdog (added in because the IBM thermal monitoring would lock up and drop off for a variety of reasons) looks for occ_error to be non-zero -- during / after this event occ_error remained zero so the userspace processes to read CPU1 temperatures on the BMC remained hung.

Until the BMC software stack can reliably handle the OCC disappearing and reappearing, I'm considering this a serious bug -- one of the main concerns is who has liability right now for CPUs that are destroyed due to overheating once the OCC crashes. If there's anything I can do to assist with debugging this please let me know ASAP.

Thanks!

@marthabroyles

This comment has been minimized.

Copy link
Contributor

commented Jul 15, 2019

There are two different issues here, one is the OCC going thru a reset which the BMC must handle and this should not be causing fans to go low, fans should go high when the BMC detects that it cannot get temperatures anymore from the OCC. The BMC team should be looking at the fan issue. Did you gather the esels from this last time so we can see why the OCC was reset?

@madscientist159

This comment has been minimized.

Copy link
Contributor Author

commented Jul 15, 2019

@marthabroyles What's happening is the BMC user space isn't recovering from the OCC dropping off / reappearing. Working theory right now is that the hwmon DBUS proxy isn't detecting the device node going away, so keeps trying to read from a stale handle, since the process is present and the BMC kernel is able to read the temps after OCC recovery has finished. This means the fan controls are using stale data from when the CPU was in throttled mode, and allowing the reactivated CPU to overheat.

What I still don't understand is why the OCC resets are hitting only two machines out of a large number.

There was only one eSEL generated, for the OCC read failure BMC-side. Since this time the OCC recovered and the BMC fan controls could be manually reset, the machine is still online so I can gather additional info if needed.

@sharkcz

This comment has been minimized.

Copy link

commented Jul 15, 2019

@madscientist159 asked me to comment here, because I've also met the

úno 05 17:32:30 talos.danny.cz kernel: powernv-cpufreq: Pstate set to safe frequency
úno 05 17:32:30 talos.danny.cz kernel: powernv-cpufreq: PMSR = 6363630040000000
úno 05 17:32:30 talos.danny.cz kernel: powernv-cpufreq: CPU Frequency could be throttled

messages, happened twice in cca 1 year on my Talos with their last stable firmware. In both cases it was when I started a KVM based VM. Don't think I can provide more info except kernel versions for host & guest.

@rbatraAustinIBM

This comment has been minimized.

Copy link

commented Jul 15, 2019

@madscientist159 Chris Cain, pointed out from the HBRT log the following information about the image build level you are running with.

Jul 14 13:00:49 host opal-prd: HBRT: ISTEPS_TRACE:pullTraceBuildInfo: image build date = 20190713, build time = 0011, build tag = hw042519a.940

As you can see, the build tag is hw042519a.940 which is April 25th. However, the fix that I mentioned earlier went in on Jun 6th, so you will need to run with HW Image tag which at least Jun 5th level or higher. It should be named hw060519a.940

@madscientist159

This comment has been minimized.

Copy link
Contributor Author

commented Jul 15, 2019

@rbatraAustinIBM We build our own HCODE images, the linked patch was applied to our existing tree and the HCODE was rebuilt. That is why the tag is still the older string value.

Was there an additional patch that fixed this problem as well? We don't really want to have to go through the entire months-long re-qualification process starting with beta firmware again, which would be needed if we pulled the newest HCODE at this point.

@cjcain

This comment has been minimized.

Copy link
Contributor

commented Jul 15, 2019

FYI, The symptoms in that last opal-prd-fail.log (from yesterday) were the same as the original failure. Which is why I had @rbatraAustinIBM confirm the code level.

@rbatraAustinIBM

This comment has been minimized.

Copy link

commented Jul 15, 2019

@madscientist159 You mentioned you are seeing this problem on couple of machines only. Have you ever seen the problem on these machines before even with older code level? Also, what was the previous working code level in terms HW image tag?

@madscientist159

This comment has been minimized.

Copy link
Contributor Author

commented Jul 15, 2019

@cjcain Understood, and that's why I'm wondering if there was a different patch that is also needed.

@rbatraAustinIBM Yes, we saw this problem basically from day 1. It's persisted across parts swaps (first thought was defective CPU / mainboard) and multiple firmware levels including revisions from back in 2018. That particular machine is a heavily loaded production VM host, which is also why it is not a trivial matter to get approval to test a firmware level against that particular system.

The other system still needs to have a parts swap done as a matter of routine to rule out hardware problems, but may be a better test bed if that pans out as it does not handle the same level of critical workloads.

@marthabroyles

This comment has been minimized.

Copy link
Contributor

commented Jul 15, 2019

The handling of OCC resets should be looked at from the BMC side. The BMC is told whenever the OCC is being reset so the BMC will stop trying to communicate with the OCC and adjust the fans accordingly (higher) for not having temperatures from the OCC. Once the OCC reset is complete and the OCCs are running again the BMC is again informed so it can start communicating with the OCCs again. We use an OCC reset as a recovery method which must work, it is designed to allow 3 OCC resets before we consider there to be a real problem. In this case the OCC reset did recover the OCC but it sounds like the BMC did not recover which is a big problem. OCC resets is a known architected recovery mechanism that the BMC should be handling.

@msbarth

This comment has been minimized.

Copy link

commented Jul 15, 2019

To handle the OCC reset (active status), there is an openpower OCC control application(https://github.com/openbmc/openpower-occ-control) that was created to set an associated OCC active property on dbus and bind/unbind the driver per OCC. The active property controlled by this application would be the signal that can be configured within fan control to set fans to full speed and unload any events that use this active property as a precondition. Is this application being used and is fan control configured with a precondition against all OCC active properties that must pass to allow fan speed changes otherwise fans run at a set speed, i.e. full speed?

Once all OCCs become active again, the fan control's configured precondition would pass, allowing the events configured to be loaded, subscribing to the OCC sensors and enabling fan speed changes. (https://github.com/openbmc/phosphor-fan-presence/blob/master/control/preconditions.cpp#L17-L76)

@madscientist159

This comment has been minimized.

Copy link
Contributor Author

commented Jul 15, 2019

@msbarth Yes, it is being used and it did work as intended in every OCC failure up until the last one. The BMC revision level did not change, for some reason this latest OCC failure was not properly detected by the BMC and therefore the fans did not ramp up as they normally do (the ramped up fans cause a massive temperature drop, which shows on our rack management systems as an alert; this normally the first indication we get that the OCC has failed).

@msbarth

This comment has been minimized.

Copy link

commented Jul 16, 2019

@madscientist159 Ok, I've mentioned this issue to the maintainer of the OCC control application to take a look into that OCC-to-BMC communication path.

@eddiejames

This comment has been minimized.

Copy link

commented Jul 16, 2019

@madscientist159 I see from the BMC dmesg that there are errors communicating with the OCC. Can you tell if this is before, during, or after the OCC was reset?

What kernel is on your BMC?

You also said "the BMC kernel is able to read the temps after OCC recovery has finished." This means that the BMC recovered after the OCC reset fine? I don't think it's possible there is stale temperature data coming from the BMC OCC driver. If it can't get the poll response, it will return an error when the hwmon files are read.

@madscientist159

This comment has been minimized.

Copy link
Contributor Author

commented Jul 16, 2019

@eddiejames The dmesg failure entries on host and guest start at exactly the same time down to the second. I'm not comfortable trying to infer order below 1 second resolution since the BMC and host dmesg stamps can drift slightly compared to one another.

Yes, the BMC kernel recovered after the OCC reset this time. The BMC userspace for whatever reason did not. Normally the OCC reset knocks the BMC kernel out too, requiring a reboot of host and BMC.

Kernel 5.0.7 is on the BMC.

@eddiejames

This comment has been minimized.

Copy link

commented Jul 17, 2019

Normally the OCC reset breaks the BMC? That seems odd.

But there were a couple of fixes put into 5.1 for the OCC driver that could be helpful here. Specifically:

openbmc/linux@b5c46a5

Without this, it was possible to get stale data.

@rbatraAustinIBM

This comment has been minimized.

Copy link

commented Jul 17, 2019

@madscientist159 Ok, so we still need to understand why the failure occured in the first place. If you have the commit, then we would need to debug this further. Can you please gather the BMC FFDC at the point of failure?

@cjcain

This comment has been minimized.

Copy link
Contributor

commented Jul 23, 2019

Let us first confirm that the PM Complex can get reset and still maintain communication with the BMC.
Copy the occtoolp9 script to the host.
Boot your system so that the OCCs are active and running. You can verify by running the following on the host:

#  occtoolp9 -I
==> opal-prd --expert-mode htmgt-passthru 0x01
HTMGT State: ACTIVE (0x03), 1 OCC (master:OCC0), resetCount:0
OCC0: Master ACTIVE (0x03) resetCount:0 wofResets:0  pollRsp:0xc3000100...

Then you can manually issue a reset (takes ~15 seconds) and then you should see the OCCs go active again and the reset count increment:

# opal-prd occ reset 
# ./occtoolp9 -I
==> opal-prd --expert-mode htmgt-passthru 0x01
HTMGT State: ACTIVE (0x03), 1 OCC (master:OCC0), resetCount:0
           : PM Complex Resets since boot: 1
OCC0: Master ACTIVE (0x03) resetCount:1 wofResets:0  pollRsp:0xc3000100...

If you continue to issue reset commands (total of 4), you can verify the system goes to safe mode:

# occtoolp9 -I
==> opal-prd --expert-mode htmgt-passthru 0x01
HTMGT State: ACTIVE (0x03), 1 OCC (master:OCC0), resetCount:0
           : PM Complex Resets since boot: 3
OCC0: Master ACTIVE (0x03) resetCount:3 wofResets:0  pollRsp:0xc3000100...
# opal-prd occ reset
# ./occtoolp9 -I
==> opal-prd --expert-mode htmgt-passthru 0x01
HTMGT State: SAFE   (0x04), 1 OCC (master:OCC0), resetCount:0,  (SAFE MODE: rc=0x2616/OCC0)
           : PM Complex Resets since boot: 4
OCC0: Master RESET  (0x85) resetCount:4 wofResets:0 FAILED  pollRsp:0xc3000100...

The OCCs/PM Complex should be held in reset and the system is running in safe mode. At this point, an eSEL should have been generated and sent to the BMC. This test should verify the full reset/recovery process as well sending eSELs to the BMC. You can send me the BMC FFDC and opal_prd.log if you need help debugging. To recovery you can just reboot the system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.