Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitor for faults at runtime #1732

Closed
spinler opened this issue Jun 6, 2017 · 6 comments
Closed

Monitor for faults at runtime #1732

spinler opened this issue Jun 6, 2017 · 6 comments
Assignees

Comments

@spinler
Copy link
Contributor

spinler commented Jun 6, 2017

At runtime, there are a handful of faults to detect from STATUS_WORD. Need error logs and real time status.

Can be polled or interrupt driven. Same application as standby one, just takes a different argument when started.

@bjwyman
Copy link
Contributor

bjwyman commented Aug 10, 2017

Task List

  • Update standby fault monitor to detect power on state (Monitor for faults at standby #1731)
  • After IPL complete call out power supply if PG# or UNIT_IS_OFF shows fault.
  • After IPL complete call out power supply if IOUT_OC_FAULT (overcurrent fault seen).
  • After IPL complete call out power supply if VOUT_OV_FAULT (output over voltage fault).
  • After IPL complete call out both power supplies if either has TEMPERATURE_FAULT bit turn on (over temperature condition, paired supply may be putting out less current).
  • After IPL complete, call out power supply if FAN_FAULT on (bad fan operation detected).

@bjwyman
Copy link
Contributor

bjwyman commented Aug 10, 2017

Separate story -> If power on fails due to time out or fault status, check STATUS_WORD for PG#, IOUT_OC_FAULT, VOUT_OV_FAULT, TEMPERATURE_FAULT, FAN_FAULT, or MFR_SPECIFIC. Log SEL pointing to the power supply that shows any of those fault bits.
#1797

@bjwyman
Copy link
Contributor

bjwyman commented Aug 11, 2017

STATUS

  • 08/11:
  • 08/12:
  • 08/17:
  • 08/18:
    • Today: Tested code for detecting power state. Discovered at least one Witherspoon system with power supplies behaving strangely. They report the wrong CCIN, indicate invalid command, and thus the device driver cannot show in1_alarm or power1_alarm. Tested on system with single power supply, all looked good (after minor tweak). Pushed code up for review.
    • Next: Add in checks for pgood off or unit is off on, log error.
    • Blockers: Monitor for faults at standby #1731.
  • 08/21:
    • Today: Talked with Spinler on missing STATUS_MFR_SPECIFIC, but apparently overlooked something, looks to be status0_mfr. Have a start at code to check PG# and UNIT_IS_OFF, but having troubles with metadata that includes a callout to the inventory path.
    • Tomorrow: Try to figure out how I get a call out and include the other bytes of data in the metadata.
    • Blockers: Monitor for faults at standby #1731.
  • 08/23:
    • Today: Worked with Deepak and Spinler to fix the call out and metadata problem. Tested on system with two power supplies, setup top supply to require enable line and command to power on. Fault logged when power state changed to on (bug with repeating logging fixed). Pushed up code for review (removed prior WIP label, added reviewers).
    • Tomorrow: Also noticed that 2nd repeat of test logged fault for both supplies, one 0x0800 other 0x0840, but only expected the top supply to have bits on for fault. Timing issue? Move onto overcurrent fault.
    • Blockers: Monitor for faults at standby #1731
  • 08/24:
    • Today: Discussed timing of POWER_GOOD Negated going low with Jordan. Older power supplies could apparently take up to a second to change that bit, newer ones should be faster, but it has not been measured. Addressed review comments/fixes from Matt Spinler. Did not get to the over current fault checking.
    • Tomorrow: Address more review comments in earlier commit. Add in additional fault checks.
    • Blockers: Monitor for faults at standby #1731
  • 08/25:
    • Today: More review comments/updates. Re-tested code for VIN_UV_FAULT and INPUT_FAULT. Re-test of should be on fault, noted odd behavior with PG# not changing in a timely fashion. Retried on another system with one supply, not official Witherspoon power supply. It did not have that problem. Tried on a third, two supplies, similar to first system, had the same odd PG# behavior. Problems with INPUT_FAULT clearing found and fixed. Added in support for output overcurrent detection and error logging.
    • Next: Get information on power supplies, try to track down reason for odd PG# behavior. Add in support for output over voltage fault.
    • Blockers: Monitor for faults at standby #1731
  • 08/28:
    • Today: Checked for feedback on odd PG# behavior (nothing). Refactored the analysis function, splitting off the various blocks of bit checking and error logging into separate functions.
    • Tomorrow: Look for power supply PG# feedback. Test the refactored analysis code. Add in support for output over voltage fault.
    • Blockers: Monitor for faults at standby #1731
  • 08/29:
    • Today: Hopefully got rebased in line with Monitor for faults at standby #1731 to avoid merge conflicts. Re-tested standby input faults after a refactor. Re-tested PG# or off when should be on. E-mail replies from Mike Miller. Apparently the power supply can take up to one second to change that PG# bit. The power supplies I have seen are early bring-up levels, I may want to do fault testing with GA level power supplies.
    • Tomorrow: Look for feedback on testing over current condition. Look for feedback on update power supply firmware. FINALLY add in the output over voltage fault detection?
    • Blockers: Monitor for faults at standby #1731.
  • 08/30:
    • Today: Address review feedback (after doing quite a few reviews for others). Somehow dropped a commit, so had to do redo some things to get correct updates pushed to Gerrit. Added in some code for output overvoltage detection. Need to test those changes before pushing. Noticed some other possibilities for refactoring to avoid some duplicate code. Replies on fake fault testing and firmware updates. There are some commands to fake out the various fault bits getting set, waiting on PMBus version of sending those for testing. Updating the power supply firmware could be done, but some of the update require hardware changes, without which bad things could happen.
    • Tomorrow: Hopefully test the output over current and output overvoltage fault conditions via fake out. Look for any additional review feedback. Revisit the power on 1 second time out before checking for PG# and UNIT_IS_OFF.
    • Blockers: Monitor for faults at standby #1731.
  • 08/31:
    • Today: Addressed review feedback. Received information on commands to inject fake OC, OV, and OT faults, but did not work on the system I was using to test. CML bit turned on. Added in code for detecting bad fan operation. Sent e-mail to Jordan regarding call outs requested for temperature fault condition.
    • Next: Look for updates on fake fault inject. Add temperature fault detection and logging. Address any further review feedback.
    • Blockers: Monitor for faults at standby #1731.
  • 09/07:
  • 09/08:
    • Today: Addressed review comments. Added in timer for the 1 second delay before should check PG#, etc. Read e-mail updates for the fault injection commands.
    • Next: Add temperature fault detection and logging. Address any further review feedback.
    • Blockers: Monitor for faults at standby #1731
  • 09/11:
    • Today: Re-did a bunch of commits due to PMBus::read() change.
    • Tomorrow: Temperature fault detection and logging, any further changes due to PMBus::read(). Look for and address further review feedback.
    • Blockers: Monitor for faults at standby #1731.
  • 09/12:
    • Today: Changes to work with new PMBus::read() from Monitor for faults at standby #1731. Re-test, more changes, re-test again, push, build failures on four commits, fix one commit, rebase addressed the other failures. Noted odd behavior with fake fan fault. May need changes, or test with manually stopped rotor.
    • Tomorrow: Updates for metadata will be needed. Track down commit with the helper class. Add in temperature fault checking and logging.
    • Blockers: Monitor for faults at standby #1731.
  • 09/13:
    • Today: Cherry picked changes on top of commit with helper function for metadata. Redo/update code to use the new helper for logging errors with metadata. Added in final piece of code to check for temperature fault or warning condition.
    • Tomorrow: Do some re-testing, especially fan fault (inject commands had some odd behavior). Address review comments. Check into possible power state property checking concern if app started during power on state (reset/reload, reboot conditions).
    • Blockers: Monitor for faults at standby #1731, and https://gerrit.openbmc-project.xyz/#/c/6579/.
  • 09/14:
    • Today: Some information from Jordan on fault clearing. Realized missing clear of faults on presence change, so updated all the commits with new code for detecting the faults. Also noticed that the bind/unbind story that will affect presence and device driver binding is now merged.
    • Tomorrow: Do the re-testing I had planned for today, include remove/replace power supply in testing. Power state property checking?
    • Blockers: Monitor for faults at standby #1731, and https://gerrit.openbmc-project.xyz/#/c/6579/.
  • 09/15:
  • 09/18:
  • 09/19:
    • Today: Merge, fetch, rebase, ... changes. Discussion(s) about over temperature fault not working, PMBus core driver apparently clearing bit(s). Re-test with proposed changes had same problem.
    • Tomorrow: Figure out how to address problem with thermal fault bits getting cleared.
    • Blockers: Monitor for faults at standby #1731.
  • 09/20:
    • Today: Read e-mail reply from Jordan on odd temperature fault injection results. They are looking into it. Made a few minor changes for review comments/feedback.
    • Tomorrow: Look for update on temperature fault, possibly some code changes to address the problem. If get a final commit, be sure to put resolves tag into it.
    • Blockers: Monitor for faults at standby #1731
  • 09/22:
    • Today: Installed newer power supply in bottom slot (REV:08?). Top supply cannot completely remove, so left pulled out. Chassis powered on, injected over temperature fault, error logged for should be on, not over temperature. System shut down (only one supply). Using i2cget, I could see the bits for the thermal fault turn off and on, assuming this is still due to PMBus core doing the CLEAR_FAULTS command.
    • Next: Consider adding in read of STATUS_MFR to log the thermal fault. A bit in there appears to be sticking.
    • Blockers: Monitor for faults at standby #1731.
  • 09/25:
    • Today: Finally got time on a JEMT system with ship level power supplies. More difficulties with testing over-temperature fault. Files sometimes missing, not clear why. Opened issue Power supply PMBus files missing on some systems #2364. Added a series of changes on top of the over-temperature commit, to work around the missing file issue that may creep up, along with an order change for which faults to check first. FINALLY have a Resolves commit.
    • Tomorrow: Look for review feedback.
    • Blockers: Monitor for faults at standby #1731.
  • 09/26:
  • 09/28:

@bjwyman
Copy link
Contributor

bjwyman commented Aug 31, 2017

Removing this task "Update PMBus hwmon directory when power supply goes from missing to present."
It will be handled in #2205.

@bjwyman
Copy link
Contributor

bjwyman commented Sep 8, 2017

The desired both/all power supply call out for the temperature fault was noted as not an easy task. I communicated that fact with Jordan, and the plan to only call out the one supply reporting the fault. He has made a note about that.

@rfrandse
Copy link

https://gerrit.openbmc-project.xyz/6866 Change order of powered on fault checks
Resolves: #1732 Monitor for faults at runtime

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants