HBRT infinite loop on ECC error during startup #67

ghost · 2016-10-20T03:44:25Z

Start opal-prd and observe this log before opal-prd gets stuck at 100% CPU.

HBRT: PRDF:>>PRDF::main() Global attnType=0004
HBRT: PRDF:>>PRDF::noLock_initialize() 
HBRT: PRDF:>>PegasusConfigurator::build()
HBRT: PRDF:<<PegasusConfigurator::build()
HBRT: PRDF:<<PRDF::noLock_initialize() 
HBRT: ERRL:>>ErrlManager::ErrlManager constructor.
HBRT: ERRL:iv_hiddenErrorLogsEnable = 0x0
HBRT: ERRL:>>setupPnorInfo
HBRT: PNOR:>>RtPnor::getSectionInfo
HBRT: PNOR:>>RtPnor::readFromDevice: i_offset=0x0, i_procId=0 sec=11 size=0x20000 ecc=1
HBRT: PNOR:RtPnor::readFromDevice: removing ECC...
HBRT: PNOR:RtPnor::readFromDevice> Uncorrectable ECC error : chip=0,offset=0x0

(at which point everything stops with opal-prd chewing 100% CPU)

Which ends up being a fairly classic race in trying to log an error before everything has been initialized.

Consequently, opal-prd spins a core and is right off into the weeds.

The text was updated successfully, but these errors were encountered:

dcrowell77 · 2016-10-20T04:04:26Z

Ironically enough we just encountered this in a different environment. I'm surprised it took this long. I've got some hacks that need to be ironed out. Since this is a secondary fail it will go on the backburner for a little while.

ghost · 2016-10-20T06:06:04Z

FWIW, the workaround is to erase the HBEL partition:

pflash -P HBEL -e

and then you'll be okay.

ghost · 2016-10-20T06:31:20Z

Or, rather, okay until you reboot, then you fail to IPL:

  0.61819|ECC error in PNOR flash in section offset 0x00008000

  0.62322|System shutting down with error status 0x60F
  3.22583|Ignoring boot flags, incorrect version 0x0
  3.30757|ISTEP  6. 3
  1.09243|ECC error in PNOR flash in section offset 0x00008000

  1.09246|System shutting down with error status 0x60F
  3.68140|Ignoring boot flags, incorrect version 0x0
  3.76426|ISTEP  6. 3

ghost · 2016-10-20T06:52:34Z

and then you get into this stupid inescapable mess that is the golden side and your life is horrible.

But it seems that by flashing zeros over rather than just erasing will do the trick.

Changes Included for package witherspoon-xml, branch master: 7bec10c - Erich Hauptli - 2017-09-26 - Adding new WOF data 3d66657 - e-liner - 2017-09-21 - Merge pull request open-power#69 from e-liner/memd_binary 8b9fa55 - Elizabeth Liner - 2017-09-21 - Adding witherspoon MEMD binary ac74311 - Erich Hauptli - 2017-09-21 - Updating Memory Attributes c0b9bc1 - William Hoffa - 2017-09-15 - Mark Xbus Targets Deconfigurable (open-power#67) 5736f3e - Prachi Gupta - 2017-09-08 - sync with common_mrw_xml -- 09/07 (open-power#66) Changes Included for package hostboot, branch master: 7f59b42 - Jacob Harvey - 2017-09-26 - Increment red_waterfall for low vdn fix ad079f5 - Zane Shelley - 2017-09-26 - PRD: Nimbus DD2.0.1 workaround for nce/tce/mpe/impe 49d2286 - Matt K. Light - 2017-09-25 - remove cas_latency.H include from p9_mss_freq.H 4930d04 - Luke Mulkey - 2017-09-25 - Memory buffer vpd accessor functions 3027cb5 - Thi Tran - 2017-09-25 - L3 Update - p9_hcd_cache_stopclocks HWP 72b46fb - Ben Gass - 2017-09-25 - Fix DMI scom translation. 190d346 - Prem Shanker Jha - 2017-09-25 - 24x7: Corrected handling of MCA on a direct attached systems. 3245f4f - Louis Stermole - 2017-09-25 - Restore original training settings if mss_draminit_training_adv fails ecb8cf7 - Sachin Gupta - 2017-09-25 - Added comment for INVALID enum value 7085e6b - Andre Marin - 2017-09-25 - Add Write CRC attributes to xml and eff_dimm 84e9979 - Andre Marin - 2017-09-25 - Modify VPD decoder to take into account deconfigured ports b6c7737 - Matthew Hickman - 2017-09-25 - Changed two symbol correction disable to mnfg flag DISABLE_DRAM_REPAIRS f0e99cd - Nick Klazynski - 2017-09-25 - Core workarounds for multiple issues. d58dbd6 - Soma BhanuTej - 2017-09-25 - Nimbus DD22 support updates to ekb c0719c3 - Ben Gass - 2017-09-25 - Updates for HW416934 and HW417233 bc88548 - Elizabeth Liner - 2017-09-25 - Removing first byte from MEMD binary 0e38c62 - Nick Bofferding - 2017-09-25 - Secure Boot: Direct signature temp files to specific scratch dir 54c1fc7 - Nick Bofferding - 2017-09-25 - Secure Boot: Support open signing with component IDs 1b3b999 - Christian Geddes - 2017-09-25 - Set variables to nullptr after they are deleted

dcrowell77 · 2018-08-17T15:10:43Z

Resolved with 1e784c0. HBRT will now assert and crash if we get early life PNOR failures instead of doing an infinite loop that pegs the cpu.

dcrowell77 self-assigned this Oct 20, 2016

ghost mentioned this issue Oct 21, 2016

IPL failure with corrupt HBEL partition #68

Closed

dcrowell77 closed this as completed Aug 17, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HBRT infinite loop on ECC error during startup #67

HBRT infinite loop on ECC error during startup #67

ghost commented Oct 20, 2016 •

edited by ghost

Loading

dcrowell77 commented Oct 20, 2016

ghost commented Oct 20, 2016

ghost commented Oct 20, 2016

ghost commented Oct 20, 2016

dcrowell77 commented Aug 17, 2018

HBRT infinite loop on ECC error during startup #67

HBRT infinite loop on ECC error during startup #67

Comments

ghost commented Oct 20, 2016 • edited by ghost Loading

dcrowell77 commented Oct 20, 2016

ghost commented Oct 20, 2016

ghost commented Oct 20, 2016

ghost commented Oct 20, 2016

dcrowell77 commented Aug 17, 2018

ghost commented Oct 20, 2016 •

edited by ghost

Loading