Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HBRT infinite loop on ECC error during startup #67

Closed
ghost opened this issue Oct 20, 2016 · 5 comments
Closed

HBRT infinite loop on ECC error during startup #67

ghost opened this issue Oct 20, 2016 · 5 comments
Assignees

Comments

@ghost
Copy link

ghost commented Oct 20, 2016

Start opal-prd and observe this log before opal-prd gets stuck at 100% CPU.

HBRT: PRDF:>>PRDF::main() Global attnType=0004
HBRT: PRDF:>>PRDF::noLock_initialize() 
HBRT: PRDF:>>PegasusConfigurator::build()
HBRT: PRDF:<<PegasusConfigurator::build()
HBRT: PRDF:<<PRDF::noLock_initialize() 
HBRT: ERRL:>>ErrlManager::ErrlManager constructor.
HBRT: ERRL:iv_hiddenErrorLogsEnable = 0x0
HBRT: ERRL:>>setupPnorInfo
HBRT: PNOR:>>RtPnor::getSectionInfo
HBRT: PNOR:>>RtPnor::readFromDevice: i_offset=0x0, i_procId=0 sec=11 size=0x20000 ecc=1
HBRT: PNOR:RtPnor::readFromDevice: removing ECC...
HBRT: PNOR:RtPnor::readFromDevice> Uncorrectable ECC error : chip=0,offset=0x0

(at which point everything stops with opal-prd chewing 100% CPU)

Which ends up being a fairly classic race in trying to log an error before everything has been initialized.

Consequently, opal-prd spins a core and is right off into the weeds.

@dcrowell77 dcrowell77 self-assigned this Oct 20, 2016
@dcrowell77
Copy link
Collaborator

Ironically enough we just encountered this in a different environment. I'm surprised it took this long. I've got some hacks that need to be ironed out. Since this is a secondary fail it will go on the backburner for a little while.

@ghost
Copy link
Author

ghost commented Oct 20, 2016

FWIW, the workaround is to erase the HBEL partition:

pflash -P HBEL -e

and then you'll be okay.

@ghost
Copy link
Author

ghost commented Oct 20, 2016

Or, rather, okay until you reboot, then you fail to IPL:

  0.61819|ECC error in PNOR flash in section offset 0x00008000

  0.62322|System shutting down with error status 0x60F
  3.22583|Ignoring boot flags, incorrect version 0x0
  3.30757|ISTEP  6. 3
  1.09243|ECC error in PNOR flash in section offset 0x00008000

  1.09246|System shutting down with error status 0x60F
  3.68140|Ignoring boot flags, incorrect version 0x0
  3.76426|ISTEP  6. 3

@ghost
Copy link
Author

ghost commented Oct 20, 2016

and then you get into this stupid inescapable mess that is the golden side and your life is horrible.

But it seems that by flashing zeros over rather than just erasing will do the trick.

wghoffa pushed a commit to wghoffa/hostboot that referenced this issue Sep 28, 2017
Changes Included for package witherspoon-xml, branch master:
7bec10c - Erich Hauptli - 2017-09-26 - Adding new WOF data
3d66657 - e-liner - 2017-09-21 - Merge pull request open-power#69 from e-liner/memd_binary
8b9fa55 - Elizabeth Liner - 2017-09-21 - Adding witherspoon MEMD binary
ac74311 - Erich Hauptli - 2017-09-21 - Updating Memory Attributes
c0b9bc1 - William Hoffa - 2017-09-15 - Mark Xbus Targets Deconfigurable (open-power#67)
5736f3e - Prachi Gupta - 2017-09-08 - sync with common_mrw_xml -- 09/07 (open-power#66)

Changes Included for package hostboot, branch master:
7f59b42 - Jacob Harvey - 2017-09-26 - Increment red_waterfall for low vdn fix
ad079f5 - Zane Shelley - 2017-09-26 - PRD: Nimbus DD2.0.1 workaround for nce/tce/mpe/impe
49d2286 - Matt K. Light - 2017-09-25 - remove cas_latency.H include from p9_mss_freq.H
4930d04 - Luke Mulkey - 2017-09-25 - Memory buffer vpd accessor functions
3027cb5 - Thi Tran - 2017-09-25 - L3 Update - p9_hcd_cache_stopclocks HWP
72b46fb - Ben Gass - 2017-09-25 - Fix DMI scom translation.
190d346 - Prem Shanker Jha - 2017-09-25 - 24x7: Corrected handling of MCA on a direct attached systems.
3245f4f - Louis Stermole - 2017-09-25 - Restore original training settings if mss_draminit_training_adv fails
ecb8cf7 - Sachin Gupta - 2017-09-25 - Added comment for INVALID enum value
7085e6b - Andre Marin - 2017-09-25 - Add Write CRC attributes to xml and eff_dimm
84e9979 - Andre Marin - 2017-09-25 - Modify VPD decoder to take into account deconfigured ports
b6c7737 - Matthew Hickman - 2017-09-25 - Changed two symbol correction disable to mnfg flag DISABLE_DRAM_REPAIRS
f0e99cd - Nick Klazynski - 2017-09-25 - Core workarounds for multiple issues.
d58dbd6 - Soma BhanuTej - 2017-09-25 - Nimbus DD22 support updates to ekb
c0719c3 - Ben Gass - 2017-09-25 - Updates for HW416934 and HW417233
bc88548 - Elizabeth Liner - 2017-09-25 - Removing first byte from MEMD binary
0e38c62 - Nick Bofferding - 2017-09-25 - Secure Boot: Direct signature temp files to specific scratch dir
54c1fc7 - Nick Bofferding - 2017-09-25 - Secure Boot: Support open signing with component IDs
1b3b999 - Christian Geddes - 2017-09-25 - Set variables to nullptr after they are deleted
@dcrowell77
Copy link
Collaborator

Resolved with 1e784c0. HBRT will now assert and crash if we get early life PNOR failures instead of doing an infinite loop that pegs the cpu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant