Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Romulus / Talos does not IPL on DD2.2 #10

Closed
madscientist159 opened this issue Mar 14, 2018 · 13 comments
Closed

Romulus / Talos does not IPL on DD2.2 #10

madscientist159 opened this issue Mar 14, 2018 · 13 comments

Comments

@madscientist159
Copy link

On the latest op-build and DD2.2 the SBE hangs on ISTEP 5. Nothing is printed to console. IBM Austin has reproduced this issue on Romulus; we see it on Talos.

We've also reproduced the issue on this end with SBE hash 75ddac2. Working through a bisect / regression test as the SBE images from November 2017 allow hostboot to initialize.

@madscientist159
Copy link
Author

madscientist159 commented Mar 15, 2018

After extensive testing, one of the known working versions SBE versions is 9b78381 .

However, there is a confounding factor in all of this: a physical power cycle to at least our Talos boards is required to recover from a "bad" SBE version, or even after a standard SBE update from hostboot. This poorly-understood issue means that the "bad" SBE versions also need to be power-cycle tested to see if they recover. Power cycle here means pulling standby power to the entire mainboard, not just cycling the host power via the BMC.

This was not seen until recently, so something seems to have changed in newer SBE code and/or the DD2.2 silicon itself.

@sgupta2m
Copy link
Contributor

Currently almost all zz/ws/zaius systems are DD2.1 or DD2.2 system. Issues has not been reported anywhere. I doubt it is related with SBE code. Can you please give us some system where this issue is reproducing.

@madscientist159
Copy link
Author

Talos has this issue, and IBM Austin has replicated on Romulus

@sgupta2m
Copy link
Contributor

Do u have system on which this issue is coming. we will need live system for any debug as currently BMC does not capture any debug data for sbe fails

@madscientist159
Copy link
Author

@sgupta2m I have direct access to the DD2.2 system showing the problem and a Cronus box. Just let me know what you need to see / have run on the system.

@sgupta2m
Copy link
Contributor

sgupta2m commented Mar 15, 2018

To start with can u please give us output of this after failure
croquery sbe_version pu
sbe-debug.py -l sbestatus -t HW

@madscientist159
Copy link
Author

@sgupta2m

croquery sbe_version pu
p9n     k0:n0:s0:p00
fwCommitId = 9b783817
fwTag = 628464939fe1f795e
/opt/openpower/p9/cronus/p9/exe/dev/p9_dev_x86_64.exe croquery sbe_version pu

I don't have the sbe-debug.py script, but here are the status fields over CFAM:

getcfam pu 2809 -p0
p9n     k0:n0:s0:p00       0x82405083

getcfam pu 2809 -p0
p9n     k0:n0:s0:p00       0x00600002

@madscientist159
Copy link
Author

@sgupta2m Found the debug-sbe.py script and ran as requested:

sbe-debug.py -l sbestatus -t HW
k0
/opt/openpower/p9/cronus/p9/exe/dev/p9_dev_x86_64.exe setconfig USE_SBE_FIFO off
NOTE: (System::~System): Config settings have changed, forcing a write of the config file ...
cmd: getcfam pu 2809 -n0 -p0
SBE Booted           : True
Async FFDC           : False
Reserver Bit [2:3]   : 00
SBE Previous State   : SBE_STATE_ISTEP (0010)
SBE Current State    : SBE_STATE_RUNTIME (0100)
Istep Major          : 5
Istep Minor          : 2
Reserved Bit [26:31] : 000011
k0
/opt/openpower/p9/cronus/p9/exe/dev/p9_dev_x86_64.exe setconfig USE_SBE_FIFO off

@sgupta2m
Copy link
Contributor

sbe looks good here. we need to understand from HB team where they are failing. you need to send this issue to HB team.
@dcrowell77 is right person to look into it

@madscientist159
Copy link
Author

@sgupta2m Why would downgrading the SBE alone with Cronus (without changing hostboot) cause the system to work again if this is not an SBE issue?

FWIW since I have the debug box up here is full trace from the SBE:

-------------------------------------------------------------------------------
TRACEBUFFER: Mixed buffer
-------------------------------------------------------------------------------
 Sec    Usec      PID Comp             Line Entry Data
-------------------------------------------------------------------------------
00000160.480706666|    0|sbe_seeprom_DD2 |   1|!!! NO STRING NO TRACE !!! for hash=940781583
00000160.480706666|    0|sbe_seeprom_DD2 |   1|~[0x0000] 3813F0F1                                *8...            *
00000160.480709847|    0|sbe_seeprom_DD2 |   1|I> sbeIsCmdAllowedAtState SBE State [0x00000002] Fence State[0x0045]
00000160.480716844|    0|sbe_seeprom_DD2 |   1|I> sbeSyncCommandProcessor_routine Processing command from client :0x0
00000160.480721834|    0|sbe_seeprom_DD2 |   1|I>sbeUpFifoAckEot
00000160.480724072|    0|sbe_seeprom_DD2 |   1|I>validateIstep prevMajorNumber:4 prevMinorNumber:33
00000160.480726679|    0|sbe_seeprom_DD2 |   1|I>sbeExecuteIstep Major number:0x4 minor number:0x22
00000160.480728531|    0|sbe_seeprom_DD2 |   1|I>istepNoOp
00000160.480737603|    0|sbe_seeprom_DD2 |   1|I> sbeSyncCommandProcessor_routine Command processesed. l_rc=[0x0000]
00000160.480741033|    0|sbe_seeprom_DD2 |   1|I> sbeHandleFifoResponse ChipOp Done
00000160.481675352|    0|sbe_seeprom_DD2 |   1|I> sbeValidateCmdClass i_cmdClass[0xA1], i_cmdOpcode[0x01]
00000160.481678533|    0|sbe_seeprom_DD2 |   1|I> sbeIsCmdAllowedAtState SBE State [0x00000002] Fence State[0x0045]
00000160.481685530|    0|sbe_seeprom_DD2 |   1|I> sbeSyncCommandProcessor_routine Processing command from client :0x0
00000160.481690529|    0|sbe_seeprom_DD2 |   1|I>sbeUpFifoAckEot
00000160.481692758|    0|sbe_seeprom_DD2 |   1|I>validateIstep prevMajorNumber:4 prevMinorNumber:34
00000160.481695502|    0|sbe_seeprom_DD2 |   1|I>sbeExecuteIstep Major number:0x5 minor number:0x1
00000163.609474212|    0|sbe_seeprom_DD2 |   1|I>istep 5.1 HB Dump mem Region [0x0000000008000000]
00000163.612226675|    0|sbe_seeprom_DD2 |   1|I>SBESecureMemRegionManager::add Adding region Mem[0x0000000008000000], size[0x00A00000]
00000163.613668227|    0|sbe_seeprom_DD2 |   1|I>SBESecureMemRegionManager::add after addition iv_regionsOpenCnt [1]
00000163.614332347|    0|sbe_seeprom_DD2 |   1|I> sbeSyncCommandProcessor_routine Command processesed. l_rc=[0x0000]
00000163.614335785|    0|sbe_seeprom_DD2 |   1|I> sbeHandleFifoResponse ChipOp Done
00000163.615932861|    0|sbe_seeprom_DD2 |   1|I> sbeValidateCmdClass i_cmdClass[0xA1], i_cmdOpcode[0x01]
00000163.615936034|    0|sbe_seeprom_DD2 |   1|I> sbeIsCmdAllowedAtState SBE State [0x00000002] Fence State[0x0045]
00000163.615943039|    0|sbe_seeprom_DD2 |   1|I> sbeSyncCommandProcessor_routine Processing command from client :0x0
00000163.615948021|    0|sbe_seeprom_DD2 |   1|I>sbeUpFifoAckEot
00000163.615950259|    0|sbe_seeprom_DD2 |   1|I>validateIstep prevMajorNumber:5 prevMinorNumber:1
00000163.615953003|    0|sbe_seeprom_DD2 |   1|I>sbeExecuteIstep Major number:0x5 minor number:0x2
00000163.631154838|    0|sbe_seeprom_DD2 |   1|I>SbeRegAccess::stateTransition Event Received 0 CurrState 0x00000002 StartCnt8 EndCnt3
00000163.631156278|    0|sbe_seeprom_DD2 |   1|I>SbeRegAccess::stateTransition Updating State as 4
00000163.631166816|    0|sbe_seeprom_DD2 |   1|I> sbeSyncCommandProcessor_routine Command processesed. l_rc=[0x0000]
00000163.631170255|    0|sbe_seeprom_DD2 |   1|I> sbeHandleFifoResponse ChipOp Done
00000166.273899129|    0|sbe_seeprom_DD2 |   1|I> sbeValidateCmdClass i_cmdClass[0xA8], i_cmdOpcode[0x02]
00000166.273902310|    0|sbe_seeprom_DD2 |   1|I> sbeIsCmdAllowedAtState SBE State [0x00000004] Fence State[0x0000]
00000166.273909256|    0|sbe_seeprom_DD2 |   1|I> sbeSyncCommandProcessor_routine Processing command from client :0x0
00000166.285959296|    0|sbe_seeprom_DD2 |   1|I>sbeUpFifoAckEot
00000166.288281183|    0|sbe_seeprom_DD2 |   1|I> sbeSyncCommandProcessor_routine Command processesed. l_rc=[0x0000]
00000166.288284621|    0|sbe_seeprom_DD2 |   1|I> sbeHandleFifoResponse ChipOp Done

@sgupta2m
Copy link
Contributor

SBE image has HBBL ( owned by HB team ) . If you are failing after istep 5.2 but before HB isteps starts , most probably issue is in HBBL

@madscientist159
Copy link
Author

OK, that helps. Thanks!

@madscientist159
Copy link
Author

Migrated back to open-power/hostboot#128

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants