Skip to content

v5.10.3

Choose a tag to compare

released this 01 May 06:07
· 2105 commits to master since this release
v5.10.3

skiboot-5.10.3


skiboot 5.10.3 was released on Thursday March 28th, 2018. It replaces
skiboot-5.10.2 as the current stable release in the 5.10.x series.

It is recommended that 5.10.3 be used instead of any previous 5.10.x
version due to the bug fixes and debugging enhancements in it.

Over skiboot-5.10.2, we have a few improvements and bug fixes:

  • NPU2: dump NPU2 registers on npu2 HMI

    Due to the nature of debugging npu2 issues, folk are wanting the
    full list of NPU2 registers dumped when there’s a problem.

    This is different than the solution introduced in 5.10.1 as there we
    would dump the registers in a way that would trigger a FIR bit that
    would confuse PRD.

  • npu2: Add performance tuning SCOM inits

    Peer-to-peer GPU bandwidth latency testing has produced some tunable
    values that improve performance. Add them to our device
    initialization.

    File these under things that need to be cleaned up with nice
    #defines for the register names and bitfields when we get time.

    A few of the settings are dependent on the system’s particular
    NVLink topology, so introduce a helper to determine how many links
    go to a single GPU.

  • hw/npu2: Assign a unique LPARSHORTID per GPU

    This gets used elsewhere to index items in the XTS tables.

  • occ: Set up OCC messaging even if we fail to setup pstates

    This means that we no longer hit this bug if we fail to get valid
    pstates from the OCC.

    [console-pexpect]#echo 1 > //sys/firmware/opal/sensor_groups//occ-csm0/clear
    echo 1 > //sys/firmware/opal/sensor_groups//occ-csm0/clear
    [ 94.019971181,5] CPU ATTEMPT TO RE-ENTER FIRMWARE! PIR=083d cpu @0x33cf4000 -> pir=083d token=8
    [ 94.020098392,5] CPU ATTEMPT TO RE-ENTER FIRMWARE! PIR=083d cpu @0x33cf4000 -> pir=083d token=8
    [ 10.318805] Disabling lock debugging due to kernel taint
    [ 10.318808] Severe Machine check interrupt [Not recovered]
    [ 10.318812] NIP [000000003003e434]: 0x3003e434
    [ 10.318813] Initiator: CPU
    [ 10.318815] Error type: Real address [Load/Store (foreign)]
    [ 10.318817] opal: Hardware platform error: Unrecoverable Machine Check exception
    [ 10.318821] CPU: 117 PID: 2745 Comm: sh Tainted: G M 4.15.9-openpower1 #3
    [ 10.318823] NIP: 000000003003e434 LR: 000000003003025c CTR: 0000000030030240
    [ 10.318825] REGS: c00000003fa7bd80 TRAP: 0200 Tainted: G M (4.15.9-openpower1)
    [ 10.318826] MSR: 9000000000201002 <SF,HV,ME,RI> CR: 48002888 XER: 20040000
    [ 10.318831] CFAR: 0000000030030258 DAR: 394a00147d5a03a6 DSISR: 00000008 SOFTE: 1

  • core/fast-reboot: disable fast reboot upon fundamental
    entry/exit/locking errors

    This disables fast reboot in several more cases where serious errors
    like lock corruption or call re-entrancy are detected.

  • core/opal: allow some re-entrant calls

    This allows a small number of OPAL calls to succeed despite re-
    entering the firmware, and rejects others rather than aborting.

    This allows a system reset interrupt that interrupts OPAL to do
    something useful. Sreset other CPUs, use the console, which allows
    xmon to work or stack traces to be printed, reboot the system.

    Use OPAL_INTERNAL_ERROR when rejecting, rather than OPAL_BUSY, which
    is used for many other things that does not mean a serious permanent
    error.

  • core/opal: abort in case of re-entrant OPAL call

    The stack is already destroyed by the time we get here, so there is
    not much point continuing.

  • npu2: Disable fast reboot

    Fast reboot does not yet work right with the NPU. It’s been disabled
    on NVLink and OpenCAPI machines. Do the same for NVLink2.

    This amounts to a port of 3e45779 (“npu: Fix broken fast
    reset”) from the npu code to npu2.