Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sidecar: avoid Monorail restart if Tofino PCIe link not up #1510

Merged
merged 2 commits into from
Aug 28, 2023

Conversation

arjenroodselaar
Copy link
Contributor

Monorail monitors the 10G link with Tofino and periodically restarts the task if this link is not up. This in turn causes the technician port PHY to flap which is disruptive while working with pilot racktest.

In order to reduce spurious restarts the PCIe link with Tofino is monitored and restarts are only performed if the link is up, assuming the 10G port is down otherwise. In addition logging in the sequencer task is improved, making the events more accurate and cutting down on noise.

The following ringbuf snippets are shown when a chassis is freshly booted and the PCIe link with a Gimlet is not (yet) up:

humility: ring buffer drv_sidecar_seq_server::__RINGBUF in sequencer:
 NDX LINE      GEN    COUNT PAYLOAD
   0  802        1        1 FpgaInit
   1  810        1        1 LoadingFpgaBitstream
   2  855        1        1 MainboardControllerId(0x1de5bae)
   3  869        1        1 MainboardControllerChecksum(0x8e40be86)
   4  901        1        1 MainboardControllerVersion(0xe6)
   5  902        1        1 MainboardControllerSha(0xfebfa15a)
   6  903        1        1 FpgaInitComplete
   7   27        1        1 LoadingClockConfiguration
   8  928        1        1 ClockConfigurationComplete
   9  219        1        1 FrontIOBoardPowerEnable(true)
  10  933        1        1 FrontIOBoardPresent
  11   80        1        1 LoadingFrontIOControllerBitstream { fpga_id: 0x0 }
  12   92        1        1 FrontIOControllerIdent { fpga_id: 0x0, ident: 0x1deaa55 }
  13   99        1        1 FrontIOControllerChecksum { fpga_id: 0x0, checksum: [ 0x5, 0x9a, 0xa, 0x59 ], expected: [ 0x5, 0x9a, 0xa, 0x59 ] }
  14   80        1        1 LoadingFrontIOControllerBitstream { fpga_id: 0x1 }
  15   92        1        1 FrontIOControllerIdent { fpga_id: 0x1, ident: 0x1deaa55 }
  16   99        1        1 FrontIOControllerChecksum { fpga_id: 0x1, checksum: [ 0x5, 0x9a, 0xa, 0x59 ], expected: [ 0x5, 0x9a, 0xa, 0x59 ] }
  17  336        1        1 TofinoSequencerTick(Disabled, A2 { error: None })
  18  145        1        1 FanModuleLedUpdate(Zero, On)
  19  145        1        1 FanModuleLedUpdate(One, On)
  20  145        1        1 FanModuleLedUpdate(Two, On)
  21  145        1        1 FanModuleLedUpdate(Three, On)
  22  336        1        3 TofinoSequencerTick(Disabled, A2 { error: None })
  23  299        1        1 FrontIOBoardPhyPowerEnable(true)
  24  336        1        1 TofinoSequencerTick(Disabled, A2 { error: None })
  25  521        1        1 FrontIOBoardPhyOscGood
  26  336        1      116 TofinoSequencerTick(Disabled, A2 { error: None })
humility: ring buffer task_monorail_server::bsp::__RINGBUF in monorail:
 NDX LINE      GEN    COUNT PAYLOAD
   0  173        1        1 Reinit
   1  433        1        1 FrontIoSpeedChange { port: 0x2d, before: Speed1G, after: Speed100M }

As shown, 116 sequencer ticks have elapsed and Monorail has not restarted itself, which is expected to happen every ~25 ticks/seconds.

After attaching a Gimlet:

humility: ring buffer drv_sidecar_seq_server::__RINGBUF in sequencer:
 NDX LINE      GEN    COUNT PAYLOAD
  17  336        1        1 TofinoSequencerTick(Disabled, A2 { error: None })
  18  145        1        1 FanModuleLedUpdate(Zero, On)
  19  145        1        1 FanModuleLedUpdate(One, On)
  20  145        1        1 FanModuleLedUpdate(Two, On)
  21  145        1        1 FanModuleLedUpdate(Three, On)
  22  336        1        3 TofinoSequencerTick(Disabled, A2 { error: None })
  23  299        1        1 FrontIOBoardPhyPowerEnable(true)
  24  336        1        1 TofinoSequencerTick(Disabled, A2 { error: None })
  25  521        1        1 FrontIOBoardPhyOscGood
  26  336        1      137 TofinoSequencerTick(Disabled, A2 { error: None })
  27  374        1        1 ClearingTofinoSequencerFault(None)
  28  339        1        1 TofinoSequencerPolicyUpdate(LatchOffOnFault)
  29  336        1        1 TofinoSequencerTick(LatchOffOnFault, A2 { error: None })
  30   80        1        1 TofinoPowerUp
  31   51        1        1 SetVddCoreVout(Volts(0.79))
   0  105        2        1 TofinoVidAck
   1  122        2        1 TofinoBar0RegisterValue(SoftwareReset, 0xf3fc0f0)
   2  152        2        1 TofinoBar0RegisterValue(ResetOptions, 0x70000a8)
   3  165        2        1 TofinoEepromIdCode(0x220134)
   4  191        2        1 TofinoBar0RegisterValue(PciePhyLaneControl0, 0xc000c)
   5  191        2        1 TofinoBar0RegisterValue(PciePhyLaneControl1, 0xc000c)
   6  217        2        1 TofinoCfgRegisterValue(KGen, 0xe20f03)
   7  236        2        1 TofinoBar0RegisterValue(SoftwareReset, 0xf3fc000)
   8   61        2        1 SetPCIePresent
   9  336        2        1 TofinoSequencerTick(LatchOffOnFault, A0 { pcie_link: false })
  10  336        2       24 TofinoSequencerTick(LatchOffOnFault, A0 { pcie_link: true })
  11  279        2        1 FrontIOBoardPhyPowerEnable(false)
  12  299        2        1 FrontIOBoardPhyPowerEnable(true)
  13  336        2       24 TofinoSequencerTick(LatchOffOnFault, A0 { pcie_link: true })
  14  279        2        1 FrontIOBoardPhyPowerEnable(false)
  15  299        2        1 FrontIOBoardPhyPowerEnable(true)
  16  336        2        7 TofinoSequencerTick(LatchOffOnFault, A0 { pcie_link: true })
humility: ring buffer task_monorail_server::bsp::__RINGBUF in monorail:
 NDX LINE      GEN    COUNT PAYLOAD
   0  173        1        1 Reinit
   1  433        1        1 FrontIoSpeedChange { port: 0x2d, before: Speed1G, after: Speed100M }
   2  481        1       41 Restarted10GAneg
   3  173        1        1 Reinit
   4  481        1        4 Restarted10GAneg
   5  433        1        1 FrontIoSpeedChange { port: 0x2d, before: Speed1G, after: Speed100M }
   6  481        1       36 Restarted10GAneg
   7  173        1        1 Reinit
   8  481        1        4 Restarted10GAneg
   9  433        1        1 FrontIoSpeedChange { port: 0x2d, before: Speed1G, after: Speed100M }
  10  481        1       36 Restarted10GAneg
  11  173        1        1 Reinit
  12  481        1        4 Restarted10GAneg
  13  433        1        1 FrontIoSpeedChange { port: 0x2d, before: Speed1G, after: Speed100M }
  14  481        1       34 Restarted10GAneg

Monorail now monitors the 10G link and restarts itself every ~25 ticks/seconds.

And after disconnecting the PCIe link 228 ticks/seconds pass without Monorail restarting itself, only to resume this behavior when the PCIe link comes back up:

humility: ring buffer drv_sidecar_seq_server::__RINGBUF in sequencer:
 NDX LINE      GEN    COUNT PAYLOAD
  30   61        2        1 ClearPCIePresent
  31  338        2      133 TofinoSequencerTick(Disabled, A2 { error: None })
   0  372        3        1 ClearingTofinoSequencerFault(None)
   1  337        3        1 TofinoSequencerPolicyUpdate(LatchOffOnFault)
   2  338        3        1 TofinoSequencerTick(LatchOffOnFault, A2 { error: None })
   3   82        3        1 TofinoPowerUp
   4   51        3        1 SetVddCoreVout(Volts(0.79))
   5  107        3        1 TofinoVidAck
   6  124        3        1 TofinoBar0RegisterValue(SoftwareReset, 0xf3fc0f0)
   7  154        3        1 TofinoBar0RegisterValue(ResetOptions, 0x70000a8)
   8  167        3        1 TofinoEepromIdCode(0x220134)
   9  193        3        1 TofinoBar0RegisterValue(PciePhyLaneControl0, 0xc000c)
  10  193        3        1 TofinoBar0RegisterValue(PciePhyLaneControl1, 0xc000c)
  11  219        3        1 TofinoCfgRegisterValue(KGen, 0xe20f03)
  12  238        3        1 TofinoBar0RegisterValue(SoftwareReset, 0xf3fc000)
  13   61        3        1 SetPCIePresent
  14  338        3        1 TofinoSequencerTick(LatchOffOnFault, A0 { pcie_link: false })
  15  338        3       24 TofinoSequencerTick(LatchOffOnFault, A0 { pcie_link: true })
  16  277        3        1 FrontIOBoardPhyPowerEnable(false)
  17  297        3        1 FrontIOBoardPhyPowerEnable(true)
  18  338        3       24 TofinoSequencerTick(LatchOffOnFault, A0 { pcie_link: true })
  19  277        3        1 FrontIOBoardPhyPowerEnable(false)
  20  297        3        1 FrontIOBoardPhyPowerEnable(true)
  21  338        3       24 TofinoSequencerTick(LatchOffOnFault, A0 { pcie_link: true })
  22  277        3        1 FrontIOBoardPhyPowerEnable(false)
  23  297        3        1 FrontIOBoardPhyPowerEnable(true)
  24  338        3       14 TofinoSequencerTick(LatchOffOnFault, A0 { pcie_link: true })
  25  338        3      228 TofinoSequencerTick(LatchOffOnFault, A0 { pcie_link: false })
  26  338        3       24 TofinoSequencerTick(LatchOffOnFault, A0 { pcie_link: true })
  27  277        3        1 FrontIOBoardPhyPowerEnable(false)
  28  297        3        1 FrontIOBoardPhyPowerEnable(true)
  29  338        3        7 TofinoSequencerTick(LatchOffOnFault, A0 { pcie_link: true })

@arjenroodselaar arjenroodselaar force-pushed the reduce_monorail_restart branch 2 times, most recently from d2ed99c to 5b22a4c Compare August 28, 2023 21:02
Monorail monitors the 10G link with Tofino and periodically restarts the task if
this link is not up. This in turn causes the technician port PHY to flap which
is disruptive while working with `pilot racktest`.

In order to reduce spurious restarts the PCIe link with Tofino is monitored and
restarts are only performed if the link is up, assuming the 10G port is down
otherwise. In addition logging in the sequencer task is improved, making the
events more accurate and cutting down on noise.
@arjenroodselaar arjenroodselaar enabled auto-merge (squash) August 28, 2023 22:47
@arjenroodselaar arjenroodselaar merged commit 580b0fc into master Aug 28, 2023
71 checks passed
@arjenroodselaar arjenroodselaar deleted the reduce_monorail_restart branch August 29, 2023 04:32
mkeeter added a commit that referenced this pull request Apr 26, 2024
We noticed on the dogfood rack that a Sidecar's 10G link was
persistently down.

Normally, this is fixed by a watchdog. However, the watchdog only fires
if the PCIe link is active
(#1510).

The PCIe link was falsely being reported as down because the debug port
state (`TOFINO_DEBUG_PORT_STATE`) had `receive_buffer_empty = false`, so
we bailed out of the check at [this
condition](https://github.com/oxidecomputer/hubris/blob/020d014880382d872d048fbfe1e8152a39e7c47a/drv/sidecar-mainboard-controller/src/tofino2.rs#L662).
This failure was persistent through SP reboots (which notably do not
reflash the FPGA), so it's likely out-of-sync state between the FPGA and
SP.

This PR adds a startup step to reset the debug port tx/rx buffers by
writing to the `TOFINO_DEBUG_PORT_STATE` register.

Flashing this firmware onto the misbehaving system brought it back into
working state (i.e. reporting `pcie_link = true`).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants