Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QSGMII link to front IO PHY sometimes doesn't come up #1410

Closed
mkeeter opened this issue Jun 13, 2023 · 10 comments
Closed

QSGMII link to front IO PHY sometimes doesn't come up #1410

mkeeter opened this issue Jun 13, 2023 · 10 comments
Assignees
Milestone

Comments

@mkeeter
Copy link
Collaborator

mkeeter commented Jun 13, 2023

When running a hard reboot loop to debug #1399 , I found the system in a state where the link from the VSC7448 to the technician port PHY was down.

PORT | MODE    SPEED  DEV     SERDES  LINK |   PHY    MAC LINK  MEDIA LINK
-----|-------------------------------------|-------------------------------
 44  | QSGMII  1G     1G_20   6G_15   err  | VSC8562  err       down
 45  | QSGMII  1G     1G_21   6G_15   down | VSC8562  err       down

This is distinct from #1399, where the QSGMII link is fine, but the VSC7448 is dropping packets in its queue system.

We see various bits indicating that the QSGMII link can't sync up:

matt@lurch ~ (sidecar-sp) $ h monorail read HW_QSGMII_STAT[11]
humility: attached to 0483:374e:0028001E4741500720383733 via ST-Link V3
humility: Reading HSIO:HW_CFGSTAT:HW_QSGMII_STAT[11] from 0x714601a0
HSIO:HW_CFGSTAT:HW_QSGMII_STAT[11] => 0x20
  bits |    value   | field
   6:1 | 0x10       | DELAY_VAR_X200PS
     0 | 0x0        | SYNC
matt@lurch ~ (sidecar-sp) $ h monorail phy read -p44 MAC_SERDES_PCS_STATUS
humility: attached to 0483:374e:0028001E4741500720383733 via ST-Link V3
Reading from port 44 PHY, register EXTENDED_3:MAC_SERDES_PCS_STATUS
Got result 0xc405
  bits |    value   | field
    15 | 0x1        | MAC_SYNC_FAIL
    14 | 0x1        | MAC_CGBAD
    12 | 0x0        | SGMII_ALIGN_ERROR
    11 | 0x0        | MAC_LP_ANEG_RESTART
     5 | 0x0        | MAC_FDX_ADV
     4 | 0x0        | MAC_HDX_ADV
     3 | 0x0        | MAC_LP_ANEG_CAPABLE
     2 | 0x1        | MAC_LINK_STATUS
     1 | 0x0        | MAC_ANEG_COMPLETE
     0 | 0x1        | MAC_PCS_SIG_DETECT
matt@lurch ~ (sidecar-sp) $ h monorail phy read -p44 MAC_SERDES_STATUS
humility: attached to 0483:374e:0028001E4741500720383733 via ST-Link V3
Reading from port 44 PHY, register EXTENDED_3:MAC_SERDES_STATUS
Got result 0xd000

(the latter has bits set for "Comma realigned", "SerDes signal detect", and "MAC comma detect", but not "QSGMII sync status")

The failure was not resolved by reinitializing the PHY (using Monorail.reinit); it was only resolved by power-cycling the entire Sidecar. This is confusing, because reinitialization should also power-cycling the PHY. It's possible that the wait time of 10 ms isn't sufficient to fully discharge the rail.

Next steps are:

  • Scope that power rail during reinitialization, to see how quickly it discharges
  • Get the system stuck again, then
    • Scope the QSGMII lines to get ground-truth data
    • Test reinitialization with a longer wait time (100 ms?)

cc @Aaron-Hartwig @refugeesus

@mkeeter mkeeter added this to the FCS milestone Jun 13, 2023
@mkeeter mkeeter changed the title Link to front IO PHY sometimes doesn't come up QSGMII link to front IO PHY sometimes doesn't come up Jun 13, 2023
@mkeeter
Copy link
Collaborator Author

mkeeter commented Jun 13, 2023

Testing PHY reinitialization with a long wait time (10s) still leaves the system stuck. This leads me to suspect the VSC7448, since it's not being power cycled, but getting ground-truth readings is going to be essential for debugging.

@refugeesus
Copy link

:( ok. Someone will need to probe the board in the office asap

@refugeesus
Copy link

refugeesus commented Jun 20, 2023

OK so sad news, our osc3 does a thing that makes it output the incorrect frequency sometimes. See this note from microshit:
https://ww1.microchip.com/downloads/en/DeviceDoc/DSC11xx-Family-Silicon-Errata-DS80000982A.pdf

Unfortunately the parts which tri-state are unobtainable or have very long lead times.

@refugeesus
Copy link

This is a happy clock:
image

This is a sad clock:
image

Unfortunately sometimes we see a sad clock...

@refugeesus
Copy link

refugeesus commented Jun 20, 2023

The signal off freq which is a symptom of the above problems:
image

This is VSC7448 side of our link

@nathanaelhuffman
Copy link
Contributor

Unfortunately, I don't think it's wise to plan any more rework on this pass of sidecar. As discussed on the hardware tactical today, given the ~2% boot failure rate, we're proposing that the sidecar power-cycle the qsfp board (software workaround) in the cases where this issue is detected. @arjenroodselaar is signed up to scope out that work.

This isn't awesome and we should consider using a different part in the future.

@refugeesus
Copy link

refugeesus commented Jun 20, 2023

First, we are going to try and power cycle the Front IO board from Sidecar.

Alternatively, per our huddle, I plan to sever the FPGA's connection to the enable of our current osc (in a reparable way), and we will attempt to work with the VSC when we violate it's sequencing instructions (typically want's power before refclk)

@arjenroodselaar
Copy link
Contributor

An update on this issue; https://github.com/oxidecomputer/hubris/tree/front_io_bad_osc contains changes across the sequencer task, monorail task and the controller bitstreams to work around this issue. This is currently running in a loop where the system is power cycled and the links are checked afterwards. So far the monorail task has detected two instances where the QSGMII link did not come up and the front IO board needed to be power cycled and the PHY reinitialized to work around the problem. Afterwards the QSGMII link and technician ports worked as intended.

This will take a few days to get through review, but so far a software workaround seems adequate.

@arjenroodselaar
Copy link
Contributor

arjenroodselaar commented Jun 30, 2023

This ran overnight and 1464 power cycles of Sidecar were done. During 57 of those cycles monrail-server determined the QSGMII link not functional and requested one or more power cycles of the front IO board from the sequencer. Once the QSGMII link came up ping tests using both technician ports succeeded in all 1464 cycles.

@mkeeter
Copy link
Collaborator Author

mkeeter commented Jul 10, 2023

Done in #1449

@mkeeter mkeeter closed this as completed Jul 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants