Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
npu2/hw-procedures: Remove assertion from check_credits()
The RX clock mux in the NVLink PHY can glitch, which will manifest in hard to diagnose behavior--at best, a checkstop during the first link traffic. The only reliable way we found to detect this was by checking for a discrepancy in the credits we expect to receive during link training. Since the time the check was added, we've found that * Commit ac6f159 ("npu2: hw-procedures: Add phy_rx_clock_sel()") does work around the original glitch. * Asserting is too harsh. Before root cause was established, it was thought this could have been a manufacturing defect and we wanted to loudly fail hardware acceptance boot cycle tests. * It seems there is a valid situation in which credits are off from the expected value. During GPU hot reset, a CPU prefetch across the link can affect the credit count before we check. Given all of the above, remove the assert(). Cc: stable # 6.0.x Signed-off-by: Reza Arbab <arbab@linux.ibm.com> Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
- Loading branch information