Skip to content

Commit

Permalink
npu2/hw-procedures: Remove assertion from check_credits()
Browse files Browse the repository at this point in the history
The RX clock mux in the NVLink PHY can glitch, which will manifest in
hard to diagnose behavior--at best, a checkstop during the first link
traffic. The only reliable way we found to detect this was by checking
for a discrepancy in the credits we expect to receive during link
training.

Since the time the check was added, we've found that

* Commit ac6f159 ("npu2: hw-procedures: Add phy_rx_clock_sel()")
does work around the original glitch.

* Asserting is too harsh. Before root cause was established, it was
thought this could have been a manufacturing defect and we wanted to
loudly fail hardware acceptance boot cycle tests.

* It seems there is a valid situation in which credits are off from
the expected value. During GPU hot reset, a CPU prefetch across the link
can affect the credit count before we check.

Given all of the above, remove the assert().

Cc: stable # 6.0.x
Signed-off-by: Reza Arbab <arbab@linux.ibm.com>
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
  • Loading branch information
rarbab authored and oohal committed Nov 19, 2019
1 parent c9c6815 commit 24664b4
Showing 1 changed file with 6 additions and 9 deletions.
15 changes: 6 additions & 9 deletions hw/npu2-hw-procedures.c
Expand Up @@ -780,17 +780,14 @@ static uint32_t check_credit(struct npu2_dev *ndev, uint64_t reg,

static uint32_t check_credits(struct npu2_dev *ndev)
{
int fail = 0;
uint64_t val;

fail += CHECK_CREDIT(ndev, NPU2_NTL_CRED_HDR_CREDIT_RX, 0x0BE0BE0000000000ULL);
fail += CHECK_CREDIT(ndev, NPU2_NTL_RSP_HDR_CREDIT_RX, 0x0BE0BE0000000000ULL);
fail += CHECK_CREDIT(ndev, NPU2_NTL_CRED_DATA_CREDIT_RX, 0x1001000000000000ULL);
fail += CHECK_CREDIT(ndev, NPU2_NTL_RSP_DATA_CREDIT_RX, 0x1001000000000000ULL);
fail += CHECK_CREDIT(ndev, NPU2_NTL_DBD_HDR_CREDIT_RX, 0x0640640000000000ULL);
fail += CHECK_CREDIT(ndev, NPU2_NTL_ATSD_HDR_CREDIT_RX, 0x0200200000000000ULL);

assert(!fail);
CHECK_CREDIT(ndev, NPU2_NTL_CRED_HDR_CREDIT_RX, 0x0BE0BE0000000000ULL);
CHECK_CREDIT(ndev, NPU2_NTL_RSP_HDR_CREDIT_RX, 0x0BE0BE0000000000ULL);
CHECK_CREDIT(ndev, NPU2_NTL_CRED_DATA_CREDIT_RX, 0x1001000000000000ULL);
CHECK_CREDIT(ndev, NPU2_NTL_RSP_DATA_CREDIT_RX, 0x1001000000000000ULL);
CHECK_CREDIT(ndev, NPU2_NTL_DBD_HDR_CREDIT_RX, 0x0640640000000000ULL);
CHECK_CREDIT(ndev, NPU2_NTL_ATSD_HDR_CREDIT_RX, 0x0200200000000000ULL);

val = npu2_read(ndev->npu, NPU2_NTL_MISC_CFG1(ndev));
val &= 0xFF3FFFFFFFFFFFFFUL;
Expand Down

0 comments on commit 24664b4

Please sign in to comment.