Skip to content

Commit

Permalink
npu2-opencapi: Log a warning when resetting a broken device
Browse files Browse the repository at this point in the history
On P9, the NPU doesn't support recovery if the link goes down
unexpectedly. It was not fully verified. We mark the device as broken
when we receive an error interrupt from the NPU. However, there's
nothing to prevent the OS from trying to reset the device; It may or
may not work, it's unsupported territory, so let's log a message to
make it clear, as it could help when debugging. We haven't hit any
cases where the reset goes badly enough that we'd want to prevent it,
so let it go for now. We can revisit later if we have evidence that
it's causing more problems than it is worth.

Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
Reviewed-by: Christophe Lombard <clombard@linux.vnet.ibm.com>
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
  • Loading branch information
fbarrat authored and oohal committed Oct 22, 2019
1 parent 9d5faaf commit 233e863
Showing 1 changed file with 4 additions and 0 deletions.
4 changes: 4 additions & 0 deletions hw/npu2-opencapi.c
Expand Up @@ -1203,6 +1203,10 @@ static int64_t npu2_opencapi_poll_link(struct pci_slot *slot)
case OCAPI_SLOT_LINK_TRAINED:
otl_enabletx(chip_id, dev->npu->xscom_base, dev);
pci_slot_set_state(slot, OCAPI_SLOT_NORMAL);
if (dev->flags & NPU2_DEV_BROKEN) {
OCAPIERR(dev, "Resetting a device which hit a previous error. Device recovery is not supported, so future behavior is undefined\n");
dev->flags &= ~NPU2_DEV_BROKEN;
}
check_perf_counters(dev);
dev->phb_ocapi.scan_map = 1;
return OPAL_SUCCESS;
Expand Down

0 comments on commit 233e863

Please sign in to comment.