Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rack 2 powered off and left blinking power sequencers behind #1800

Open
leftwo opened this issue May 22, 2024 · 2 comments
Open

Rack 2 powered off and left blinking power sequencers behind #1800

leftwo opened this issue May 22, 2024 · 2 comments

Comments

@leftwo
Copy link

leftwo commented May 22, 2024

It was noticed that rack2 (dogfood) was not responding.
This rack has a single power whip connected to it.

PSC-blinks.mp4

The rack had powered off, and three of the sequencers were blinking green.

The PSC was removed and re-inserted into the chassis.
From chat, @cbiffle wrote this summary:

Alright, what we know / don't about that PSC behavior:

  • The rectifier enable lights appear to have been cycling at a bit over 1 Hz.
  • They were not perfectly in lock step -- in the time between the initial video and when we got on the call, their phases had shifted relative to one another.
  • The PSC status light was steady.
  • When the PSC was detached, the rectifiers started behaving.
  • We had no management network access at the time, likely because the power was off, so we don't have logs from the PSC -- and because this is a B rev (and more generally because persistent log support isn't implemented yet due to lack of C rev availability) any logs will have been lost when the PSC was unplugged.
  • Reinserting the PSC did not reproduce the problem.
@cbiffle
Copy link
Collaborator

cbiffle commented May 23, 2024

Alright, I think we've managed to tease this one out.

When the PSUs hit certain fault conditions they drop their "OK" (active high) line. Up until this month, nobody had written code to actually monitor that line, and we learned of fault conditions in which the PSUs required active intervention to turn back on. In that state they would hang out with an amber light lit.

I added code to the PSC to attempt to cycle the PSUs and clear faults like this, which has been released. Because we don't have a power shelf for testing in EMY, I did all the testing of that change with a hand-wired mockup. It appears my hand-wired mockup got one of the PSU behaviors wrong:

It turns out the PSUs require you to re-enable them before they will stop indicating a fault condition. I had added logic to try to avoid cycling them on and off unnecessarily, which in practice has the effect of never turning them back on in this class of fault condition. We need to change this logic to turn the PSU on and wait a bit before deciding if it's back or not.

While a PSU is disabled in this manner, it blinks its light green at about 1Hz. This means "I'm off," confusingly. This is the signal we've been seeing: it's a sign that the PSC is commanding the PSU off. Due to my misunderstanding of the behavior of the PSU fault signals, it unfortunately never turns it back on.

It turns out that this class of fault condition is relatively easy to reproduce on a lab rack: sneak in via Humility and alter the PSU enable line state. So we have a way to test this in Dogfood now that we have an extender card mounted.

cbiffle added a commit that referenced this issue May 23, 2024
See #1800.

In brief: the PSU won't start asserting OK again until it's re-enabled,
whereas we were waiting to see OK before we would re-enable. This
produced something of a stalemate.
cbiffle added a commit that referenced this issue May 23, 2024
See #1800.

In brief: the PSU won't start asserting OK again until it's re-enabled,
whereas we were waiting to see OK before we would re-enable. This
produced something of a stalemate.
cbiffle added a commit that referenced this issue May 23, 2024
See #1800.

In brief: the PSU won't start asserting OK again until it's re-enabled,
whereas we were waiting to see OK before we would re-enable. This
produced something of a stalemate.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants