Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

psc: allow PSUs a grace period before requiring OK after enable. #1802

Merged
merged 1 commit into from
May 24, 2024

Conversation

cbiffle
Copy link
Collaborator

@cbiffle cbiffle commented May 24, 2024

In fault injection testing I've seen one time when the PSU failed to assert OK quickly after being enabled. This caused us to fault it and try again, at which point it worked. But, that has the effect of inserting an additional 5s delay on the fault recovery cycle, which in theory could repeat forever.

This change adds a grace period after taking a faulted PSU out of fault status, before we start expecting it to assert OK. This mirrors the grace period we provide for newly inserted PSUs.

@cbiffle cbiffle requested a review from hawkw May 24, 2024 17:33
Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good to me, very straightforward!

drv/psc-seq-server/src/main.rs Outdated Show resolved Hide resolved
) => {
if deadline <= now {
// Take PSU out of probation state and start monitoring its
// OK line.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think it's worth including a ringbuf entry indicating that the PSU has come out of probation?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered that, and I don't think it's worth it, since it's an unconditional event 1 second after On -- you'll either see it stay on, or you'll see it disable.

In fault injection testing I've seen _one time_ when the PSU failed to
assert OK quickly after being enabled. This caused us to fault it and
try again, at which point it worked. But, that has the effect of
inserting an additional 5s delay on the fault recovery cycle, which in
theory could repeat forever.

This change adds a grace period after taking a faulted PSU out of fault
status, before we start expecting it to assert OK. This mirrors the
grace period we provide for newly inserted PSUs.
@cbiffle cbiffle enabled auto-merge (rebase) May 24, 2024 18:06
@cbiffle cbiffle merged commit 01adfc4 into master May 24, 2024
104 checks passed
@cbiffle cbiffle deleted the cbiffle/psc-noflap branch May 24, 2024 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants