Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gimlet-seq: retry I2C errors, or die trying #1611

Merged
merged 3 commits into from
Feb 14, 2024
Merged

gimlet-seq: retry I2C errors, or die trying #1611

merged 3 commits into from
Feb 14, 2024

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Feb 10, 2024

Currently, the sequencer task uses unwrap() on I2C bus operations.
However, these may fail due to transient bus errors, or fail permanently
for reasons wholly outside the sequencer task's control. As discussed in
issue #1205, it's incorrect for the sequencer task to unwap these
errors, as panicking results in the task being restarted. On fatal bus
errors, this means we crash loop.

Instead, this commit changes the gimlet-seq task to retry failed I2C
reads/writes up to three times, and, if they still fail, transition to a
permanent failed state, setting the FAULT net to Ignition to indicate
that the Gimlet must be power-cycled.

@hawkw hawkw requested a review from cbiffle February 10, 2024 01:16
drv/gimlet-seq-server/src/main.rs Outdated Show resolved Hide resolved
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N.B. that the git diff is super messy just because I moved a big chunk of code out of main and into a function; it's mostly unchanged, but GitHub doesn't really seem to get that. All the functional changes are in commit ee898f0, which might be a bit easier to review.

@rmustacc
Copy link
Contributor

A general question I have from an upstack consumer perspective:

Instead, this commit changes the gimlet-seq task to retry failed I2C
reads/writes up to three times, and, if they still fail, transition to a
permanent failed state, setting the FAULT net to Ignition to indicate
that the Gimlet must be power-cycled.

My understanding is that #1556 has caused the visible state of the fault signal to be asserted the entire time that we are booting up. So I'm not sure how this intersects with this. In particular, given the change in that PR, we cannot actually know that we need to power cycle that Gimlet without creating some sort of startup time-out countdown timer. That is, upstack we'd have to know that this has remained asserted for some time x to believe that this is valid versus the original theory of being able to say whenever it's asserted we know it's a problem.

If the logic in 1556 happens before this, how do we distinguish the two? If not, do they intersect?

@hawkw
Copy link
Member Author

hawkw commented Feb 12, 2024

@rmustacc If I'm being entirely honest, that seems like more of a question for @cbiffle — I just picked up the issue as he described it in #1205. Cliff probably has a better sense of how this is expected to interact with the rest of the system.

With that said, I believe the visible behavior on failures should be similar? Currently, the fault pin is pulled low during the startup process, and only set high once the sequencer task is actually serving IPC requests. If the sequencer task crashes and enters a crashloop before it starts serving IPC requests, the fault pin remains asserted. Now, if a failure occurs during startup, the sequencer task no longer crash loops, but it still won't set the fault pin high, so the fault pin still remains asserted indefinitely if the sequencer task can't come up.

@hawkw hawkw marked this pull request as ready for review February 13, 2024 19:54
Copy link
Collaborator

@cbiffle cbiffle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think moving init to ServerImpl has more justification than you suggest in your commit message -- you've basically done a "parse don't validate," since ServerImpl is written with a lot of embedded assumptions that init succeeded. So it's now harder to accidentally start a server without finishing init.

Generally good, a few questions below:

// All these moments will be lost in time, like tears in rain...
// Time to die.
loop {
hl::sleep_for(100);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IPC server probably assumes that initialization was successful, so it's unsafe to start it here... that's too bad because it'd be really great to be able to ask it questions.

We'd have to introduce some notion of "I'm broken" into the IPC server for that to work.

Mostly just thinking out loud here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was also thinking about possibly having it start serving, and just respond to everything with an "I'm broken" message. Would be happy to change that now, if you think that's a good idea, or in a later PR?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this PR is a significant marginal improvement, so there's no reason to ask you to cram more stuff in before merging it.

drv/gimlet-seq-server/src/main.rs Show resolved Hide resolved
drv/gimlet-seq-server/src/main.rs Show resolved Hide resolved
{
// Chosen by fair dice roll, seems reasonable-ish?
let mut retries_remaining = 3;
loop {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might consider structuring this as a for _ in 0..retries_remaining just to avoid accidentally creating an infinite loop through a logic error below.

Copy link
Member Author

@hawkw hawkw Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that was what i was going to do. but, we want to return the last error when we run out of retries, which means that we need to keep it around when we fall through to the end of the loop, which felt a bit ugly; it would have to be something like this:

// Chosen arbitrarily, but don't worry! We'll never actually return this --- it's 
// just because we need to initialize this to *something*.
let mut last_error = i2c::ResponseCode::BadResponse;
for retries_remaining in (0..3).rev() {
    match txn() {
        Ok(x) => return Ok(x),
        Err(e) => {
            let code = e.into();
            ringbuf_entry!(Trace::I2cFault {
                retries_remaining,
                code,
            });
            last_error = code;
        }
    }
}

Err(last_error)

in particular, i felt like having to assign last_error to something, even though we've never actually seen that error, felt a bit unfortunate.

what do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm. Fair point. Current structure's fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, the more I think about it, I'm not sure whether initializing the variable to an arbitrary error code we'll never actually return, or the theoretically-infinite loop which isn't actually is more unpleasant. They're both kinda ugly...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A third option, which is also not my favorite, is to do something like this:

    for retries_remaining in (0..MAX_RETRIES).rev() {
        match txn() {
            Ok(x) => return Ok(x),
            Err(error) => {
                let code = error.into();
                ringbuf_entry!(Trace::I2cFault {
                    retries_remaining,
                    code,
                });
                if retries_remaining == 0 {
                    return Err(code);
                }
            }
        }
    }

    unreachable!()

Now, we've gotten rid of the arbitrary error code that we'll never actually return, and the infinite loop that isn't actually infinite, but this comes at the cost of an unreachable!(), and having to write unreachable!() also makes me feel kind of sad.

If you have a preference for any of these bad options, I'll happily go with that one --- I don't really have strong feelings about which is the least objectionable. :)

drv/gimlet-seq-server/src/main.rs Outdated Show resolved Hide resolved
drv/gimlet-seq-server/src/main.rs Outdated Show resolved Hide resolved
@cbiffle
Copy link
Collaborator

cbiffle commented Feb 13, 2024

My understanding is that #1556 has caused the visible state of the fault signal to be asserted the entire time that we are booting up. So I'm not sure how this intersects with this. In particular, given the change in that PR, we cannot actually know that we need to power cycle that Gimlet without creating some sort of startup time-out countdown timer. That is, upstack we'd have to know that this has remained asserted for some time x to believe that this is valid versus the original theory of being able to say whenever it's asserted we know it's a problem.

If the logic in 1556 happens before this, how do we distinguish the two? If not, do they intersect?

@rmustacc - Until we're confident that potential crashloops are all gone from the sequencer task, I'm hesitant to remove the code that sets fault on startup. So we probably need to work out a "how much time does the sequencer have to start up before we consider it faulted" delay either way until we're more confident in the code.

This change fixes most (or all, possibly) of the I2C-related panics, but not all the panics.

@rmustacc
Copy link
Contributor

My understanding is that #1556 has caused the visible state of the fault signal to be asserted the entire time that we are booting up. So I'm not sure how this intersects with this. In particular, given the change in that PR, we cannot actually know that we need to power cycle that Gimlet without creating some sort of startup time-out countdown timer. That is, upstack we'd have to know that this has remained asserted for some time x to believe that this is valid versus the original theory of being able to say whenever it's asserted we know it's a problem.
If the logic in 1556 happens before this, how do we distinguish the two? If not, do they intersect?

@rmustacc - Until we're confident that potential crashloops are all gone from the sequencer task, I'm hesitant to remove the code that sets fault on startup. So we probably need to work out a "how much time does the sequencer have to start up before we consider it faulted" delay either way until we're more confident in the code.

That's fine. Just want to make sure that we can actually use this up stack and then we'll have to figure out how that fits into the ignition controller and related (or the future looking interface overhaul you have planned). I realize the upstack stuff isn't there yet so I don't ask this to block this at all, just to make sure we know how it all would fit together.

@cbiffle
Copy link
Collaborator

cbiffle commented Feb 13, 2024

@rmustacc - understood. The main improvement of this change is that it'll stop frantically reinitializing the FPGA and bouncing rails when I2C messes up; improved upstream diagnostics are kind of a side effect. There's more we can do to improve the diagnostics beyond this change.

Copy link
Collaborator

@cbiffle cbiffle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(please squash or fold your most recent commits)
(or all of them, I'm not the cops)

hawkw added a commit that referenced this pull request Feb 14, 2024
Currently, the sequencer task uses `unwrap()` on I2C bus operations.
However, these may fail due to transient bus errors, or fail permanently
for reasons wholly outside the sequencer task's control. As discussed in
issue #1205, it's incorrect for the sequencer task to unwap these
errors, as panicking results in the task being restarted. On fatal bus
errors, this means we crash loop.

Instead, this commit changes the `gimlet-seq` task to retry failed I2C
reads/writes up to three times, and, if they still fail, transition to a
permanent failed state, setting the FAULT net to Ignition to indicate
that the Gimlet must be power-cycled.

Fixes #1205
hawkw added a commit that referenced this pull request Feb 14, 2024
This way, we only construct a `ServerImpl` when the sequencer task is in
a state to serve IPC requests. This isn't all that important, but it's
nice to make it a bit harder to accidentally start a server without
finishing init.
hawkw added a commit that referenced this pull request Feb 14, 2024
Sleeping for 100ms in a loop is not a particularly efficient way for a
task to die permanently. Thanks to suggestions from @labbott and
@cbiffle, I've changed this so that the sequencer task now dies by
waiting for a notification with an empty notification mask, essentially
waiting forever for a notification that should never happen. This is
much more respectful of other tasks that might still be trying to do
stuff, and would like to be scheduled to do that stuff.

Also, I've cleaned up a couple other things based on some of Cliff's
suggestions. Because we no longer set the state when dying, we can pass
the `jefe` handle into `init` by value, which is a bit nicer.
@hawkw hawkw force-pushed the eliza/dont-panic branch 2 times, most recently from a0cdaf6 to b02c0d7 Compare February 14, 2024 00:16
@hawkw hawkw enabled auto-merge (rebase) February 14, 2024 00:17
Currently, the sequencer task uses `unwrap()` on I2C bus operations.
However, these may fail due to transient bus errors, or fail permanently
for reasons wholly outside the sequencer task's control. As discussed in
issue #1205, it's incorrect for the sequencer task to unwap these
errors, as panicking results in the task being restarted. On fatal bus
errors, this means we crash loop.

Instead, this commit changes the `gimlet-seq` task to retry failed I2C
reads/writes up to three times, and, if they still fail, transition to a
permanent failed state, setting the FAULT net to Ignition to indicate
that the Gimlet must be power-cycled.

Fixes #1205
This way, we only construct a `ServerImpl` when the sequencer task is in
a state to serve IPC requests. This isn't all that important, but it's
nice to make it a bit harder to accidentally start a server without
finishing init.
hawkw added a commit that referenced this pull request Feb 14, 2024
Sleeping for 100ms in a loop is not a particularly efficient way for a
task to die permanently. Thanks to suggestions from @labbott and
@cbiffle, I've changed this so that the sequencer task now dies by
waiting for a notification with an empty notification mask, essentially
waiting forever for a notification that should never happen. This is
much more respectful of other tasks that might still be trying to do
stuff, and would like to be scheduled to do that stuff.

Also, I've cleaned up a couple other things based on some of Cliff's
suggestions. Because we no longer set the state when dying, we can pass
the `jefe` handle into `init` by value, which is a bit nicer.
Sleeping for 100ms in a loop is not a particularly efficient way for a
task to die permanently. Thanks to suggestions from @labbott and
@cbiffle, I've changed this so that the sequencer task now dies by
waiting for a notification with an empty notification mask, essentially
waiting forever for a notification that should never happen. This is
much more respectful of other tasks that might still be trying to do
stuff, and would like to be scheduled to do that stuff.

Also, I've cleaned up a couple other things based on some of Cliff's
suggestions. Because we no longer set the state when dying, we can pass
the `jefe` handle into `init` by value, which is a bit nicer.
@hawkw hawkw merged commit 66b9453 into master Feb 14, 2024
83 checks passed
hawkw added a commit that referenced this pull request Feb 14, 2024
Currently, the sequencer task uses `unwrap()` on I2C bus operations.
However, these may fail due to transient bus errors, or fail permanently
for reasons wholly outside the sequencer task's control. As discussed in
issue #1205, it's incorrect for the sequencer task to unwap these
errors, as panicking results in the task being restarted. On fatal bus
errors, this means we crash loop.

Instead, this commit changes the `gimlet-seq` task to retry failed I2C
reads/writes up to three times, and, if they still fail, transition to a
permanent failed state, setting the FAULT net to Ignition to indicate
that the Gimlet must be power-cycled.

Fixes #1205
hawkw added a commit that referenced this pull request Feb 14, 2024
This way, we only construct a `ServerImpl` when the sequencer task is in
a state to serve IPC requests. This isn't all that important, but it's
nice to make it a bit harder to accidentally start a server without
finishing init.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants