gimlet-seq: retry I2C errors, or die trying #1611

hawkw · 2024-02-10T01:16:25Z

Currently, the sequencer task uses unwrap() on I2C bus operations.
However, these may fail due to transient bus errors, or fail permanently
for reasons wholly outside the sequencer task's control. As discussed in
issue #1205, it's incorrect for the sequencer task to unwap these
errors, as panicking results in the task being restarted. On fatal bus
errors, this means we crash loop.

Instead, this commit changes the gimlet-seq task to retry failed I2C
reads/writes up to three times, and, if they still fail, transition to a
permanent failed state, setting the FAULT net to Ignition to indicate
that the Gimlet must be power-cycled.

drv/gimlet-seq-server/src/main.rs

hawkw · 2024-02-10T01:20:04Z

drv/gimlet-seq-server/src/main.rs

N.B. that the git diff is super messy just because I moved a big chunk of code out of main and into a function; it's mostly unchanged, but GitHub doesn't really seem to get that. All the functional changes are in commit ee898f0, which might be a bit easier to review.

rmustacc · 2024-02-12T17:16:39Z

A general question I have from an upstack consumer perspective:

Instead, this commit changes the gimlet-seq task to retry failed I2C
reads/writes up to three times, and, if they still fail, transition to a
permanent failed state, setting the FAULT net to Ignition to indicate
that the Gimlet must be power-cycled.

My understanding is that #1556 has caused the visible state of the fault signal to be asserted the entire time that we are booting up. So I'm not sure how this intersects with this. In particular, given the change in that PR, we cannot actually know that we need to power cycle that Gimlet without creating some sort of startup time-out countdown timer. That is, upstack we'd have to know that this has remained asserted for some time x to believe that this is valid versus the original theory of being able to say whenever it's asserted we know it's a problem.

If the logic in 1556 happens before this, how do we distinguish the two? If not, do they intersect?

hawkw · 2024-02-12T20:07:58Z

@rmustacc If I'm being entirely honest, that seems like more of a question for @cbiffle — I just picked up the issue as he described it in #1205. Cliff probably has a better sense of how this is expected to interact with the rest of the system.

With that said, I believe the visible behavior on failures should be similar? Currently, the fault pin is pulled low during the startup process, and only set high once the sequencer task is actually serving IPC requests. If the sequencer task crashes and enters a crashloop before it starts serving IPC requests, the fault pin remains asserted. Now, if a failure occurs during startup, the sequencer task no longer crash loops, but it still won't set the fault pin high, so the fault pin still remains asserted indefinitely if the sequencer task can't come up.

cbiffle

I think moving init to ServerImpl has more justification than you suggest in your commit message -- you've basically done a "parse don't validate," since ServerImpl is written with a lot of embedded assumptions that init succeeded. So it's now harder to accidentally start a server without finishing init.

Generally good, a few questions below:

cbiffle · 2024-02-13T22:40:50Z

drv/gimlet-seq-server/src/main.rs

+            // All these moments will be lost in time, like tears in rain...
+            // Time to die.
+            loop {
+                hl::sleep_for(100);


The IPC server probably assumes that initialization was successful, so it's unsafe to start it here... that's too bad because it'd be really great to be able to ask it questions.

We'd have to introduce some notion of "I'm broken" into the IPC server for that to work.

Mostly just thinking out loud here.

Yeah, I was also thinking about possibly having it start serving, and just respond to everything with an "I'm broken" message. Would be happy to change that now, if you think that's a good idea, or in a later PR?

I feel like this PR is a significant marginal improvement, so there's no reason to ask you to cram more stuff in before merging it.

drv/gimlet-seq-server/src/main.rs

cbiffle · 2024-02-13T22:43:26Z

drv/gimlet-seq-server/src/main.rs

+{
+    // Chosen by fair dice roll, seems reasonable-ish?
+    let mut retries_remaining = 3;
+    loop {


I might consider structuring this as a for _ in 0..retries_remaining just to avoid accidentally creating an infinite loop through a logic error below.

that was what i was going to do. but, we want to return the last error when we run out of retries, which means that we need to keep it around when we fall through to the end of the loop, which felt a bit ugly; it would have to be something like this:

// Chosen arbitrarily, but don't worry! We'll never actually return this --- it's // just because we need to initialize this to *something*. let mut last_error = i2c::ResponseCode::BadResponse; for retries_remaining in (0..3).rev() { match txn() { Ok(x) => return Ok(x), Err(e) => { let code = e.into(); ringbuf_entry!(Trace::I2cFault { retries_remaining, code, }); last_error = code; } } } Err(last_error)

in particular, i felt like having to assign last_error to something, even though we've never actually seen that error, felt a bit unfortunate.

what do you think?

Hmmmm. Fair point. Current structure's fine.

Honestly, the more I think about it, I'm not sure whether initializing the variable to an arbitrary error code we'll never actually return, or the theoretically-infinite loop which isn't actually is more unpleasant. They're both kinda ugly...

A third option, which is also not my favorite, is to do something like this:

for retries_remaining in (0..MAX_RETRIES).rev() { match txn() { Ok(x) => return Ok(x), Err(error) => { let code = error.into(); ringbuf_entry!(Trace::I2cFault { retries_remaining, code, }); if retries_remaining == 0 { return Err(code); } } } } unreachable!()

Now, we've gotten rid of the arbitrary error code that we'll never actually return, and the infinite loop that isn't actually infinite, but this comes at the cost of an unreachable!(), and having to write unreachable!() also makes me feel kind of sad.

If you have a preference for any of these bad options, I'll happily go with that one --- I don't really have strong feelings about which is the least objectionable. :)

drv/gimlet-seq-server/src/main.rs

cbiffle · 2024-02-13T22:51:04Z

My understanding is that #1556 has caused the visible state of the fault signal to be asserted the entire time that we are booting up. So I'm not sure how this intersects with this. In particular, given the change in that PR, we cannot actually know that we need to power cycle that Gimlet without creating some sort of startup time-out countdown timer. That is, upstack we'd have to know that this has remained asserted for some time x to believe that this is valid versus the original theory of being able to say whenever it's asserted we know it's a problem.

If the logic in 1556 happens before this, how do we distinguish the two? If not, do they intersect?

@rmustacc - Until we're confident that potential crashloops are all gone from the sequencer task, I'm hesitant to remove the code that sets fault on startup. So we probably need to work out a "how much time does the sequencer have to start up before we consider it faulted" delay either way until we're more confident in the code.

This change fixes most (or all, possibly) of the I2C-related panics, but not all the panics.

rmustacc · 2024-02-13T22:57:52Z

My understanding is that #1556 has caused the visible state of the fault signal to be asserted the entire time that we are booting up. So I'm not sure how this intersects with this. In particular, given the change in that PR, we cannot actually know that we need to power cycle that Gimlet without creating some sort of startup time-out countdown timer. That is, upstack we'd have to know that this has remained asserted for some time x to believe that this is valid versus the original theory of being able to say whenever it's asserted we know it's a problem.
If the logic in 1556 happens before this, how do we distinguish the two? If not, do they intersect?

@rmustacc - Until we're confident that potential crashloops are all gone from the sequencer task, I'm hesitant to remove the code that sets fault on startup. So we probably need to work out a "how much time does the sequencer have to start up before we consider it faulted" delay either way until we're more confident in the code.

That's fine. Just want to make sure that we can actually use this up stack and then we'll have to figure out how that fits into the ignition controller and related (or the future looking interface overhaul you have planned). I realize the upstack stuff isn't there yet so I don't ask this to block this at all, just to make sure we know how it all would fit together.

cbiffle · 2024-02-13T23:08:42Z

@rmustacc - understood. The main improvement of this change is that it'll stop frantically reinitializing the FPGA and bouncing rails when I2C messes up; improved upstream diagnostics are kind of a side effect. There's more we can do to improve the diagnostics beyond this change.

cbiffle

(please squash or fold your most recent commits)
(or all of them, I'm not the cops)

Currently, the sequencer task uses `unwrap()` on I2C bus operations. However, these may fail due to transient bus errors, or fail permanently for reasons wholly outside the sequencer task's control. As discussed in issue #1205, it's incorrect for the sequencer task to unwap these errors, as panicking results in the task being restarted. On fatal bus errors, this means we crash loop. Instead, this commit changes the `gimlet-seq` task to retry failed I2C reads/writes up to three times, and, if they still fail, transition to a permanent failed state, setting the FAULT net to Ignition to indicate that the Gimlet must be power-cycled. Fixes #1205

This way, we only construct a `ServerImpl` when the sequencer task is in a state to serve IPC requests. This isn't all that important, but it's nice to make it a bit harder to accidentally start a server without finishing init.

@labbott

Sleeping for 100ms in a loop is not a particularly efficient way for a task to die permanently. Thanks to suggestions from @labbott and @cbiffle, I've changed this so that the sequencer task now dies by waiting for a notification with an empty notification mask, essentially waiting forever for a notification that should never happen. This is much more respectful of other tasks that might still be trying to do stuff, and would like to be scheduled to do that stuff. Also, I've cleaned up a couple other things based on some of Cliff's suggestions. Because we no longer set the state when dying, we can pass the `jefe` handle into `init` by value, which is a bit nicer.

Currently, the sequencer task uses `unwrap()` on I2C bus operations. However, these may fail due to transient bus errors, or fail permanently for reasons wholly outside the sequencer task's control. As discussed in issue #1205, it's incorrect for the sequencer task to unwap these errors, as panicking results in the task being restarted. On fatal bus errors, this means we crash loop. Instead, this commit changes the `gimlet-seq` task to retry failed I2C reads/writes up to three times, and, if they still fail, transition to a permanent failed state, setting the FAULT net to Ignition to indicate that the Gimlet must be power-cycled. Fixes #1205

This way, we only construct a `ServerImpl` when the sequencer task is in a state to serve IPC requests. This isn't all that important, but it's nice to make it a bit harder to accidentally start a server without finishing init.

@labbott

Sleeping for 100ms in a loop is not a particularly efficient way for a task to die permanently. Thanks to suggestions from @labbott and @cbiffle, I've changed this so that the sequencer task now dies by waiting for a notification with an empty notification mask, essentially waiting forever for a notification that should never happen. This is much more respectful of other tasks that might still be trying to do stuff, and would like to be scheduled to do that stuff. Also, I've cleaned up a couple other things based on some of Cliff's suggestions. Because we no longer set the state when dying, we can pass the `jefe` handle into `init` by value, which is a bit nicer.

@labbott

Sleeping for 100ms in a loop is not a particularly efficient way for a task to die permanently. Thanks to suggestions from @labbott and @cbiffle, I've changed this so that the sequencer task now dies by waiting for a notification with an empty notification mask, essentially waiting forever for a notification that should never happen. This is much more respectful of other tasks that might still be trying to do stuff, and would like to be scheduled to do that stuff. Also, I've cleaned up a couple other things based on some of Cliff's suggestions. Because we no longer set the state when dying, we can pass the `jefe` handle into `init` by value, which is a bit nicer.

Currently, the sequencer task uses `unwrap()` on I2C bus operations. However, these may fail due to transient bus errors, or fail permanently for reasons wholly outside the sequencer task's control. As discussed in issue #1205, it's incorrect for the sequencer task to unwap these errors, as panicking results in the task being restarted. On fatal bus errors, this means we crash loop. Instead, this commit changes the `gimlet-seq` task to retry failed I2C reads/writes up to three times, and, if they still fail, transition to a permanent failed state, setting the FAULT net to Ignition to indicate that the Gimlet must be power-cycled. Fixes #1205

This way, we only construct a `ServerImpl` when the sequencer task is in a state to serve IPC requests. This isn't all that important, but it's nice to make it a bit harder to accidentally start a server without finishing init.

hawkw requested a review from cbiffle February 10, 2024 01:16

hawkw commented Feb 10, 2024

View reviewed changes

hawkw mentioned this pull request Feb 10, 2024

gimlet-seq should not unwrap I2C results. #1205

Closed

hawkw force-pushed the eliza/dont-panic branch from 8ce2f19 to 48f3391 Compare February 12, 2024 17:52

hawkw marked this pull request as ready for review February 13, 2024 19:54

cbiffle reviewed Feb 13, 2024

View reviewed changes

hawkw force-pushed the eliza/dont-panic branch from e569c99 to 982cb5b Compare February 13, 2024 23:22

hawkw requested a review from cbiffle February 13, 2024 23:42

cbiffle approved these changes Feb 13, 2024

View reviewed changes

hawkw force-pushed the eliza/dont-panic branch 2 times, most recently from a0cdaf6 to b02c0d7 Compare February 14, 2024 00:16

hawkw enabled auto-merge (rebase) February 14, 2024 00:17

hawkw added 2 commits February 13, 2024 16:18

gimlet-seq: Move init to ServerImpl::init (#1611)

eda7479

This way, we only construct a `ServerImpl` when the sequencer task is in a state to serve IPC requests. This isn't all that important, but it's nice to make it a bit harder to accidentally start a server without finishing init.

hawkw force-pushed the eliza/dont-panic branch from b02c0d7 to bd539d6 Compare February 14, 2024 00:18

hawkw force-pushed the eliza/dont-panic branch from bd539d6 to 72c99f3 Compare February 14, 2024 00:46

hawkw merged commit 66b9453 into master Feb 14, 2024
83 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gimlet-seq: retry I2C errors, or die trying #1611

gimlet-seq: retry I2C errors, or die trying #1611

hawkw commented Feb 10, 2024

hawkw Feb 10, 2024

rmustacc commented Feb 12, 2024

hawkw commented Feb 12, 2024

cbiffle left a comment

cbiffle Feb 13, 2024

hawkw Feb 13, 2024

cbiffle Feb 13, 2024

cbiffle Feb 13, 2024

hawkw Feb 13, 2024 •

edited

Loading

cbiffle Feb 13, 2024

hawkw Feb 13, 2024

hawkw Feb 13, 2024

cbiffle commented Feb 13, 2024 •

edited

Loading

rmustacc commented Feb 13, 2024

cbiffle commented Feb 13, 2024

cbiffle left a comment

gimlet-seq: retry I2C errors, or die trying #1611

gimlet-seq: retry I2C errors, or die trying #1611

Conversation

hawkw commented Feb 10, 2024

Choose a reason for hiding this comment

rmustacc commented Feb 12, 2024

hawkw commented Feb 12, 2024

cbiffle left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hawkw Feb 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbiffle commented Feb 13, 2024 • edited Loading

rmustacc commented Feb 13, 2024

cbiffle commented Feb 13, 2024

cbiffle left a comment

Choose a reason for hiding this comment

hawkw Feb 13, 2024 •

edited

Loading

cbiffle commented Feb 13, 2024 •

edited

Loading