Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

thermal: don't panic if MAX31790 init fails #1798

Merged
merged 6 commits into from
May 29, 2024
Merged

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented May 21, 2024

Currently, the thermal task tries to initialize the MAX31790 fan
controller(s) once when the task starts, and unwrap_lites the result.
This means that in the face of a transient I2C error, the whole task
panics and gets restarted. This is unfortunate: sometimes, the bus is
briefly locked or arbitration is lost or something, and then everything
is fine. Furthermore, if we can't talk to the MAX31790, its watchdog
will trip and it'll just run the fans at 100%...and would probably
prefer to just let the thermal task keep running and publishing sensor
readings, rather than crashlooping.

This commit changes the thermal task to track whether it has
initialized the MAX31790(s), and if it hasn't yet, any operation which
will attempt to talk to the MAX31790(s) will first try to initialize the
fan controller. IPC operations which require an initialized fan
controller will return an error to the caller if it can't be
fan controller. IPC operations which require an initialized fan
controller will return an error to the caller if it can't be
initialized, but the thermal task will stay running, reading sensor
data, and continue trying to initialize the fan controller every time
the thermal control loop runs. This way, we don't have crash loops and
can tolerate temporary I2C jank.

For example, here's what happens when I flash the gimlet-c firmware on
a board that doesn't have a rear I2C bus:

eliza@niles ~ $ pfexec humility -t gimlet-c tasks
humility: attached to 0483:3754:000D00344741500820383733 via ST-Link V3
system time = 8810
ID TASK                       GEN PRI STATE
  0 jefe                         0   0 recv, notif: fault timer(T+90)
  1 net                          0   5 notif: jefe-state-change
  2 sys                          0   1 recv, notif: exti-wildcard-irq(irq6/irq7/irq8/irq9/irq10/irq23/irq40)
  3 spi2_driver                  0   3 recv(gimlet_seq/gen0 only)
  4 i2c_driver                   0   3 recv
  5 spd                          0   2 notif: jefe-state-change
  6 packrat                      0   1 recv
  7 thermal                      0   5 wait: send to gimlet_seq/gen0
  8 power                        0   6 wait: send to gimlet_seq/gen0
  9 hiffy                        0   5 notif: bit31(T+232)
10 gimlet_seq                   0   4 notif: bit31(T+2)
11 gimlet_inspector             0   6 wait: send to net/gen0
12 hash_driver                  0   2 recv
13 hf                           0   3 notif: bit31(T+218)
14 update_server                0   3 recv
15 sensor                       0   4 recv
16 host_sp_comms                0   7 recv, notif: jefe-state-change usart-irq(irq82) multitimer control-plane-agent
17 udpecho                      0   6 wait: send to net/gen0
18 udpbroadcast                 0   6 wait: send to net/gen0
19 control_plane_agent          0   6 wait: send to net/gen0
20 sprot                        0   4 recv
21 validate                     0   5 recv
22 vpd                          0   4 recv
23 user_leds                    0   2 recv, notif: timer
24 dump_agent                   0   6 wait: send to net/gen0
25 sbrmi                        0   4 recv
26 idle                         0   8 RUNNING
27 udprpc                       0   6 wait: send to net/gen0
eliza@niles ~ $ pfexec humility -t gimlet-c ringbuf thermal
humility: attached to 0483:3754:000D00344741500820383733 via ST-Link V3
humility: ring buffer drv_i2c_devices::max31790::__RINGBUF in thermal:
humility: ring buffer task_thermal::__RINGBUF in thermal:
    TOTAL VARIANT
        8 FanControllerInitError(BusLocked)
        6 FanReadFailed
        6 FanAdded
        5 MiscReadFailed
        1 Start
        1 ThermalMode(Auto)
        1 AutoState(Boot)
  NDX LINE      GEN    COUNT PAYLOAD
    0  394        1        1 Start
    1  116        1        1 FanControllerInitError(BusLocked)
    2  180        1        1 ThermalMode(Auto)
    3  633        1        1 AutoState(Boot)
    4  116        1        1 FanControllerInitError(BusLocked)
    5  642        1        1 FanAdded(Fan(0x0))
    6  642        1        1 FanAdded(Fan(0x1))
    7  642        1        1 FanAdded(Fan(0x2))
    8  642        1        1 FanAdded(Fan(0x3))
    9  642        1        1 FanAdded(Fan(0x4))
  10  642        1        1 FanAdded(Fan(0x5))
  11  116        1        1 FanControllerInitError(BusLocked)
  12  675        1        1 FanReadFailed(SensorId(0x63), I2cError(BusLocked))
  13  116        1        1 FanControllerInitError(BusLocked)
  14  675        1        1 FanReadFailed(SensorId(0x64), I2cError(BusLocked))
  15  116        1        1 FanControllerInitError(BusLocked)
  16  675        1        1 FanReadFailed(SensorId(0x65), I2cError(BusLocked))
  17  116        1        1 FanControllerInitError(BusLocked)
  18  675        1        1 FanReadFailed(SensorId(0x66), I2cError(BusLocked))
  19  116        1        1 FanControllerInitError(BusLocked)
  20  675        1        1 FanReadFailed(SensorId(0x67), I2cError(BusLocked))
  21  116        1        1 FanControllerInitError(BusLocked)
  22  675        1        1 FanReadFailed(SensorId(0x68), I2cError(BusLocked))
  23  688        1        1 MiscReadFailed(SensorId(0x2), I2cError(NoDevice))
  24  688        1        1 MiscReadFailed(SensorId(0x6e), I2cError(BusLocked))
  25  688        1        1 MiscReadFailed(SensorId(0x6c), I2cError(BusLocked))
  26  688        1        1 MiscReadFailed(SensorId(0x6d), I2cError(BusLocked))
  27  688        1        1 MiscReadFailed(SensorId(0x1), I2cError(NoDevice))
eliza@niles ~ $

Note that:

  • The thermal task has not crashed, even though there's no MAX31790 on
    the bus whatsoever
  • Thermal control loop iterations have continued trying to poll the
    various sensors --- in this case, they also don't exist, but if they
    did, the sensor readings would be published successfully even though
    the thermal task isn't driving the fan controller.

Fixes #1779

hawkw added 5 commits May 21, 2024 14:20
Currently, the `thermal` task tries to initialize the MAX31790 fan
controller(s) once when the task starts, and `unwrap_lite`s the result.
This means that in the face of a transient I2C error, the whole task
panics and gets restarted. This is unfortunate: sometimes, the bus is
briefly locked or arbitration is lost or something, and then everything
is fine. Furthermore, if we can't talk to the MAX31790, its watchdog
will trip and it'll just run the fans at 100%...and would probably
prefer to just let the `thermal` task keep running and publishing sensor
readings, rather than crashlooping.

This commit changes the `thermal` task to track whether it has
initialized the MAX31790(s), and if it hasn't yet, any operation which
will attempt to talk to the MAX31790(s) will first try to initialize the
fan controller. IPC operations which require an initialized fan
controller will return an error to the caller if it can't be
initialized, but the `thermal` task will stay running, reading sensor
data, and continue trying to initialize the fan controller every time
the thermal control loop runs. This way, we don't have crash loops and
can tolerate temporary I2C jank.

For example, here's what happens when I flash the `gimlet-c` firmware on
a board that doesn't have a `rear` I2C bus:

```console
eliza@niles ~ $ pfexec humility -t gimlet-c tasks
humility: attached to 0483:3754:000D00344741500820383733 via ST-Link V3
system time = 8810
ID TASK                       GEN PRI STATE
 0 jefe                         0   0 recv, notif: fault timer(T+90)
 1 net                          0   5 notif: jefe-state-change
 2 sys                          0   1 recv, notif: exti-wildcard-irq(irq6/irq7/irq8/irq9/irq10/irq23/irq40)
 3 spi2_driver                  0   3 recv(gimlet_seq/gen0 only)
 4 i2c_driver                   0   3 recv
 5 spd                          0   2 notif: jefe-state-change
 6 packrat                      0   1 recv
 7 thermal                      0   5 wait: send to gimlet_seq/gen0
 8 power                        0   6 wait: send to gimlet_seq/gen0
 9 hiffy                        0   5 notif: bit31(T+232)
10 gimlet_seq                   0   4 notif: bit31(T+2)
11 gimlet_inspector             0   6 wait: send to net/gen0
12 hash_driver                  0   2 recv
13 hf                           0   3 notif: bit31(T+218)
14 update_server                0   3 recv
15 sensor                       0   4 recv
16 host_sp_comms                0   7 recv, notif: jefe-state-change usart-irq(irq82) multitimer control-plane-agent
17 udpecho                      0   6 wait: send to net/gen0
18 udpbroadcast                 0   6 wait: send to net/gen0
19 control_plane_agent          0   6 wait: send to net/gen0
20 sprot                        0   4 recv
21 validate                     0   5 recv
22 vpd                          0   4 recv
23 user_leds                    0   2 recv, notif: timer
24 dump_agent                   0   6 wait: send to net/gen0
25 sbrmi                        0   4 recv
26 idle                         0   8 RUNNING
27 udprpc                       0   6 wait: send to net/gen0
eliza@niles ~ $ pfexec humility -t gimlet-c ringbuf thermal
humility: attached to 0483:3754:000D00344741500820383733 via ST-Link V3
humility: ring buffer drv_i2c_devices::max31790::__RINGBUF in thermal:
humility: ring buffer task_thermal::__RINGBUF in thermal:
   TOTAL VARIANT
       8 FanControllerInitError(BusLocked)
       6 FanReadFailed
       6 FanAdded
       5 MiscReadFailed
       1 Start
       1 ThermalMode(Auto)
       1 AutoState(Boot)
 NDX LINE      GEN    COUNT PAYLOAD
   0  394        1        1 Start
   1  116        1        1 FanControllerInitError(BusLocked)
   2  180        1        1 ThermalMode(Auto)
   3  633        1        1 AutoState(Boot)
   4  116        1        1 FanControllerInitError(BusLocked)
   5  642        1        1 FanAdded(Fan(0x0))
   6  642        1        1 FanAdded(Fan(0x1))
   7  642        1        1 FanAdded(Fan(0x2))
   8  642        1        1 FanAdded(Fan(0x3))
   9  642        1        1 FanAdded(Fan(0x4))
  10  642        1        1 FanAdded(Fan(0x5))
  11  116        1        1 FanControllerInitError(BusLocked)
  12  675        1        1 FanReadFailed(SensorId(0x63), I2cError(BusLocked))
  13  116        1        1 FanControllerInitError(BusLocked)
  14  675        1        1 FanReadFailed(SensorId(0x64), I2cError(BusLocked))
  15  116        1        1 FanControllerInitError(BusLocked)
  16  675        1        1 FanReadFailed(SensorId(0x65), I2cError(BusLocked))
  17  116        1        1 FanControllerInitError(BusLocked)
  18  675        1        1 FanReadFailed(SensorId(0x66), I2cError(BusLocked))
  19  116        1        1 FanControllerInitError(BusLocked)
  20  675        1        1 FanReadFailed(SensorId(0x67), I2cError(BusLocked))
  21  116        1        1 FanControllerInitError(BusLocked)
  22  675        1        1 FanReadFailed(SensorId(0x68), I2cError(BusLocked))
  23  688        1        1 MiscReadFailed(SensorId(0x2), I2cError(NoDevice))
  24  688        1        1 MiscReadFailed(SensorId(0x6e), I2cError(BusLocked))
  25  688        1        1 MiscReadFailed(SensorId(0x6c), I2cError(BusLocked))
  26  688        1        1 MiscReadFailed(SensorId(0x6d), I2cError(BusLocked))
  27  688        1        1 MiscReadFailed(SensorId(0x1), I2cError(NoDevice))
eliza@niles ~ $
```

Note that:
- The `thermal` task has not crashed, even though there's no MAX31790 on
  the bus whatsoever
- Thermal control loop iterations have continued trying to poll the
  various sensors --- in this case, they *also* don't exist, but if they
  did, the sensor readings would be published successfully even though
  the thermal task isn't driving the fan controller.

Fixes #1779
Seemed nicer to have it be in the `control` module. I also added some
comments.
Copy link
Contributor

@aapoalas aapoalas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had some questions out of curiosity.

@@ -26,6 +26,7 @@ pub enum ThermalError {
InvalidWatchdogTime = 7,
InvalidParameter = 8,
InvalidIndex = 9,
FanControllerUninitialized = 10,

This comment was marked as resolved.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is for use with C-like error types, where we use 0 to represent "no error".

Copy link
Contributor

@aapoalas aapoalas May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry I was very unclear. I meant why keep defining = 2 etc. instead of letting Rust lay those out in numeric order after the = 1. Skipping 0 of course makes sense for the C-like error types.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Yeah, we could just use the auto-increment numbering and it would be equivalent, I guess it's just a matter of taste. Personally, I kind of like the explicitness, you can look at the code and see which variant corresponds to a given number more easily without having to count variants yourself. But, I don't think it matters a whole lot either way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay, makes sense, especially if number-to-code reverse lookup is done at any regularity.

Thank you for humouring my curiosity.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's something we expect to do super often, but it certainly could happen while e.g. debugging a crash dump. Although, I wasn't the one who originally decided to use explicit numbers here, so, I'm mostly just following the existing code for consistency. :)

}
match self.fan_control(4.into()) {
Ok(c) => fctrl(c),
Err(e) => last_err = Err(e),

This comment was marked as resolved.

Copy link
Member Author

@hawkw hawkw May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, my rationale here was mostly just that existing code in this module will return the last error in cases where both fan controllers return an error. In practice, if both fan controllers return an error, the errors will have been recorded elsewhere (in the I2C driver's ring buffer), so it's not totally thrown away...and the higher level code that calls this function ends up discarding the more detailed error information anyway. The caller of the thermal IPC interface is getting an error type that doesn't include details about the underlying I2C error condition regardless of what this function does.

@hawkw hawkw merged commit 0aa6370 into master May 29, 2024
104 checks passed
@hawkw hawkw deleted the eliza/fanctrl-retry branch May 29, 2024 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Panic from fctrl.initialize().unwrap_lite(); in thermal task
3 participants