-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
thermal: don't panic if MAX31790 init fails #1798
Conversation
Currently, the `thermal` task tries to initialize the MAX31790 fan controller(s) once when the task starts, and `unwrap_lite`s the result. This means that in the face of a transient I2C error, the whole task panics and gets restarted. This is unfortunate: sometimes, the bus is briefly locked or arbitration is lost or something, and then everything is fine. Furthermore, if we can't talk to the MAX31790, its watchdog will trip and it'll just run the fans at 100%...and would probably prefer to just let the `thermal` task keep running and publishing sensor readings, rather than crashlooping. This commit changes the `thermal` task to track whether it has initialized the MAX31790(s), and if it hasn't yet, any operation which will attempt to talk to the MAX31790(s) will first try to initialize the fan controller. IPC operations which require an initialized fan controller will return an error to the caller if it can't be initialized, but the `thermal` task will stay running, reading sensor data, and continue trying to initialize the fan controller every time the thermal control loop runs. This way, we don't have crash loops and can tolerate temporary I2C jank. For example, here's what happens when I flash the `gimlet-c` firmware on a board that doesn't have a `rear` I2C bus: ```console eliza@niles ~ $ pfexec humility -t gimlet-c tasks humility: attached to 0483:3754:000D00344741500820383733 via ST-Link V3 system time = 8810 ID TASK GEN PRI STATE 0 jefe 0 0 recv, notif: fault timer(T+90) 1 net 0 5 notif: jefe-state-change 2 sys 0 1 recv, notif: exti-wildcard-irq(irq6/irq7/irq8/irq9/irq10/irq23/irq40) 3 spi2_driver 0 3 recv(gimlet_seq/gen0 only) 4 i2c_driver 0 3 recv 5 spd 0 2 notif: jefe-state-change 6 packrat 0 1 recv 7 thermal 0 5 wait: send to gimlet_seq/gen0 8 power 0 6 wait: send to gimlet_seq/gen0 9 hiffy 0 5 notif: bit31(T+232) 10 gimlet_seq 0 4 notif: bit31(T+2) 11 gimlet_inspector 0 6 wait: send to net/gen0 12 hash_driver 0 2 recv 13 hf 0 3 notif: bit31(T+218) 14 update_server 0 3 recv 15 sensor 0 4 recv 16 host_sp_comms 0 7 recv, notif: jefe-state-change usart-irq(irq82) multitimer control-plane-agent 17 udpecho 0 6 wait: send to net/gen0 18 udpbroadcast 0 6 wait: send to net/gen0 19 control_plane_agent 0 6 wait: send to net/gen0 20 sprot 0 4 recv 21 validate 0 5 recv 22 vpd 0 4 recv 23 user_leds 0 2 recv, notif: timer 24 dump_agent 0 6 wait: send to net/gen0 25 sbrmi 0 4 recv 26 idle 0 8 RUNNING 27 udprpc 0 6 wait: send to net/gen0 eliza@niles ~ $ pfexec humility -t gimlet-c ringbuf thermal humility: attached to 0483:3754:000D00344741500820383733 via ST-Link V3 humility: ring buffer drv_i2c_devices::max31790::__RINGBUF in thermal: humility: ring buffer task_thermal::__RINGBUF in thermal: TOTAL VARIANT 8 FanControllerInitError(BusLocked) 6 FanReadFailed 6 FanAdded 5 MiscReadFailed 1 Start 1 ThermalMode(Auto) 1 AutoState(Boot) NDX LINE GEN COUNT PAYLOAD 0 394 1 1 Start 1 116 1 1 FanControllerInitError(BusLocked) 2 180 1 1 ThermalMode(Auto) 3 633 1 1 AutoState(Boot) 4 116 1 1 FanControllerInitError(BusLocked) 5 642 1 1 FanAdded(Fan(0x0)) 6 642 1 1 FanAdded(Fan(0x1)) 7 642 1 1 FanAdded(Fan(0x2)) 8 642 1 1 FanAdded(Fan(0x3)) 9 642 1 1 FanAdded(Fan(0x4)) 10 642 1 1 FanAdded(Fan(0x5)) 11 116 1 1 FanControllerInitError(BusLocked) 12 675 1 1 FanReadFailed(SensorId(0x63), I2cError(BusLocked)) 13 116 1 1 FanControllerInitError(BusLocked) 14 675 1 1 FanReadFailed(SensorId(0x64), I2cError(BusLocked)) 15 116 1 1 FanControllerInitError(BusLocked) 16 675 1 1 FanReadFailed(SensorId(0x65), I2cError(BusLocked)) 17 116 1 1 FanControllerInitError(BusLocked) 18 675 1 1 FanReadFailed(SensorId(0x66), I2cError(BusLocked)) 19 116 1 1 FanControllerInitError(BusLocked) 20 675 1 1 FanReadFailed(SensorId(0x67), I2cError(BusLocked)) 21 116 1 1 FanControllerInitError(BusLocked) 22 675 1 1 FanReadFailed(SensorId(0x68), I2cError(BusLocked)) 23 688 1 1 MiscReadFailed(SensorId(0x2), I2cError(NoDevice)) 24 688 1 1 MiscReadFailed(SensorId(0x6e), I2cError(BusLocked)) 25 688 1 1 MiscReadFailed(SensorId(0x6c), I2cError(BusLocked)) 26 688 1 1 MiscReadFailed(SensorId(0x6d), I2cError(BusLocked)) 27 688 1 1 MiscReadFailed(SensorId(0x1), I2cError(NoDevice)) eliza@niles ~ $ ``` Note that: - The `thermal` task has not crashed, even though there's no MAX31790 on the bus whatsoever - Thermal control loop iterations have continued trying to poll the various sensors --- in this case, they *also* don't exist, but if they did, the sensor readings would be published successfully even though the thermal task isn't driving the fan controller. Fixes #1779
Seemed nicer to have it be in the `control` module. I also added some comments.
152c822
to
847f170
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had some questions out of curiosity.
@@ -26,6 +26,7 @@ pub enum ThermalError { | |||
InvalidWatchdogTime = 7, | |||
InvalidParameter = 8, | |||
InvalidIndex = 9, | |||
FanControllerUninitialized = 10, |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is for use with C-like error types, where we use 0 to represent "no error".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sorry I was very unclear. I meant why keep defining = 2
etc. instead of letting Rust lay those out in numeric order after the = 1
. Skipping 0 of course makes sense for the C-like error types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. Yeah, we could just use the auto-increment numbering and it would be equivalent, I guess it's just a matter of taste. Personally, I kind of like the explicitness, you can look at the code and see which variant corresponds to a given number more easily without having to count variants yourself. But, I don't think it matters a whole lot either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay, makes sense, especially if number-to-code reverse lookup is done at any regularity.
Thank you for humouring my curiosity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's something we expect to do super often, but it certainly could happen while e.g. debugging a crash dump. Although, I wasn't the one who originally decided to use explicit numbers here, so, I'm mostly just following the existing code for consistency. :)
} | ||
match self.fan_control(4.into()) { | ||
Ok(c) => fctrl(c), | ||
Err(e) => last_err = Err(e), |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, my rationale here was mostly just that existing code in this module will return the last error in cases where both fan controllers return an error. In practice, if both fan controllers return an error, the errors will have been recorded elsewhere (in the I2C driver's ring buffer), so it's not totally thrown away...and the higher level code that calls this function ends up discarding the more detailed error information anyway. The caller of the thermal
IPC interface is getting an error type that doesn't include details about the underlying I2C error condition regardless of what this function does.
Currently, the
thermal
task tries to initialize the MAX31790 fancontroller(s) once when the task starts, and
unwrap_lite
s the result.This means that in the face of a transient I2C error, the whole task
panics and gets restarted. This is unfortunate: sometimes, the bus is
briefly locked or arbitration is lost or something, and then everything
is fine. Furthermore, if we can't talk to the MAX31790, its watchdog
will trip and it'll just run the fans at 100%...and would probably
prefer to just let the
thermal
task keep running and publishing sensorreadings, rather than crashlooping.
This commit changes the
thermal
task to track whether it hasinitialized the MAX31790(s), and if it hasn't yet, any operation which
will attempt to talk to the MAX31790(s) will first try to initialize the
fan controller. IPC operations which require an initialized fan
controller will return an error to the caller if it can't be
fan controller. IPC operations which require an initialized fan
controller will return an error to the caller if it can't be
initialized, but the
thermal
task will stay running, reading sensordata, and continue trying to initialize the fan controller every time
the thermal control loop runs. This way, we don't have crash loops and
can tolerate temporary I2C jank.
For example, here's what happens when I flash the
gimlet-c
firmware ona board that doesn't have a
rear
I2C bus:Note that:
thermal
task has not crashed, even though there's no MAX31790 onthe bus whatsoever
various sensors --- in this case, they also don't exist, but if they
did, the sensor readings would be published successfully even though
the thermal task isn't driving the fan controller.
Fixes #1779