Skip to content

Thermal loop doesn't distinguish between "real" and "fake" temperatures #2385

@nathanaelhuffman

Description

@nathanaelhuffman
➜  ~ humility -d 1001-sled26_hubris.core.0 ringbuf thermal
humility: attached to dump
humility: ring buffer drv_i2c_devices::emc2305::__RINGBUF in thermal:
humility: ring buffer drv_i2c_devices::max31790::__RINGBUF in thermal:
humility: ring buffer task_thermal::__RINGBUF in thermal:
   TOTAL VARIANT
    7545 ControlPwm
      57 SensorReadFailed
       3 AutoState(Boot)
       2 AutoState(Running)
       1 AutoState(Overheated)
       1 AutoState(Uncontrollable)
       6 FanAdded
       3 MiscReadFailed
       2 PowerModeChanged
       1 Start
       1 ThermalMode(Auto)
       1 PowerDownDueTo
       1 CriticalDueTo
       1 FanControllerInitialized
       1 PowerDownAt
       1 SetFanWatchdogOk
 NDX LINE      GEN    COUNT PAYLOAD
  30  895        4        1 SensorReadFailed(SensorId(0x38), I2cError(NoDevice))
  31 1206        4        1 ControlPwm(0x57)
   0  895        5        1 SensorReadFailed(SensorId(0x38), I2cError(NoDevice))
   1 1112        5        1 CriticalDueTo { sensor_id: SensorId(0x38), temperature: Celsius(80.3285) }
   2 1120        5        1 AutoState(Overheated)
   3 1206        5        1 ControlPwm(0x64)
   4  895        5        1 SensorReadFailed(SensorId(0x38), I2cError(NoDevice))
   5 1206        5        1 ControlPwm(0x64)
   6  895        5        1 SensorReadFailed(SensorId(0x38), I2cError(NoDevice))
   7 1206        5        1 ControlPwm(0x64)
   8  895        5        1 SensorReadFailed(SensorId(0x38), I2cError(NoDevice))
   9 1206        5        1 ControlPwm(0x64)
  10  895        5        1 SensorReadFailed(SensorId(0x38), I2cError(NoDevice))
  11 1206        5        1 ControlPwm(0x64)
  12  895        5        1 SensorReadFailed(SensorId(0x38), I2cError(NoDevice))
  13 1206        5        1 ControlPwm(0x64)
  14  895        5        1 SensorReadFailed(SensorId(0x38), I2cError(NoDevice))
  15 1206        5        1 ControlPwm(0x64)
  16  895        5        1 SensorReadFailed(SensorId(0x38), I2cError(NoDevice))
  17 1206        5        1 ControlPwm(0x64)
  18  895        5        1 SensorReadFailed(SensorId(0x38), I2cError(NoDevice))
  19 1206        5        1 ControlPwm(0x64)
  20  895        5        1 SensorReadFailed(SensorId(0x38), I2cError(NoDevice))
  21 1206        5        1 ControlPwm(0x64)
  22  895        5        1 SensorReadFailed(SensorId(0x38), I2cError(NoDevice))
  23 1164        5        1 PowerDownDueTo { sensor_id: SensorId(0x38), temperature: Celsius(85.351) }
  24 1169        5        1 AutoState(Uncontrollable)
  25 1210        5        1 PowerDownAt(0x72435)
  26  964        5        1 PowerModeChanged(PowerBitmask(0b1))
  27  788        5        1 AutoState(Boot)
  28 1065        5        1 AutoState(Running)
  29 1206        5     7085 ControlPwm(0x0)

While debugging #2384, we noticed that the temperatures reported here may actually be artificially inflated due to this logic

but we don't indicate this or otherwise provide any indication that this might not actually be a real, measured temperature, and as such this can be extremely confusing. We should provide some way of indicating that this is a synthesized temperature in this case.

Metadata

Metadata

Assignees

Labels

fault-managementEverything related to the Oxide's Fault Management architecture implementationproductRequired for the product to be generally useful

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions