Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add humility counters, show counts in humility ringbuf #449

Merged
merged 25 commits into from
Feb 29, 2024

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Feb 22, 2024

Hubris PR oxidecomputer/hubris#1621 added the ability for ring buffers
whose entry type is an enum to generate a set of counters that track
the total number of times each entry enum variant has been added to the
ring buffer. PR oxidecomputer/hubris#1630 generalizes this to also allow
counters to be generated for any enum without requiring a
corresponding ring buffer.

These counters are designed to be consumed by Humility. This PR updates
Humility to understand these counters. In particular, it:

  • Changes the humility ringbuf command to understand CountedRingbufs
    as well as un-counted Ringbufs
  • When displaying a CountedRingbuf, the humility ringbuf command
    will now display the total counts for that ringbuf's entry variants
    (unless explicitly asked not to)
  • Adds a new humility counters subcommand for displaying all counters,
    whether or not they are attached to a ringbuf.

The humility counters command behaves similarly to humility ringbuf,
and allows filtering which counters are displayed based on the name of
the variable and task, and can also be used to list all counters in the
attached Hubris instance or dump. Both commands support a CLI flag to
display counters with zero values.

Examples

Listing all counters in a dump:
$ humility -d hubris.c12.bufcounts5.core counters --list
humility: attached to dump
control_plane_agent:
    ADDR        SIZE VARIABLE
    0x2402b938   664 task_control_plane_agent::CRITICAL
    0x24029000  1272 task_control_plane_agent::__RINGBUF
gimlet_inspector:
    ADDR        SIZE VARIABLE
    0x2404ce40    12 task_gimlet_inspector::__COUNTERS
gimlet_seq:
    ADDR        SIZE VARIABLE
    0x2403f424    48 drv_gimlet_seq_server::IPC_REQUESTS
    0x2403e640  3448 drv_gimlet_seq_server::__RINGBUF
thermal:
    ADDR        SIZE VARIABLE
    0x240039e0   912 task_thermal::__RINGBUF
Displaying all counters in a dump:
$ humility -d hubris.c12.bufcounts5.core counters
humility: attached to dump
control_plane_agent
 |
 +---> task_control_plane_agent::CRITICAL:
 |     <no counts recorded>
 +---> task_control_plane_agent::__RINGBUF:
       <no counts recorded>
gimlet_inspector
 |
 +---> task_gimlet_inspector::__COUNTERS:
       <no counts recorded>
gimlet_seq
 |
 +---> drv_gimlet_seq_server::IPC_REQUESTS:
 |       411 GetState
 |         1 SetState(A0)
 +---> drv_gimlet_seq_server::__RINGBUF:
        2068 Status
         446 ClockConfigWrite
         295 A0Status
         102 A1Status
          15 A0Power
           2 V3P3SysA0VOut
           1 SMStatus
           1 SetState
           1 RailsOff
           1 Ident
           1 A2Status
           1 A2
           1 SpdDimmsFound
           1 Ice40PowerGoodV1P2
           1 IdentValid
           1 CPUPresent
           1 Coretype
           1 Programmed
           1 Ice40Rails
           1 InterruptFlags
           1 RailsOn
           1 UartEnabled
           1 Ice40PowerGoodV3P3
           1 UpdateState(A0)
           1 A0
           1 Reprogram
           1 ClockConfigSuccess
           1 ChecksumValid
           1 PGStatus
           1 Programming
           1 PowerControl
thermal
 |
 +---> task_thermal::__RINGBUF:
         138 ControlPwm
           6 FanAdded
           3 AutoState(Boot)
           2 AutoState(Running)
           2 PowerModeChanged
           1 Start
           1 ThermalMode(Auto)
Filtering by task:
$ humility -d hubris.c12.bufcounts5.core counters thermal
humility: attached to dump
thermal
 |
 +---> task_thermal::__RINGBUF:
         138 ControlPwm
           6 FanAdded
           3 AutoState(Boot)
           2 AutoState(Running)
           2 PowerModeChanged
           1 Start
           1 ThermalMode(Auto)
Sorting counters by their value:
$ humility -d hubris.c12.bufcounts5.core counters thermal --sort value
humility: attached to dump
thermal
 |
 +---> task_thermal::__RINGBUF:
         138 ControlPwm
           6 FanAdded
           3 AutoState(Boot)
           2 AutoState(Running)
           2 PowerModeChanged
           1 Start
           1 ThermalMode(Auto)
Sorting counters alphabetically:
$ humility -d hubris.c12.bufcounts5.core counters thermal --sort alpha
humility: attached to dump
thermal
 |
 +---> task_thermal::__RINGBUF:
           3 AutoState(Boot)
           2 AutoState(Running)
         138 ControlPwm
           6 FanAdded
           2 PowerModeChanged
           1 Start
           1 ThermalMode(Auto)
Displaying all counters, including those with zero values:
$ humility -d hubris.c12.bufcounts5.core counters gimlet_seq --full
humility: attached to dump
gimlet_seq
 |
 +---> drv_gimlet_seq_server::IPC_REQUESTS:
 |       411 GetState
 |         1 SetState(_)
 |         0 |     SetState(A2)
 |         0 |     SetState(A2PlusFans)
 |         0 |     SetState(A1)
 |         1 +---> SetState(A0)
 |         0 |     SetState(A0PlusHP)
 |         0 |     SetState(A0Thermtrip)
 |         0 |     SetState(A0Reset)
 |         0 FansOn
 |         0 FansOff
 |         0 SendHardwareNmi
 |         0 ReadFpgaRegs
 +---> drv_gimlet_seq_server::__RINGBUF:
           1 Ice40Rails
           1 IdentValid
           1 ChecksumValid
           1 Reprogram
           1 Programmed
           1 Programming
           1 Ice40PowerGoodV1P2
           1 Ice40PowerGoodV3P3
           1 RailsOff
           1 Ident
           1 A2Status
           1 A2
           0 A0FailureDetails
           0 A0Failed(_)
           0 |     A0Failed(IllegalTransition)
           0 |     A0Failed(MuxToHostCPUFailed)
           0 |     A0Failed(MuxToSPFailed)
           0 |     A0Failed(ReadRegsFailed)
           0 |     A0Failed(CPUNotPresent)
           0 |     A0Failed(UnrecognizedCPU)
           0 |     A0Failed(A1Timeout)
           0 |     A0Failed(A0TimeoutGroupC)
           0 |     A0Failed(A0Timeout)
           0 |     A0Failed(I2cFault)
           0 |     A0Failed(ServerRestarted)
         102 A1Status
           1 CPUPresent
           1 Coretype
         295 A0Status
          15 A0Power
           0 NICPowerEnableLow
           1 RailsOn
           1 UartEnabled
           1 A0
           1 SetState
           1 UpdateState(_)
           0 |     UpdateState(A2)
           0 |     UpdateState(A2PlusFans)
           0 |     UpdateState(A1)
           1 +---> UpdateState(A0)
           0 |     UpdateState(A0PlusHP)
           0 |     UpdateState(A0Thermtrip)
           0 |     UpdateState(A0Reset)
         446 ClockConfigWrite
           1 ClockConfigSuccess
        2068 Status
           1 PGStatus
           1 SMStatus
           0 ResetCounts
           1 PowerControl
           1 InterruptFlags
           2 V3P3SysA0VOut
           0 SpdBankAbsent
           0 SpdAbsent
           1 SpdDimmsFound
           0 I2cFault(_)
           0 |     I2cFault(BadResponse)
           0 |     I2cFault(BadArg)
           0 |     I2cFault(NoDevice)
           0 |     I2cFault(BadController)
           0 |     I2cFault(ReservedAddress)
           0 |     I2cFault(BadPort)
           0 |     I2cFault(NoRegister)
           0 |     I2cFault(BadMux)
           0 |     I2cFault(BadSegment)
           0 |     I2cFault(MuxNotFound)
           0 |     I2cFault(SegmentNotFound)
           0 |     I2cFault(SegmentDisconnected)
           0 |     I2cFault(MuxDisconnected)
           0 |     I2cFault(MuxMissing)
           0 |     I2cFault(BadMuxRegister)
           0 |     I2cFault(BusReset)
           0 |     I2cFault(BusResetMux)
           0 |     I2cFault(BusLocked)
           0 |     I2cFault(BusLockedMux)
           0 |     I2cFault(ControllerBusy)
           0 |     I2cFault(BusError)
           0 |     I2cFault(BadDeviceState)
           0 |     I2cFault(OperationNotSupported)
           0 |     I2cFault(IllegalLeaseCount)
           0 |     I2cFault(TooMuchData)
           0 StartFailed(_)
           0 |     StartFailed(IllegalTransition)
           0 |     StartFailed(MuxToHostCPUFailed)
           0 |     StartFailed(MuxToSPFailed)
           0 |     StartFailed(ReadRegsFailed)
           0 |     StartFailed(CPUNotPresent)
           0 |     StartFailed(UnrecognizedCPU)
           0 |     StartFailed(A1Timeout)
           0 |     StartFailed(A0TimeoutGroupC)
           0 |     StartFailed(A0Timeout)
           0 |     StartFailed(I2cFault)
           0 |     StartFailed(ServerRestarted)
Displaying variant total counts when running `humility ringbuf`:
$ humility -d hubris.c12.bufcounts5.core ringbuf thermal
humility: attached to dump
humility: ring buffer drv_i2c_devices::max31790::__RINGBUF in thermal:
humility: ring buffer task_thermal::__RINGBUF in thermal:
   TOTAL VARIANT
     138 ControlPwm
       6 FanAdded
       3 AutoState(Boot)
       2 AutoState(Running)
       2 PowerModeChanged
       1 Start
       1 ThermalMode(Auto)
 NDX LINE      GEN    COUNT PAYLOAD
   0  346        1        1 Start
   1  133        1        1 ThermalMode(Auto)
   2  625        1        1 AutoState(Boot)
   3  634        1        1 FanAdded(Fan(0x0))
   4  634        1        1 FanAdded(Fan(0x1))
   5  634        1        1 FanAdded(Fan(0x2))
   6  634        1        1 FanAdded(Fan(0x3))
   7  634        1        1 FanAdded(Fan(0x4))
   8  634        1        1 FanAdded(Fan(0x5))
   9  778        1        1 PowerModeChanged(PowerBitmask(InternalBitFlags(0x1)))
  10  625        1        1 AutoState(Boot)
  11  876        1        1 AutoState(Running)
  12  990        1      117 ControlPwm(0x0)
  13  778        1        1 PowerModeChanged(PowerBitmask(InternalBitFlags(0x2)))
  14  625        1        1 AutoState(Boot)
  15  876        1        1 AutoState(Running)
  16  990        1       21 ControlPwm(0x0)
Displaying totals with zero values when running `humility ringbuf`:
$ humility -d hubris.c12.bufcounts5.core ringbuf thermal --full-totals
humility: attached to dump
humility: ring buffer drv_i2c_devices::max31790::__RINGBUF in thermal:
humility: ring buffer task_thermal::__RINGBUF in thermal:
   TOTAL VARIANT
       1 Start
       1 ThermalMode(_)
       0 |     ThermalMode(Off)
       0 |     ThermalMode(Manual)
       1 +---> ThermalMode(Auto)
       5 AutoState(_)
       3 +---> AutoState(Boot)
       2 +---> AutoState(Running)
       0 |     AutoState(Overheated)
       0 |     AutoState(Uncontrollable)
       0 FanReadFailed
       0 MiscReadFailed
       0 SensorReadFailed
       0 PostFailed
     138 ControlPwm
       2 PowerModeChanged
       0 PowerDownFailed
       0 ControlError(_)
       0 |     ControlError(InvalidFan)
       0 |     ControlError(InvalidPWM)
       0 |     ControlError(DeviceError)
       0 |     ControlError(NotInManualMode)
       0 |     ControlError(NotInAutoMode)
       0 |     ControlError(AlreadyInAutoMode)
       0 |     ControlError(InvalidWatchdogTime)
       0 |     ControlError(InvalidParameter)
       0 |     ControlError(InvalidIndex)
       0 |     ControlError(ServerDeath)
       0 FanPresenceUpdateFailed
       6 FanAdded
       0 FanRemoved
       0 PowerDownAt
       0 AddedDynamicInput
       0 RemovedDynamicInput
 NDX LINE      GEN    COUNT PAYLOAD
   0  346        1        1 Start
   1  133        1        1 ThermalMode(Auto)
   2  625        1        1 AutoState(Boot)
   3  634        1        1 FanAdded(Fan(0x0))
   4  634        1        1 FanAdded(Fan(0x1))
   5  634        1        1 FanAdded(Fan(0x2))
   6  634        1        1 FanAdded(Fan(0x3))
   7  634        1        1 FanAdded(Fan(0x4))
   8  634        1        1 FanAdded(Fan(0x5))
   9  778        1        1 PowerModeChanged(PowerBitmask(InternalBitFlags(0x1)))
  10  625        1        1 AutoState(Boot)
  11  876        1        1 AutoState(Running)
  12  990        1      117 ControlPwm(0x0)
  13  778        1        1 PowerModeChanged(PowerBitmask(InternalBitFlags(0x2)))
  14  625        1        1 AutoState(Boot)
  15  876        1        1 AutoState(Running)
  16  990        1       21 ControlPwm(0x0)
Opting out of totals when displaying a counted ringbuf:
$ humility -d hubris.c12.bufcounts5.core ringbuf thermal --no-totals
humility: attached to dump
humility: ring buffer drv_i2c_devices::max31790::__RINGBUF in thermal:
humility: ring buffer task_thermal::__RINGBUF in thermal:
 NDX LINE      GEN    COUNT PAYLOAD
   0  346        1        1 Start
   1  133        1        1 ThermalMode(Auto)
   2  625        1        1 AutoState(Boot)
   3  634        1        1 FanAdded(Fan(0x0))
   4  634        1        1 FanAdded(Fan(0x1))
   5  634        1        1 FanAdded(Fan(0x2))
   6  634        1        1 FanAdded(Fan(0x3))
   7  634        1        1 FanAdded(Fan(0x4))
   8  634        1        1 FanAdded(Fan(0x5))
   9  778        1        1 PowerModeChanged(PowerBitmask(InternalBitFlags(0x1)))
  10  625        1        1 AutoState(Boot)
  11  876        1        1 AutoState(Running)
  12  990        1      117 ControlPwm(0x0)
  13  778        1        1 PowerModeChanged(PowerBitmask(InternalBitFlags(0x2)))
  14  625        1        1 AutoState(Boot)
  15  876        1        1 AutoState(Running)
  16  990        1       21 ControlPwm(0x0)

hawkw added a commit to oxidecomputer/hubris that referenced this pull request Feb 26, 2024
Ringbufs provide a fixed-size buffer for storing diagnostic events,
which may include data. This is very useful for debugging. However,
because the ring buffer has a fixed size, only the last `N` events are
visible at any given time. If a ringbuf has a capacity of, say, 16
events, and a given event has occurred 10,000 times over the lifespan of
the task, there's currently no way to know that the event has happened
more than 16 times. Furthermore, there's no way to use the ringbuf to
determine whether a given event has _ever_ occurred during the task's
lifetime: if a given variant is not _currently_ present in the ringbuf,
it may still have occurred previously and been "drowned out" by
subsequent events. Finally, some targets disable the ringbuf entirely,
for space reasons, and on those targets, it may still be desirable to be
able to get some less granular diagnostic data out of the ringbuf events
a task records.

Therefore, this commit extends the `ringbuf` crate to support generating
counts of the number of times a particular event variant has been
recorded, in addition to a fixed-size buffer of events. These counters
provide less detailed data than inspecting the ringbuf contents, because
they don't record the values of any _fields_ on that event. However,
they allow recording very large numbers of events in a fixed amount of
space, and provide historical data for events that may have occurred in
the past and then "fallen off" the end of the ringbuf as new events were
recorded. By inspecting these counters, we can determine if a given
variant has *ever* been recorded, even if it isn't currently in the
buffer, and we can see the total number of times it has been recorded
over the task's entire lifetime.

Event counters are implemented using a new `CountedRingbuf` type, which
marries a ring buffer with a type which can count occurances of entry
variants. `CountedRingbuf`s may be declared using the `counted_ringbuf!`
macro, and entries can be recorded using the existing `ringbuf_entry!`
macro.

`CountedRingbuf` requires that the entry type implement a new `Count`
trait, which defines the counter type, an initializer for creating new
instances of the counter type, and a method to increment the counter
type with a given instance of the entry type. An implementation of
`ringbuf::Count` can be generated for an entry type using the
`#[derive(ringbuf::Count)]` attribute, which generates a `struct` with
an `AtomicU32` field for each of the deriving `enum`'s variants. This
`struct` can then be loaded by Humility to provide a view of the event
counters.

Because recording event counters requires only `4 * <N_VARIANTS>` bytes,
counters are currently recorded even when the `ringbuf` crate has the
`disabled` feature set. This way, we can still record some diagnostic
data on targets that don't have the space to store the ringbuf itself.
If it becomes necessary to also disable counters, we could add a
separate `disable-counters` feature as well, so that a target can pick
and choose between recording ringbuf entries, counts, or both.

As a proof of concept, I've also updated `gimlet-seq-server` to use the
new event counters for its `Trace` ringbuf. Subsequent commits will roll
it out elsewhere.

Future work will include:

- [ ] Actually adopting event counters in existing ringbufs
- [ ] Updating `humility ringbuf` to also dump ringbuf event counts (see
  oxidecomputer/humility#449).

Depends on #1624, both to reduce merge conflicts and to mitigate an
increase in size due to the changes in this branch.
This way, we can add a separate `humility counters` subcommand for
printing counters that may not be attached to a ringbuf.
This adds a command for displaying all counters, regardless of whether
or not they are associated with a ringbuf. Given that we now have this
command, I've removed the `-t`/`--totals-only` flag from `humility
ringbuf`, as it no longer seems necessary
@hawkw hawkw changed the title WIP: support event counters in humility ringbuf add humility counters, show counts in humility ringbuf Feb 28, 2024
@hawkw hawkw marked this pull request as ready for review February 28, 2024 20:00
@hawkw hawkw requested review from cbiffle, mkeeter and bcantrill and removed request for cbiffle February 28, 2024 20:00
README.md Show resolved Hide resolved
cmd/counters/src/lib.rs Outdated Show resolved Hide resolved
@hawkw hawkw requested a review from mkeeter February 28, 2024 20:59
Copy link
Contributor

@bcantrill bcantrill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great -- please take a dump from a Gimlet system (or another that has counters) and add it to the test suite. Also, please add a counters test to the test suite -- and then be sure to bump the version number! (And considering that this is a breaking change in that new Hubris needs new Humility to properly display some ringbufs, I think we should bump the minor and not just the patch -- taking us to 0.11.0.)

@hawkw
Copy link
Member Author

hawkw commented Feb 29, 2024

This looks great -- please take a dump from a Gimlet system (or another that has counters) and add it to the test suite. Also, please add a counters test to the test suite

Ah, thanks — I had totally missed that we had a test suite with representative dumps, very nice! I’ll add some tests shortly.

Copy link
Contributor

@bcantrill bcantrill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great -- thanks for adding the tests!

@hawkw hawkw merged commit f18e452 into master Feb 29, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants