Skip to content

Add ereports for host panic/boot failure#2503

Open
jamesmunns wants to merge 6 commits into
masterfrom
james/host-panic-ereport
Open

Add ereports for host panic/boot failure#2503
jamesmunns wants to merge 6 commits into
masterfrom
james/host-panic-ereport

Conversation

@jamesmunns
Copy link
Copy Markdown
Contributor

This PR adds an ereport for a host panic, as well as boot failures. This PR does not currently implement the interface necessary to retrieve the panic message, which I will do in a follow-up commit/PR.

For @rmustacc, I've chosen the ereport class hw.host.panic and hw.host.btfail. Open to suggestions on this. Also worth noting, we will truncate any Host Panics or Boot Failures that are each larger than MAX_HOST_FAIL_MESSAGE_LEN, which is currently 4KiB. I will probably need to figure out how to paginate this for access in a follow up issue, but I wanted to make sure that 4KiB limit wasn't concerning to you (either not enough, or excessive).

For @hawkw, is there a guideline for whether declare_ereporter should be done in the relevant task, or in a central place like lib/ereports? The current hw.cpu ereports are there, but drv/xxx-seq-servers do it in their own crate (like this PR). I can move this to lib/ereports if that's the preferred approach.

I intend to include the ttl_ct/panic_len fields in the host comms API, either having the host include it in the "get" request, or just including it in the send, to ensure that we are actually retrieving the panic expected. We might also crosscheck the SP's boot count, to make sure we aren't getting any extra-stale requests. I'll note that in the follow-up PR.

Closes #2140.
Related to #2337.

@jamesmunns jamesmunns requested review from hawkw and rmustacc May 8, 2026 12:40
@hawkw
Copy link
Copy Markdown
Member

hawkw commented May 8, 2026

@jamesmunns:

For @rmustacc, I've chosen the ereport class hw.host.panic and hw.host.btfail.

The hw.* hierarchy is for ereports related to hardware components (hw stands for HardWare). In general we try to abide by the convention that all hw.* ereports have a refdes --- I think I mentioned this in the watercooler a bit ago? In this case, I don't think that the host boot fail and panic ereports should be under hw.* since they are not indicating an event related to a particular hardware component, but an event related to upstack software (the host OS). IMO, we should probably just define a new top-level host.* hierarchy for these ereports, but perhaps Robert will have an opinion.

For @hawkw, is there a guideline for whether declare_ereporter should be done in the relevant task, or in a central place like lib/ereports? The current hw.cpu ereports are there, but drv/xxx-seq-servers do it in their own crate (like this PR). I can move this to lib/ereports if that's the preferred approach.

I would only put ereport type definitions in lib/ereports when the same ereport type is reported by multiple tasks. The hw.cpu ereports are defined in lib/ereports because both the Gimlet and Cosmo sequencer tasks will report some of the same ereport types (but with different type parameters in some cases, which you can see if you look at their declare_ereporter! invocations). As for the declare_ereporter! macro, it should always be invoked in a task, and not in a library crate. That macro is responsible for creating some static allocations for the ereport buffer and ereport debugging counters, so if you invoke it in a lib crate, that will exist in every task that depends on that lib. And, it's supposed to be invoked with the complete list of ereport types that that task will want to be able to report. For example, if you look at different tasks that report different sets of ereports, you can see that they invoke declare_ereporter! with a different list of ereport message types:

ereports::declare_ereporter! {
pub(crate) struct Ereporter<SeqEreport> {
PmbusAlert(
ereports::pwr::PmbusAlert<
FixedStr<'static, 9>,
{ REFDES_LEN },
>,
),
Bmr491MitigationFailure(
ereports::pwr::Bmr491MitigationFailure<{ REFDES_LEN }>
),
Thermtrip(ereports::cpu::Thermtrip),
UnsupportedCpu(ereports::cpu::UnsupportedCpu<1, 2>),
CpuMissing(ereports::cpu::CpuMissing),
}

ereports::declare_ereporter! {
struct Ereporter<Ereport> {
PsuInserted(PsuInsertedEreport),
PsuRemoved(PsuRemovedEreport),
PowerGood(PowerGoodEreport),
PowerUngood(PowerUngoodEreport)
}
}

If the intended use of the macro wasn't clear in its docs, I should probably go improve that!

Copy link
Copy Markdown
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions about the ereport messages and data contained therein.

I also wonder somewhat if we might want to try to include which BSU (host boot slot) we were booting from in the boot fail (and also panic) messages. See

// Per RFD 241, the phase 1 device (which we can read via
// `hf`) is tightly bound to the BSU, so we can map flash0 to
// BSU A and flash1 to BSU B.
//
// What should we do if we fail to get the device from the host
// flash task? That should only happen if `hf` is unable to
// respond to us at all, which makes it seem unlikely that the
// host could even be up. We'll default to returning Bsu::A.
let bsu = match self.hf.get_dev() {
Ok(HfDevSelect::Flash0) | Err(_) => Bsu::A,
Ok(HfDevSelect::Flash1) => Bsu::B,
};
Some(SpToHost::BootStorageUnit(bsu))
for an example of getting it from the host-flash task. I'm not sure whether we expect the host to include this state in the panic messages someplace, but I don't think it will...

Comment thread task/host-sp-comms/src/main.rs
Comment thread task/host-sp-comms/src/main.rs Outdated
Comment thread task/host-sp-comms/src/main.rs Outdated
Comment thread task/host-sp-comms/src/main.rs Outdated
Comment thread task/host-sp-comms/src/main.rs Outdated
Comment thread task/host-sp-comms/src/main.rs Outdated
Comment thread task/host-sp-comms/src/main.rs
@jamesmunns jamesmunns changed the title Add ereports for host panic/boot failure. Add ereports for host panic/boot failure May 14, 2026
@jamesmunns
Copy link
Copy Markdown
Contributor Author

@hawkw should be ready for re-review now! I think I've addressed all comments.

@jamesmunns jamesmunns force-pushed the james/host-panic-ereport branch from 07e7912 to 5e68f88 Compare May 18, 2026 11:19
@jamesmunns
Copy link
Copy Markdown
Contributor Author

@hawkw CI is passing now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ereport: host panicked

2 participants