Metrics for crucible disks by leftwo · Pull Request #1073 · oxidecomputer/propolis

leftwo · 2026-03-09T18:33:25Z

Crucible disk metrics were missing because set_metric_consumer was called on a DeviceAttachment before Crucible's queues were associated with it. The original implementation only walked the already-associated queues and handed the consumer to each one's QueueMinder. Any queue that arrived late simply never received it.

The fix stores the MetricConsumer in QueueColState (the collection-level state guarded by the collection's mutex), then in queue_associate propagates it to any newly-arriving queue's minder before the minder is installed into the slot. This mirrors exactly how the paused flag is already propagated to late-associating queues.

A minor correctness fix: as_mut() -> as_ref() on the minder Option
was also included, since set_metric_consumer takes &self.

jmcarp · 2026-03-11T16:31:42Z

I just ran into this myself. Looking forward to this landing so my dashboards will look nicer.

leftwo · 2026-03-11T20:43:00Z

Fix for: #1077

leftwo · 2026-03-11T21:06:09Z

Crucible disk metrics from Dublin running this repo:

leftwo · 2026-03-11T23:40:50Z

To answer a question @iximeow had, does the scrub count for these?

To answer, here we have a propolis server running a scrub:

23:19:20.856Z INFO propolis-server (vm_state_driver): Scrub check for f04ccd2c-bbc9-4db0-9f3a-9c8bc25ec122               
23:19:20.856Z INFO propolis-server (vm_state_driver): Scrub pause 120 seconds before starting                            
23:21:20.857Z INFO propolis-server (vm_state_driver): Scrub for f04ccd2c-bbc9-4db0-9f3a-9c8bc25ec122 begins              
23:21:20.857Z INFO propolis-server (vm_state_driver): Scrub with total_size:85899345920 block_size:512                   
23:21:20.857Z INFO propolis-server (vm_state_driver): Scrubs from block 0 to 167772160 in (256) 131072 size IOs pm:25

So, we have a read only parent copying IO on a new blank disk.
We can see this in dtrace output, where we read from one volume and write to the other:

  PID     UUID  SESSION DS0 DS1 DS2   NEXT_JOB  DELTA CONN   ELR   ELC   ERR   ERN                                       
16419 1122ef37 de0b7fed ACT ACT ACT      18359     37    3     0     0     0     0                                       
16419 f04ccd2c f6dbfa5c ACT ACT ACT      20817     37    3     0     0     0     0                                       
16419 1122ef37 de0b7fed ACT ACT ACT      18397     38    3     0     0     0     0                                       
16419 f04ccd2c f6dbfa5c ACT ACT ACT      20854     37    3     0     0     0     0

Looking in the console, we can see this:

We see the traffic from the initial boot, then IOs go to zero.
Given we are doing 1M IOs, we should see something in the metrics if the scrub was counted.

A few minutes later, we see a little traffic on the disk, but I suspect this is from whatever background activities the boot disk is logging. I don't see traffic here to indicate the scrub traffic is being counted.

Now I have question for me, and that is why are these not counted?

iximeow

thanks for the fix, sorry for the slow review.

this almost-but-does-not-quite get us metrics for local disks, only because the plumbing in propolis-server/bin/lib/initializer.rs is a little more clever than I'd had expected to find! as-is I think we'll experience a lot of heartburn around the Arc<dyn MetricConsumer> (same problem as #1003, but slightly different words) so that works out in our favor.

I'll do a followup issue about local metrics (because I think collecting them should be cheap!) and the imminent bottleneck if we turned that on as-is.

leftwo and others added 2 commits March 9, 2026 11:30

Metrics for disks

07faaf8

fix cargo fmt

8a00267

Move clone after loop

e42f6e5

leftwo marked this pull request as ready for review March 11, 2026 21:22

iximeow approved these changes Mar 31, 2026

View reviewed changes

iximeow added bug Something that isn't working. storage Related to storage devices/backends. labels Mar 31, 2026

iximeow added this to the 19 milestone Mar 31, 2026

iximeow mentioned this pull request Mar 31, 2026

local disk metrics #1095

Closed

leftwo merged commit 7478363 into master Mar 31, 2026
12 checks passed

leftwo deleted the alan/cru-wants-oximeter branch March 31, 2026 03:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics for crucible disks#1073

Metrics for crucible disks#1073
leftwo merged 3 commits intomasterfrom
alan/cru-wants-oximeter

leftwo commented Mar 9, 2026 •

edited

Loading

Uh oh!

jmcarp commented Mar 11, 2026

Uh oh!

leftwo commented Mar 11, 2026

Uh oh!

leftwo commented Mar 11, 2026

Uh oh!

leftwo commented Mar 11, 2026 •

edited

Loading

Uh oh!

iximeow left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

leftwo commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jmcarp commented Mar 11, 2026

Uh oh!

leftwo commented Mar 11, 2026

Uh oh!

leftwo commented Mar 11, 2026

Uh oh!

leftwo commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iximeow left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

leftwo commented Mar 9, 2026 •

edited

Loading

leftwo commented Mar 11, 2026 •

edited

Loading