Skip to content

Metrics for crucible disks#1073

Merged
leftwo merged 3 commits intomasterfrom
alan/cru-wants-oximeter
Mar 31, 2026
Merged

Metrics for crucible disks#1073
leftwo merged 3 commits intomasterfrom
alan/cru-wants-oximeter

Conversation

@leftwo
Copy link
Copy Markdown
Contributor

@leftwo leftwo commented Mar 9, 2026

Crucible disk metrics were missing because set_metric_consumer was called on a DeviceAttachment before Crucible's queues were associated with it. The original implementation only walked the already-associated queues and handed the consumer to each one's QueueMinder. Any queue that arrived late simply never received it.

The fix stores the MetricConsumer in QueueColState (the collection-level state guarded by the collection's mutex), then in queue_associate propagates it to any newly-arriving queue's minder before the minder is installed into the slot. This mirrors exactly how the paused flag is already propagated to late-associating queues.

A minor correctness fix: as_mut() -> as_ref() on the minder Option
was also included, since set_metric_consumer takes &self.

@jmcarp
Copy link
Copy Markdown

jmcarp commented Mar 11, 2026

I just ran into this myself. Looking forward to this landing so my dashboards will look nicer.

@leftwo
Copy link
Copy Markdown
Contributor Author

leftwo commented Mar 11, 2026

Fix for: #1077

@leftwo
Copy link
Copy Markdown
Contributor Author

leftwo commented Mar 11, 2026

Crucible disk metrics from Dublin running this repo:
Screenshot 2026-03-11 at 2 05 04 PM

@leftwo leftwo marked this pull request as ready for review March 11, 2026 21:22
@leftwo
Copy link
Copy Markdown
Contributor Author

leftwo commented Mar 11, 2026

To answer a question @iximeow had, does the scrub count for these?

To answer, here we have a propolis server running a scrub:

23:19:20.856Z INFO propolis-server (vm_state_driver): Scrub check for f04ccd2c-bbc9-4db0-9f3a-9c8bc25ec122               
23:19:20.856Z INFO propolis-server (vm_state_driver): Scrub pause 120 seconds before starting                            
23:21:20.857Z INFO propolis-server (vm_state_driver): Scrub for f04ccd2c-bbc9-4db0-9f3a-9c8bc25ec122 begins              
23:21:20.857Z INFO propolis-server (vm_state_driver): Scrub with total_size:85899345920 block_size:512                   
23:21:20.857Z INFO propolis-server (vm_state_driver): Scrubs from block 0 to 167772160 in (256) 131072 size IOs pm:25 

So, we have a read only parent copying IO on a new blank disk.
We can see this in dtrace output, where we read from one volume and write to the other:

  PID     UUID  SESSION DS0 DS1 DS2   NEXT_JOB  DELTA CONN   ELR   ELC   ERR   ERN                                       
16419 1122ef37 de0b7fed ACT ACT ACT      18359     37    3     0     0     0     0                                       
16419 f04ccd2c f6dbfa5c ACT ACT ACT      20817     37    3     0     0     0     0                                       
16419 1122ef37 de0b7fed ACT ACT ACT      18397     38    3     0     0     0     0                                       
16419 f04ccd2c f6dbfa5c ACT ACT ACT      20854     37    3     0     0     0     0           

Looking in the console, we can see this:

Screenshot 2026-03-11 at 4 29 01 PM

We see the traffic from the initial boot, then IOs go to zero.
Given we are doing 1M IOs, we should see something in the metrics if the scrub was counted.

A few minutes later, we see a little traffic on the disk, but I suspect this is from whatever background activities the boot disk is logging. I don't see traffic here to indicate the scrub traffic is being counted.

Screenshot 2026-03-11 at 4 39 16 PM

Now I have question for me, and that is why are these not counted?

Copy link
Copy Markdown
Member

@iximeow iximeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the fix, sorry for the slow review.

this almost-but-does-not-quite get us metrics for local disks, only because the plumbing in propolis-server/bin/lib/initializer.rs is a little more clever than I'd had expected to find! as-is I think we'll experience a lot of heartburn around the Arc<dyn MetricConsumer> (same problem as #1003, but slightly different words) so that works out in our favor.

I'll do a followup issue about local metrics (because I think collecting them should be cheap!) and the imminent bottleneck if we turned that on as-is.

@iximeow iximeow added bug Something that isn't working. storage Related to storage devices/backends. labels Mar 31, 2026
@iximeow iximeow added this to the 19 milestone Mar 31, 2026
@iximeow iximeow mentioned this pull request Mar 31, 2026
@leftwo leftwo merged commit 7478363 into master Mar 31, 2026
12 checks passed
@leftwo leftwo deleted the alan/cru-wants-oximeter branch March 31, 2026 03:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something that isn't working. storage Related to storage devices/backends.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants