Collect metrics for allocated CPU, storage, and memory

In the console we want to show system-level utilization and capacity over time for CPU, storage, and memory (ideally also broken down by silo, org, project, etc). If I understand correctly, at any given point in time, the relevant data is available in CockroachDB in the `instance` and `disk` tables:

https://github.com/oxidecomputer/omicron/blob/508906b5465aeb4a5e8dce4406fab0f708dbbd89/nexus/db-model/src/schema.rs#L105-L106

https://github.com/oxidecomputer/omicron/blob/508906b5465aeb4a5e8dce4406fab0f708dbbd89/nexus/db-model/src/schema.rs#L24

However, we want to see these values as they change over time. We discussed this at the 9/22 product council and it seemed reasonable that Nexus would send metrics to ClickHouse whenever events happen that change the allocations (likely creation and deletion of instances and disks). Nexus is already set up to be a metrics producer for request latencies:

https://github.com/oxidecomputer/omicron/blob/d1fbdd2d89106858f54bb96ecdac4c998ed81165/nexus/src/context.rs#L115-L121

There are some interesting problems to solve around accumulated values and filtering by project, org, etc. Say a disk is created and we want to record the change in allocated space. We can send a metric to ClickHouse that records the size delta, i.e., the size of the created disk, along with IDs for project, org, and silo to facilitate aggregate queries. But the data we actually want to display in the console is the accumulation of all these changes over time. So to show the disk allocation from time _t1_ to _t2_, we will need the accumulation of all changes from the beginning of time _t0_ to _t1_ as the baseline value, and then starting from that we calculate the cumulative change for each point between _t1_ and _t2_ and return that as the result of the query.

As @bnaecker mentioned [here](https://github.com/oxidecomputer/omicron/issues/1624#issuecomment-1220893036) on a related issue, we could calculate and store the cumulative value in each row along with the delta. That way, when we're interested in what happens between _t1_ and _t2_, we can at least skip the calculation for _t0_ to _t1_. But wait! If we want to be able to filter by individual disk, project, org, or silo, we'd also need to maintain a cumulative value for each of those groupings. So the row stored in ClickHouse would end up looking like this

```
storage allocation metric
------------
timestamp
size_bytes (delta bytes)
disk_id
disk_acc_value (bytes)
project_id
project_acc_value (bytes)
org_id
org_acc_value (bytes)
silo_id 
silo_acc_value (bytes)
fleet_acc_value (bytes)
```

I hope the accumulation side of the problem is common to most metrics and can be solved in a general way.

	let producer_registry = ProducerRegistry::with_id(config.deployment.id);
	producer_registry
	.register_producer(internal_latencies.clone())
	.unwrap();
	producer_registry
	.register_producer(external_latencies.clone())
	.unwrap();

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Collect metrics for allocated CPU, storage, and memory #1734

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	ncpus -> Int8,
	memory -> Int8,

Collect metrics for allocated CPU, storage, and memory #1734

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions