-
Notifications
You must be signed in to change notification settings - Fork 62
Description
In the console we want to show system-level utilization and capacity over time for CPU, storage, and memory (ideally also broken down by silo, org, project, etc). If I understand correctly, at any given point in time, the relevant data is available in CockroachDB in the instance and disk tables:
omicron/nexus/db-model/src/schema.rs
Lines 105 to 106 in 508906b
| ncpus -> Int8, | |
| memory -> Int8, |
omicron/nexus/db-model/src/schema.rs
Line 24 in 508906b
| size_bytes -> Int8, |
However, we want to see these values as they change over time. We discussed this at the 9/22 product council and it seemed reasonable that Nexus would send metrics to ClickHouse whenever events happen that change the allocations (likely creation and deletion of instances and disks). Nexus is already set up to be a metrics producer for request latencies:
Lines 115 to 121 in d1fbdd2
| let producer_registry = ProducerRegistry::with_id(config.deployment.id); | |
| producer_registry | |
| .register_producer(internal_latencies.clone()) | |
| .unwrap(); | |
| producer_registry | |
| .register_producer(external_latencies.clone()) | |
| .unwrap(); |
There are some interesting problems to solve around accumulated values and filtering by project, org, etc. Say a disk is created and we want to record the change in allocated space. We can send a metric to ClickHouse that records the size delta, i.e., the size of the created disk, along with IDs for project, org, and silo to facilitate aggregate queries. But the data we actually want to display in the console is the accumulation of all these changes over time. So to show the disk allocation from time t1 to t2, we will need the accumulation of all changes from the beginning of time t0 to t1 as the baseline value, and then starting from that we calculate the cumulative change for each point between t1 and t2 and return that as the result of the query.
As @bnaecker mentioned here on a related issue, we could calculate and store the cumulative value in each row along with the delta. That way, when we're interested in what happens between t1 and t2, we can at least skip the calculation for t0 to t1. But wait! If we want to be able to filter by individual disk, project, org, or silo, we'd also need to maintain a cumulative value for each of those groupings. So the row stored in ClickHouse would end up looking like this
storage allocation metric
------------
timestamp
size_bytes (delta bytes)
disk_id
disk_acc_value (bytes)
project_id
project_acc_value (bytes)
org_id
org_acc_value (bytes)
silo_id
silo_acc_value (bytes)
fleet_acc_value (bytes)
I hope the accumulation side of the problem is common to most metrics and can be solved in a general way.