feat(BA-4050): Add Prometheus-based kernel live stat query action#10998
feat(BA-4050): Add Prometheus-based kernel live stat query action#10998seedspirit wants to merge 5 commits intomainfrom
Conversation
| gauge, diff, rate = await asyncio.gather( | ||
| self._query_gauge_kernel_live_stat(action.kernel_ids), | ||
| self._query_diff_kernel_live_stat(action.kernel_ids), | ||
| self._query_rate_kernel_live_stat(action.kernel_ids), | ||
| ) |
There was a problem hiding this comment.
It was difficult to write a PromQL query that could handle everything at once, so I used asyncio.gather for query. In the case of the Prometheus client, it seems there are no issues since it uses a client pool, but I'm not entirely sure if this approach is the right one.
There was a problem hiding this comment.
Could you leave some details of the difficulty of batch query?
I got it. It is accurate to let prometheus compute such diff/rate metrics
5c8103d to
de7614a
Compare
2921ef9 to
18faea0
Compare
18faea0 to
03029dd
Compare
0d99943 to
f446ece
Compare
There was a problem hiding this comment.
Pull request overview
Adds a Prometheus-backed “kernel live stat” batch query path to the manager metric service, exposing it as a new action and integrating it into the utilization metric processor package.
Changes:
- Implement
UtilizationMetricService.query_kernel_live_stat_batch()that executes GAUGE/DIFF/RATE PromQL instant queries in parallel and merges results per kernel. - Introduce new live-stat action/result types and supporting result/value DTOs for per-kernel batch outputs.
- Add unit tests for the batch query pipeline (including grouping, empty-kernel behavior, and PromQL rendering expectations).
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/manager/services/utilization_metric/test_kernel_live_stat_batch.py | Adds unit tests for the new batch live-stat query pipeline. |
| src/ai/backend/manager/services/metric/types.py | Adds kernel live-stat result/value dataclasses and metric classification constants. |
| src/ai/backend/manager/services/metric/root_service.py | Implements the batch live-stat Prometheus querying and query preset construction. |
| src/ai/backend/manager/services/metric/processors/utilization_metric.py | Wires the new live-stat action into the utilization metric processor package. |
| src/ai/backend/manager/services/metric/container_metric.py | Refactors metric type detection to use shared DIFF/RATE metric sets; uses UnreachableError. |
| src/ai/backend/manager/services/metric/actions/live_stat.py | Introduces KernelLiveStatAction and KernelLiveStatActionResult. |
| src/ai/backend/common/data/permission/types.py | Adds EntityType.CONTAINER_LIVE_STAT for permission/action typing. |
| changes/10998.feature.md | Adds a changelog entry for the new batch live-stat pipeline. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Add service-layer infrastructure for querying kernel live stats from Prometheus instead of Valkey. Introduces KernelLiveStatAction, batch query methods (gauge/diff/rate), and unit tests for the pipeline.
94d0741 to
6aa570a
Compare
| template = ( | ||
| "sum by ({group_by})(rate(" | ||
| + CONTAINER_UTILIZATION_METRIC_NAME | ||
| + "{{{labels}}}[{window}]))" | ||
| " / " + str(UTILIZATION_METRIC_INTERVAL) | ||
| ) |
There was a problem hiding this comment.
How about
template = (
"sum by ({group_by})("
"rate({metric}{{{labels}}}[{window}])"
")"
" / {interval}"
).format(
group_by=group_by,
metric=CONTAINER_UTILIZATION_METRIC_NAME,
labels=labels,
window=window,
interval=UTILIZATION_METRIC_INTERVAL,
)or
template = (
f"sum by ({group_by})("
f"rate({CONTAINER_UTILIZATION_METRIC_NAME}{{{{{labels}}}}}[{window}])"
f")"
f" / {UTILIZATION_METRIC_INTERVAL}"
)
| if ( | ||
| info.kernel_id is None | ||
| or info.container_metric_name is None | ||
| or info.value_type is None | ||
| or not metric.values | ||
| ): |
There was a problem hiding this comment.
How about creating a method for indicating these to the Metric type?
| if ( | ||
| info.kernel_id is None | ||
| or info.container_metric_name is None | ||
| or info.value_type is None | ||
| or not metric.values | ||
| ): |
There was a problem hiding this comment.
How about creating a method for indicating these to the Metric type?
Summary
query_kernel_live_stat_batchtoUtilizationMetricServicethat issues gauge / diff / rate PromQL queries in parallel viaasyncio.gather, reducing latency from 3×RTT to 1×RTTKernelNode._entry_to_live_stat_mappingto transform Prometheus pipeline output into the legacy Valkey-compatible GQL shapeKernelLiveStatAction/KernelLiveStatActionResultaction pair and wire throughUtilizationMetricProcessorsTest plan
tests/unit/manager/services/utilization_metric/test_kernel_live_stat_batch.pytests/component/metric/test_kernel_live_stat_valkey_equivalence.pyResolves BA-4050
🤖 Generated with Claude Code