Skip to content

fix(BA-5768): Add Prometheus relabel for model-service metrics#11170

Open
seedspirit wants to merge 5 commits intomainfrom
fix/BA-5768
Open

fix(BA-5768): Add Prometheus relabel for model-service metrics#11170
seedspirit wants to merge 5 commits intomainfrom
fix/BA-5768

Conversation

@seedspirit
Copy link
Copy Markdown
Contributor

@seedspirit seedspirit commented Apr 17, 2026

resolve #11169 (BA-5768)

Summary

  • add a Prometheus relabel rule that rewrites model-services scrape targets to a host-accessible address
  • make the relabel behavior configurable in the pyinfra Prometheus dashboard config via kernel_metrics_host
  • add a component test that verifies Prometheus can scrape a rewritten model-service target end to end

Why

Model-service targets returned by HTTP service discovery may use Docker-internal or loopback addresses that are not directly reachable from the Prometheus container. This change adds an optional host rewrite so Prometheus can still scrape the service metrics path.

Validation

  • added component coverage in tests/component/common/clients/prometheus/test_sd_relabel.py
  • no local test run was completed in this session

@github-actions github-actions bot added the size:L 100~500 LoC label Apr 17, 2026
@seedspirit seedspirit changed the title [codex] Add Prometheus relabel for model-service metrics fix(BA-5768): Add Prometheus relabel for model-service metrics Apr 17, 2026
@seedspirit seedspirit marked this pull request as ready for review April 17, 2026 03:07
Copilot AI review requested due to automatic review settings April 17, 2026 03:07
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an optional Prometheus relabel rule to rewrite model-services HTTP-SD targets to a host-accessible address (to handle Docker-internal/loopback targets), wires the behavior into the pyinfra Prometheus dashboard configuration via kernel_metrics_host, and introduces a component test to validate end-to-end scraping through the rewrite.

Changes:

  • Add relabel_configs for the http-sd job to rewrite model-services __address__ to a configured host (kernel_metrics_host).
  • Plumb kernel_metrics_host through pyinfra’s Prometheus dashboard config and template rendering.
  • Add a new component test that spins up Prometheus + mock SD/metrics endpoints and verifies scrape success after relabeling.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
tests/component/common/clients/prometheus/test_sd_relabel.py New component test covering HTTP-SD + relabel rewrite + scrape verification.
src/ai/backend/install/pyinfra/deploy/monitor/dashboard/prometheus/templates/prometheus.yml.j2 Adds conditional relabel rule to rewrite model-service targets when configured.
src/ai/backend/install/pyinfra/deploy/monitor/dashboard/prometheus/deploy.py Passes kernel_metrics_host into the Jinja template context.
src/ai/backend/install/pyinfra/configs/dashboard.py Adds the kernel_metrics_host Prometheus dashboard setting.
configs/prometheus/prometheus.yaml Updates halfstack Prometheus config to include the relabel rewrite rule.
changes/11170.fix.md Changelog entry for the scraping fix.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +205 to +207
max_attempts = 15
result: PrometheusResponse | None = None

Comment on lines +205 to +210
max_attempts = 15
result: PrometheusResponse | None = None

for _ in range(max_attempts):
time.sleep(2)
result = await prometheus_client_with_relabel.query_instant(up_model_service_preset)
Comment on lines +137 to +146
container = (
DockerContainer("prom/prometheus:v2.53.0")
.with_name(f"test--prom-relabel-slot-{get_parallel_slot()}-{random_id}")
.with_exposed_ports(9090)
.with_volume_mapping(
str(prometheus_config_with_relabel),
"/etc/prometheus/prometheus.yml",
mode="ro",
)
.with_kwargs(
)


class TestKernelMetricsScrapeWithRelabel:
# Jinja2 context (resolve host.docker.internal to actual host IP)
http_sd_host=self.resolve_host(self.config.http_sd_host),
http_sd_port=self.config.http_sd_port,
kernel_metrics_host=self.config.kernel_metrics_host,
Comment on lines +31 to +38
{%- if kernel_metrics_host %}
relabel_configs:
# Rewrite model-service targets from Docker-internal IPs to host-accessible address
- source_labels: [service_group, __address__]
separator: ;
regex: model-services;[^:]+:(.+)
target_label: __address__
replacement: {{ kernel_metrics_host }}:${1}
Comment on lines +28 to +31
"Host-accessible address for kernel metrics scraping. "
"When set, model-service targets returned by HTTP SD will have their "
"Docker-internal IPs rewritten to this address via relabel_configs. "
"Leave empty to disable rewriting (use when kernel IPs are already routable)."
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L 100~500 LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Prometheus cannot scrape model-service kernel metrics due to Docker-internal kernel_host in HTTP SD targets

2 participants