Skip to content

fix(BA-5707): defer prometheus_client imports in common/metrics/multiprocess#11036

Closed
agatha197 wants to merge 2 commits intomainfrom
fix/BA-5707-prometheus-multiproc-early-import
Closed

fix(BA-5707): defer prometheus_client imports in common/metrics/multiprocess#11036
agatha197 wants to merge 2 commits intomainfrom
fix/BA-5707-prometheus-multiproc-early-import

Conversation

@agatha197
Copy link
Copy Markdown
Contributor

@agatha197 agatha197 commented Apr 14, 2026

Summary

  • Move prometheus_client imports in common/metrics/multiprocess.py from module top level into the function bodies that use them.
  • Without this, importing setup_prometheus_multiprocess_dir transitively imports prometheus_client.values, which binds ValueClass once at import time based on the current PROMETHEUS_MULTIPROC_DIR. Because every CLI entry point imports the setup helper before it sets that env var, ValueClass was frozen to MutexValue (single-process) in the parent. Fork-based aiotools workers (default for manager/agent) inherited that state and every Gauge write was silently dropped.
  • Storage was accidentally safe because storage/server.py:818 calls multiprocessing.set_start_method("spawn"), which makes workers start a fresh Python process that re-imports prometheus_client after the env var is already set.

Jira

BA-5707

Symptom

  • curl :6003/metrics (agent) and curl :18080/metrics (manager) returned 0 bytes.
  • run/prometheus/{agent,manager}/ contained no .db files.
  • Prometheus had no backendai_container_utilization samples → auto-scaling rules backed by Prometheus metrics never got data → prometheusQueryPresetResult GQL always returned result: [].

Verification (local dev, full stack restart)

  • run/prometheus/agent/ and run/prometheus/manager/ now contain the three .db files per worker pid.
  • curl :6003/metrics → ~40 KB, 12 backendai_container_utilization samples.
  • curl :18080/metrics → ~74 KB, 504 backendai_* metric lines.
  • Prometheus backendai_container_utilization sample count: 10.
  • prometheusQueryPresetResult GQL returns actual vector result with kernel_id + numeric value.
  • pants fmt | lint | check all pass.

Test plan

  • After deploy, curl :18080/metrics on manager shows non-empty backendai_* metrics.
  • After deploy, curl :6003/metrics on agent shows backendai_container_utilization while any kernel is running.
  • Prometheus has non-zero samples for backendai_container_utilization.
  • prometheusQueryPresetResult GQL returns non-empty data when a healthy replica is running.
  • Grafana manager dashboards no longer show "No data".

Note

Other multi-worker components using the same setup_prometheus_multiprocess_dir pattern without set_start_method("spawn") (account-manager, appproxy coordinator/worker) were almost certainly affected by the same bug and are fixed by this change. Storage's explicit set_start_method("spawn") can stay or be removed later — it is no longer required for multi-process metrics to work after this fix.

🤖 Generated with Claude Code

…process

Imports at the top of `common/metrics/multiprocess.py` pulled in
`prometheus_client.values`, which binds `ValueClass` exactly once at
import time based on the current `PROMETHEUS_MULTIPROC_DIR`. Because
every CLI entry point imports `setup_prometheus_multiprocess_dir`
*before* it sets that env var, `ValueClass` was frozen as `MutexValue`
for the whole parent process. With fork-based `aiotools` workers (the
default for manager/agent), children inherited the single-process
registry and every Gauge write was silently dropped — `/metrics`
returned 0 bytes and `run/prometheus/{manager,agent}/` never grew `.db`
files.

Move the three `prometheus_client` imports into the function bodies
that use them. `setup_prometheus_multiprocess_dir` no longer triggers a
prometheus_client import, so by the time the server module finally
imports it the env var is in place and `ValueClass` binds to
`MmapedValue` correctly.

Storage was not affected because it calls
`multiprocessing.set_start_method("spawn")` in server.py, which makes
workers re-import prometheus_client fresh in a new Python process —
masking the bug. Agent and manager never set the spawn method and hit
it.

Verified end-to-end after restart: agent /metrics returns ~40 KB with
12 backendai_container_utilization samples, manager /metrics returns
~74 KB with 504 backendai_* metrics, Prometheus ingests the samples,
and prometheusQueryPresetResult GQL returns non-empty data.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 14, 2026 06:17
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added size:S 10~30 LoC comp:common Related to Common component labels Apr 14, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR fixes multiprocess Prometheus metrics being silently dropped in fork-based worker processes by deferring prometheus_client imports until after PROMETHEUS_MULTIPROC_DIR is set, ensuring ValueClass is initialized in multiprocess mode.

Changes:

  • Move prometheus_client imports in common/metrics/multiprocess.py from module scope into the functions that use them.
  • Add an in-file note documenting why module-level imports break multiprocess metrics initialization.
  • Add a changelog entry describing the fix and its effect (MmapedValue vs MutexValue).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
src/ai/backend/common/metrics/multiprocess.py Defers prometheus_client imports to avoid premature ValueClass binding and ensure correct multiprocess metrics behavior.
changes/11036.fix.md Documents the bugfix for silent Prometheus metrics loss in manager/agent.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@agatha197 agatha197 requested a review from HyeockJinKim April 14, 2026 06:21
graphite-app Bot pushed a commit to lablup/backend.ai-webui that referenced this pull request Apr 16, 2026
…6642)

Resolves #6642 (FR-2494)

Blocked by lablup/backend.ai#11036

> [!NOTE]
> Test Node: /serving/e21d2f21-028a-405b-b196-69736b7343b0 with 10.122.10.215

## Summary

- Add Prometheus query result preview in the auto-scaling rule editor modal
- When a Prometheus preset is selected, an instant query result is displayed below the query template using `prometheusQueryPresetResult` API
- Shows current metric value with a refresh button to re-fetch on demand
- Handles multiple series results, empty results, loading (Suspense), and errors (ErrorBoundary)

## Changes

- `react/src/components/AutoScalingRuleEditorModal.tsx`: Added `PrometheusPresetPreview` component using `useLazyLoadQuery` with `fetchKey` for manual refresh support
- `resources/i18n/en.json`: Added i18n keys for preview UI (`CurrentValue`, `MultipleSeriesResult`, `NoDataAvailable`, `RefreshPreview`)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp:common Related to Common component size:S 10~30 LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants