fix(BA-5707): defer prometheus_client imports in common/metrics/multiprocess#11036
Closed
fix(BA-5707): defer prometheus_client imports in common/metrics/multiprocess#11036
Conversation
…process
Imports at the top of `common/metrics/multiprocess.py` pulled in
`prometheus_client.values`, which binds `ValueClass` exactly once at
import time based on the current `PROMETHEUS_MULTIPROC_DIR`. Because
every CLI entry point imports `setup_prometheus_multiprocess_dir`
*before* it sets that env var, `ValueClass` was frozen as `MutexValue`
for the whole parent process. With fork-based `aiotools` workers (the
default for manager/agent), children inherited the single-process
registry and every Gauge write was silently dropped — `/metrics`
returned 0 bytes and `run/prometheus/{manager,agent}/` never grew `.db`
files.
Move the three `prometheus_client` imports into the function bodies
that use them. `setup_prometheus_multiprocess_dir` no longer triggers a
prometheus_client import, so by the time the server module finally
imports it the env var is in place and `ValueClass` binds to
`MmapedValue` correctly.
Storage was not affected because it calls
`multiprocessing.set_start_method("spawn")` in server.py, which makes
workers re-import prometheus_client fresh in a new Python process —
masking the bug. Agent and manager never set the spawn method and hit
it.
Verified end-to-end after restart: agent /metrics returns ~40 KB with
12 backendai_container_utilization samples, manager /metrics returns
~74 KB with 504 backendai_* metrics, Prometheus ingests the samples,
and prometheusQueryPresetResult GQL returns non-empty data.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR fixes multiprocess Prometheus metrics being silently dropped in fork-based worker processes by deferring prometheus_client imports until after PROMETHEUS_MULTIPROC_DIR is set, ensuring ValueClass is initialized in multiprocess mode.
Changes:
- Move
prometheus_clientimports incommon/metrics/multiprocess.pyfrom module scope into the functions that use them. - Add an in-file note documenting why module-level imports break multiprocess metrics initialization.
- Add a changelog entry describing the fix and its effect (
MmapedValuevsMutexValue).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| src/ai/backend/common/metrics/multiprocess.py | Defers prometheus_client imports to avoid premature ValueClass binding and ensure correct multiprocess metrics behavior. |
| changes/11036.fix.md | Documents the bugfix for silent Prometheus metrics loss in manager/agent. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
graphite-app Bot
pushed a commit
to lablup/backend.ai-webui
that referenced
this pull request
Apr 16, 2026
…6642) Resolves #6642 (FR-2494) Blocked by lablup/backend.ai#11036 > [!NOTE] > Test Node: /serving/e21d2f21-028a-405b-b196-69736b7343b0 with 10.122.10.215 ## Summary - Add Prometheus query result preview in the auto-scaling rule editor modal - When a Prometheus preset is selected, an instant query result is displayed below the query template using `prometheusQueryPresetResult` API - Shows current metric value with a refresh button to re-fetch on demand - Handles multiple series results, empty results, loading (Suspense), and errors (ErrorBoundary) ## Changes - `react/src/components/AutoScalingRuleEditorModal.tsx`: Added `PrometheusPresetPreview` component using `useLazyLoadQuery` with `fetchKey` for manual refresh support - `resources/i18n/en.json`: Added i18n keys for preview UI (`CurrentValue`, `MultipleSeriesResult`, `NoDataAvailable`, `RefreshPreview`)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
prometheus_clientimports incommon/metrics/multiprocess.pyfrom module top level into the function bodies that use them.setup_prometheus_multiprocess_dirtransitively importsprometheus_client.values, which bindsValueClassonce at import time based on the currentPROMETHEUS_MULTIPROC_DIR. Because every CLI entry point imports the setup helper before it sets that env var,ValueClasswas frozen toMutexValue(single-process) in the parent. Fork-based aiotools workers (default for manager/agent) inherited that state and every Gauge write was silently dropped.storage/server.py:818callsmultiprocessing.set_start_method("spawn"), which makes workers start a fresh Python process that re-importsprometheus_clientafter the env var is already set.Jira
BA-5707
Symptom
curl :6003/metrics(agent) andcurl :18080/metrics(manager) returned 0 bytes.run/prometheus/{agent,manager}/contained no.dbfiles.backendai_container_utilizationsamples → auto-scaling rules backed by Prometheus metrics never got data →prometheusQueryPresetResultGQL always returnedresult: [].Verification (local dev, full stack restart)
run/prometheus/agent/andrun/prometheus/manager/now contain the three.dbfiles per worker pid.curl :6003/metrics→ ~40 KB, 12backendai_container_utilizationsamples.curl :18080/metrics→ ~74 KB, 504backendai_*metric lines.backendai_container_utilizationsample count: 10.prometheusQueryPresetResultGQL returns actual vector result with kernel_id + numeric value.pants fmt | lint | checkall pass.Test plan
curl :18080/metricson manager shows non-emptybackendai_*metrics.curl :6003/metricson agent showsbackendai_container_utilizationwhile any kernel is running.backendai_container_utilization.prometheusQueryPresetResultGQL returns non-empty data when a healthy replica is running.Note
Other multi-worker components using the same
setup_prometheus_multiprocess_dirpattern withoutset_start_method("spawn")(account-manager, appproxy coordinator/worker) were almost certainly affected by the same bug and are fixed by this change. Storage's explicitset_start_method("spawn")can stay or be removed later — it is no longer required for multi-process metrics to work after this fix.🤖 Generated with Claude Code