fix(BA-5707): defer prometheus_client imports in common/metrics/multiprocess by agatha197 · Pull Request #11036 · lablup/backend.ai

agatha197 · 2026-04-14T06:17:04Z

Summary

Move prometheus_client imports in common/metrics/multiprocess.py from module top level into the function bodies that use them.
Without this, importing setup_prometheus_multiprocess_dir transitively imports prometheus_client.values, which binds ValueClass once at import time based on the current PROMETHEUS_MULTIPROC_DIR. Because every CLI entry point imports the setup helper before it sets that env var, ValueClass was frozen to MutexValue (single-process) in the parent. Fork-based aiotools workers (default for manager/agent) inherited that state and every Gauge write was silently dropped.
Storage was accidentally safe because storage/server.py:818 calls multiprocessing.set_start_method("spawn"), which makes workers start a fresh Python process that re-imports prometheus_client after the env var is already set.

Jira

BA-5707

Symptom

curl :6003/metrics (agent) and curl :18080/metrics (manager) returned 0 bytes.
run/prometheus/{agent,manager}/ contained no .db files.
Prometheus had no backendai_container_utilization samples → auto-scaling rules backed by Prometheus metrics never got data → prometheusQueryPresetResult GQL always returned result: [].

Verification (local dev, full stack restart)

run/prometheus/agent/ and run/prometheus/manager/ now contain the three .db files per worker pid.
curl :6003/metrics → ~40 KB, 12 backendai_container_utilization samples.
curl :18080/metrics → ~74 KB, 504 backendai_* metric lines.
Prometheus backendai_container_utilization sample count: 10.
prometheusQueryPresetResult GQL returns actual vector result with kernel_id + numeric value.
pants fmt | lint | check all pass.

Test plan

After deploy, curl :18080/metrics on manager shows non-empty backendai_* metrics.
After deploy, curl :6003/metrics on agent shows backendai_container_utilization while any kernel is running.
Prometheus has non-zero samples for backendai_container_utilization.
prometheusQueryPresetResult GQL returns non-empty data when a healthy replica is running.
Grafana manager dashboards no longer show "No data".

Note

Other multi-worker components using the same setup_prometheus_multiprocess_dir pattern without set_start_method("spawn") (account-manager, appproxy coordinator/worker) were almost certainly affected by the same bug and are fixed by this change. Storage's explicit set_start_method("spawn") can stay or be removed later — it is no longer required for multi-process metrics to work after this fix.

🤖 Generated with Claude Code

…process Imports at the top of `common/metrics/multiprocess.py` pulled in `prometheus_client.values`, which binds `ValueClass` exactly once at import time based on the current `PROMETHEUS_MULTIPROC_DIR`. Because every CLI entry point imports `setup_prometheus_multiprocess_dir` *before* it sets that env var, `ValueClass` was frozen as `MutexValue` for the whole parent process. With fork-based `aiotools` workers (the default for manager/agent), children inherited the single-process registry and every Gauge write was silently dropped — `/metrics` returned 0 bytes and `run/prometheus/{manager,agent}/` never grew `.db` files. Move the three `prometheus_client` imports into the function bodies that use them. `setup_prometheus_multiprocess_dir` no longer triggers a prometheus_client import, so by the time the server module finally imports it the env var is in place and `ValueClass` binds to `MmapedValue` correctly. Storage was not affected because it calls `multiprocessing.set_start_method("spawn")` in server.py, which makes workers re-import prometheus_client fresh in a new Python process — masking the bug. Agent and manager never set the spawn method and hit it. Verified end-to-end after restart: agent /metrics returns ~40 KB with 12 backendai_container_utilization samples, manager /metrics returns ~74 KB with 504 backendai_* metrics, Prometheus ingests the samples, and prometheusQueryPresetResult GQL returns non-empty data. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR fixes multiprocess Prometheus metrics being silently dropped in fork-based worker processes by deferring prometheus_client imports until after PROMETHEUS_MULTIPROC_DIR is set, ensuring ValueClass is initialized in multiprocess mode.

Changes:

Move prometheus_client imports in common/metrics/multiprocess.py from module scope into the functions that use them.
Add an in-file note documenting why module-level imports break multiprocess metrics initialization.
Add a changelog entry describing the fix and its effect (MmapedValue vs MutexValue).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
src/ai/backend/common/metrics/multiprocess.py	Defers `prometheus_client` imports to avoid premature `ValueClass` binding and ensure correct multiprocess metrics behavior.
changes/11036.fix.md	Documents the bugfix for silent Prometheus metrics loss in manager/agent.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…6642) Resolves #6642 (FR-2494) Blocked by lablup/backend.ai#11036 > [!NOTE] > Test Node: /serving/e21d2f21-028a-405b-b196-69736b7343b0 with 10.122.10.215 ## Summary - Add Prometheus query result preview in the auto-scaling rule editor modal - When a Prometheus preset is selected, an instant query result is displayed below the query template using `prometheusQueryPresetResult` API - Shows current metric value with a refresh button to re-fetch on demand - Handles multiple series results, empty results, loading (Suspense), and errors (ErrorBoundary) ## Changes - `react/src/components/AutoScalingRuleEditorModal.tsx`: Added `PrometheusPresetPreview` component using `useLazyLoadQuery` with `fetchKey` for manual refresh support - `resources/i18n/en.json`: Added i18n keys for preview UI (`CurrentValue`, `MultipleSeriesResult`, `NoDataAvailable`, `RefreshPreview`)

Copilot AI review requested due to automatic review settings April 14, 2026 06:17

github-actions Bot assigned agatha197 Apr 14, 2026

chore: add news fragment for BA-5707 fix

b2c8f4c

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions Bot added size:S 10~30 LoC comp:common Related to Common component labels Apr 14, 2026

Copilot AI reviewed Apr 14, 2026

View reviewed changes

agatha197 requested a review from HyeockJinKim April 14, 2026 06:21

agatha197 mentioned this pull request Apr 14, 2026

feat(FR-2494): add Prometheus query result preview in editor modal lablup/backend.ai-webui#6642

Merged

HyeockJinKim closed this Apr 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(BA-5707): defer prometheus_client imports in common/metrics/multiprocess#11036

fix(BA-5707): defer prometheus_client imports in common/metrics/multiprocess#11036
agatha197 wants to merge 2 commits intomainfrom
fix/BA-5707-prometheus-multiproc-early-import

agatha197 commented Apr 14, 2026 •

edited by atlassian Bot

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

agatha197 commented Apr 14, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Jira

Symptom

Verification (local dev, full stack restart)

Test plan

Note

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

agatha197 commented Apr 14, 2026 •

edited by atlassian Bot

Loading