🐛 Fix GPU exhaustion: cluster-level threshold, not per-node by clubanderson · Pull Request #4185 · kubestellar/console

clubanderson · 2026-04-01T17:53:57Z

Summary

The Predictive Health Monitor was flagging every node at 100% GPU allocation (e.g., 8/8) as an issue. A fully utilized GPU node is healthy — not a problem.

Changed to cluster-level detection:

Over-allocation (allocated > capacity): critical severity
Cluster >80% allocated: warning severity
Individual nodes at 100%: no longer flagged

Uses strict > (not >=) for the 80% threshold — exactly 80% is fine.

Signed-off-by: Andrew Anderson <andy@clubanderson.com>

… per-node A fully utilized node (8/8 GPUs) is healthy — not an issue. Changed GPU exhaustion detection from per-node to cluster-level: - Over-allocation (allocated > capacity): critical (always an error) - Cluster at >80% GPU allocation: warning (capacity planning) - Individual nodes at 100%: no longer flagged (normal utilization) Signed-off-by: Andrew Anderson <andy@clubanderson.com>

kubestellar-prow · 2026-04-01T17:54:02Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign clubanderson for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

netlify · 2026-04-01T17:54:14Z

✅ Deploy Preview for kubestellarconsole ready!

Name	Link
🔨 Latest commit	`ef53d6f`
🔍 Latest deploy log	https://app.netlify.com/projects/kubestellarconsole/deploys/69cd5bb94511ef0007e35bf4
😎 Deploy Preview	https://deploy-preview-4185.console-deploy-preview.kubestellar.io
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

github-actions · 2026-04-01T17:57:43Z

👋 Hey @clubanderson — thanks for opening this PR!

🤖 This project is developed exclusively using AI coding assistants.

Please do not attempt to code anything for this project manually.
All contributions should be authored using an AI coding tool such as:

Claude Code (Opus 4.5 / 4.6) — recommended

GitHub Copilot

Cursor

Other AI coding assistants

This ensures consistency in code style, architecture patterns, test coverage,
and commit quality across the entire codebase.

This is an automated message.

github-actions · 2026-04-01T17:59:03Z

Thank you for your contribution! Your PR has been merged.

Check out what's new:

KubeStellar Console — Live multi-cluster dashboard
Marketplace — Community extensions
Knowledge Base — Troubleshooting and how-tos

Stay connected: Slack #kubestellar-dev | Multi-Cluster Survey

Copilot

Pull request overview

Adjusts GPU exhaustion detection to avoid flagging healthy, fully utilized GPU nodes by switching to cluster-level allocation thresholds, and tweaks coverage workflow gating.

Changes:

Replace per-node “100% allocated” GPU exhaustion detection with per-cluster aggregation and thresholding (critical on over-allocation; warning when cluster allocation exceeds 80%).
Update risk messaging/IDs to be cluster-scoped for GPU exhaustion signals.
Make the hourly coverage merge job run even when upstream shard jobs fail.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
web/src/components/cards/console-missions/ConsoleOfflineDetectionCard.tsx	Aggregates GPU capacity/allocations per cluster and emits cluster-level risks instead of per-node 100% utilization risks.
.github/workflows/coverage-hourly.yml	Changes merge job condition to `always()` so it executes regardless of upstream shard job outcomes (still repo-gated).

Copilot · 2026-04-01T17:59:58Z

web/src/components/cards/console-missions/ConsoleOfflineDetectionCard.tsx

+    // 3. Cluster-level GPU exhaustion — only flag when 80%+ of a cluster's
+    // total GPUs are allocated. Individual nodes at 100% is normal utilization.


The comments say “80%+” / “80%+ allocated”, but the code uses a strict > threshold (> 0.8), meaning exactly 80% is not flagged. To avoid misleading future readers, update the comments to reflect the strict comparison (e.g., “above 80%” / “>80%”).

Suggested change

// 3. Cluster-level GPU exhaustion — only flag when 80%+ of a cluster's

// total GPUs are allocated. Individual nodes at 100% is normal utilization.

// 3. Cluster-level GPU exhaustion — only flag when above 80% of a cluster's

// total GPUs are allocated (>80%). Individual nodes at 100% is normal utilization.

Copilot · 2026-04-01T17:59:58Z

web/src/components/cards/console-missions/ConsoleOfflineDetectionCard.tsx

+          source: 'heuristic',
+        })
+      } else if (gpus.total > 0 && gpus.allocated / gpus.total > GPU_CLUSTER_EXHAUSTION_THRESHOLD) {
+        // Flag cluster-level near-exhaustion (80%+ allocated)


The comments say “80%+” / “80%+ allocated”, but the code uses a strict > threshold (> 0.8), meaning exactly 80% is not flagged. To avoid misleading future readers, update the comments to reflect the strict comparison (e.g., “above 80%” / “>80%”).

Suggested change

// Flag cluster-level near-exhaustion (80%+ allocated)

// Flag cluster-level near-exhaustion (>80% allocated)

Copilot · 2026-04-01T17:59:58Z

.github/workflows/coverage-hourly.yml

  merge-coverage:
    needs: test-shard
-    if: github.repository == 'kubestellar/console'
+    if: always() && github.repository == 'kubestellar/console'


Using always() here will run merge-coverage even when test-shard fails or is cancelled. If merge-coverage expects artifacts from all shards, this can cause noisy secondary failures and make the workflow harder to interpret. Consider instead gating on shard success (or handling missing artifacts explicitly inside the merge job), so the merge step doesn’t fail in a cascading way.

Suggested change

if: always() && github.repository == 'kubestellar/console'

if: github.repository == 'kubestellar/console' && needs.test-shard.result == 'success'

Copilot · 2026-04-01T17:59:59Z

web/src/components/cards/console-missions/ConsoleOfflineDetectionCard.tsx

-    // 3. GPU nodes with high allocation - risk of GPU exhaustion
+    // 3. Cluster-level GPU exhaustion — only flag when 80%+ of a cluster's
+    // total GPUs are allocated. Individual nodes at 100% is normal utilization.
+    const GPU_CLUSTER_EXHAUSTION_THRESHOLD = 0.8


This constant is defined inside the component function, so it will be re-created on each render. Consider hoisting it to module scope (or a shared constants file) to reduce per-render churn and keep policy thresholds centralized.

clubanderson · 2026-04-01T18:01:46Z

🔄 Auto-Applying Copilot Code Review

Copilot code review found 3 code suggestion(s) and 1 general comment(s).

@copilot Please apply all of the following code review suggestions:

web/src/components/cards/console-missions/ConsoleOfflineDetectionCard.tsx (line 424): // 3. Cluster-level GPU exhaustion — only flag when above 80% of a cluster's ...
web/src/components/cards/console-missions/ConsoleOfflineDetectionCard.tsx (line 456): // Flag cluster-level near-exhaustion (>80% allocated)
.github/workflows/coverage-hourly.yml (line 62): if: github.repository == 'kubestellar/console' && needs.test-shard.result == 'su...

Also address these general comments:

web/src/components/cards/console-missions/ConsoleOfflineDetectionCard.tsx (line 425): This constant is defined inside the component function, so it will be re-created on each render. Consider hoisting it to

Push all fixes in a single commit. Run cd web && npm run build && npm run lint before committing.

Auto-generated by copilot-review-apply workflow.

- Tour.tsx: Add aria-label on icon-only tour button (#4172) - ConsoleOfflineDetectionCard: Hoist GPU_CLUSTER_EXHAUSTION_THRESHOLD to module scope, fix comments "80%+" → ">80%" (#4185) - coverage-hourly.yml: Replace always() with success check to avoid noisy merge-coverage failures when shards fail (#4185) Signed-off-by: Andrew Anderson <andy@clubanderson.com>

…ow (#4186) - Tour.tsx: Add aria-label on icon-only tour button (#4172) - ConsoleOfflineDetectionCard: Hoist GPU_CLUSTER_EXHAUSTION_THRESHOLD to module scope, fix comments "80%+" → ">80%" (#4185) - coverage-hourly.yml: Replace always() with success check to avoid noisy merge-coverage failures when shards fail (#4185) Signed-off-by: Andrew Anderson <andy@clubanderson.com>

) * 🐛 Address Copilot comments: tour a11y, GPU threshold, coverage workflow - Tour.tsx: Add aria-label on icon-only tour button (#4172) - ConsoleOfflineDetectionCard: Hoist GPU_CLUSTER_EXHAUSTION_THRESHOLD to module scope, fix comments "80%+" → ">80%" (#4185) - coverage-hourly.yml: Replace always() with success check to avoid noisy merge-coverage failures when shards fail (#4185) Signed-off-by: Andrew Anderson <andy@clubanderson.com> * 🌱 Add debug logging for screenshot upload flow in feedback dialog All logs use console.debug with [Screenshot] prefix for easy filtering in browser DevTools. Logs at each stage: - File selection (drag, drop, paste, file picker): file count, types, sizes - FileReader: loaded data URI length, errors - Paste: clipboard item types, getAsFile results - Submit: screenshot count, sizes, timeout, success/failure details This helps team members diagnose why screenshot uploads fail on their machines while working on the reporter's machine. Signed-off-by: Andrew Anderson <andy@clubanderson.com> --------- Signed-off-by: Andrew Anderson <andy@clubanderson.com>

Remove continue-on-error: true from test shard steps, which caused the test-shard job to always report success even when tests failed. The merge-coverage gate (needs.test-shard.result == 'success') was effectively a no-op. Copilot flagged this on 6 PRs (#4185, #4186, #4187, #4222, #4223, #4225). Changes: - Remove continue-on-error: true so shard failures are properly reported - Widen merge-coverage gate to !cancelled() so coverage is still merged from passing shards even when some fail - Add if-no-files-found: ignore on artifact upload so failed shards that produce no coverage file don't fail the upload step - Add final 'Fail if any shard failed' step so the overall workflow correctly surfaces test failures after merging coverage and updating the badge - Fix stale comment: '4 shards' -> '12 shards' Signed-off-by: Andrew Anderson <andy@clubanderson.com>

clubanderson added 2 commits April 1, 2026 13:07

🐛 Run merge-coverage even if some shards fail/cancel

488e834

Signed-off-by: Andrew Anderson <andy@clubanderson.com>

Copilot AI review requested due to automatic review settings April 1, 2026 17:53

kubestellar-prow bot added the dco-signoff: yes Indicates the PR's author has signed the DCO. label Apr 1, 2026

kubestellar-prow bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 1, 2026

clubanderson merged commit 3ba35f3 into main Apr 1, 2026
17 of 18 checks passed

kubestellar-prow bot deleted the fix/gpu-exhaustion-cluster-level branch April 1, 2026 17:54

Copilot started reviewing on behalf of clubanderson April 1, 2026 17:59 View session

Copilot AI reviewed Apr 1, 2026

View reviewed changes

clubanderson assigned Copilot Apr 1, 2026

clubanderson mentioned this pull request Apr 1, 2026

🐛 Address Copilot comments: tour a11y, GPU threshold scope, coverage workflow #4186

Merged

clubanderson mentioned this pull request Apr 2, 2026

🐛 Fix coverage workflow: shard failures no longer silently pass #4238

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 Fix GPU exhaustion: cluster-level threshold, not per-node#4185

🐛 Fix GPU exhaustion: cluster-level threshold, not per-node#4185
clubanderson merged 2 commits intomainfrom
fix/gpu-exhaustion-cluster-level

clubanderson commented Apr 1, 2026

Uh oh!

kubestellar-prow bot commented Apr 1, 2026

Uh oh!

Uh oh!

netlify bot commented Apr 1, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 1, 2026

Uh oh!

github-actions bot commented Apr 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

clubanderson commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		// 3. Cluster-level GPU exhaustion — only flag when 80%+ of a cluster's
		// total GPUs are allocated. Individual nodes at 100% is normal utilization.

	// Flag cluster-level near-exhaustion (80%+ allocated)
	// Flag cluster-level near-exhaustion (>80% allocated)

	if: always() && github.repository == 'kubestellar/console'
	if: github.repository == 'kubestellar/console' && needs.test-shard.result == 'success'

Conversation

clubanderson commented Apr 1, 2026

Summary

Uh oh!

kubestellar-prow bot commented Apr 1, 2026

Uh oh!

Uh oh!

netlify bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for kubestellarconsole ready!

Uh oh!

github-actions bot commented Apr 1, 2026

Uh oh!

github-actions bot commented Apr 1, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

clubanderson commented Apr 1, 2026

🔄 Auto-Applying Copilot Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

netlify bot commented Apr 1, 2026 •

edited

Loading