🐛 Fix GPU exhaustion: cluster-level threshold, not per-node#4185
🐛 Fix GPU exhaustion: cluster-level threshold, not per-node#4185clubanderson merged 2 commits intomainfrom
Conversation
Signed-off-by: Andrew Anderson <andy@clubanderson.com>
… per-node A fully utilized node (8/8 GPUs) is healthy — not an issue. Changed GPU exhaustion detection from per-node to cluster-level: - Over-allocation (allocated > capacity): critical (always an error) - Cluster at >80% GPU allocation: warning (capacity planning) - Individual nodes at 100%: no longer flagged (normal utilization) Signed-off-by: Andrew Anderson <andy@clubanderson.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
✅ Deploy Preview for kubestellarconsole ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
👋 Hey @clubanderson — thanks for opening this PR!
This is an automated message. |
|
Thank you for your contribution! Your PR has been merged. Check out what's new:
Stay connected: Slack #kubestellar-dev | Multi-Cluster Survey |
There was a problem hiding this comment.
Pull request overview
Adjusts GPU exhaustion detection to avoid flagging healthy, fully utilized GPU nodes by switching to cluster-level allocation thresholds, and tweaks coverage workflow gating.
Changes:
- Replace per-node “100% allocated” GPU exhaustion detection with per-cluster aggregation and thresholding (critical on over-allocation; warning when cluster allocation exceeds 80%).
- Update risk messaging/IDs to be cluster-scoped for GPU exhaustion signals.
- Make the hourly coverage merge job run even when upstream shard jobs fail.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| web/src/components/cards/console-missions/ConsoleOfflineDetectionCard.tsx | Aggregates GPU capacity/allocations per cluster and emits cluster-level risks instead of per-node 100% utilization risks. |
| .github/workflows/coverage-hourly.yml | Changes merge job condition to always() so it executes regardless of upstream shard job outcomes (still repo-gated). |
| // 3. Cluster-level GPU exhaustion — only flag when 80%+ of a cluster's | ||
| // total GPUs are allocated. Individual nodes at 100% is normal utilization. |
There was a problem hiding this comment.
The comments say “80%+” / “80%+ allocated”, but the code uses a strict > threshold (> 0.8), meaning exactly 80% is not flagged. To avoid misleading future readers, update the comments to reflect the strict comparison (e.g., “above 80%” / “>80%”).
| // 3. Cluster-level GPU exhaustion — only flag when 80%+ of a cluster's | |
| // total GPUs are allocated. Individual nodes at 100% is normal utilization. | |
| // 3. Cluster-level GPU exhaustion — only flag when above 80% of a cluster's | |
| // total GPUs are allocated (>80%). Individual nodes at 100% is normal utilization. |
| source: 'heuristic', | ||
| }) | ||
| } else if (gpus.total > 0 && gpus.allocated / gpus.total > GPU_CLUSTER_EXHAUSTION_THRESHOLD) { | ||
| // Flag cluster-level near-exhaustion (80%+ allocated) |
There was a problem hiding this comment.
The comments say “80%+” / “80%+ allocated”, but the code uses a strict > threshold (> 0.8), meaning exactly 80% is not flagged. To avoid misleading future readers, update the comments to reflect the strict comparison (e.g., “above 80%” / “>80%”).
| // Flag cluster-level near-exhaustion (80%+ allocated) | |
| // Flag cluster-level near-exhaustion (>80% allocated) |
| merge-coverage: | ||
| needs: test-shard | ||
| if: github.repository == 'kubestellar/console' | ||
| if: always() && github.repository == 'kubestellar/console' |
There was a problem hiding this comment.
Using always() here will run merge-coverage even when test-shard fails or is cancelled. If merge-coverage expects artifacts from all shards, this can cause noisy secondary failures and make the workflow harder to interpret. Consider instead gating on shard success (or handling missing artifacts explicitly inside the merge job), so the merge step doesn’t fail in a cascading way.
| if: always() && github.repository == 'kubestellar/console' | |
| if: github.repository == 'kubestellar/console' && needs.test-shard.result == 'success' |
| // 3. GPU nodes with high allocation - risk of GPU exhaustion | ||
| // 3. Cluster-level GPU exhaustion — only flag when 80%+ of a cluster's | ||
| // total GPUs are allocated. Individual nodes at 100% is normal utilization. | ||
| const GPU_CLUSTER_EXHAUSTION_THRESHOLD = 0.8 |
There was a problem hiding this comment.
This constant is defined inside the component function, so it will be re-created on each render. Consider hoisting it to module scope (or a shared constants file) to reduce per-render churn and keep policy thresholds centralized.
🔄 Auto-Applying Copilot Code ReviewCopilot code review found 3 code suggestion(s) and 1 general comment(s). @copilot Please apply all of the following code review suggestions:
Also address these general comments:
Push all fixes in a single commit. Run Auto-generated by copilot-review-apply workflow. |
- Tour.tsx: Add aria-label on icon-only tour button (#4172) - ConsoleOfflineDetectionCard: Hoist GPU_CLUSTER_EXHAUSTION_THRESHOLD to module scope, fix comments "80%+" → ">80%" (#4185) - coverage-hourly.yml: Replace always() with success check to avoid noisy merge-coverage failures when shards fail (#4185) Signed-off-by: Andrew Anderson <andy@clubanderson.com>
…ow (#4186) - Tour.tsx: Add aria-label on icon-only tour button (#4172) - ConsoleOfflineDetectionCard: Hoist GPU_CLUSTER_EXHAUSTION_THRESHOLD to module scope, fix comments "80%+" → ">80%" (#4185) - coverage-hourly.yml: Replace always() with success check to avoid noisy merge-coverage failures when shards fail (#4185) Signed-off-by: Andrew Anderson <andy@clubanderson.com>
) * 🐛 Address Copilot comments: tour a11y, GPU threshold, coverage workflow - Tour.tsx: Add aria-label on icon-only tour button (#4172) - ConsoleOfflineDetectionCard: Hoist GPU_CLUSTER_EXHAUSTION_THRESHOLD to module scope, fix comments "80%+" → ">80%" (#4185) - coverage-hourly.yml: Replace always() with success check to avoid noisy merge-coverage failures when shards fail (#4185) Signed-off-by: Andrew Anderson <andy@clubanderson.com> * 🌱 Add debug logging for screenshot upload flow in feedback dialog All logs use console.debug with [Screenshot] prefix for easy filtering in browser DevTools. Logs at each stage: - File selection (drag, drop, paste, file picker): file count, types, sizes - FileReader: loaded data URI length, errors - Paste: clipboard item types, getAsFile results - Submit: screenshot count, sizes, timeout, success/failure details This helps team members diagnose why screenshot uploads fail on their machines while working on the reporter's machine. Signed-off-by: Andrew Anderson <andy@clubanderson.com> --------- Signed-off-by: Andrew Anderson <andy@clubanderson.com>
Remove continue-on-error: true from test shard steps, which caused the test-shard job to always report success even when tests failed. The merge-coverage gate (needs.test-shard.result == 'success') was effectively a no-op. Copilot flagged this on 6 PRs (#4185, #4186, #4187, #4222, #4223, #4225). Changes: - Remove continue-on-error: true so shard failures are properly reported - Widen merge-coverage gate to !cancelled() so coverage is still merged from passing shards even when some fail - Add if-no-files-found: ignore on artifact upload so failed shards that produce no coverage file don't fail the upload step - Add final 'Fail if any shard failed' step so the overall workflow correctly surfaces test failures after merging coverage and updating the badge - Fix stale comment: '4 shards' -> '12 shards' Signed-off-by: Andrew Anderson <andy@clubanderson.com>
Remove continue-on-error: true from test shard steps, which caused the test-shard job to always report success even when tests failed. The merge-coverage gate (needs.test-shard.result == 'success') was effectively a no-op. Copilot flagged this on 6 PRs (#4185, #4186, #4187, #4222, #4223, #4225). Changes: - Remove continue-on-error: true so shard failures are properly reported - Widen merge-coverage gate to !cancelled() so coverage is still merged from passing shards even when some fail - Add if-no-files-found: ignore on artifact upload so failed shards that produce no coverage file don't fail the upload step - Add final 'Fail if any shard failed' step so the overall workflow correctly surfaces test failures after merging coverage and updating the badge - Fix stale comment: '4 shards' -> '12 shards' Signed-off-by: Andrew Anderson <andy@clubanderson.com>
Summary
The Predictive Health Monitor was flagging every node at 100% GPU allocation (e.g., 8/8) as an issue. A fully utilized GPU node is healthy — not a problem.
Changed to cluster-level detection:
Uses strict
>(not>=) for the 80% threshold — exactly 80% is fine.