Skip to content

🐛 Fix GPU exhaustion: cluster-level threshold, not per-node#4185

Merged
clubanderson merged 2 commits intomainfrom
fix/gpu-exhaustion-cluster-level
Apr 1, 2026
Merged

🐛 Fix GPU exhaustion: cluster-level threshold, not per-node#4185
clubanderson merged 2 commits intomainfrom
fix/gpu-exhaustion-cluster-level

Conversation

@clubanderson
Copy link
Copy Markdown
Collaborator

Summary

The Predictive Health Monitor was flagging every node at 100% GPU allocation (e.g., 8/8) as an issue. A fully utilized GPU node is healthy — not a problem.

Changed to cluster-level detection:

  • Over-allocation (allocated > capacity): critical severity
  • Cluster >80% allocated: warning severity
  • Individual nodes at 100%: no longer flagged

Uses strict > (not >=) for the 80% threshold — exactly 80% is fine.

Signed-off-by: Andrew Anderson <andy@clubanderson.com>
… per-node

A fully utilized node (8/8 GPUs) is healthy — not an issue. Changed
GPU exhaustion detection from per-node to cluster-level:

- Over-allocation (allocated > capacity): critical (always an error)
- Cluster at >80% GPU allocation: warning (capacity planning)
- Individual nodes at 100%: no longer flagged (normal utilization)

Signed-off-by: Andrew Anderson <andy@clubanderson.com>
Copilot AI review requested due to automatic review settings April 1, 2026 17:53
@kubestellar-prow kubestellar-prow bot added the dco-signoff: yes Indicates the PR's author has signed the DCO. label Apr 1, 2026
@kubestellar-prow
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign clubanderson for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubestellar-prow kubestellar-prow bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 1, 2026
@clubanderson clubanderson merged commit 3ba35f3 into main Apr 1, 2026
17 of 18 checks passed
@netlify
Copy link
Copy Markdown

netlify bot commented Apr 1, 2026

Deploy Preview for kubestellarconsole ready!

Name Link
🔨 Latest commit ef53d6f
🔍 Latest deploy log https://app.netlify.com/projects/kubestellarconsole/deploys/69cd5bb94511ef0007e35bf4
😎 Deploy Preview https://deploy-preview-4185.console-deploy-preview.kubestellar.io
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@kubestellar-prow kubestellar-prow bot deleted the fix/gpu-exhaustion-cluster-level branch April 1, 2026 17:54
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

👋 Hey @clubanderson — thanks for opening this PR!

🤖 This project is developed exclusively using AI coding assistants.

Please do not attempt to code anything for this project manually.
All contributions should be authored using an AI coding tool such as:

This ensures consistency in code style, architecture patterns, test coverage,
and commit quality across the entire codebase.


This is an automated message.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

Thank you for your contribution! Your PR has been merged.

Check out what's new:

Stay connected: Slack #kubestellar-dev | Multi-Cluster Survey

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adjusts GPU exhaustion detection to avoid flagging healthy, fully utilized GPU nodes by switching to cluster-level allocation thresholds, and tweaks coverage workflow gating.

Changes:

  • Replace per-node “100% allocated” GPU exhaustion detection with per-cluster aggregation and thresholding (critical on over-allocation; warning when cluster allocation exceeds 80%).
  • Update risk messaging/IDs to be cluster-scoped for GPU exhaustion signals.
  • Make the hourly coverage merge job run even when upstream shard jobs fail.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
web/src/components/cards/console-missions/ConsoleOfflineDetectionCard.tsx Aggregates GPU capacity/allocations per cluster and emits cluster-level risks instead of per-node 100% utilization risks.
.github/workflows/coverage-hourly.yml Changes merge job condition to always() so it executes regardless of upstream shard job outcomes (still repo-gated).

Comment on lines +423 to +424
// 3. Cluster-level GPU exhaustion — only flag when 80%+ of a cluster's
// total GPUs are allocated. Individual nodes at 100% is normal utilization.
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comments say “80%+” / “80%+ allocated”, but the code uses a strict > threshold (> 0.8), meaning exactly 80% is not flagged. To avoid misleading future readers, update the comments to reflect the strict comparison (e.g., “above 80%” / “>80%”).

Suggested change
// 3. Cluster-level GPU exhaustion — only flag when 80%+ of a cluster's
// total GPUs are allocated. Individual nodes at 100% is normal utilization.
// 3. Cluster-level GPU exhaustion — only flag when above 80% of a cluster's
// total GPUs are allocated (>80%). Individual nodes at 100% is normal utilization.

Copilot uses AI. Check for mistakes.
source: 'heuristic',
})
} else if (gpus.total > 0 && gpus.allocated / gpus.total > GPU_CLUSTER_EXHAUSTION_THRESHOLD) {
// Flag cluster-level near-exhaustion (80%+ allocated)
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comments say “80%+” / “80%+ allocated”, but the code uses a strict > threshold (> 0.8), meaning exactly 80% is not flagged. To avoid misleading future readers, update the comments to reflect the strict comparison (e.g., “above 80%” / “>80%”).

Suggested change
// Flag cluster-level near-exhaustion (80%+ allocated)
// Flag cluster-level near-exhaustion (>80% allocated)

Copilot uses AI. Check for mistakes.
merge-coverage:
needs: test-shard
if: github.repository == 'kubestellar/console'
if: always() && github.repository == 'kubestellar/console'
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using always() here will run merge-coverage even when test-shard fails or is cancelled. If merge-coverage expects artifacts from all shards, this can cause noisy secondary failures and make the workflow harder to interpret. Consider instead gating on shard success (or handling missing artifacts explicitly inside the merge job), so the merge step doesn’t fail in a cascading way.

Suggested change
if: always() && github.repository == 'kubestellar/console'
if: github.repository == 'kubestellar/console' && needs.test-shard.result == 'success'

Copilot uses AI. Check for mistakes.
// 3. GPU nodes with high allocation - risk of GPU exhaustion
// 3. Cluster-level GPU exhaustion — only flag when 80%+ of a cluster's
// total GPUs are allocated. Individual nodes at 100% is normal utilization.
const GPU_CLUSTER_EXHAUSTION_THRESHOLD = 0.8
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constant is defined inside the component function, so it will be re-created on each render. Consider hoisting it to module scope (or a shared constants file) to reduce per-render churn and keep policy thresholds centralized.

Copilot uses AI. Check for mistakes.
@clubanderson
Copy link
Copy Markdown
Collaborator Author

🔄 Auto-Applying Copilot Code Review

Copilot code review found 3 code suggestion(s) and 1 general comment(s).

@copilot Please apply all of the following code review suggestions:

  • web/src/components/cards/console-missions/ConsoleOfflineDetectionCard.tsx (line 424): // 3. Cluster-level GPU exhaustion — only flag when above 80% of a cluster's ...
  • web/src/components/cards/console-missions/ConsoleOfflineDetectionCard.tsx (line 456): // Flag cluster-level near-exhaustion (>80% allocated)
  • .github/workflows/coverage-hourly.yml (line 62): if: github.repository == 'kubestellar/console' && needs.test-shard.result == 'su...

Also address these general comments:

  • web/src/components/cards/console-missions/ConsoleOfflineDetectionCard.tsx (line 425): This constant is defined inside the component function, so it will be re-created on each render. Consider hoisting it to

Push all fixes in a single commit. Run cd web && npm run build && npm run lint before committing.


Auto-generated by copilot-review-apply workflow.

clubanderson added a commit that referenced this pull request Apr 1, 2026
- Tour.tsx: Add aria-label on icon-only tour button (#4172)
- ConsoleOfflineDetectionCard: Hoist GPU_CLUSTER_EXHAUSTION_THRESHOLD
  to module scope, fix comments "80%+" → ">80%" (#4185)
- coverage-hourly.yml: Replace always() with success check to avoid
  noisy merge-coverage failures when shards fail (#4185)

Signed-off-by: Andrew Anderson <andy@clubanderson.com>
clubanderson added a commit that referenced this pull request Apr 1, 2026
…ow (#4186)

- Tour.tsx: Add aria-label on icon-only tour button (#4172)
- ConsoleOfflineDetectionCard: Hoist GPU_CLUSTER_EXHAUSTION_THRESHOLD
  to module scope, fix comments "80%+" → ">80%" (#4185)
- coverage-hourly.yml: Replace always() with success check to avoid
  noisy merge-coverage failures when shards fail (#4185)

Signed-off-by: Andrew Anderson <andy@clubanderson.com>
clubanderson added a commit that referenced this pull request Apr 1, 2026
)

* 🐛 Address Copilot comments: tour a11y, GPU threshold, coverage workflow

- Tour.tsx: Add aria-label on icon-only tour button (#4172)
- ConsoleOfflineDetectionCard: Hoist GPU_CLUSTER_EXHAUSTION_THRESHOLD
  to module scope, fix comments "80%+" → ">80%" (#4185)
- coverage-hourly.yml: Replace always() with success check to avoid
  noisy merge-coverage failures when shards fail (#4185)

Signed-off-by: Andrew Anderson <andy@clubanderson.com>

* 🌱 Add debug logging for screenshot upload flow in feedback dialog

All logs use console.debug with [Screenshot] prefix for easy filtering
in browser DevTools. Logs at each stage:
- File selection (drag, drop, paste, file picker): file count, types, sizes
- FileReader: loaded data URI length, errors
- Paste: clipboard item types, getAsFile results
- Submit: screenshot count, sizes, timeout, success/failure details

This helps team members diagnose why screenshot uploads fail on their
machines while working on the reporter's machine.

Signed-off-by: Andrew Anderson <andy@clubanderson.com>

---------

Signed-off-by: Andrew Anderson <andy@clubanderson.com>
clubanderson added a commit that referenced this pull request Apr 2, 2026
Remove continue-on-error: true from test shard steps, which caused
the test-shard job to always report success even when tests failed.
The merge-coverage gate (needs.test-shard.result == 'success') was
effectively a no-op. Copilot flagged this on 6 PRs (#4185, #4186,
#4187, #4222, #4223, #4225).

Changes:
- Remove continue-on-error: true so shard failures are properly reported
- Widen merge-coverage gate to !cancelled() so coverage is still merged
  from passing shards even when some fail
- Add if-no-files-found: ignore on artifact upload so failed shards
  that produce no coverage file don't fail the upload step
- Add final 'Fail if any shard failed' step so the overall workflow
  correctly surfaces test failures after merging coverage and updating
  the badge
- Fix stale comment: '4 shards' -> '12 shards'

Signed-off-by: Andrew Anderson <andy@clubanderson.com>
clubanderson added a commit that referenced this pull request Apr 2, 2026
Remove continue-on-error: true from test shard steps, which caused
the test-shard job to always report success even when tests failed.
The merge-coverage gate (needs.test-shard.result == 'success') was
effectively a no-op. Copilot flagged this on 6 PRs (#4185, #4186,
#4187, #4222, #4223, #4225).

Changes:
- Remove continue-on-error: true so shard failures are properly reported
- Widen merge-coverage gate to !cancelled() so coverage is still merged
  from passing shards even when some fail
- Add if-no-files-found: ignore on artifact upload so failed shards
  that produce no coverage file don't fail the upload step
- Add final 'Fail if any shard failed' step so the overall workflow
  correctly surfaces test failures after merging coverage and updating
  the badge
- Fix stale comment: '4 shards' -> '12 shards'

Signed-off-by: Andrew Anderson <andy@clubanderson.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has signed the DCO. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants