Skip to content

✨ Add per-GPU-type history with node filtering and churn metrics#4008

Merged
clubanderson merged 1 commit intomainfrom
feature/gpu-history-granular
Mar 31, 2026
Merged

✨ Add per-GPU-type history with node filtering and churn metrics#4008
clubanderson merged 1 commit intomainfrom
feature/gpu-history-granular

Conversation

@clubanderson
Copy link
Copy Markdown
Collaborator

Summary

Fixes #3946 — Enhances the GPU Inventory History card with granular per-GPU-type tracking, per-node filtering, and churn/duration metrics as requested by @MikeSpreitzer.

Changes

Phase 1: Enrich Snapshot Data

  • Added GPUType field to GPUNodeMetricSnapshot struct (Go backend) and MetricsSnapshot.gpuNodes type (frontend)
  • Backend captureSnapshot() now populates gpuType from GPUNode.GPUType
  • Frontend useMetricsHistory captures gpuType in both auto and manual snapshots
  • Backward compatible: legacy snapshots without gpuType display as "Unknown"

Phase 2: Enhanced GPUInventoryHistory Card

  • Per-GPU-type stacked area chart: Each GPU type gets its own colored series in the stacked chart (dynamic colors, up to 8 types)
  • Chart mode toggle: Switch between "Aggregate" (original allocated/free view) and "By Type" (per-GPU-type breakdown)
  • GPU type filter dropdown: Filter history to a specific GPU type (e.g., only NVIDIA A100)
  • Node selector dropdown: Filter to a specific node for per-node history view
  • Table view toggle: Switch between chart and table view showing per-node, per-type breakdown with utilization percentages and pagination
  • Churn metrics in footer: Arrival rate (GPUs newly allocated per interval), departure rate (GPUs freed per interval), and approximate average allocation duration

Phase 3: Extended Retention

  • Increased MAX_SNAPSHOTS from 144 to 1008 (7 days at 10-min intervals)
  • Updated snapshotRetentionHrs from 24 to 168 (7 days) in both backend and frontend
  • Updated CACHE_TTL_MS to 7 days in frontend

Code Quality

  • All numeric literals use named constants with comments
  • All array operations guarded against undefined with (arr || [])
  • Zero new lint errors (verified with eslint on changed files)
  • Both Go backend and TypeScript frontend compile clean

Test plan

  • Verify the card renders in demo mode with per-GPU-type stacked chart
  • Verify "By Type" / "Aggregate" toggle switches chart series correctly
  • Verify GPU type filter dropdown filters chart data
  • Verify node filter dropdown filters chart data
  • Verify table view shows per-node breakdown with utilization colors
  • Verify table pagination works correctly
  • Verify churn metrics appear in footer when sufficient history exists
  • Verify backward compatibility: old snapshots without gpuType show as "Unknown"
  • Verify cluster filter still works correctly
  • Verify Go backend captures gpuType in new snapshots

Fixes #3946

- Enrich GPUNodeMetricSnapshot with gpuType field in both Go backend
  and frontend types, populated from GPUNode.GPUType during snapshot
  capture. Legacy snapshots without gpuType display as "Unknown".

- Enhance GPUInventoryHistory card with:
  - Per-GPU-type stacked area chart (dynamic colors per type)
  - GPU type filter dropdown
  - Node selector dropdown for per-node filtering
  - Chart mode toggle (Aggregate vs By Type)
  - Table view with per-node, per-type breakdown and pagination
  - Churn metrics in footer (arrival rate, departure rate, avg duration)

- Extend retention from 24h to 7 days (MAX_SNAPSHOTS 144 -> 1008,
  snapshotRetentionHrs 24 -> 168) in both backend and frontend.

- Guard all array operations against undefined with (arr || []).

Signed-off-by: Andrew Anderson <andy@clubanderson.com>
Copilot AI review requested due to automatic review settings March 31, 2026 16:10
@kubestellar-prow
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign clubanderson for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubestellar-prow kubestellar-prow bot added the dco-signoff: yes Indicates the PR's author has signed the DCO. label Mar 31, 2026
@netlify
Copy link
Copy Markdown

netlify bot commented Mar 31, 2026

Deploy Preview for kubestellarconsole ready!

Name Link
🔨 Latest commit aa208f7
🔍 Latest deploy log https://app.netlify.com/projects/kubestellarconsole/deploys/69cbf1fe8a0b5f00087afb4c
😎 Deploy Preview https://deploy-preview-4008.console-deploy-preview.kubestellar.io
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@kubestellar-prow kubestellar-prow bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Mar 31, 2026
@clubanderson clubanderson merged commit 43d0ef4 into main Mar 31, 2026
21 of 22 checks passed
@kubestellar-prow kubestellar-prow bot deleted the feature/gpu-history-granular branch March 31, 2026 16:14
@github-actions
Copy link
Copy Markdown
Contributor

👋 Hey @clubanderson — thanks for opening this PR!

🤖 This project is developed exclusively using AI coding assistants.

Please do not attempt to code anything for this project manually.
All contributions should be authored using an AI coding tool such as:

This ensures consistency in code style, architecture patterns, test coverage,
and commit quality across the entire codebase.


This is an automated message.

@github-actions
Copy link
Copy Markdown
Contributor

Thank you for your contribution! Your PR has been merged.

Check out what's new:

Stay connected: Slack #kubestellar-dev | Multi-Cluster Survey

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Enhances the GPU Inventory History experience by adding GPU-type metadata to historical snapshots, extending snapshot retention to 7 days, and expanding the GPUInventoryHistory card with per-GPU-type visualization plus node/type filtering and churn-style metrics.

Changes:

  • Added gpuType to GPU node snapshot shapes in both backend (agent snapshot) and frontend types/history capture.
  • Extended snapshot retention from 24h to 7d in both the agent and the frontend local history cache.
  • Expanded GPUInventoryHistory UI with GPU-type/node filters, chart/table view toggles, and churn/duration metrics (plus new i18n strings).

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
web/src/types/predictions.ts Extends MetricsSnapshot.gpuNodes with optional gpuType.
web/src/locales/en/cards.json Adds new i18n strings for filters, view toggles, table labels, and churn tooltips.
web/src/hooks/useMetricsHistory.ts Increases local history retention/MAX_SNAPSHOTS and captures gpuType in snapshots.
web/src/components/cards/GPUInventoryHistory.tsx Adds per-GPU-type charting, filters, table view, pagination, and churn metrics UI.
pkg/agent/metrics_history.go Extends agent snapshot retention to 7 days and includes gpuType in captured GPU node snapshots.

Comment on lines +358 to +359
// If more than MAX_CHART_SERIES, we just render them all — Recharts handles it
return sorted.slice(0, MAX_CHART_SERIES)
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chartGPUTypes is sliced to MAX_CHART_SERIES, but allocated/free are computed from all types. When more than 8 GPU types exist, the chart will omit some allocated GPUs while still computing free = total - allocated, so the stacked areas no longer sum to total and the visualization becomes inaccurate. Either render all types, or aggregate the omitted types into an "Other" series (and include it in the stack) so the stack remains consistent with allocated/total.

Suggested change
// If more than MAX_CHART_SERIES, we just render them all — Recharts handles it
return sorted.slice(0, MAX_CHART_SERIES)
// Render all GPU types; Recharts can handle multiple series
return sorted

Copilot uses AI. Check for mistakes.
Comment on lines +404 to +409
const prevAllocated = prev.reduce((s, g) => s + (g.gpuAllocated || 0), 0)
const currAllocated = curr.reduce((s, g) => s + (g.gpuAllocated || 0), 0)
const delta = currAllocated - prevAllocated

if (delta > 0) totalArrivals += delta
if (delta < 0) totalDepartures += Math.abs(delta)
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arrival/departure rates are currently derived from the net change in total allocated GPUs between snapshots (delta = currAllocated - prevAllocated). If some GPUs are freed while others are allocated in the same interval, netting them hides churn and the displayed rates won’t match the tooltip text (“newly allocated” / “freed”). Consider computing churn by diffing per-node (and/or per-node+type) allocations and summing positive diffs as arrivals and negative diffs as departures.

Suggested change
const prevAllocated = prev.reduce((s, g) => s + (g.gpuAllocated || 0), 0)
const currAllocated = curr.reduce((s, g) => s + (g.gpuAllocated || 0), 0)
const delta = currAllocated - prevAllocated
if (delta > 0) totalArrivals += delta
if (delta < 0) totalDepartures += Math.abs(delta)
// Build per-node allocation maps so we can capture churn even when
// allocations and frees cancel out at the aggregate level.
const prevMap: Record<string, number> = {}
prev.forEach((g, index) => {
const key = (g as { name?: string }).name ?? `idx-${index}`
prevMap[key] = (prevMap[key] || 0) + (g.gpuAllocated || 0)
})
const currMap: Record<string, number> = {}
curr.forEach((g, index) => {
const key = (g as { name?: string }).name ?? `idx-${index}`
currMap[key] = (currMap[key] || 0) + (g.gpuAllocated || 0)
})
const allKeys = new Set<string>([
...Object.keys(prevMap),
...Object.keys(currMap),
])
allKeys.forEach((key) => {
const prevAllocated = prevMap[key] ?? 0
const currAllocated = currMap[key] ?? 0
const delta = currAllocated - prevAllocated
if (delta > 0) totalArrivals += delta
if (delta < 0) totalDepartures += Math.abs(delta)
})

Copilot uses AI. Check for mistakes.
Comment on lines +418 to +423
// Approximate average duration: if arrival rate > 0, avgDuration ~ totalAllocated / arrivalRate
const latestAllocated = (chartData || []).length > 0
? chartData[chartData.length - 1].allocated
: 0
const avgDurationIntervals = arrivalRate > 0 ? latestAllocated / arrivalRate : 0

Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avgDurationIntervals uses latestAllocated / arrivalRate, which can fluctuate heavily and doesn’t match the usual Little’s Law approximation (L/λ) unless L is the average allocated over the measurement window. Consider using the mean allocated across the same diffs window (or across chartData) instead of only the latest point.

Copilot uses AI. Check for mistakes.
Comment on lines +450 to +456
const paginatedRows = useMemo(() => {
const start = tablePage * TABLE_PAGE_SIZE
return (tableRows || []).slice(start, start + TABLE_PAGE_SIZE)
}, [tableRows, tablePage])

const totalTablePages = Math.max(1, Math.ceil((tableRows || []).length / TABLE_PAGE_SIZE))

Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tablePage isn’t clamped/reset when filters change and tableRows.length shrinks. This can leave the UI on an out-of-range page and show “No matching nodes” even though rows exist on earlier pages. Consider resetting tablePage to 0 when relevant filters change, or clamping it when totalTablePages decreases.

Copilot uses AI. Check for mistakes.
Comment on lines +221 to +228
const [viewMode, setViewMode] = useState<ViewMode>('chart')
const [chartMode, setChartMode] = useState<ChartMode>('by-type')
const [selectedGPUType, setSelectedGPUType] = useState<string>('all')
const [selectedNode, setSelectedNode] = useState<string>('all')
const [showTypeDropdown, setShowTypeDropdown] = useState(false)
const [showNodeDropdown, setShowNodeDropdown] = useState(false)
const typeDropdownRef = useRef<HTMLDivElement>(null)
const nodeDropdownRef = useRef<HTMLDivElement>(null)
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GPU type / node dropdowns open via local state, but there’s no click-outside or Escape handling to close them (unlike other dropdowns in this codebase that attach a document.addEventListener('mousedown', ...)). Since refs are already present, consider adding an effect to close the dropdown(s) when clicking outside the ref elements and when pressing Escape to avoid menus getting “stuck” open.

Copilot uses AI. Check for mistakes.
Comment on lines +8 to +10
const MAX_SNAPSHOTS = 1008 // 7 days at 10-min intervals (6 per hour * 24 hours * 7 days)
/** Cache TTL: 7 days — remove snapshots older than this */
const CACHE_TTL_MS = 7 * 24 * 60 * 60 * 1000
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CACHE_TTL_MS was updated to 7 days, but there’s still an inline comment later in this file (“Remove snapshots older than 24 hours”) next to the cutoff filter. Please update that comment to match the new TTL so future maintenance doesn’t get misleading guidance.

Copilot uses AI. Check for mistakes.
Comment on lines 16 to 20
const (
maxSnapshots = 144 // 24 hours at 10-min intervals
maxSnapshots = 1008 // 7 days at 10-min intervals (6 per hour * 24 hours * 7 days)
metricsHistoryFile = "metrics_history.json"
snapshotRetentionHrs = 24
snapshotRetentionHrs = 168 // 7 days of retention
metricsHistoryTimeout = 30 * time.Second
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retention was changed to 7 days, but there’s still a comment later in captureSnapshot() that says it trims to “keep last 24 hours”. Please update that comment to match snapshotRetentionHrs so the code and documentation stay consistent.

Copilot uses AI. Check for mistakes.
@clubanderson
Copy link
Copy Markdown
Collaborator Author

🔄 Auto-Applying Copilot Code Review

Copilot code review found 2 code suggestion(s) and 5 general comment(s).

@copilot Please apply all of the following code review suggestions:

  • web/src/components/cards/GPUInventoryHistory.tsx (line 359): // Render all GPU types; Recharts can handle multiple series return sorted
  • web/src/components/cards/GPUInventoryHistory.tsx (line 409): // Build per-node allocation maps so we can capture churn even when // all...

Also address these general comments:

  • web/src/components/cards/GPUInventoryHistory.tsx (line 423): avgDurationIntervals uses latestAllocated / arrivalRate, which can fluctuate heavily and doesn’t match the usual Lit
  • web/src/components/cards/GPUInventoryHistory.tsx (line 456): tablePage isn’t clamped/reset when filters change and tableRows.length shrinks. This can leave the UI on an out-of-r
  • web/src/components/cards/GPUInventoryHistory.tsx (line 228): The GPU type / node dropdowns open via local state, but there’s no click-outside or Escape handling to close them (unlik
  • web/src/hooks/useMetricsHistory.ts (line 10): CACHE_TTL_MS was updated to 7 days, but there’s still an inline comment later in this file (“Remove snapshots older th
  • pkg/agent/metrics_history.go (line 20): Retention was changed to 7 days, but there’s still a comment later in captureSnapshot() that says it trims to “keep la

Push all fixes in a single commit. Run cd web && npm run build && npm run lint before committing.


Auto-generated by copilot-review-apply workflow.

@github-actions github-actions bot mentioned this pull request Mar 31, 2026
clubanderson added a commit that referenced this pull request Mar 31, 2026
- KubeRay: parse string GPU quantities, return empty state instead of throwing
- Trino Gateway: return empty state instead of throwing when not detected
- SLO Compliance: skip burn rate for non-percentage targets, fix API routes
- Failover Timeline: filter to rescheduled bindings only, include workload in dedup key
- GPU History: aggregate overflow types into "Other", per-node churn diffing,
  use mean allocated for Little's Law, clamp table page, add dropdown close handlers
- Fix stale "24 hours" comments to match 7-day retention

Signed-off-by: Andrew Anderson <andy@clubanderson.com>
clubanderson added a commit that referenced this pull request Mar 31, 2026
* 🐛 fix: address Copilot review comments from PRs #4008 and #4009

- KubeRay: parse string GPU quantities, return empty state instead of throwing
- Trino Gateway: return empty state instead of throwing when not detected
- SLO Compliance: skip burn rate for non-percentage targets, fix API routes
- Failover Timeline: filter to rescheduled bindings only, include workload in dedup key
- GPU History: aggregate overflow types into "Other", per-node churn diffing,
  use mean allocated for Little's Law, clamp table page, add dropdown close handlers
- Fix stale "24 hours" comments to match 7-day retention

Signed-off-by: Andrew Anderson <andy@clubanderson.com>

* 🐛 fix: use explicit isLoading pattern for card-standard compliance

The card-standard CI check requires `isLoading: var && !hasData` instead
of shorthand `isLoading` in useCardLoadingState calls.

Signed-off-by: Andrew Anderson <andy@clubanderson.com>

* Initial plan

Agent-Logs-Url: https://github.com/kubestellar/console/sessions/c7cb576d-dfee-4502-bbb7-c682f8489def

Co-authored-by: clubanderson <407614+clubanderson@users.noreply.github.com>

---------

Signed-off-by: Andrew Anderson <andy@clubanderson.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: clubanderson <407614+clubanderson@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has signed the DCO. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feature: Historical view of Inventory of GPUs

3 participants