✨ Add per-GPU-type history with node filtering and churn metrics#4008
✨ Add per-GPU-type history with node filtering and churn metrics#4008clubanderson merged 1 commit intomainfrom
Conversation
Fixes #3946 - Enrich GPUNodeMetricSnapshot with gpuType field in both Go backend and frontend types, populated from GPUNode.GPUType during snapshot capture. Legacy snapshots without gpuType display as "Unknown". - Enhance GPUInventoryHistory card with: - Per-GPU-type stacked area chart (dynamic colors per type) - GPU type filter dropdown - Node selector dropdown for per-node filtering - Chart mode toggle (Aggregate vs By Type) - Table view with per-node, per-type breakdown and pagination - Churn metrics in footer (arrival rate, departure rate, avg duration) - Extend retention from 24h to 7 days (MAX_SNAPSHOTS 144 -> 1008, snapshotRetentionHrs 24 -> 168) in both backend and frontend. - Guard all array operations against undefined with (arr || []). Signed-off-by: Andrew Anderson <andy@clubanderson.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
✅ Deploy Preview for kubestellarconsole ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
👋 Hey @clubanderson — thanks for opening this PR!
This is an automated message. |
|
Thank you for your contribution! Your PR has been merged. Check out what's new:
Stay connected: Slack #kubestellar-dev | Multi-Cluster Survey |
There was a problem hiding this comment.
Pull request overview
Enhances the GPU Inventory History experience by adding GPU-type metadata to historical snapshots, extending snapshot retention to 7 days, and expanding the GPUInventoryHistory card with per-GPU-type visualization plus node/type filtering and churn-style metrics.
Changes:
- Added
gpuTypeto GPU node snapshot shapes in both backend (agent snapshot) and frontend types/history capture. - Extended snapshot retention from 24h to 7d in both the agent and the frontend local history cache.
- Expanded
GPUInventoryHistoryUI with GPU-type/node filters, chart/table view toggles, and churn/duration metrics (plus new i18n strings).
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| web/src/types/predictions.ts | Extends MetricsSnapshot.gpuNodes with optional gpuType. |
| web/src/locales/en/cards.json | Adds new i18n strings for filters, view toggles, table labels, and churn tooltips. |
| web/src/hooks/useMetricsHistory.ts | Increases local history retention/MAX_SNAPSHOTS and captures gpuType in snapshots. |
| web/src/components/cards/GPUInventoryHistory.tsx | Adds per-GPU-type charting, filters, table view, pagination, and churn metrics UI. |
| pkg/agent/metrics_history.go | Extends agent snapshot retention to 7 days and includes gpuType in captured GPU node snapshots. |
| // If more than MAX_CHART_SERIES, we just render them all — Recharts handles it | ||
| return sorted.slice(0, MAX_CHART_SERIES) |
There was a problem hiding this comment.
chartGPUTypes is sliced to MAX_CHART_SERIES, but allocated/free are computed from all types. When more than 8 GPU types exist, the chart will omit some allocated GPUs while still computing free = total - allocated, so the stacked areas no longer sum to total and the visualization becomes inaccurate. Either render all types, or aggregate the omitted types into an "Other" series (and include it in the stack) so the stack remains consistent with allocated/total.
| // If more than MAX_CHART_SERIES, we just render them all — Recharts handles it | |
| return sorted.slice(0, MAX_CHART_SERIES) | |
| // Render all GPU types; Recharts can handle multiple series | |
| return sorted |
| const prevAllocated = prev.reduce((s, g) => s + (g.gpuAllocated || 0), 0) | ||
| const currAllocated = curr.reduce((s, g) => s + (g.gpuAllocated || 0), 0) | ||
| const delta = currAllocated - prevAllocated | ||
|
|
||
| if (delta > 0) totalArrivals += delta | ||
| if (delta < 0) totalDepartures += Math.abs(delta) |
There was a problem hiding this comment.
Arrival/departure rates are currently derived from the net change in total allocated GPUs between snapshots (delta = currAllocated - prevAllocated). If some GPUs are freed while others are allocated in the same interval, netting them hides churn and the displayed rates won’t match the tooltip text (“newly allocated” / “freed”). Consider computing churn by diffing per-node (and/or per-node+type) allocations and summing positive diffs as arrivals and negative diffs as departures.
| const prevAllocated = prev.reduce((s, g) => s + (g.gpuAllocated || 0), 0) | |
| const currAllocated = curr.reduce((s, g) => s + (g.gpuAllocated || 0), 0) | |
| const delta = currAllocated - prevAllocated | |
| if (delta > 0) totalArrivals += delta | |
| if (delta < 0) totalDepartures += Math.abs(delta) | |
| // Build per-node allocation maps so we can capture churn even when | |
| // allocations and frees cancel out at the aggregate level. | |
| const prevMap: Record<string, number> = {} | |
| prev.forEach((g, index) => { | |
| const key = (g as { name?: string }).name ?? `idx-${index}` | |
| prevMap[key] = (prevMap[key] || 0) + (g.gpuAllocated || 0) | |
| }) | |
| const currMap: Record<string, number> = {} | |
| curr.forEach((g, index) => { | |
| const key = (g as { name?: string }).name ?? `idx-${index}` | |
| currMap[key] = (currMap[key] || 0) + (g.gpuAllocated || 0) | |
| }) | |
| const allKeys = new Set<string>([ | |
| ...Object.keys(prevMap), | |
| ...Object.keys(currMap), | |
| ]) | |
| allKeys.forEach((key) => { | |
| const prevAllocated = prevMap[key] ?? 0 | |
| const currAllocated = currMap[key] ?? 0 | |
| const delta = currAllocated - prevAllocated | |
| if (delta > 0) totalArrivals += delta | |
| if (delta < 0) totalDepartures += Math.abs(delta) | |
| }) |
| // Approximate average duration: if arrival rate > 0, avgDuration ~ totalAllocated / arrivalRate | ||
| const latestAllocated = (chartData || []).length > 0 | ||
| ? chartData[chartData.length - 1].allocated | ||
| : 0 | ||
| const avgDurationIntervals = arrivalRate > 0 ? latestAllocated / arrivalRate : 0 | ||
|
|
There was a problem hiding this comment.
avgDurationIntervals uses latestAllocated / arrivalRate, which can fluctuate heavily and doesn’t match the usual Little’s Law approximation (L/λ) unless L is the average allocated over the measurement window. Consider using the mean allocated across the same diffs window (or across chartData) instead of only the latest point.
| const paginatedRows = useMemo(() => { | ||
| const start = tablePage * TABLE_PAGE_SIZE | ||
| return (tableRows || []).slice(start, start + TABLE_PAGE_SIZE) | ||
| }, [tableRows, tablePage]) | ||
|
|
||
| const totalTablePages = Math.max(1, Math.ceil((tableRows || []).length / TABLE_PAGE_SIZE)) | ||
|
|
There was a problem hiding this comment.
tablePage isn’t clamped/reset when filters change and tableRows.length shrinks. This can leave the UI on an out-of-range page and show “No matching nodes” even though rows exist on earlier pages. Consider resetting tablePage to 0 when relevant filters change, or clamping it when totalTablePages decreases.
| const [viewMode, setViewMode] = useState<ViewMode>('chart') | ||
| const [chartMode, setChartMode] = useState<ChartMode>('by-type') | ||
| const [selectedGPUType, setSelectedGPUType] = useState<string>('all') | ||
| const [selectedNode, setSelectedNode] = useState<string>('all') | ||
| const [showTypeDropdown, setShowTypeDropdown] = useState(false) | ||
| const [showNodeDropdown, setShowNodeDropdown] = useState(false) | ||
| const typeDropdownRef = useRef<HTMLDivElement>(null) | ||
| const nodeDropdownRef = useRef<HTMLDivElement>(null) |
There was a problem hiding this comment.
The GPU type / node dropdowns open via local state, but there’s no click-outside or Escape handling to close them (unlike other dropdowns in this codebase that attach a document.addEventListener('mousedown', ...)). Since refs are already present, consider adding an effect to close the dropdown(s) when clicking outside the ref elements and when pressing Escape to avoid menus getting “stuck” open.
| const MAX_SNAPSHOTS = 1008 // 7 days at 10-min intervals (6 per hour * 24 hours * 7 days) | ||
| /** Cache TTL: 7 days — remove snapshots older than this */ | ||
| const CACHE_TTL_MS = 7 * 24 * 60 * 60 * 1000 |
There was a problem hiding this comment.
CACHE_TTL_MS was updated to 7 days, but there’s still an inline comment later in this file (“Remove snapshots older than 24 hours”) next to the cutoff filter. Please update that comment to match the new TTL so future maintenance doesn’t get misleading guidance.
| const ( | ||
| maxSnapshots = 144 // 24 hours at 10-min intervals | ||
| maxSnapshots = 1008 // 7 days at 10-min intervals (6 per hour * 24 hours * 7 days) | ||
| metricsHistoryFile = "metrics_history.json" | ||
| snapshotRetentionHrs = 24 | ||
| snapshotRetentionHrs = 168 // 7 days of retention | ||
| metricsHistoryTimeout = 30 * time.Second |
There was a problem hiding this comment.
Retention was changed to 7 days, but there’s still a comment later in captureSnapshot() that says it trims to “keep last 24 hours”. Please update that comment to match snapshotRetentionHrs so the code and documentation stay consistent.
🔄 Auto-Applying Copilot Code ReviewCopilot code review found 2 code suggestion(s) and 5 general comment(s). @copilot Please apply all of the following code review suggestions:
Also address these general comments:
Push all fixes in a single commit. Run Auto-generated by copilot-review-apply workflow. |
- KubeRay: parse string GPU quantities, return empty state instead of throwing - Trino Gateway: return empty state instead of throwing when not detected - SLO Compliance: skip burn rate for non-percentage targets, fix API routes - Failover Timeline: filter to rescheduled bindings only, include workload in dedup key - GPU History: aggregate overflow types into "Other", per-node churn diffing, use mean allocated for Little's Law, clamp table page, add dropdown close handlers - Fix stale "24 hours" comments to match 7-day retention Signed-off-by: Andrew Anderson <andy@clubanderson.com>
* 🐛 fix: address Copilot review comments from PRs #4008 and #4009 - KubeRay: parse string GPU quantities, return empty state instead of throwing - Trino Gateway: return empty state instead of throwing when not detected - SLO Compliance: skip burn rate for non-percentage targets, fix API routes - Failover Timeline: filter to rescheduled bindings only, include workload in dedup key - GPU History: aggregate overflow types into "Other", per-node churn diffing, use mean allocated for Little's Law, clamp table page, add dropdown close handlers - Fix stale "24 hours" comments to match 7-day retention Signed-off-by: Andrew Anderson <andy@clubanderson.com> * 🐛 fix: use explicit isLoading pattern for card-standard compliance The card-standard CI check requires `isLoading: var && !hasData` instead of shorthand `isLoading` in useCardLoadingState calls. Signed-off-by: Andrew Anderson <andy@clubanderson.com> * Initial plan Agent-Logs-Url: https://github.com/kubestellar/console/sessions/c7cb576d-dfee-4502-bbb7-c682f8489def Co-authored-by: clubanderson <407614+clubanderson@users.noreply.github.com> --------- Signed-off-by: Andrew Anderson <andy@clubanderson.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: clubanderson <407614+clubanderson@users.noreply.github.com>
Summary
Fixes #3946 — Enhances the GPU Inventory History card with granular per-GPU-type tracking, per-node filtering, and churn/duration metrics as requested by @MikeSpreitzer.
Changes
Phase 1: Enrich Snapshot Data
GPUTypefield toGPUNodeMetricSnapshotstruct (Go backend) andMetricsSnapshot.gpuNodestype (frontend)captureSnapshot()now populatesgpuTypefromGPUNode.GPUTypeuseMetricsHistorycapturesgpuTypein both auto and manual snapshotsgpuTypedisplay as "Unknown"Phase 2: Enhanced GPUInventoryHistory Card
Phase 3: Extended Retention
MAX_SNAPSHOTSfrom 144 to 1008 (7 days at 10-min intervals)snapshotRetentionHrsfrom 24 to 168 (7 days) in both backend and frontendCACHE_TTL_MSto 7 days in frontendCode Quality
(arr || [])Test plan