✨ feat: add KubeRay, SLO, Failover, and Trino dashboard cards with preset dashboards#4009
Conversation
…eset dashboards Add 4 new dashboard cards to counter Karmada+KubeRay and Karmada+Trino architectures showcased at KubeCon EU 2026: - **KubeRay Fleet Monitor**: Discovers RayCluster, RayService, and RayJob CRDs across all clusters with GPU allocation tracking and fleet stats - **SLO Compliance Tracker**: Configurable SLO targets with error budget burn-rate gauges, compliance donut charts, and per-cluster indicators - **Cross-Region Failover Timeline**: Forensic timeline of Karmada ResourceBinding failover events with cluster outage/recovery tracking - **Trino Gateway Monitor**: Discovers Trino coordinator/worker/gateway pods with per-cluster query health and routing status Add 2 preset dashboards: - **Karmada AI Operations**: Pre-arranged dashboard for Karmada+KubeRay inference environments (KubeRay Fleet + SLO + Failover + GPU Overview) - **Karmada Data Platform**: Pre-arranged dashboard for Karmada+Trino data platforms (Trino Gateway + SLO + Failover + Cluster Locations) All cards registered in cardRegistry, cardMetadata, and AddCardModal catalog under the Orchestration category. Signed-off-by: Andrew Anderson <andy@clubanderson.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
✅ Deploy Preview for kubestellarconsole ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
Thank you for your contribution! Your PR has been merged. Check out what's new:
Stay connected: Slack #kubestellar-dev | Multi-Cluster Survey |
|
👋 Hey @clubanderson — thanks for opening this PR!
This is an automated message. |
There was a problem hiding this comment.
Pull request overview
Adds new Orchestration-focused dashboard cards (KubeRay fleet, SLO compliance, Karmada failover timeline, Trino gateway) and introduces two multi-card preset dashboards to support Day-2 operations views for Karmada-based multi-cluster environments.
Changes:
- Register 4 new card types in the UI catalog/registry/metadata (titles, descriptions, lazy loading, default widths, chunk preloaders).
- Implement new card UIs + data hooks with demo datasets for KubeRay Fleet, SLO Compliance, Failover Timeline, and Trino Gateway.
- Add two preset dashboard JSON definitions for AI ops and data platform layouts.
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| web/src/components/dashboard/AddCardModal.tsx | Adds the 4 new cards to the Orchestration catalog list. |
| web/src/components/cards/cardRegistry.ts | Registers new card components (lazy imports), preloaders, live-data set, and default widths. |
| web/src/components/cards/cardMetadata.ts | Adds titles/descriptions for the 4 new cards. |
| web/src/components/cards/kuberay_fleet/useKubeRayFleet.ts | Fetches/aggregates Ray CRDs across clusters via custom-resources API. |
| web/src/components/cards/kuberay_fleet/KubeRayFleet.tsx | Renders KubeRay fleet summary + per-resource sections. |
| web/src/components/cards/kuberay_fleet/demoData.ts | Defines KubeRay fleet types + demo dataset. |
| web/src/components/cards/kuberay_fleet/index.ts | Barrel export for KubeRayFleet card. |
| web/src/components/cards/slo_compliance/useSLOCompliance.ts | Fetches SLO config + queries Prometheus to compute compliance data. |
| web/src/components/cards/slo_compliance/SLOCompliance.tsx | Renders donut gauges + per-target burn/compliance rows. |
| web/src/components/cards/slo_compliance/demoData.ts | Defines SLO types + demo dataset. |
| web/src/components/cards/slo_compliance/index.ts | Barrel export for SLOCompliance card. |
| web/src/components/cards/failover_timeline/useFailoverTimeline.ts | Fetches Karmada CRDs and correlates conditions/bindings into timeline events. |
| web/src/components/cards/failover_timeline/FailoverTimeline.tsx | Renders timeline UI with severity/type badges and summary stats. |
| web/src/components/cards/failover_timeline/demoData.ts | Defines failover event types + demo dataset. |
| web/src/components/cards/failover_timeline/index.ts | Barrel export for FailoverTimeline card. |
| web/src/components/cards/trino_gateway/useTrinoGateway.ts | Discovers Trino + gateway pods via label selectors and aggregates status. |
| web/src/components/cards/trino_gateway/TrinoGateway.tsx | Renders Trino cluster/gateway summary and backend routing rows. |
| web/src/components/cards/trino_gateway/demoData.ts | Defines Trino gateway types + demo dataset. |
| web/src/components/cards/trino_gateway/index.ts | Barrel export for TrinoGateway card. |
| presets/karmada-ai-operations.json | Adds a 6-card preset dashboard layout for Karmada + KubeRay AI operations. |
| presets/karmada-data-platform.json | Adds a 6-card preset dashboard layout for Karmada + Trino data platform ops. |
| function safeNumber(val: unknown, fallback = 0): number { | ||
| return typeof val === 'number' ? val : fallback | ||
| } | ||
|
|
||
| function safeString(val: unknown, fallback = ''): string { | ||
| return typeof val === 'string' ? val : fallback | ||
| } | ||
|
|
||
| // --------------------------------------------------------------------------- | ||
| // CRD parsers | ||
| // --------------------------------------------------------------------------- | ||
|
|
||
| function parseRayCluster(item: CRItem): RayClusterInfo { | ||
| const status = getRecord(item.status) | ||
| const spec = getRecord(item.spec) | ||
| const workerSpecs = Array.isArray(spec.workerGroupSpecs) ? spec.workerGroupSpecs : [] | ||
|
|
||
| let gpuCount = 0 | ||
| for (const wg of workerSpecs) { | ||
| const wgObj = getRecord(wg) | ||
| const replicas = safeNumber(wgObj.replicas, 1) | ||
| const template = getRecord(getRecord(wgObj.template).spec) | ||
| const containers = Array.isArray(template.containers) ? template.containers : [] | ||
| for (const c of containers) { | ||
| const limits = getRecord(getRecord(getRecord(c).resources).limits) | ||
| const gpuVal = limits['nvidia.com/gpu'] | ||
| if (gpuVal) gpuCount += safeNumber(gpuVal) * replicas | ||
| } |
There was a problem hiding this comment.
safeNumber() only accepts numeric JSON values, but Kubernetes resource quantities (including resources.limits['nvidia.com/gpu']) are typically serialized as strings (e.g. "1"). As written, gpuCount will stay 0 for real RayCluster specs. Parse numeric strings (and consider handling Quantity formats) before multiplying by replicas.
| if (!detected) throw new Error('No Ray resources detected') | ||
|
|
There was a problem hiding this comment.
Treating "no Ray resources detected" as an exception marks the cache as failed and increments consecutiveFailures, which can surface as an error/offline state even though "not installed" is a valid empty state. Return detected: false (with empty arrays + lastCheckTime) instead of throwing so the card can show the intended empty state without reporting a fetch failure.
| if (!detected) throw new Error('No Ray resources detected') | |
| if (!detected) { | |
| return { | |
| detected: false, | |
| rayClusters: [], | |
| rayServices: [], | |
| rayJobs: [], | |
| totalGPUs: 0, | |
| lastCheckTime: new Date().toISOString(), | |
| } | |
| } |
| ]) | ||
|
|
||
| const detected = coordinators.length > 0 || workers.length > 0 || gatewayPods.length > 0 | ||
| if (!detected) throw new Error('No Trino resources detected') |
There was a problem hiding this comment.
fetchTrinoGatewayData() throws when no Trino/Gateway pods are found. That turns a normal "not installed" state into a failed fetch (and bumps consecutiveFailures). Return detected: false (plus lastCheckTime) instead of throwing so the card can render the empty state without being marked as failed.
| if (!detected) throw new Error('No Trino resources detected') | |
| if (!detected) { | |
| return { | |
| detected: false, | |
| trinoClusters: [], | |
| gateways: [], | |
| totalWorkers: 0, | |
| totalActiveQueries: 0, | |
| lastCheckTime: new Date().toISOString(), | |
| } | |
| } |
| async function fetchSLOCompliance(): Promise<SLOComplianceData> { | ||
| // Step 1: Get SLO target configuration from the backend | ||
| const configResp = await authFetch('/api/mcp/slo-targets', { | ||
| headers: { Accept: 'application/json' }, | ||
| signal: AbortSignal.timeout(FETCH_DEFAULT_TIMEOUT_MS), | ||
| }) | ||
| if (!configResp.ok) throw new Error('SLO targets not configured') | ||
|
|
||
| const configBody: SLOConfigResponse = await configResp.json().catch(() => ({})) | ||
| const sloConfigs = configBody.targets ?? [] | ||
| if (sloConfigs.length === 0) throw new Error('No SLO targets defined') | ||
|
|
||
| // Step 2: Query Prometheus for each target's current compliance | ||
| const targets: SLOTarget[] = await Promise.all( | ||
| sloConfigs.map(async (cfg) => { | ||
| try { | ||
| const params = new URLSearchParams({ query: cfg.query }) | ||
| const resp = await authFetch(`/api/mcp/prometheus/query?${params}`, { | ||
| headers: { Accept: 'application/json' }, | ||
| signal: AbortSignal.timeout(FETCH_DEFAULT_TIMEOUT_MS), | ||
| }) | ||
| if (!resp.ok) { |
There was a problem hiding this comment.
These API routes don't appear to exist in the repo (/api/mcp/slo-targets and /api/mcp/prometheus/query): the backend registers /api/mcp/... handlers, but only the agent exposes /prometheus/query (no /api/mcp/prometheus/query route), and there is no slo-targets handler. This will 404 and permanently keep the card in a failed/empty state in live mode. Either add the corresponding backend endpoints + mocks, or change the card to use the existing agent Prometheus proxy (${LOCAL_AGENT_HTTP_URL}/prometheus/query) and a real source for SLO target config.
| function calculateBurnRate(target: SLOTarget): number { | ||
| const budgetUsed = FULL_COMPLIANCE - target.currentCompliance | ||
| const budgetAllowed = FULL_COMPLIANCE - target.threshold | ||
| if (budgetAllowed <= 0) return 0 | ||
| return budgetUsed / budgetAllowed | ||
| } |
There was a problem hiding this comment.
calculateBurnRate() assumes target.threshold is a 0–100% SLO objective, but the UI/data model also uses thresholds like 500ms (see demo data) and displays them as latency/error targets. For non-percent thresholds this makes budgetAllowed negative and forces burn rate to 0, hiding problems. Burn rate needs to be computed from an explicit objective percentage (e.g. 99.9%) or from error budget math that matches the metric type, rather than subtracting a latency threshold from 100.
| // Find binding reschedules within the correlation window | ||
| for (const bt of bindingTransitions) { | ||
| if (!bt.scheduledTime) continue | ||
| const bindingMs = new Date(bt.scheduledTime).getTime() | ||
| if (isNaN(bindingMs)) continue | ||
|
|
||
| const delta = bindingMs - transitionMs | ||
| if (delta >= 0 && delta <= CORRELATION_WINDOW_MS) { | ||
| const targetCluster = bt.clusters.length > 0 ? bt.clusters[0] : 'unknown' | ||
| events.push({ | ||
| timestamp: bt.scheduledTime, | ||
| eventType: 'binding_reschedule' as FailoverEventType, | ||
| cluster: targetCluster, | ||
| workload: bt.resourceKind ? `${bt.resourceKind}/${bt.bindingName}` : bt.bindingName, | ||
| details: `ResourceBinding rescheduled from ${ct.clusterName} to ${targetCluster}`, | ||
| severity: 'warning' as FailoverSeverity, | ||
| }) | ||
| } | ||
| } |
There was a problem hiding this comment.
In the cluster-down correlation block, reschedule events are emitted for any binding that became Scheduled within the correlation window, even if it wasn't actually rescheduled (i.e., bt.isRescheduled is false). This can generate false "Reschedule" events unrelated to failover. Filter to bindings that are explicitly marked rescheduled (or otherwise verify the binding changed due to the down cluster) before emitting binding_reschedule events.
| const alreadyCorrelated = events.some( | ||
| e => e.eventType === 'binding_reschedule' && e.timestamp === bt.scheduledTime, | ||
| ) | ||
| if (alreadyCorrelated) continue |
There was a problem hiding this comment.
alreadyCorrelated de-dupes reschedule events only by eventType + timestamp. If multiple ResourceBindings reschedule at the same scheduledTime, later ones will be dropped. Include a binding/workload identifier (e.g. workload or bindingName) in the de-dupe key to avoid losing events.
🔄 Auto-Applying Copilot Code ReviewCopilot code review found 2 code suggestion(s) and 5 general comment(s). @copilot Please apply all of the following code review suggestions:
Also address these general comments:
Push all fixes in a single commit. Run Auto-generated by copilot-review-apply workflow. |
- KubeRay: parse string GPU quantities, return empty state instead of throwing - Trino Gateway: return empty state instead of throwing when not detected - SLO Compliance: skip burn rate for non-percentage targets, fix API routes - Failover Timeline: filter to rescheduled bindings only, include workload in dedup key - GPU History: aggregate overflow types into "Other", per-node churn diffing, use mean allocated for Little's Law, clamp table page, add dropdown close handlers - Fix stale "24 hours" comments to match 7-day retention Signed-off-by: Andrew Anderson <andy@clubanderson.com>
* 🐛 fix: address Copilot review comments from PRs #4008 and #4009 - KubeRay: parse string GPU quantities, return empty state instead of throwing - Trino Gateway: return empty state instead of throwing when not detected - SLO Compliance: skip burn rate for non-percentage targets, fix API routes - Failover Timeline: filter to rescheduled bindings only, include workload in dedup key - GPU History: aggregate overflow types into "Other", per-node churn diffing, use mean allocated for Little's Law, clamp table page, add dropdown close handlers - Fix stale "24 hours" comments to match 7-day retention Signed-off-by: Andrew Anderson <andy@clubanderson.com> * 🐛 fix: use explicit isLoading pattern for card-standard compliance The card-standard CI check requires `isLoading: var && !hasData` instead of shorthand `isLoading` in useCardLoadingState calls. Signed-off-by: Andrew Anderson <andy@clubanderson.com> * Initial plan Agent-Logs-Url: https://github.com/kubestellar/console/sessions/c7cb576d-dfee-4502-bbb7-c682f8489def Co-authored-by: clubanderson <407614+clubanderson@users.noreply.github.com> --------- Signed-off-by: Andrew Anderson <andy@clubanderson.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: clubanderson <407614+clubanderson@users.noreply.github.com>
Summary
Adds 4 new dashboard cards and 2 preset dashboards to position KubeStellar Console as the Day 2 operations dashboard for Karmada-based multi-cluster architectures showcased at KubeCon EU 2026:
New Cards (registered in Orchestration category)
kuberay_fleet) — Discovers RayCluster, RayService, and RayJob CRDs across all clusters. Shows fleet stats (clusters, workers, GPUs, jobs), per-cluster drill-down, serving endpoint status, and job progress.slo_compliance) — Configurable SLO targets with error budget burn-rate gauges, compliance donut charts, and per-cluster compliance indicators. Supports AI inference SLOs (TTFT, TPOT) and data platform SLOs.failover_timeline) — Forensic timeline of Karmada ResourceBinding failover events. Shows cluster outages, binding rescheduling, replica rebalancing, and recovery events with severity coloring.trino_gateway) — Discovers Trino coordinator, worker, and gateway pods across clusters. Shows per-cluster query health, gateway routing status, and worker distribution.New Preset Dashboards
presets/karmada-ai-operations.json) — Pre-arranged 6-card dashboard for Karmada+KubeRay inference environmentspresets/karmada-data-platform.json) — Pre-arranged 6-card dashboard for Karmada+Trino data platformsRegistration (all 5 locations)
All cards registered in:
cardRegistry.ts,cardMetadata.ts,AddCardModal.tsx(Orchestration category), DEMO_EXEMPT set, chunk map, and default width map.Motivation
Two KubeCon EU 2026 sessions from Bloomberg showcase Karmada+KubeRay (multi-cluster AI inference) and Karmada+Trino (disaster-resilient data platform). These sessions provide infrastructure plumbing (YAML manifests). KubeStellar Console fills the Day 2 gap: operational intelligence for running these architectures in production.
Test plan
npm run buildpasses