✨ feat: add KubeRay, SLO, Failover, and Trino dashboard cards with preset dashboards by clubanderson · Pull Request #4009 · kubestellar/console

clubanderson · 2026-03-31T16:10:40Z

Summary

Adds 4 new dashboard cards and 2 preset dashboards to position KubeStellar Console as the Day 2 operations dashboard for Karmada-based multi-cluster architectures showcased at KubeCon EU 2026:

New Cards (registered in Orchestration category)

KubeRay Fleet Monitor (kuberay_fleet) — Discovers RayCluster, RayService, and RayJob CRDs across all clusters. Shows fleet stats (clusters, workers, GPUs, jobs), per-cluster drill-down, serving endpoint status, and job progress.
SLO Compliance Tracker (slo_compliance) — Configurable SLO targets with error budget burn-rate gauges, compliance donut charts, and per-cluster compliance indicators. Supports AI inference SLOs (TTFT, TPOT) and data platform SLOs.
Cross-Region Failover Timeline (failover_timeline) — Forensic timeline of Karmada ResourceBinding failover events. Shows cluster outages, binding rescheduling, replica rebalancing, and recovery events with severity coloring.
Trino Gateway Monitor (trino_gateway) — Discovers Trino coordinator, worker, and gateway pods across clusters. Shows per-cluster query health, gateway routing status, and worker distribution.

New Preset Dashboards

Karmada AI Operations (presets/karmada-ai-operations.json) — Pre-arranged 6-card dashboard for Karmada+KubeRay inference environments
Karmada Data Platform (presets/karmada-data-platform.json) — Pre-arranged 6-card dashboard for Karmada+Trino data platforms

Registration (all 5 locations)

All cards registered in: cardRegistry.ts, cardMetadata.ts, AddCardModal.tsx (Orchestration category), DEMO_EXEMPT set, chunk map, and default width map.

Motivation

Two KubeCon EU 2026 sessions from Bloomberg showcase Karmada+KubeRay (multi-cluster AI inference) and Karmada+Trino (disaster-resilient data platform). These sessions provide infrastructure plumbing (YAML manifests). KubeStellar Console fills the Day 2 gap: operational intelligence for running these architectures in production.

Test plan

npm run build passes
Cards appear in Add Cards dialog under Orchestration category
Each card renders with demo data when added to a dashboard
Preset dashboards load correctly from the marketplace
Cards show loading skeleton → demo data flow correctly

…eset dashboards Add 4 new dashboard cards to counter Karmada+KubeRay and Karmada+Trino architectures showcased at KubeCon EU 2026: - **KubeRay Fleet Monitor**: Discovers RayCluster, RayService, and RayJob CRDs across all clusters with GPU allocation tracking and fleet stats - **SLO Compliance Tracker**: Configurable SLO targets with error budget burn-rate gauges, compliance donut charts, and per-cluster indicators - **Cross-Region Failover Timeline**: Forensic timeline of Karmada ResourceBinding failover events with cluster outage/recovery tracking - **Trino Gateway Monitor**: Discovers Trino coordinator/worker/gateway pods with per-cluster query health and routing status Add 2 preset dashboards: - **Karmada AI Operations**: Pre-arranged dashboard for Karmada+KubeRay inference environments (KubeRay Fleet + SLO + Failover + GPU Overview) - **Karmada Data Platform**: Pre-arranged dashboard for Karmada+Trino data platforms (Trino Gateway + SLO + Failover + Cluster Locations) All cards registered in cardRegistry, cardMetadata, and AddCardModal catalog under the Orchestration category. Signed-off-by: Andrew Anderson <andy@clubanderson.com>

kubestellar-prow · 2026-03-31T16:10:46Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign clubanderson for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

netlify · 2026-03-31T16:10:51Z

✅ Deploy Preview for kubestellarconsole ready!

Name	Link
🔨 Latest commit	`ade983b`
🔍 Latest deploy log	https://app.netlify.com/projects/kubestellarconsole/deploys/69cbf2067853580009bcecfa
😎 Deploy Preview	https://deploy-preview-4009.console-deploy-preview.kubestellar.io
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

github-actions · 2026-03-31T16:17:58Z

Thank you for your contribution! Your PR has been merged.

Check out what's new:

KubeStellar Console — Live multi-cluster dashboard
Marketplace — Community extensions
Knowledge Base — Troubleshooting and how-tos

Stay connected: Slack #kubestellar-dev | Multi-Cluster Survey

github-actions · 2026-03-31T16:18:01Z

👋 Hey @clubanderson — thanks for opening this PR!

🤖 This project is developed exclusively using AI coding assistants.

Please do not attempt to code anything for this project manually.
All contributions should be authored using an AI coding tool such as:

Claude Code (Opus 4.5 / 4.6) — recommended

GitHub Copilot

Cursor

Other AI coding assistants

This ensures consistency in code style, architecture patterns, test coverage,
and commit quality across the entire codebase.

This is an automated message.

Copilot

Pull request overview

Adds new Orchestration-focused dashboard cards (KubeRay fleet, SLO compliance, Karmada failover timeline, Trino gateway) and introduces two multi-card preset dashboards to support Day-2 operations views for Karmada-based multi-cluster environments.

Changes:

Register 4 new card types in the UI catalog/registry/metadata (titles, descriptions, lazy loading, default widths, chunk preloaders).
Implement new card UIs + data hooks with demo datasets for KubeRay Fleet, SLO Compliance, Failover Timeline, and Trino Gateway.
Add two preset dashboard JSON definitions for AI ops and data platform layouts.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
web/src/components/dashboard/AddCardModal.tsx	Adds the 4 new cards to the Orchestration catalog list.
web/src/components/cards/cardRegistry.ts	Registers new card components (lazy imports), preloaders, live-data set, and default widths.
web/src/components/cards/cardMetadata.ts	Adds titles/descriptions for the 4 new cards.
web/src/components/cards/kuberay_fleet/useKubeRayFleet.ts	Fetches/aggregates Ray CRDs across clusters via custom-resources API.
web/src/components/cards/kuberay_fleet/KubeRayFleet.tsx	Renders KubeRay fleet summary + per-resource sections.
web/src/components/cards/kuberay_fleet/demoData.ts	Defines KubeRay fleet types + demo dataset.
web/src/components/cards/kuberay_fleet/index.ts	Barrel export for KubeRayFleet card.
web/src/components/cards/slo_compliance/useSLOCompliance.ts	Fetches SLO config + queries Prometheus to compute compliance data.
web/src/components/cards/slo_compliance/SLOCompliance.tsx	Renders donut gauges + per-target burn/compliance rows.
web/src/components/cards/slo_compliance/demoData.ts	Defines SLO types + demo dataset.
web/src/components/cards/slo_compliance/index.ts	Barrel export for SLOCompliance card.
web/src/components/cards/failover_timeline/useFailoverTimeline.ts	Fetches Karmada CRDs and correlates conditions/bindings into timeline events.
web/src/components/cards/failover_timeline/FailoverTimeline.tsx	Renders timeline UI with severity/type badges and summary stats.
web/src/components/cards/failover_timeline/demoData.ts	Defines failover event types + demo dataset.
web/src/components/cards/failover_timeline/index.ts	Barrel export for FailoverTimeline card.
web/src/components/cards/trino_gateway/useTrinoGateway.ts	Discovers Trino + gateway pods via label selectors and aggregates status.
web/src/components/cards/trino_gateway/TrinoGateway.tsx	Renders Trino cluster/gateway summary and backend routing rows.
web/src/components/cards/trino_gateway/demoData.ts	Defines Trino gateway types + demo dataset.
web/src/components/cards/trino_gateway/index.ts	Barrel export for TrinoGateway card.
presets/karmada-ai-operations.json	Adds a 6-card preset dashboard layout for Karmada + KubeRay AI operations.
presets/karmada-data-platform.json	Adds a 6-card preset dashboard layout for Karmada + Trino data platform ops.

Copilot · 2026-03-31T16:30:42Z

web/src/components/cards/kuberay_fleet/useKubeRayFleet.ts

+function safeNumber(val: unknown, fallback = 0): number {
+  return typeof val === 'number' ? val : fallback
+}
+
+function safeString(val: unknown, fallback = ''): string {
+  return typeof val === 'string' ? val : fallback
+}
+
+// ---------------------------------------------------------------------------
+// CRD parsers
+// ---------------------------------------------------------------------------
+
+function parseRayCluster(item: CRItem): RayClusterInfo {
+  const status = getRecord(item.status)
+  const spec = getRecord(item.spec)
+  const workerSpecs = Array.isArray(spec.workerGroupSpecs) ? spec.workerGroupSpecs : []
+
+  let gpuCount = 0
+  for (const wg of workerSpecs) {
+    const wgObj = getRecord(wg)
+    const replicas = safeNumber(wgObj.replicas, 1)
+    const template = getRecord(getRecord(wgObj.template).spec)
+    const containers = Array.isArray(template.containers) ? template.containers : []
+    for (const c of containers) {
+      const limits = getRecord(getRecord(getRecord(c).resources).limits)
+      const gpuVal = limits['nvidia.com/gpu']
+      if (gpuVal) gpuCount += safeNumber(gpuVal) * replicas
+    }


safeNumber() only accepts numeric JSON values, but Kubernetes resource quantities (including resources.limits['nvidia.com/gpu']) are typically serialized as strings (e.g. "1"). As written, gpuCount will stay 0 for real RayCluster specs. Parse numeric strings (and consider handling Quantity formats) before multiplying by replicas.

Copilot · 2026-03-31T16:30:43Z

web/src/components/cards/kuberay_fleet/useKubeRayFleet.ts

+  if (!detected) throw new Error('No Ray resources detected')
+


Treating "no Ray resources detected" as an exception marks the cache as failed and increments consecutiveFailures, which can surface as an error/offline state even though "not installed" is a valid empty state. Return detected: false (with empty arrays + lastCheckTime) instead of throwing so the card can show the intended empty state without reporting a fetch failure.

Suggested change

if (!detected) throw new Error('No Ray resources detected')

if (!detected) {

return {

detected: false,

rayClusters: [],

rayServices: [],

rayJobs: [],

totalGPUs: 0,

lastCheckTime: new Date().toISOString(),

}

}

Copilot · 2026-03-31T16:30:44Z

web/src/components/cards/trino_gateway/useTrinoGateway.ts

+  ])
+
+  const detected = coordinators.length > 0 || workers.length > 0 || gatewayPods.length > 0
+  if (!detected) throw new Error('No Trino resources detected')


fetchTrinoGatewayData() throws when no Trino/Gateway pods are found. That turns a normal "not installed" state into a failed fetch (and bumps consecutiveFailures). Return detected: false (plus lastCheckTime) instead of throwing so the card can render the empty state without being marked as failed.

Suggested change

if (!detected) throw new Error('No Trino resources detected')

if (!detected) {

return {

detected: false,

trinoClusters: [],

gateways: [],

totalWorkers: 0,

totalActiveQueries: 0,

lastCheckTime: new Date().toISOString(),

}

}

Copilot · 2026-03-31T16:30:44Z

web/src/components/cards/slo_compliance/useSLOCompliance.ts

+async function fetchSLOCompliance(): Promise<SLOComplianceData> {
+  // Step 1: Get SLO target configuration from the backend
+  const configResp = await authFetch('/api/mcp/slo-targets', {
+    headers: { Accept: 'application/json' },
+    signal: AbortSignal.timeout(FETCH_DEFAULT_TIMEOUT_MS),
+  })
+  if (!configResp.ok) throw new Error('SLO targets not configured')
+
+  const configBody: SLOConfigResponse = await configResp.json().catch(() => ({}))
+  const sloConfigs = configBody.targets ?? []
+  if (sloConfigs.length === 0) throw new Error('No SLO targets defined')
+
+  // Step 2: Query Prometheus for each target's current compliance
+  const targets: SLOTarget[] = await Promise.all(
+    sloConfigs.map(async (cfg) => {
+      try {
+        const params = new URLSearchParams({ query: cfg.query })
+        const resp = await authFetch(`/api/mcp/prometheus/query?${params}`, {
+          headers: { Accept: 'application/json' },
+          signal: AbortSignal.timeout(FETCH_DEFAULT_TIMEOUT_MS),
+        })
+        if (!resp.ok) {


These API routes don't appear to exist in the repo (/api/mcp/slo-targets and /api/mcp/prometheus/query): the backend registers /api/mcp/... handlers, but only the agent exposes /prometheus/query (no /api/mcp/prometheus/query route), and there is no slo-targets handler. This will 404 and permanently keep the card in a failed/empty state in live mode. Either add the corresponding backend endpoints + mocks, or change the card to use the existing agent Prometheus proxy (${LOCAL_AGENT_HTTP_URL}/prometheus/query) and a real source for SLO target config.

Copilot · 2026-03-31T16:30:45Z

web/src/components/cards/slo_compliance/SLOCompliance.tsx

+function calculateBurnRate(target: SLOTarget): number {
+  const budgetUsed = FULL_COMPLIANCE - target.currentCompliance
+  const budgetAllowed = FULL_COMPLIANCE - target.threshold
+  if (budgetAllowed <= 0) return 0
+  return budgetUsed / budgetAllowed
+}


calculateBurnRate() assumes target.threshold is a 0–100% SLO objective, but the UI/data model also uses thresholds like 500ms (see demo data) and displays them as latency/error targets. For non-percent thresholds this makes budgetAllowed negative and forces burn rate to 0, hiding problems. Burn rate needs to be computed from an explicit objective percentage (e.g. 99.9%) or from error budget math that matches the metric type, rather than subtracting a latency threshold from 100.

Copilot · 2026-03-31T16:30:46Z

web/src/components/cards/failover_timeline/useFailoverTimeline.ts

+      // Find binding reschedules within the correlation window
+      for (const bt of bindingTransitions) {
+        if (!bt.scheduledTime) continue
+        const bindingMs = new Date(bt.scheduledTime).getTime()
+        if (isNaN(bindingMs)) continue
+
+        const delta = bindingMs - transitionMs
+        if (delta >= 0 && delta <= CORRELATION_WINDOW_MS) {
+          const targetCluster = bt.clusters.length > 0 ? bt.clusters[0] : 'unknown'
+          events.push({
+            timestamp: bt.scheduledTime,
+            eventType: 'binding_reschedule' as FailoverEventType,
+            cluster: targetCluster,
+            workload: bt.resourceKind ? `${bt.resourceKind}/${bt.bindingName}` : bt.bindingName,
+            details: `ResourceBinding rescheduled from ${ct.clusterName} to ${targetCluster}`,
+            severity: 'warning' as FailoverSeverity,
+          })
+        }
+      }


In the cluster-down correlation block, reschedule events are emitted for any binding that became Scheduled within the correlation window, even if it wasn't actually rescheduled (i.e., bt.isRescheduled is false). This can generate false "Reschedule" events unrelated to failover. Filter to bindings that are explicitly marked rescheduled (or otherwise verify the binding changed due to the down cluster) before emitting binding_reschedule events.

Copilot · 2026-03-31T16:30:46Z

web/src/components/cards/failover_timeline/useFailoverTimeline.ts

+    const alreadyCorrelated = events.some(
+      e => e.eventType === 'binding_reschedule' && e.timestamp === bt.scheduledTime,
+    )
+    if (alreadyCorrelated) continue


alreadyCorrelated de-dupes reschedule events only by eventType + timestamp. If multiple ResourceBindings reschedule at the same scheduledTime, later ones will be dropped. Include a binding/workload identifier (e.g. workload or bindingName) in the de-dupe key to avoid losing events.

clubanderson · 2026-03-31T16:32:23Z

🔄 Auto-Applying Copilot Code Review

Copilot code review found 2 code suggestion(s) and 5 general comment(s).

@copilot Please apply all of the following code review suggestions:

web/src/components/cards/kuberay_fleet/useKubeRayFleet.ts (line 183): if (!detected) { return { detected: false, rayClusters: [], ...
web/src/components/cards/trino_gateway/useTrinoGateway.ts (line 201): if (!detected) { return { detected: false, trinoClusters: [], ...

Also address these general comments:

web/src/components/cards/kuberay_fleet/useKubeRayFleet.ts (line 114): safeNumber() only accepts numeric JSON values, but Kubernetes resource quantities (including `resources.limits['nvidia
web/src/components/cards/slo_compliance/useSLOCompliance.ts (line 95): These API routes don't appear to exist in the repo (/api/mcp/slo-targets and /api/mcp/prometheus/query): the backend
web/src/components/cards/slo_compliance/SLOCompliance.tsx (line 60): calculateBurnRate() assumes target.threshold is a 0–100% SLO objective, but the UI/data model also uses thresholds l
web/src/components/cards/failover_timeline/useFailoverTimeline.ts (line 210): In the cluster-down correlation block, reschedule events are emitted for any binding that became Scheduled within the
web/src/components/cards/failover_timeline/useFailoverTimeline.ts (line 233): alreadyCorrelated de-dupes reschedule events only by eventType + timestamp. If multiple ResourceBindings reschedul

Push all fixes in a single commit. Run cd web && npm run build && npm run lint before committing.

Auto-generated by copilot-review-apply workflow.

- KubeRay: parse string GPU quantities, return empty state instead of throwing - Trino Gateway: return empty state instead of throwing when not detected - SLO Compliance: skip burn rate for non-percentage targets, fix API routes - Failover Timeline: filter to rescheduled bindings only, include workload in dedup key - GPU History: aggregate overflow types into "Other", per-node churn diffing, use mean allocated for Little's Law, clamp table page, add dropdown close handlers - Fix stale "24 hours" comments to match 7-day retention Signed-off-by: Andrew Anderson <andy@clubanderson.com>

* 🐛 fix: address Copilot review comments from PRs #4008 and #4009 - KubeRay: parse string GPU quantities, return empty state instead of throwing - Trino Gateway: return empty state instead of throwing when not detected - SLO Compliance: skip burn rate for non-percentage targets, fix API routes - Failover Timeline: filter to rescheduled bindings only, include workload in dedup key - GPU History: aggregate overflow types into "Other", per-node churn diffing, use mean allocated for Little's Law, clamp table page, add dropdown close handlers - Fix stale "24 hours" comments to match 7-day retention Signed-off-by: Andrew Anderson <andy@clubanderson.com> * 🐛 fix: use explicit isLoading pattern for card-standard compliance The card-standard CI check requires `isLoading: var && !hasData` instead of shorthand `isLoading` in useCardLoadingState calls. Signed-off-by: Andrew Anderson <andy@clubanderson.com> * Initial plan Agent-Logs-Url: https://github.com/kubestellar/console/sessions/c7cb576d-dfee-4502-bbb7-c682f8489def Co-authored-by: clubanderson <407614+clubanderson@users.noreply.github.com> --------- Signed-off-by: Andrew Anderson <andy@clubanderson.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: clubanderson <407614+clubanderson@users.noreply.github.com>

Copilot AI review requested due to automatic review settings March 31, 2026 16:10

kubestellar-prow bot added the dco-signoff: yes Indicates the PR's author has signed the DCO. label Mar 31, 2026

kubestellar-prow bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Mar 31, 2026

clubanderson merged commit 0e3b876 into main Mar 31, 2026
18 of 20 checks passed

kubestellar-prow bot deleted the feat/karmada-ai-operations-cards branch March 31, 2026 16:10

Copilot started reviewing on behalf of clubanderson March 31, 2026 16:22 View session

Copilot AI reviewed Mar 31, 2026

View reviewed changes

clubanderson assigned Copilot Mar 31, 2026

github-actions bot mentioned this pull request Mar 31, 2026

📋 [aw] No-Op Runs #3237

Closed

clubanderson mentioned this pull request Mar 31, 2026

🐛 fix: address Copilot review comments from PRs #4008 and #4009 #4012

Merged

5 tasks

-  if (!detected) throw new Error('No Ray resources detected')
+  if (!detected) {
+    return {
+      detected: false,
+      rayClusters: [],
+      rayServices: [],
+      rayJobs: [],
+      totalGPUs: 0,
+      lastCheckTime: new Date().toISOString(),
+    }
+  }

Conversation

clubanderson commented Mar 31, 2026

Summary

New Cards (registered in Orchestration category)

New Preset Dashboards

Registration (all 5 locations)

Motivation

Test plan

Uh oh!

kubestellar-prow bot commented Mar 31, 2026

Uh oh!

netlify bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for kubestellarconsole ready!

Uh oh!

Uh oh!

github-actions bot commented Mar 31, 2026

Uh oh!

github-actions bot commented Mar 31, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

clubanderson commented Mar 31, 2026

🔄 Auto-Applying Copilot Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

netlify bot commented Mar 31, 2026 •

edited

Loading