Skip to content

✨ feat: add KubeRay, SLO, Failover, and Trino dashboard cards with preset dashboards#4009

Merged
clubanderson merged 1 commit intomainfrom
feat/karmada-ai-operations-cards
Mar 31, 2026
Merged

✨ feat: add KubeRay, SLO, Failover, and Trino dashboard cards with preset dashboards#4009
clubanderson merged 1 commit intomainfrom
feat/karmada-ai-operations-cards

Conversation

@clubanderson
Copy link
Copy Markdown
Collaborator

Summary

Adds 4 new dashboard cards and 2 preset dashboards to position KubeStellar Console as the Day 2 operations dashboard for Karmada-based multi-cluster architectures showcased at KubeCon EU 2026:

New Cards (registered in Orchestration category)

  • KubeRay Fleet Monitor (kuberay_fleet) — Discovers RayCluster, RayService, and RayJob CRDs across all clusters. Shows fleet stats (clusters, workers, GPUs, jobs), per-cluster drill-down, serving endpoint status, and job progress.
  • SLO Compliance Tracker (slo_compliance) — Configurable SLO targets with error budget burn-rate gauges, compliance donut charts, and per-cluster compliance indicators. Supports AI inference SLOs (TTFT, TPOT) and data platform SLOs.
  • Cross-Region Failover Timeline (failover_timeline) — Forensic timeline of Karmada ResourceBinding failover events. Shows cluster outages, binding rescheduling, replica rebalancing, and recovery events with severity coloring.
  • Trino Gateway Monitor (trino_gateway) — Discovers Trino coordinator, worker, and gateway pods across clusters. Shows per-cluster query health, gateway routing status, and worker distribution.

New Preset Dashboards

  • Karmada AI Operations (presets/karmada-ai-operations.json) — Pre-arranged 6-card dashboard for Karmada+KubeRay inference environments
  • Karmada Data Platform (presets/karmada-data-platform.json) — Pre-arranged 6-card dashboard for Karmada+Trino data platforms

Registration (all 5 locations)

All cards registered in: cardRegistry.ts, cardMetadata.ts, AddCardModal.tsx (Orchestration category), DEMO_EXEMPT set, chunk map, and default width map.

Motivation

Two KubeCon EU 2026 sessions from Bloomberg showcase Karmada+KubeRay (multi-cluster AI inference) and Karmada+Trino (disaster-resilient data platform). These sessions provide infrastructure plumbing (YAML manifests). KubeStellar Console fills the Day 2 gap: operational intelligence for running these architectures in production.

Test plan

  • npm run build passes
  • Cards appear in Add Cards dialog under Orchestration category
  • Each card renders with demo data when added to a dashboard
  • Preset dashboards load correctly from the marketplace
  • Cards show loading skeleton → demo data flow correctly

…eset dashboards

Add 4 new dashboard cards to counter Karmada+KubeRay and Karmada+Trino
architectures showcased at KubeCon EU 2026:

- **KubeRay Fleet Monitor**: Discovers RayCluster, RayService, and RayJob
  CRDs across all clusters with GPU allocation tracking and fleet stats
- **SLO Compliance Tracker**: Configurable SLO targets with error budget
  burn-rate gauges, compliance donut charts, and per-cluster indicators
- **Cross-Region Failover Timeline**: Forensic timeline of Karmada
  ResourceBinding failover events with cluster outage/recovery tracking
- **Trino Gateway Monitor**: Discovers Trino coordinator/worker/gateway
  pods with per-cluster query health and routing status

Add 2 preset dashboards:
- **Karmada AI Operations**: Pre-arranged dashboard for Karmada+KubeRay
  inference environments (KubeRay Fleet + SLO + Failover + GPU Overview)
- **Karmada Data Platform**: Pre-arranged dashboard for Karmada+Trino
  data platforms (Trino Gateway + SLO + Failover + Cluster Locations)

All cards registered in cardRegistry, cardMetadata, and AddCardModal
catalog under the Orchestration category.

Signed-off-by: Andrew Anderson <andy@clubanderson.com>
Copilot AI review requested due to automatic review settings March 31, 2026 16:10
@kubestellar-prow kubestellar-prow bot added the dco-signoff: yes Indicates the PR's author has signed the DCO. label Mar 31, 2026
@kubestellar-prow
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign clubanderson for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubestellar-prow kubestellar-prow bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Mar 31, 2026
@netlify
Copy link
Copy Markdown

netlify bot commented Mar 31, 2026

Deploy Preview for kubestellarconsole ready!

Name Link
🔨 Latest commit ade983b
🔍 Latest deploy log https://app.netlify.com/projects/kubestellarconsole/deploys/69cbf2067853580009bcecfa
😎 Deploy Preview https://deploy-preview-4009.console-deploy-preview.kubestellar.io
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@clubanderson clubanderson merged commit 0e3b876 into main Mar 31, 2026
18 of 20 checks passed
@kubestellar-prow kubestellar-prow bot deleted the feat/karmada-ai-operations-cards branch March 31, 2026 16:10
@github-actions
Copy link
Copy Markdown
Contributor

Thank you for your contribution! Your PR has been merged.

Check out what's new:

Stay connected: Slack #kubestellar-dev | Multi-Cluster Survey

@github-actions
Copy link
Copy Markdown
Contributor

👋 Hey @clubanderson — thanks for opening this PR!

🤖 This project is developed exclusively using AI coding assistants.

Please do not attempt to code anything for this project manually.
All contributions should be authored using an AI coding tool such as:

This ensures consistency in code style, architecture patterns, test coverage,
and commit quality across the entire codebase.


This is an automated message.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds new Orchestration-focused dashboard cards (KubeRay fleet, SLO compliance, Karmada failover timeline, Trino gateway) and introduces two multi-card preset dashboards to support Day-2 operations views for Karmada-based multi-cluster environments.

Changes:

  • Register 4 new card types in the UI catalog/registry/metadata (titles, descriptions, lazy loading, default widths, chunk preloaders).
  • Implement new card UIs + data hooks with demo datasets for KubeRay Fleet, SLO Compliance, Failover Timeline, and Trino Gateway.
  • Add two preset dashboard JSON definitions for AI ops and data platform layouts.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
web/src/components/dashboard/AddCardModal.tsx Adds the 4 new cards to the Orchestration catalog list.
web/src/components/cards/cardRegistry.ts Registers new card components (lazy imports), preloaders, live-data set, and default widths.
web/src/components/cards/cardMetadata.ts Adds titles/descriptions for the 4 new cards.
web/src/components/cards/kuberay_fleet/useKubeRayFleet.ts Fetches/aggregates Ray CRDs across clusters via custom-resources API.
web/src/components/cards/kuberay_fleet/KubeRayFleet.tsx Renders KubeRay fleet summary + per-resource sections.
web/src/components/cards/kuberay_fleet/demoData.ts Defines KubeRay fleet types + demo dataset.
web/src/components/cards/kuberay_fleet/index.ts Barrel export for KubeRayFleet card.
web/src/components/cards/slo_compliance/useSLOCompliance.ts Fetches SLO config + queries Prometheus to compute compliance data.
web/src/components/cards/slo_compliance/SLOCompliance.tsx Renders donut gauges + per-target burn/compliance rows.
web/src/components/cards/slo_compliance/demoData.ts Defines SLO types + demo dataset.
web/src/components/cards/slo_compliance/index.ts Barrel export for SLOCompliance card.
web/src/components/cards/failover_timeline/useFailoverTimeline.ts Fetches Karmada CRDs and correlates conditions/bindings into timeline events.
web/src/components/cards/failover_timeline/FailoverTimeline.tsx Renders timeline UI with severity/type badges and summary stats.
web/src/components/cards/failover_timeline/demoData.ts Defines failover event types + demo dataset.
web/src/components/cards/failover_timeline/index.ts Barrel export for FailoverTimeline card.
web/src/components/cards/trino_gateway/useTrinoGateway.ts Discovers Trino + gateway pods via label selectors and aggregates status.
web/src/components/cards/trino_gateway/TrinoGateway.tsx Renders Trino cluster/gateway summary and backend routing rows.
web/src/components/cards/trino_gateway/demoData.ts Defines Trino gateway types + demo dataset.
web/src/components/cards/trino_gateway/index.ts Barrel export for TrinoGateway card.
presets/karmada-ai-operations.json Adds a 6-card preset dashboard layout for Karmada + KubeRay AI operations.
presets/karmada-data-platform.json Adds a 6-card preset dashboard layout for Karmada + Trino data platform ops.

Comment on lines +87 to +114
function safeNumber(val: unknown, fallback = 0): number {
return typeof val === 'number' ? val : fallback
}

function safeString(val: unknown, fallback = ''): string {
return typeof val === 'string' ? val : fallback
}

// ---------------------------------------------------------------------------
// CRD parsers
// ---------------------------------------------------------------------------

function parseRayCluster(item: CRItem): RayClusterInfo {
const status = getRecord(item.status)
const spec = getRecord(item.spec)
const workerSpecs = Array.isArray(spec.workerGroupSpecs) ? spec.workerGroupSpecs : []

let gpuCount = 0
for (const wg of workerSpecs) {
const wgObj = getRecord(wg)
const replicas = safeNumber(wgObj.replicas, 1)
const template = getRecord(getRecord(wgObj.template).spec)
const containers = Array.isArray(template.containers) ? template.containers : []
for (const c of containers) {
const limits = getRecord(getRecord(getRecord(c).resources).limits)
const gpuVal = limits['nvidia.com/gpu']
if (gpuVal) gpuCount += safeNumber(gpuVal) * replicas
}
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

safeNumber() only accepts numeric JSON values, but Kubernetes resource quantities (including resources.limits['nvidia.com/gpu']) are typically serialized as strings (e.g. "1"). As written, gpuCount will stay 0 for real RayCluster specs. Parse numeric strings (and consider handling Quantity formats) before multiplying by replicas.

Copilot uses AI. Check for mistakes.
Comment on lines +182 to +183
if (!detected) throw new Error('No Ray resources detected')

Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Treating "no Ray resources detected" as an exception marks the cache as failed and increments consecutiveFailures, which can surface as an error/offline state even though "not installed" is a valid empty state. Return detected: false (with empty arrays + lastCheckTime) instead of throwing so the card can show the intended empty state without reporting a fetch failure.

Suggested change
if (!detected) throw new Error('No Ray resources detected')
if (!detected) {
return {
detected: false,
rayClusters: [],
rayServices: [],
rayJobs: [],
totalGPUs: 0,
lastCheckTime: new Date().toISOString(),
}
}

Copilot uses AI. Check for mistakes.
])

const detected = coordinators.length > 0 || workers.length > 0 || gatewayPods.length > 0
if (!detected) throw new Error('No Trino resources detected')
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fetchTrinoGatewayData() throws when no Trino/Gateway pods are found. That turns a normal "not installed" state into a failed fetch (and bumps consecutiveFailures). Return detected: false (plus lastCheckTime) instead of throwing so the card can render the empty state without being marked as failed.

Suggested change
if (!detected) throw new Error('No Trino resources detected')
if (!detected) {
return {
detected: false,
trinoClusters: [],
gateways: [],
totalWorkers: 0,
totalActiveQueries: 0,
lastCheckTime: new Date().toISOString(),
}
}

Copilot uses AI. Check for mistakes.
Comment on lines +74 to +95
async function fetchSLOCompliance(): Promise<SLOComplianceData> {
// Step 1: Get SLO target configuration from the backend
const configResp = await authFetch('/api/mcp/slo-targets', {
headers: { Accept: 'application/json' },
signal: AbortSignal.timeout(FETCH_DEFAULT_TIMEOUT_MS),
})
if (!configResp.ok) throw new Error('SLO targets not configured')

const configBody: SLOConfigResponse = await configResp.json().catch(() => ({}))
const sloConfigs = configBody.targets ?? []
if (sloConfigs.length === 0) throw new Error('No SLO targets defined')

// Step 2: Query Prometheus for each target's current compliance
const targets: SLOTarget[] = await Promise.all(
sloConfigs.map(async (cfg) => {
try {
const params = new URLSearchParams({ query: cfg.query })
const resp = await authFetch(`/api/mcp/prometheus/query?${params}`, {
headers: { Accept: 'application/json' },
signal: AbortSignal.timeout(FETCH_DEFAULT_TIMEOUT_MS),
})
if (!resp.ok) {
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These API routes don't appear to exist in the repo (/api/mcp/slo-targets and /api/mcp/prometheus/query): the backend registers /api/mcp/... handlers, but only the agent exposes /prometheus/query (no /api/mcp/prometheus/query route), and there is no slo-targets handler. This will 404 and permanently keep the card in a failed/empty state in live mode. Either add the corresponding backend endpoints + mocks, or change the card to use the existing agent Prometheus proxy (${LOCAL_AGENT_HTTP_URL}/prometheus/query) and a real source for SLO target config.

Copilot uses AI. Check for mistakes.
Comment on lines +55 to +60
function calculateBurnRate(target: SLOTarget): number {
const budgetUsed = FULL_COMPLIANCE - target.currentCompliance
const budgetAllowed = FULL_COMPLIANCE - target.threshold
if (budgetAllowed <= 0) return 0
return budgetUsed / budgetAllowed
}
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calculateBurnRate() assumes target.threshold is a 0–100% SLO objective, but the UI/data model also uses thresholds like 500ms (see demo data) and displays them as latency/error targets. For non-percent thresholds this makes budgetAllowed negative and forces burn rate to 0, hiding problems. Burn rate needs to be computed from an explicit objective percentage (e.g. 99.9%) or from error budget math that matches the metric type, rather than subtracting a latency threshold from 100.

Copilot uses AI. Check for mistakes.
Comment on lines +192 to +210
// Find binding reschedules within the correlation window
for (const bt of bindingTransitions) {
if (!bt.scheduledTime) continue
const bindingMs = new Date(bt.scheduledTime).getTime()
if (isNaN(bindingMs)) continue

const delta = bindingMs - transitionMs
if (delta >= 0 && delta <= CORRELATION_WINDOW_MS) {
const targetCluster = bt.clusters.length > 0 ? bt.clusters[0] : 'unknown'
events.push({
timestamp: bt.scheduledTime,
eventType: 'binding_reschedule' as FailoverEventType,
cluster: targetCluster,
workload: bt.resourceKind ? `${bt.resourceKind}/${bt.bindingName}` : bt.bindingName,
details: `ResourceBinding rescheduled from ${ct.clusterName} to ${targetCluster}`,
severity: 'warning' as FailoverSeverity,
})
}
}
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the cluster-down correlation block, reschedule events are emitted for any binding that became Scheduled within the correlation window, even if it wasn't actually rescheduled (i.e., bt.isRescheduled is false). This can generate false "Reschedule" events unrelated to failover. Filter to bindings that are explicitly marked rescheduled (or otherwise verify the binding changed due to the down cluster) before emitting binding_reschedule events.

Copilot uses AI. Check for mistakes.
Comment on lines +230 to +233
const alreadyCorrelated = events.some(
e => e.eventType === 'binding_reschedule' && e.timestamp === bt.scheduledTime,
)
if (alreadyCorrelated) continue
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alreadyCorrelated de-dupes reschedule events only by eventType + timestamp. If multiple ResourceBindings reschedule at the same scheduledTime, later ones will be dropped. Include a binding/workload identifier (e.g. workload or bindingName) in the de-dupe key to avoid losing events.

Copilot uses AI. Check for mistakes.
@clubanderson
Copy link
Copy Markdown
Collaborator Author

🔄 Auto-Applying Copilot Code Review

Copilot code review found 2 code suggestion(s) and 5 general comment(s).

@copilot Please apply all of the following code review suggestions:

  • web/src/components/cards/kuberay_fleet/useKubeRayFleet.ts (line 183): if (!detected) { return { detected: false, rayClusters: [], ...
  • web/src/components/cards/trino_gateway/useTrinoGateway.ts (line 201): if (!detected) { return { detected: false, trinoClusters: [], ...

Also address these general comments:

  • web/src/components/cards/kuberay_fleet/useKubeRayFleet.ts (line 114): safeNumber() only accepts numeric JSON values, but Kubernetes resource quantities (including `resources.limits['nvidia
  • web/src/components/cards/slo_compliance/useSLOCompliance.ts (line 95): These API routes don't appear to exist in the repo (/api/mcp/slo-targets and /api/mcp/prometheus/query): the backend
  • web/src/components/cards/slo_compliance/SLOCompliance.tsx (line 60): calculateBurnRate() assumes target.threshold is a 0–100% SLO objective, but the UI/data model also uses thresholds l
  • web/src/components/cards/failover_timeline/useFailoverTimeline.ts (line 210): In the cluster-down correlation block, reschedule events are emitted for any binding that became Scheduled within the
  • web/src/components/cards/failover_timeline/useFailoverTimeline.ts (line 233): alreadyCorrelated de-dupes reschedule events only by eventType + timestamp. If multiple ResourceBindings reschedul

Push all fixes in a single commit. Run cd web && npm run build && npm run lint before committing.


Auto-generated by copilot-review-apply workflow.

@github-actions github-actions bot mentioned this pull request Mar 31, 2026
clubanderson added a commit that referenced this pull request Mar 31, 2026
- KubeRay: parse string GPU quantities, return empty state instead of throwing
- Trino Gateway: return empty state instead of throwing when not detected
- SLO Compliance: skip burn rate for non-percentage targets, fix API routes
- Failover Timeline: filter to rescheduled bindings only, include workload in dedup key
- GPU History: aggregate overflow types into "Other", per-node churn diffing,
  use mean allocated for Little's Law, clamp table page, add dropdown close handlers
- Fix stale "24 hours" comments to match 7-day retention

Signed-off-by: Andrew Anderson <andy@clubanderson.com>
clubanderson added a commit that referenced this pull request Mar 31, 2026
* 🐛 fix: address Copilot review comments from PRs #4008 and #4009

- KubeRay: parse string GPU quantities, return empty state instead of throwing
- Trino Gateway: return empty state instead of throwing when not detected
- SLO Compliance: skip burn rate for non-percentage targets, fix API routes
- Failover Timeline: filter to rescheduled bindings only, include workload in dedup key
- GPU History: aggregate overflow types into "Other", per-node churn diffing,
  use mean allocated for Little's Law, clamp table page, add dropdown close handlers
- Fix stale "24 hours" comments to match 7-day retention

Signed-off-by: Andrew Anderson <andy@clubanderson.com>

* 🐛 fix: use explicit isLoading pattern for card-standard compliance

The card-standard CI check requires `isLoading: var && !hasData` instead
of shorthand `isLoading` in useCardLoadingState calls.

Signed-off-by: Andrew Anderson <andy@clubanderson.com>

* Initial plan

Agent-Logs-Url: https://github.com/kubestellar/console/sessions/c7cb576d-dfee-4502-bbb7-c682f8489def

Co-authored-by: clubanderson <407614+clubanderson@users.noreply.github.com>

---------

Signed-off-by: Andrew Anderson <andy@clubanderson.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: clubanderson <407614+clubanderson@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has signed the DCO. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants