refactor(dashboard): align BFF data layer with OpenAPI v1 contract and improve job feed#282
Conversation
…PI compliance - Enhanced BFF warm route documentation to reference the upstream OpenAPI contract. - Updated developer network models endpoint to merge streaming and requests models. - Refined NAAP API warm route comments to clarify data sources. - Adjusted performance by model resolver to utilize a single cache key for streaming models. - Improved dashboard hooks to handle new data structures and ensure compatibility with requests. - Introduced new utility functions for dashboard window formatting and timeout management. - Merged streaming and requests model data in the network data layer for better consistency. - Updated orchestrator and pipeline resolvers to reflect changes in data fetching and merging logic. - Enhanced pricing resolver to accommodate new OpenAPI v1 structure and ensure accurate pricing data representation.
- Introduced a new utility function to determine the effective upstream window for orchestrators based on the requested period. - Updated the orchestrator cache key to utilize the effective window for improved data accuracy. - Simplified the KPI resolver by removing unnecessary net orchestrator data fetching and directly returning parsed KPI data. - Enhanced documentation to clarify the behavior of the orchestrator endpoint and its limitations on timeframes.
…ed performance - Changed default sorting in PipelinesCard from 'model' to 'gpus' to prioritize busy models. - Introduced OVERVIEW_TIMEFRAME_MAX_HOURS constant to cap timeframe options and ensure consistent handling across various resolvers. - Updated KPI and pipeline resolvers to utilize the new normalization function for timeframe parsing, enhancing data accuracy and performance.
…sponse handling - Updated the JobFeedCard component to conditionally display input FPS, ensuring proper formatting when values are null. - Clarified documentation in orchestrators resolver regarding the scaling of API response values for effectiveSuccessRate, noSwapRatio, slaScore, and successRatio. - Enhanced error handling in the developer API server by adding labels to the ingest function for better logging of failed requests.
…oved error handling - Added a `bustCache` option to `appendJobFeedPollQuery` for cache invalidation. - Updated `fetchJobFeed` to accept options for cache busting and improved error handling. - Introduced `JOB_FEED_RECOVERY_RETRY_DELAYS` for better retry logic on upstream errors. - Enhanced logging to reflect the actual request URL during fetch operations.
- Enhanced `fetchJobFeed` to accept a base URL and effective polling interval, improving flexibility. - Updated retry logic with `JOB_FEED_RECOVERY_RETRY_DELAYS` for better handling of transient errors. - Refactored job feed polling to utilize the new fetch function, ensuring consistent cache busting behavior. - Added documentation for recovery backoff strategy after failed polls.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
|
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (11)
📝 WalkthroughWalkthroughThis PR migrates dashboard data aggregation from legacy NAAP endpoints to OpenAPI v1 Changes
Sequence DiagramssequenceDiagram
participant Dashboard as Dashboard<br/>(Client)
participant BFF as BFF Facade<br/>(Server)
participant NAAP as NAAP API<br/>(Upstream)
Dashboard->>BFF: GET /api/v1/dashboard/kpi<br/>(timeframe)
BFF->>BFF: normalizeTimeframeHours()<br/>formatDashboardWindow()
BFF->>NAAP: GET /dashboard/kpi<br/>(window, timeoutMs)
NAAP-->>BFF: { streaming: {...},<br/>requests?: {...} }
BFF->>BFF: parseDashboardKpiWithRequests()<br/>(merge KPI + requests)
BFF-->>Dashboard: DashboardKPIWithRequests<br/>(orchestratorsObserved,<br/>requests metadata)
sequenceDiagram
participant Dashboard as Dashboard<br/>(Client)
participant BFF as BFF Facade<br/>(Server)
participant Cache as SWR/Redis<br/>(Cache)
participant StreamAPI as NAAP<br/>Streaming API
participant ReqAPI as NAAP<br/>Requests API
Dashboard->>BFF: GET /api/v1/facade/net/orchestrators
BFF->>Cache: Check (streaming+requests)<br/>aggregated cache
alt Cache Hit
Cache-->>BFF: cached orchestrator data
else Cache Miss
par
BFF->>StreamAPI: GET /v1/streaming/orchestrators
BFF->>ReqAPI: GET /v1/requests/orchestrators
and
end
StreamAPI-->>BFF: rows (models, gpu_count)
ReqAPI-->>BFF: rows (capabilities, last_seen)
BFF->>BFF: ingestRow() aggregate<br/>dedupe by address
BFF->>BFF: merge inventory URIs<br/>with dashboard serviceUri
BFF->>Cache: store merged result
Cache-->>BFF: cached data
end
BFF-->>Dashboard: NetOrchestratorData
sequenceDiagram
participant Dashboard as Dashboard<br/>(Client)
participant BFF as BFF Facade<br/>(Server)
participant Cache as SWR/Redis<br/>(Cache)
participant StreamAPI as NAAP<br/>Streaming API
participant ReqAPI as NAAP<br/>Requests API
participant PricingAPI as Pricing<br/>API
Dashboard->>BFF: GET /api/v1/facade/network/models
BFF->>Cache: Check (streaming+requests)<br/>merged cache
alt Cache Hit
Cache-->>BFF: merged NetworkModel[]
else Cache Miss
par
BFF->>StreamAPI: GET /v1/streaming/models
BFF->>ReqAPI: GET /v1/requests/models
and
end
StreamAPI-->>BFF: rows (avg_fps, WarmOrchCount)
ReqAPI-->>BFF: rows (TotalCapacity)
BFF->>BFF: merge by pipeline:model<br/>sum counts/capacity
BFF->>PricingAPI: GET /v1/dashboard/pricing
PricingAPI-->>BFF: pricing rows<br/>(legacy or per-orch)
BFF->>BFF: aggregate pricing<br/>compute min/max/avg
BFF->>BFF: enrich NetworkModel<br/>with pricing
BFF->>Cache: store merged+priced
Cache-->>BFF: cached data
end
BFF-->>Dashboard: NetworkModel[]<br/>(with pricing)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes The diff involves substantial, multi-pattern changes across facades, resolvers, and API routes: orchestrator/KPI field renames, migration from single-endpoint to merged streaming/requests aggregation patterns, introduction of new cache utilities and parsing abstractions, UI component updates with new props, and failure-recovery logic. While many changes follow consistent patterns (e.g., field renames across contracts), the resolver logic is heterogeneous—each resolver addresses merging, deduplication, or data transformation differently—requiring separate reasoning for pricing, orchestrators, network models, and performance caching. Type/export expansion adds clarity but requires tracing public API surface changes across multiple packages. Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
apps/web-next/src/hooks/useJobFeedStream.ts (1)
90-93:⚠️ Potential issue | 🟡 MinorFallback query builder drops cache-bust flag.
When URL parsing fails,
refresh=1is not appended even ifbustCacheis true, so recovery polling may not bypass stale caches.💡 Proposed fix
} catch { - const sep = fetchUrl.includes('?') ? '&' : '?'; - return `${fetchUrl}${sep}pollMs=${encodeURIComponent(String(ms))}`; + const sep = fetchUrl.includes('?') ? '&' : '?'; + const qs = [`pollMs=${encodeURIComponent(String(ms))}`]; + if (bustCache) qs.push('refresh=1'); + return `${fetchUrl}${sep}${qs.join('&')}`; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/web-next/src/hooks/useJobFeedStream.ts` around lines 90 - 93, The fallback URL builder in useJobFeedStream.ts's catch block currently appends only pollMs and loses the bustCache flag; update the catch branch that builds the fallback string using fetchUrl, sep, and ms to also append refresh=1 when the bustCache boolean is true (e.g., include "&refresh=1" or "?refresh=1" as appropriate), ensuring both pollMs and refresh are URL-encoded via encodeURIComponent and joined using the existing sep logic so recovery polling bypasses cache when requested.plugins/developer-api/backend/src/server.ts (1)
355-374:⚠️ Potential issue | 🟠 Major
limitis not enforced on merged upstream rows.When upstream returns more than
limit, response still includes all merged rows; only catalog backfill is capped. This violates the endpoint’slimitbehavior.💡 Proposed fix
- merged.sort( + merged.sort( (a, b) => a.Pipeline.localeCompare(b.Pipeline) || a.Model.localeCompare(b.Model), ); - return { merged, partial }; + const bounded = limitIsAll ? merged : merged.slice(0, limit ?? merged.length); + return { merged: bounded, partial };Also applies to: 400-404, 430-437
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugins/developer-api/backend/src/server.ts` around lines 355 - 374, The merged array is not being capped by the endpoint limit so upstream rows can exceed `limit`; after merging upstream rows into `merged` (the array built from `modelRows` and where keys are tracked by `seen` using `netModelRowKey`), enforce the limit by truncating `merged` to `limit` when `limitIsAll` is false (e.g., set merged = merged.slice(0, limit ?? merged.length)); recalc `catalogSlotsRemaining` from the post-truncation length (or compute it before backfill using Math.max(0, (limit ?? 0) - merged.length) but ensure merged has been truncated), and apply the same truncation logic in the other similar merge blocks referenced (the ones around the other fetch/merge sections).
🧹 Nitpick comments (2)
apps/web-next/src/lib/facade/dashboard-window.ts (1)
28-31: Timeout helper can be simplified.
hours <= 72and> 72return the same value, so this branch is redundant and can be collapsed for clarity.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/web-next/src/lib/facade/dashboard-window.ts` around lines 28 - 31, The conditional in the timeout helper is redundant: both branches of "if (hours <= 72)" return 55_000; remove the branch and replace it with a single return of 55_000 (i.e., eliminate the if (...) { return 55_000; } return 55_000; pattern) so the helper simply returns 55_000 unconditionally.apps/web-next/src/lib/facade/resolvers/net-capacity.ts (1)
12-12: Don’t swallow net-model fetch failures silently.Returning
[]here hides upstream outages and makes capacity collapse look like valid data. Add a warning log (or metric) before fallback.♻️ Suggested patch
- const models = await getRawNetModels().catch(() => []); + const models = await getRawNetModels().catch((err) => { + console.warn('[facade/net-capacity] getRawNetModels failed; returning empty capacity map:', err); + return []; + });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/web-next/src/lib/facade/resolvers/net-capacity.ts` at line 12, The current await getRawNetModels().catch(() => []) swallows errors; change the catch to log a warning with the caught error before falling back to [] so upstream outages are visible — e.g. replace .catch(() => []) with .catch((err) => { /* log a warning including err and context (use existing logger like processLogger or console) */; return []; }) referencing the getRawNetModels call and the models variable so the log includes which fetch failed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@apps/web-next/src/lib/facade/network-data.ts`:
- Around line 52-55: The merge logic currently adds incoming values into an
existing entry (existing.WarmOrchCount += warm; existing.TotalCapacity +=
slots), which double-counts when the same (pipeline, model) key appears from
multiple sources; update the merge in the network-data.ts routine that handles
existing entries so it uses max semantics instead of summation: set
existing.WarmOrchCount = Math.max(existing.WarmOrchCount, warm) and
existing.TotalCapacity = Math.max(existing.TotalCapacity, slots) (apply the same
max logic wherever the code merges duplicate (pipeline, model) rows).
- Around line 60-70: The loops over stream and req only read snake_case fields
(r.pipeline, r.model, r.warm_orch_count, r.gpu_slots) so PascalCase payloads are
ignored; update the parsing to accept both snake_case and PascalCase by reading
fallback properties (e.g., pipeline = (r.pipeline ?? r.Pipeline)?.trim() ?? '',
model = (r.model ?? r.Model)?.trim() ?? '', and use Number(r.warm_orch_count ??
r.WarmOrchCount ?? 0) and Number(r.gpu_slots ?? r.GpuSlots ?? 0)) before calling
add; apply the same change in both for..of loops so both stream and req entries
support either casing.
In `@apps/web-next/src/lib/facade/resolvers/net-orchestrators.ts`:
- Around line 157-171: The fetch functions fetchStreamingOrchestrators and
fetchRequestsOrchestrators currently call the offset/limit endpoints without a
limit, risking truncated results; update both naapGet calls to include a query
parameter setting limit=1000 (e.g., pass { limit: 1000 } or include it in the
request options/query) so each function explicitly requests up to 1000 rows and
prevents silent undercounting.
In `@apps/web-next/src/lib/facade/resolvers/perf-by-model.ts`:
- Around line 32-39: The current code swallows all naapGet errors by .catch(()
=> []), which causes transport failures to be cached as empty FPS data; change
the rawRows load so that you do not catch/rewrite network errors—let exceptions
from naapGet propagate (so the route can return 503), but still coerce null or
undefined responses into an empty array (i.e., after awaiting
naapGet<StreamingModelRow[] | null | undefined>(...) set rawRows = result ?? []
rather than catching rejections). Keep the same request options and preserve the
'streaming/models' call and the StreamingModelRow type.
In `@apps/web-next/src/lib/facade/resolvers/pricing.ts`:
- Around line 159-166: The aggregation loop is creating entries for rows whose
pipeline or model are blank/whitespace; before computing pricingKey(r.pipeline,
r.model) and updating map, trim r.pipeline and r.model and skip the row if
either trimmed value is an empty string (e.g., if (!trimmedPipeline ||
!trimmedModel) continue;). Apply the same pre-check in the other aggregation
block that creates slots (the second occurrence around the code handling minWei
at lines ~205-208) so no per-orchestrator rows with blank IDs are inserted into
map or emitted.
In `@packages/plugin-sdk/src/contracts/dashboard.ts`:
- Around line 504-509: The kpi resolver signature in the Dashboard contract
declares pipeline and model_id parameters that are never forwarded; either
remove pipeline and model_id from the kpi type declaration or wire them
end-to-end: update the GraphQL schema to accept pipeline and model_id, update
createDashboardProvider (the provider factory that currently extracts only
window and timeframe) to also extract and forward pipeline and model_id into the
resolver call, and ensure any consumer code that implements kpi handles those
additional args; reference the kpi type in
packages/plugin-sdk/src/contracts/dashboard.ts and the createDashboardProvider
function to apply the change consistently.
In `@plugins/developer-api/backend/src/server.ts`:
- Around line 258-275: Promise.all currently causes total failure if either
fetch rejects or res.json() throws, breaking the intended partial-merge
behavior; change the fetch logic to use a safe pattern (e.g., Promise.allSettled
or a safeFetch wrapper) so failures return a sentinel (null or rejected state)
instead of rejecting the whole Promise, and update ingest (the
ingest(res,label,source) function) to guard the JSON parse in a try/catch (log a
warning on parse error and return false) so transport errors and JSON parse
failures are treated as per-source failures rather than aborting the whole
merge; ensure code that currently reads sRes and rRes handles the sentinel/null
results and still calls ingest only for valid Response objects.
---
Outside diff comments:
In `@apps/web-next/src/hooks/useJobFeedStream.ts`:
- Around line 90-93: The fallback URL builder in useJobFeedStream.ts's catch
block currently appends only pollMs and loses the bustCache flag; update the
catch branch that builds the fallback string using fetchUrl, sep, and ms to also
append refresh=1 when the bustCache boolean is true (e.g., include "&refresh=1"
or "?refresh=1" as appropriate), ensuring both pollMs and refresh are
URL-encoded via encodeURIComponent and joined using the existing sep logic so
recovery polling bypasses cache when requested.
In `@plugins/developer-api/backend/src/server.ts`:
- Around line 355-374: The merged array is not being capped by the endpoint
limit so upstream rows can exceed `limit`; after merging upstream rows into
`merged` (the array built from `modelRows` and where keys are tracked by `seen`
using `netModelRowKey`), enforce the limit by truncating `merged` to `limit`
when `limitIsAll` is false (e.g., set merged = merged.slice(0, limit ??
merged.length)); recalc `catalogSlotsRemaining` from the post-truncation length
(or compute it before backfill using Math.max(0, (limit ?? 0) - merged.length)
but ensure merged has been truncated), and apply the same truncation logic in
the other similar merge blocks referenced (the ones around the other fetch/merge
sections).
---
Nitpick comments:
In `@apps/web-next/src/lib/facade/dashboard-window.ts`:
- Around line 28-31: The conditional in the timeout helper is redundant: both
branches of "if (hours <= 72)" return 55_000; remove the branch and replace it
with a single return of 55_000 (i.e., eliminate the if (...) { return 55_000; }
return 55_000; pattern) so the helper simply returns 55_000 unconditionally.
In `@apps/web-next/src/lib/facade/resolvers/net-capacity.ts`:
- Line 12: The current await getRawNetModels().catch(() => []) swallows errors;
change the catch to log a warning with the caught error before falling back to
[] so upstream outages are visible — e.g. replace .catch(() => []) with
.catch((err) => { /* log a warning including err and context (use existing
logger like processLogger or console) */; return []; }) referencing the
getRawNetModels call and the models variable so the log includes which fetch
failed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 93f0e15a-ba73-4410-ad65-dfdebdd5543b
📒 Files selected for processing (33)
apps/web-next/src/app/(dashboard)/dashboard/page.tsxapps/web-next/src/app/api/internal/bff-warm/route.tsapps/web-next/src/app/api/v1/dashboard/orchestrators/route.tsapps/web-next/src/app/api/v1/developer/network-models/route.tsapps/web-next/src/app/api/v1/naap-api/warm/route.tsapps/web-next/src/app/api/v1/network/perf-by-model/route.tsapps/web-next/src/components/dashboard/overview-content.tsxapps/web-next/src/content/docs/guides/dashboard-data-provider.mdxapps/web-next/src/hooks/useJobFeedStream.tsapps/web-next/src/hooks/usePublicDashboard.tsapps/web-next/src/lib/dashboard/overview-timeframe.tsapps/web-next/src/lib/facade/dashboard-window.tsapps/web-next/src/lib/facade/index.tsapps/web-next/src/lib/facade/network-data.tsapps/web-next/src/lib/facade/resolvers/gpu-capacity.tsapps/web-next/src/lib/facade/resolvers/kpi.tsapps/web-next/src/lib/facade/resolvers/net-capacity.tsapps/web-next/src/lib/facade/resolvers/net-orchestrators.tsapps/web-next/src/lib/facade/resolvers/orchestrators.tsapps/web-next/src/lib/facade/resolvers/perf-by-model.tsapps/web-next/src/lib/facade/resolvers/pipelines.tsapps/web-next/src/lib/facade/resolvers/pricing.tsapps/web-next/src/lib/facade/stubs.tsapps/web-next/src/lib/facade/upstream-parse.tspackages/plugin-sdk/src/contracts/__tests__/dashboard.test.tspackages/plugin-sdk/src/contracts/dashboard.tspackages/plugin-sdk/src/contracts/index.tspackages/plugin-sdk/src/index.tspackages/plugin-sdk/src/types/network-model.tspackages/types/src/index.tsplugins/dashboard-data-provider/frontend/src/__tests__/provider.test.tsplugins/dashboard-data-provider/frontend/src/provider.tsplugins/developer-api/backend/src/server.ts
…filter arguments - Updated the `kpi` query in the dashboard schema to accept `pipeline` and `model_id` as filter arguments, improving query flexibility. - Modified the `createDashboardProvider` function to forward the new arguments to the KPI resolver. - Added tests to verify the correct handling of the new filter arguments in the KPI resolver. - Enhanced error handling in the developer API for improved robustness during data fetching.
Summary
This PR re-applies #270 which was rolled back due to an upstream data issue which has been resolved.
This change refactors the dashboard's BFF (Backend for Frontend) data layer to align with the upstream OpenAPI v1 contract, consolidates streaming and requests model data, improves orchestrator/pipeline data handling, and hardens job feed polling with cache busting and recovery retry logic.
Changes
API / Data Layer
Orchestrator & KPI Resolvers
orchestratorsOnline→orchestratorObservedto better reflect actual semanticsPipeline & Timeframe Handling
OVERVIEW_TIMEFRAME_MAX_HOURSconstant to cap timeframe options consistently across resolversPipelinesCardfrommodeltogpusto surface the busiest models firstJob Feed
bustCacheoption toappendJobFeedPollQueryfor explicit cache invalidationJOB_FEED_RECOVERY_RETRY_DELAYSfor staged recovery backoff after upstream errorsfetchJobFeedto accept a base URL and effective polling interval for better flexibilityUI / Display Fixes
JobFeedCardnow conditionally renders input FPS only when a non-null value is presenteffectiveSuccessRate,noSwapRatio,slaScore, andsuccessRatioare scaled API response values in the orchestrators resolverCleanup
Testing
Summary by CodeRabbit
New Features
Bug Fixes
Documentation