feat(web-next): align NAAP dashboard and BFF with OpenAPI v1 (streaming + requests)#270
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThis PR consolidates dashboard data sources by fetching from merged NAAP streaming and requests endpoints instead of legacy single endpoints. It renames the orchestrators KPI metric from Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant BFF as Web Server<br/>(Facade)
participant StreamModels as NAAP:<br/>streaming/models
participant RequestModels as NAAP:<br/>requests/models
participant Pricing as NAAP:<br/>dashboard/pricing
participant Cache as Redis Cache
Client->>BFF: GET /api/v1/network/models
alt Cache Hit
BFF->>Cache: lookup networkModels
Cache-->>BFF: cached payload
else Cache Miss
par Concurrent Fetch
BFF->>+StreamModels: GET /v1/streaming/models
BFF->>+RequestModels: GET /v1/requests/models
and
BFF->>+Pricing: GET /v1/dashboard/pricing
end
StreamModels-->>-BFF: rows (pipeline, model, warm_orch_count...)
RequestModels-->>-BFF: rows (pipeline, model, gpu_slots...)
Pricing-->>-BFF: pricing rows (min/max/avg per key)
Note over BFF: Merge by pipeline:model key<br/>Sum capacity fields<br/>Aggregate pricing
BFF->>Cache: store aggregated models
end
BFF-->>Client: NetworkModel[] with pricing
Estimated code review effort🎯 4 (Complex) | ⏱️ ~65 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
|
…PI compliance - Enhanced BFF warm route documentation to reference the upstream OpenAPI contract. - Updated developer network models endpoint to merge streaming and requests models. - Refined NAAP API warm route comments to clarify data sources. - Adjusted performance by model resolver to utilize a single cache key for streaming models. - Improved dashboard hooks to handle new data structures and ensure compatibility with requests. - Introduced new utility functions for dashboard window formatting and timeout management. - Merged streaming and requests model data in the network data layer for better consistency. - Updated orchestrator and pipeline resolvers to reflect changes in data fetching and merging logic. - Enhanced pricing resolver to accommodate new OpenAPI v1 structure and ensure accurate pricing data representation.
- Introduced a new utility function to determine the effective upstream window for orchestrators based on the requested period. - Updated the orchestrator cache key to utilize the effective window for improved data accuracy. - Simplified the KPI resolver by removing unnecessary net orchestrator data fetching and directly returning parsed KPI data. - Enhanced documentation to clarify the behavior of the orchestrator endpoint and its limitations on timeframes.
3733b93 to
09f65b3
Compare
…ed performance - Changed default sorting in PipelinesCard from 'model' to 'gpus' to prioritize busy models. - Introduced OVERVIEW_TIMEFRAME_MAX_HOURS constant to cap timeframe options and ensure consistent handling across various resolvers. - Updated KPI and pipeline resolvers to utilize the new normalization function for timeframe parsing, enhancing data accuracy and performance.
|
Two issues remain blockers for this PR. ** partial blocker ** workaround added in @53bb1cea2d121a864580b268dc5b464e87f75514 to query only past 48 hours
|
There was a problem hiding this comment.
Actionable comments posted: 10
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
apps/web-next/src/components/dashboard/overview-content.tsx (1)
798-852:⚠️ Potential issue | 🟡 MinorKeep missing FPS values last when sorting descending.
Now that missing FPS is
null, descending FPS sort puts missing rows first becausemissingLast()returns1and the final comparator reverses it.Proposed comparator fix
let c = 0; + const applyDirection = (inner: number): number => (sortDir === 'asc' ? inner : -inner); const missingLast = (aMissing: boolean, bMissing: boolean, inner: number): number => { if (aMissing && bMissing) return 0; if (aMissing) return 1; if (bMissing) return -1; - return inner; + return applyDirection(inner); }; switch (sortCol) { case 'model': - c = a.localeCompare(b); + c = applyDirection(a.localeCompare(b)); break; case 'gpus': - c = ra.gpus - rb.gpus; + c = applyDirection(ra.gpus - rb.gpus); break; @@ } - if (c !== 0) return sortDir === 'asc' ? c : -c; + if (c !== 0) return c; return a.localeCompare(b);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/web-next/src/components/dashboard/overview-content.tsx` around lines 798 - 852, The comparator puts missing FPS rows first when sorting descending because missingLast currently always returns 1 for aMissing and -1 for bMissing and then the final sortDir reversal flips that; update missingLast (used in the sort block) to take sortDir into account (capture the outer sortDir or add a parameter) so it returns 1 for aMissing only when sortDir === 'asc' and -1 when sortDir === 'desc' (and the inverse for bMissing), then keep the existing final reversal logic; this ensures cases like the 'fps' branch (ra.modelFps / rb.modelFps) keep missing values last regardless of asc/desc.packages/plugin-sdk/src/contracts/dashboard.ts (1)
70-79:⚠️ Potential issue | 🟡 MinorAdd plugin-sdk CHANGELOG entry for breaking
KPI.orchestratorsObservedrename.The field rename from
orchestratorsOnline→orchestratorsObservedis already complete in both GraphQL schema and TypeScript contracts. However, plugin-sdk has no CHANGELOG documenting this breaking change for downstream consumers. Add a semver-major entry (or breaking-change note) to track this contract change.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/plugin-sdk/src/contracts/dashboard.ts` around lines 70 - 79, Add a semver-major breaking-change entry to the plugin-sdk CHANGELOG documenting the field rename orchestratorsOnline → orchestratorsObserved in the GraphQL contract and TypeScript types; mention the affected contract/type (KPI.orchestratorsObserved) and that downstream consumers must update references, include the version bump or "BREAKING CHANGE" header, and provide a short migration note showing the old and new field names so consumers can find and update usages.
🧹 Nitpick comments (7)
docs/pymthouse-minimal-design.md (2)
341-355: Considerstale-if-errordirective for improved discovery resilience.The error handling strategy uses
Cache-Control: public, max-age=0, s-maxage=5, stale-while-revalidate=0, which provides fail-closed behavior (no stale data on error). This is conservative and appropriate for preventing clients from caching error states.However, line 173 states: "If NaaP discovery is temporarily unavailable, clients receive a
503..." — emphasizing temporary unavailability. For transient failures, allowing stale discovery data may provide better user experience than hard failure.Consider whether the
stale-if-errordirective would improve resilience:Cache-Control: public, max-age=0, s-maxage=1800, stale-if-error=3600This would allow CDN/clients to serve up to 1-hour-old discovery data if NaaP returns a 503, reducing the blast radius of transient infrastructure issues while still preventing indefinite stale data serving.
Trade-off: slightly stale orchestrator data vs. complete discovery unavailability during incidents.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/pymthouse-minimal-design.md` around lines 341 - 355, Update the Error Handling Cache-Control recommendation to include a stale-if-error directive for transient failures: in the "Error Handling" section where the header is currently specified as "Cache-Control: public, max-age=0, s-maxage=5, stale-while-revalidate=0" (and the sentence "If NaaP discovery is temporarily unavailable, clients receive a `503`..."), change the suggested header to include stale-if-error (for example "public, max-age=0, s-maxage=1800, stale-if-error=3600") and add a short note explaining that this allows CDNs/clients to serve recent discovery data on 503s to improve resilience while accepting limited staleness.
280-282: Add language identifier to fenced code block.The fenced code block on line 280 lacks a language specifier, which can affect syntax highlighting and accessibility in some markdown renderers.
🎨 Suggested fix
-``` +```http GET /api/v1/network/discover-orchestrators</details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In
@docs/pymthouse-minimal-design.mdaround lines 280 - 282, The fenced code
block containing the HTTP request "GET /api/v1/network/discover-orchestrators"
is missing a language identifier; update the opening fence to include the
language (e.g., usehttp) so the block becomes fenced withhttp to enable
proper syntax highlighting and accessibility for the GET
/api/v1/network/discover-orchestrators example.</details> </blockquote></details> <details> <summary>plugins/developer-api/backend/src/server.ts (1)</summary><blockquote> `400-406`: **Brittle error-to-status mapping via substring match.** Matching `error.message.includes('NAAP_API_SERVER_URL')` couples control flow to the error's wording. A future message tweak silently downgrades 503 to 502. Prefer a typed error (or sentinel) thrown from `getNaapApiServerBase`. <details> <summary>Suggested approach</summary> ```diff +class NaapConfigError extends Error {} function getNaapApiServerBase(): string { const raw = process.env.NAAP_API_SERVER_URL?.trim(); if (!raw) { - throw new Error( + throw new NaapConfigError( 'NAAP_API_SERVER_URL is not set. Set it to the NAAP OpenAPI base URL (including /v1). See repo root .env.example.', ); } return raw.replace(/\/+$/, ''); } @@ - if (error instanceof Error && error.message.includes('NAAP_API_SERVER_URL')) { - return res.status(503).json({ error: error.message }); - } + if (error instanceof NaapConfigError) { + return res.status(503).json({ error: error.message }); + } ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@plugins/developer-api/backend/src/server.ts` around lines 400 - 406, The catch block in server.ts should not rely on error.message.includes; modify getNaapApiServerBase to throw a specific Error subclass or sentinel (e.g., NaapConfigError) when the NAAP API server URL is missing/invalid, and then in the catch branch replace the substring check with an instanceof check against that class (or compare the sentinel) to return 503; leave other errors returning 502. Update references to getNaapApiServerBase and the catch block handling to use the new NaapConfigError type. ``` </details> </blockquote></details> <details> <summary>apps/web-next/src/lib/facade/dashboard-window.ts (1)</summary><blockquote> `24-32`: **Redundant branch in `dashboardUpstreamTimeoutMs`.** The `hours <= 72` branch and the fallback both return `55_000`, so the intermediate check is dead. Either collapse to a single non-`<=24` branch, or differentiate (e.g., cap > 72h to a smaller value or reject) to make intent explicit. <details> <summary>Proposed simplification</summary> ```diff export function dashboardUpstreamTimeoutMs(hours: number): number { if (hours <= 24) { return 30_000; } - if (hours <= 72) { - return 55_000; - } return 55_000; } ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@apps/web-next/src/lib/facade/dashboard-window.ts` around lines 24 - 32, The function dashboardUpstreamTimeoutMs has a redundant branch: both the hours <= 72 case and the final fallback return 55_000; simplify by collapsing the branches so that dashboardUpstreamTimeoutMs(hours) returns 30_000 when hours <= 24 and 55_000 otherwise (or, if intended, add distinct handling for >72h), updating the implementation of dashboardUpstreamTimeoutMs to remove the dead hours <= 72 check and make the intent explicit. ``` </details> </blockquote></details> <details> <summary>apps/web-next/src/app/api/v1/network/perf-by-model/route.ts (1)</summary><blockquote> `47-54`: **Intentional single cache key — consider noting start/end are contract-only.** Since the upstream `streaming/models` resolver ignores `start`/`end` and always returns a 24h weighted average, the single SWR key is correct. The inline comment already explains this clearly. You may also want to consider either removing the `start`/`end` query param requirement or keeping a clearer doc note that they're contract-only (not data-affecting) so API consumers don't assume time-window filtering. No change required. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@apps/web-next/src/app/api/v1/network/perf-by-model/route.ts` around lines 47 - 54, The current code uses a single SWR cache key 'perf-by-model:streaming-models' with bffStaleWhileRevalidate and getPerfByModel while upstream ignores start/end; update the inline comment or API doc in this file to state explicitly that the query params start and end are contract-only and do not affect returned data (or optionally remove start/end from the route signature), and reference the bffStaleWhileRevalidate call, the cacheKey 'perf-by-model:streaming-models', and getPerfByModel to make clear where this behavior originates. ``` </details> </blockquote></details> <details> <summary>apps/web-next/src/lib/facade/resolvers/orchestrators.ts (1)</summary><blockquote> `64-74`: **`mergeOrchestratorServiceUris` dedup is exact-match; inventory URIs aren't normalized.** `out.includes(extra)` compares the raw strings. If inventory has `https://orch.example.com` and dashboard returns `https://orch.example.com/` (or different case in the host), both will be kept. Minor, but since this now feeds the `uris` list that drives row-filtering via `hasNonBlankServiceUri`, a small normalization (trim + strip trailing `/`) would reduce UI noise. <details> <summary>♻️ Optional normalization</summary> ```diff function mergeOrchestratorServiceUris(netUris: string[], dash: ApiOrchestrator): string[] { - const out = [...netUris]; + const norm = (u: string) => u.trim().replace(/\/+$/, ''); + const seen = new Set(netUris.map(norm)); + const out = [...netUris]; const extra = (typeof dash.serviceUri === 'string' && dash.serviceUri.trim()) || (typeof dash.service_uri === 'string' && dash.service_uri.trim()) || ''; - if (extra && !out.includes(extra)) { + if (extra && !seen.has(norm(extra))) { out.push(extra); } return out; } ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@apps/web-next/src/lib/facade/resolvers/orchestrators.ts` around lines 64 - 74, mergeOrchestratorServiceUris currently deduplicates by exact string match so variants like "https://orch.example.com" and "https://orch.example.com/" both appear; update the function to normalize URIs before comparison by trimming whitespace and removing a single trailing slash (e.g., normalize both existing entries in out and the extracted extra from dash.serviceUri / dash.service_uri), compare using these normalized forms, and only push the extra (preferably the normalized form) if no normalized match exists; reference mergeOrchestratorServiceUris and the dash.serviceUri / dash.service_uri extraction when making the change. ``` </details> </blockquote></details> <details> <summary>apps/web-next/src/lib/facade/upstream-parse.ts (1)</summary><blockquote> `22-41`: **Minor: fallback when `streaming` key exists but is nullish loses `requests`.** If upstream ever returns `{ streaming: null, requests: {...} }`, the current guard `o.streaming && typeof o.streaming === 'object'` fails and you fall through to `{ kpi: body as DashboardKPI }`, casting the wrapper itself as KPI and dropping the `requests` breakdown. Harmless given current API shape, but small hardening: <details> <summary>♻️ Defensive fallback</summary> ```diff export function parseDashboardKpiBody(body: unknown): { kpi: DashboardKPI; requests?: DashboardJobsOverview; } { if (body && typeof body === 'object' && 'streaming' in body) { const o = body as { streaming?: DashboardKPI; requests?: DashboardJobsOverview }; if (o.streaming && typeof o.streaming === 'object') { return { kpi: o.streaming, requests: o.requests, }; } + // Wrapper shape present but `streaming` missing/null: preserve requests if any. + return { kpi: {} as DashboardKPI, requests: o.requests }; } return { kpi: body as DashboardKPI }; } ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@apps/web-next/src/lib/facade/upstream-parse.ts` around lines 22 - 41, The guard in parseDashboardKpiBody drops requests when body has a streaming key that's nullish; change the logic in parseDashboardKpiBody to detect the presence of the streaming key (using 'streaming' in body) and when present return { kpi: o.streaming as DashboardKPI, requests: o.requests } so requests are preserved even if o.streaming is null, and keep parseDashboardKpiWithRequests unchanged to merge requests when present. ``` </details> </blockquote></details> </blockquote></details> <details> <summary>🤖 Prompt for all review comments with AI agents</summary>Verify each finding against the current code and only fix it if needed.
Inline comments:
In@apps/web-next/src/components/dashboard/overview-content.tsx:
- Line 487: The label currently uses the timeframe variable tfLabel making the
orchestrator count appear timeframe-scoped; update the tile invocation that
renders Server (tile(Server,Orchestrators (${tfLabel}),
data.orchestratorsObserved.value)) to a timeframe-agnostic label such as
"Orchestrators (observed)" or "Orchestrators (current)" — replace the template
stringOrchestrators (${tfLabel})with the new static label while leaving the
Server and data.orchestratorsObserved.value arguments unchanged.- Line 1443: The current JSX in the that renders FPS hides valid outputFps
when inputFps is missing; update the conditional to show outputFps whenever it
exists. Specifically, modify the expression in the (the one referencing
job.inputFps and job.outputFps) to: if job.outputFps != null then render
${job.inputFps != null ? job.inputFps.toFixed(1) : '—'} / ${job.outputFps.toFixed(1)}, otherwise render '—'; this ensures outputFps is
displayed even when inputFps is null while preserving the existing formatting
and fallback.In
@apps/web-next/src/lib/facade/resolvers/net-orchestrators.ts:
- Around line 157-171: The two functions fetchStreamingOrchestrators and
fetchRequestsOrchestrators currently call naapGet without pagination and
therefore only return the first page; fix both by reintroducing a page size
(e.g. NET_ORCHESTRATORS_PAGE_SIZE) and implementing a paginated fetch loop that
calls naapGet repeatedly with pagination parameters (page/limit or offset/cursor
depending on naapGet API), concatenates results into a single
StreamingOrchestratorRow[] / RequestsOrchestratorRow[] and stops when a page
returns fewer items than the page size (or no cursor is returned); apply the
same pagination change for the other calls noted around lines 229–273 so all
orchestrator records are collected before returning.In
@apps/web-next/src/lib/facade/resolvers/orchestrators.ts:
- Around line 13-19: Update the stale header comment to reflect the actual
upstream value ranges: state that effectiveSuccessRate and noSwapRatio are
provided in 0–1 (and are scaled via pct()), while slaScore and successRatio are
provided in 0–100 (slaScore is used via Math.round(dash.slaScore) and
successRatio is handled as Math.round(dash.successRatio * 10) / 10); remove the
assertion that slaScore and successRatio are 0–1 and note that dashboard
endpoints are capped at 24h as before.In
@docs/pymthouse-minimal-design.md:
- Around line 254-270: The doc's Tenant Model mentions client_id RLS but doesn't
list which DB tables are protected; add a new subsection under the "Tenant
Model" heading that explicitly enumerates the tables with PostgreSQL RLS
(examples: usage_records, api_keys, subscriptions, app_users, refresh_tokens,
billing_events), state the canonical RLS pattern (e.g., client_id =
current_setting('app.current_client_id')::uuid), list any tables that are only
filtered at the application layer, and describe exemptions for platform/admin
queries that require cross-tenant visibility so Phase 0 migration and security
review have an unambiguous RLS scope.- Around line 649-664: The docs currently state "Signer API keys stored
encrypted at rest with one application encryption key" and defer per-provider
derived-key encryption; add a required subsection under Decision 7 that mandates
operational key-management controls for the MVP: include a documented and tested
key rotation runbook for the application encryption key, explicit access-control
and audit requirements for which services/roles can access the key, an assurance
that the key is never in source control and must be provisioned via secret
management or secure env vars, and an incident-response procedure (revocation
and re-encryption steps) to follow if the application encryption key is
compromised; reference the existing lines mentioning "Signer API keys", "one
application encryption key", "planned rotation runbook", and the deferred
"per-provider derived-key encryption" so reviewers can find and expand Decision
- Around line 774-784: The header "Phase 1: NaaP Discovery Endpoint (In Progress
— PR#266)" conflicts with all checklist items marked complete; update the
document to reconcile status by either changing the header to indicate
completion (e.g., "Complete — PR#266" or "Done") if all tasks are finished, or
by unchecking any tasks that remain outstanding and adding a short note next to
the header clarifying that completion is pending review/merge; make the change
around the "Phase 1: NaaP Discovery Endpoint (In Progress — PR#266)" heading
and the checklist entries so the header and the eight checkbox items reflect the
same actual state.- Around line 284-289: Update the "Query Parameters" section row for the caps
parameter to explicitly document the multi-value syntax and give a concrete
example URL; mention that caps is repeated and show the preferred format (e.g.,
repeated query params like
?caps=text-to-image/stabilityai/sd-turbo&caps=image-to-video/other) and clarify
the OR semantics for multiple caps values, or state if comma-separated values
are also supported — include one short example line demonstrating the exact
request syntax and a one-sentence note that multiple caps are evaluated with OR
semantics.- Line 339: The doc is ambiguous about whether the pipeline slug regex
^[a-z][a-z0-9-]{0,63}$applies to the full slug or to each slash-separated
segment (e.g.,text-to-image/stabilityai/sd-turbo); update the text to
explicitly state which approach we use: either say "validation is applied
per-segment (each segment must match^[a-z][a-z0-9-]{0,63}$)" or state "the
full slug is validated and use an updated regex that permits/characters" and
provide the chosen behavior and rationale next to the example slug so
implementers know to validate segments vs. use a different pattern.In
@plugins/developer-api/backend/src/server.ts:
- Around line 243-281: The ingest helper currently drops non-ok Response objects
silently (in function ingest called with sRes and rRes), which hides upstream
4xx/5xx failures; update ingest to log a warning when res.ok is false including
res.status and the request URL (or a caller-provided label) and return early,
and additionally in the caller that awaits ingest(sRes) and ingest(rRes) detect
if both responses were non-ok (or both logged failures) and surface an error
(e.g., return a 502) instead of merging into pipeline-catalog-only rows;
reference ingest, sRes, rRes and netModelRowKey to locate where to add the
logging and the dual-failure check.
Outside diff comments:
In@apps/web-next/src/components/dashboard/overview-content.tsx:
- Around line 798-852: The comparator puts missing FPS rows first when sorting
descending because missingLast currently always returns 1 for aMissing and -1
for bMissing and then the final sortDir reversal flips that; update missingLast
(used in the sort block) to take sortDir into account (capture the outer sortDir
or add a parameter) so it returns 1 for aMissing only when sortDir === 'asc' and
-1 when sortDir === 'desc' (and the inverse for bMissing), then keep the
existing final reversal logic; this ensures cases like the 'fps' branch
(ra.modelFps / rb.modelFps) keep missing values last regardless of asc/desc.In
@packages/plugin-sdk/src/contracts/dashboard.ts:
- Around line 70-79: Add a semver-major breaking-change entry to the plugin-sdk
CHANGELOG documenting the field rename orchestratorsOnline →
orchestratorsObserved in the GraphQL contract and TypeScript types; mention the
affected contract/type (KPI.orchestratorsObserved) and that downstream consumers
must update references, include the version bump or "BREAKING CHANGE" header,
and provide a short migration note showing the old and new field names so
consumers can find and update usages.
Nitpick comments:
In@apps/web-next/src/app/api/v1/network/perf-by-model/route.ts:
- Around line 47-54: The current code uses a single SWR cache key
'perf-by-model:streaming-models' with bffStaleWhileRevalidate and getPerfByModel
while upstream ignores start/end; update the inline comment or API doc in this
file to state explicitly that the query params start and end are contract-only
and do not affect returned data (or optionally remove start/end from the route
signature), and reference the bffStaleWhileRevalidate call, the cacheKey
'perf-by-model:streaming-models', and getPerfByModel to make clear where this
behavior originates.In
@apps/web-next/src/lib/facade/dashboard-window.ts:
- Around line 24-32: The function dashboardUpstreamTimeoutMs has a redundant
branch: both the hours <= 72 case and the final fallback return 55_000; simplify
by collapsing the branches so that dashboardUpstreamTimeoutMs(hours) returns
30_000 when hours <= 24 and 55_000 otherwise (or, if intended, add distinct
handling for >72h), updating the implementation of dashboardUpstreamTimeoutMs to
remove the dead hours <= 72 check and make the intent explicit.In
@apps/web-next/src/lib/facade/resolvers/orchestrators.ts:
- Around line 64-74: mergeOrchestratorServiceUris currently deduplicates by
exact string match so variants like "https://orch.example.com" and
"https://orch.example.com/" both appear; update the function to normalize URIs
before comparison by trimming whitespace and removing a single trailing slash
(e.g., normalize both existing entries in out and the extracted extra from
dash.serviceUri / dash.service_uri), compare using these normalized forms, and
only push the extra (preferably the normalized form) if no normalized match
exists; reference mergeOrchestratorServiceUris and the dash.serviceUri /
dash.service_uri extraction when making the change.In
@apps/web-next/src/lib/facade/upstream-parse.ts:
- Around line 22-41: The guard in parseDashboardKpiBody drops requests when body
has a streaming key that's nullish; change the logic in parseDashboardKpiBody to
detect the presence of the streaming key (using 'streaming' in body) and when
present return { kpi: o.streaming as DashboardKPI, requests: o.requests } so
requests are preserved even if o.streaming is null, and keep
parseDashboardKpiWithRequests unchanged to merge requests when present.In
@docs/pymthouse-minimal-design.md:
- Around line 341-355: Update the Error Handling Cache-Control recommendation to
include a stale-if-error directive for transient failures: in the "Error
Handling" section where the header is currently specified as "Cache-Control:
public, max-age=0, s-maxage=5, stale-while-revalidate=0" (and the sentence "If
NaaP discovery is temporarily unavailable, clients receive a503..."), change
the suggested header to include stale-if-error (for example "public, max-age=0,
s-maxage=1800, stale-if-error=3600") and add a short note explaining that this
allows CDNs/clients to serve recent discovery data on 503s to improve resilience
while accepting limited staleness.- Around line 280-282: The fenced code block containing the HTTP request "GET
/api/v1/network/discover-orchestrators" is missing a language identifier; update
the opening fence to include the language (e.g., usehttp) so the block becomes fenced withhttp to enable proper syntax highlighting and
accessibility for the GET /api/v1/network/discover-orchestrators example.In
@plugins/developer-api/backend/src/server.ts:
- Around line 400-406: The catch block in server.ts should not rely on
error.message.includes; modify getNaapApiServerBase to throw a specific Error
subclass or sentinel (e.g., NaapConfigError) when the NAAP API server URL is
missing/invalid, and then in the catch branch replace the substring check with
an instanceof check against that class (or compare the sentinel) to return 503;
leave other errors returning 502. Update references to getNaapApiServerBase and
the catch block handling to use the new NaapConfigError type.</details> <details> <summary>🪄 Autofix (Beta)</summary> Fix all unresolved CodeRabbit comments on this PR: - [ ] <!-- {"checkboxId": "4b0d0e0a-96d7-4f10-b296-3a18ea78f0b9"} --> Push a commit to this branch (recommended) - [ ] <!-- {"checkboxId": "ff5b1114-7d8c-49e6-8ac1-43f82af23a33"} --> Create a new PR with the fixes </details> --- <details> <summary>ℹ️ Review info</summary> <details> <summary>⚙️ Run configuration</summary> **Configuration used**: Path: .coderabbit.yaml **Review profile**: CHILL **Plan**: Pro **Run ID**: `781edadd-689e-42d4-8c16-23c90a6b9ccf` </details> <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between 77244d7cc8b07f9a6f77f62bce178db810a54776 and aa89fcb050a54302425de6cb8a00ad6d3fd75203. </details> <details> <summary>📒 Files selected for processing (33)</summary> * `apps/web-next/src/app/(dashboard)/dashboard/page.tsx` * `apps/web-next/src/app/api/internal/bff-warm/route.ts` * `apps/web-next/src/app/api/v1/dashboard/orchestrators/route.ts` * `apps/web-next/src/app/api/v1/developer/network-models/route.ts` * `apps/web-next/src/app/api/v1/naap-api/warm/route.ts` * `apps/web-next/src/app/api/v1/network/perf-by-model/route.ts` * `apps/web-next/src/components/dashboard/overview-content.tsx` * `apps/web-next/src/content/docs/guides/dashboard-data-provider.mdx` * `apps/web-next/src/hooks/usePublicDashboard.ts` * `apps/web-next/src/lib/dashboard/overview-timeframe.ts` * `apps/web-next/src/lib/facade/dashboard-window.ts` * `apps/web-next/src/lib/facade/index.ts` * `apps/web-next/src/lib/facade/network-data.ts` * `apps/web-next/src/lib/facade/resolvers/gpu-capacity.ts` * `apps/web-next/src/lib/facade/resolvers/kpi.ts` * `apps/web-next/src/lib/facade/resolvers/net-capacity.ts` * `apps/web-next/src/lib/facade/resolvers/net-orchestrators.ts` * `apps/web-next/src/lib/facade/resolvers/orchestrators.ts` * `apps/web-next/src/lib/facade/resolvers/perf-by-model.ts` * `apps/web-next/src/lib/facade/resolvers/pipelines.ts` * `apps/web-next/src/lib/facade/resolvers/pricing.ts` * `apps/web-next/src/lib/facade/stubs.ts` * `apps/web-next/src/lib/facade/upstream-parse.ts` * `docs/pymthouse-minimal-design.md` * `packages/plugin-sdk/src/contracts/__tests__/dashboard.test.ts` * `packages/plugin-sdk/src/contracts/dashboard.ts` * `packages/plugin-sdk/src/contracts/index.ts` * `packages/plugin-sdk/src/index.ts` * `packages/plugin-sdk/src/types/network-model.ts` * `packages/types/src/index.ts` * `plugins/dashboard-data-provider/frontend/src/__tests__/provider.test.ts` * `plugins/dashboard-data-provider/frontend/src/provider.ts` * `plugins/developer-api/backend/src/server.ts` </details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
…sponse handling - Updated the JobFeedCard component to conditionally display input FPS, ensuring proper formatting when values are null. - Clarified documentation in orchestrators resolver regarding the scaling of API response values for effectiveSuccessRate, noSwapRatio, slaScore, and successRatio. - Enhanced error handling in the developer API server by adding labels to the ingest function for better logging of failed requests.
…oved error handling - Added a `bustCache` option to `appendJobFeedPollQuery` for cache invalidation. - Updated `fetchJobFeed` to accept options for cache busting and improved error handling. - Introduced `JOB_FEED_RECOVERY_RETRY_DELAYS` for better retry logic on upstream errors. - Enhanced logging to reflect the actual request URL during fetch operations.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
apps/web-next/src/lib/facade/resolvers/orchestrators.ts (1)
178-191:⚠️ Potential issue | 🟠 MajorData loss via URI filter — likely root cause of the "33 of 84 orchestrators" blocker.
orderedKeyscorrectly unions dashboard addresses with net-only addresses, but line 190 drops every row with emptyurisviahasNonBlankServiceUri. Given the PR blocker (84 orchestrators in → ~33 out), the most plausible explanation is:
- Upstream
GET /v1/dashboard/orchestratorsis not populatingserviceUri/service_urifor most rows (the new optional fields onApiOrchestratorare the only dashboard-side URI source feedingmergeOrchestratorServiceUris).- Those addresses also lack an entry in
netData.urisByAddress(i.e., neitherstreaming/orchestratorsnorrequests/orchestratorsreturned a URI), somapDashboardIntoNetRowproducesuris: []and the row is filtered out at line 190.To diagnose and prevent silent data loss:
- Add instrumentation before returning: count rows entering the merge, count after the filter, and count dropped because both inventory
urisByAddressand dashboardserviceUriwere empty.- Verify upstream payload contains
serviceUri/service_urion the dashboard rows you're seeing in staging; if not, recovery requires upstream or inventory fixes.- Consider expanding
orderedKeysto include addresses frompipelineModelsByAddress/lastSeenMsByAddressthat lackurisByAddressentries—today net-only addresses are sourced fromurisByAddress.keys()only, making any inventory record without a URI invisible.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/web-next/src/lib/facade/resolvers/orchestrators.ts` around lines 178 - 191, The merge is silently dropping rows with empty uris because merged rows are filtered by hasNonBlankServiceUri; update the logic around orderedKeys/merged to (1) add instrumentation counters inside the mergeOrchestratorServiceUris flow to log total input keys (orderedKeys), rows before filter, rows after filter and count of rows dropped due to both dashboard serviceUri and netData.urisByAddress being empty (use mapDashboardIntoNetRow/netOnlyPlaceholder outputs to detect drops), and (2) expand orderedKeys to also include addresses from pipelineModelsByAddress and lastSeenMsByAddress so inventory-only records without uris are not excluded; reference functions mapDashboardIntoNetRow, netOnlyPlaceholder, hasNonBlankServiceUri and the merged/orderedKeys variables when making these changes.
🧹 Nitpick comments (4)
plugins/developer-api/backend/src/server.ts (1)
405-413: Prefer a typed/sentinel error over substring matching for the 503 path.
error.message.includes('NAAP_API_SERVER_URL')is brittle — any future error whose message happens to mention that env name (e.g., from upstream logs orfetchwrappers) will get mapped to503instead of502, and a wording change ingetNaapApiServerBasewould silently flip the status code. A smallclass NaapConfigError extends Error {}(thrown fromgetNaapApiServerBase) and aninstanceofcheck here would be more robust.♻️ Sketch
+class NaapConfigError extends Error {} + function getNaapApiServerBase(): string { const raw = process.env.NAAP_API_SERVER_URL?.trim(); if (!raw) { - throw new Error( + throw new NaapConfigError( 'NAAP_API_SERVER_URL is not set. Set it to the NAAP OpenAPI base URL (including /v1). See repo root .env.example.', ); } return raw.replace(/\/+$/, ''); } @@ - if (error instanceof Error && error.message.includes('NAAP_API_SERVER_URL')) { + if (error instanceof NaapConfigError) { return res.status(503).json({ error: error.message }); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugins/developer-api/backend/src/server.ts` around lines 405 - 413, The catch block in server.ts currently checks error.message.includes('NAAP_API_SERVER_URL') which is brittle; instead introduce a sentinel error class (e.g., class NaapConfigError extends Error) and throw it from getNaapApiServerBase when the NAAP config is missing/invalid, then replace the substring check with an instanceof NaapConfigError check in the catch (near netModelsInflight.delete and the existing res.status calls) so true config errors return 503 and all other errors return 502.apps/web-next/src/lib/facade/resolvers/orchestrators.ts (2)
67-77: Dedup is whitespace/case-sensitive — may push a near-duplicate URI.
extrais.trim()ed before the membership check, butnetUrisentries are not normalized at this site. If inventory yields"https://example.com/"and dashboardserviceUriis" https://example.com/ "(or differs only in trailing slash / scheme casing), theout.includes(extra)check fails and a duplicate-looking URI is appended. Consider normalizing both sides (trim at minimum) before the comparison, e.g.:♻️ Suggested normalization
function mergeOrchestratorServiceUris(netUris: string[], dash: ApiOrchestrator): string[] { - const out = [...netUris]; + const out = [...netUris]; const extra = (typeof dash.serviceUri === 'string' && dash.serviceUri.trim()) || (typeof dash.service_uri === 'string' && dash.service_uri.trim()) || ''; - if (extra && !out.includes(extra)) { + if (extra && !out.some((u) => u.trim() === extra)) { out.push(extra); } return out; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/web-next/src/lib/facade/resolvers/orchestrators.ts` around lines 67 - 77, mergeOrchestratorServiceUris can append near-duplicate URIs because only dash.serviceUri is trimmed before the includes check; normalize both sides instead: map netUris to a normalized form (trim, optional lowercase, and remove trailing slash) and compute the normalized form of extra (from dash.serviceUri or dash.service_uri) and check against that normalized list before pushing; update references to netUris, extra, out, and the function mergeOrchestratorServiceUris so the original array preserves values but the membership check uses the normalized set to avoid whitespace/case/trailing-slash duplicates.
120-125: Minor: null-guard inconsistency vsApiOrchestratortyping.
ApiOrchestrator.successRatiois typednumber(non-nullable, line 42), yet line 122 defensively guardsdash.successRatio != null. Either tighten the runtime contract (drop the guard) or relax the type tonumber | nullto match the guard — pick one for consistency.slaScore(line 125) is correctly typednumber | nulland guarded with!== null, which is fine. Not a functional bug, just a contract mismatch.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/web-next/src/lib/facade/resolvers/orchestrators.ts` around lines 120 - 125, The runtime uses a null-check for dash.successRatio but ApiOrchestrator.successRatio is declared non-nullable; make the contract consistent by updating ApiOrchestrator.successRatio to be number | null (or number|null) and keep the existing guard in orchestrators.ts (the mapping that produces the successRatio field); also review any other callsites that assume a non-null number and adjust them to handle null (or coerce to 0) so the type change does not break consumers.apps/web-next/src/hooks/useJobFeedStream.ts (1)
135-139: Tighten the comment — the ladder applies to any failure streak, not just first load.
JOB_FEED_RECOVERY_RETRY_DELAYSis consulted wheneverfeedFailureStreak > 0, regardless of whether it's the first load or a transient blip mid-session. The current wording ("on first load") under-describes the actual behavior.📝 Proposed wording
-/** - * When the upstream briefly errors on first load, retry quickly before falling - * back to the steady poll interval. - */ +/** + * Recovery back-off (ms) consulted whenever the feed reports an unhealthy + * response (HTTP error, invalid JSON, or `queryFailed:true`). Each delay is + * additionally capped by the steady `pollIntervalMs`, so it can only shorten + * — never lengthen — the cadence relative to a healthy stream. + */🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/web-next/src/hooks/useJobFeedStream.ts` around lines 135 - 139, The comment above JOB_FEED_RECOVERY_RETRY_DELAYS is misleading — it says “on first load” but the delay ladder is used whenever feedFailureStreak > 0 (any failure streak). Update the comment to state that these delays are used for recovery retries on any failure streak (transient or persistent) rather than only on first load, and mention the tie to feedFailureStreak so future readers see when the array is consulted.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@plugins/developer-api/backend/src/server.ts`:
- Around line 269-288: The merge currently blindly sums capacities for identical
Pipeline:Model keys (netModelRowKey) across `streaming/models` and
`requests/models` which can overstate capacity if the endpoints report
overlapping orchestrators; change the merge in the loop over `accum` (variables:
byKey, netModelRowKey, WarmOrchCount, TotalCapacity) to detect duplicates coming
from different sources and resolve them by either asserting disjointness
(throw/log) or by a deterministic merge rule (e.g., prefer the larger capacity,
or prefer one source over the other) instead of always summing. Also propagate a
`partial` indicator from the `ingest` calls (use the boolean results from
`streamingOk` and `requestsOk`) back to the cache layer: if either upstream
failed set `partial=true` and either skip writing to cache or write with a much
shorter TTL (instead of the current 60s) so transient partial results are not
served for a minute.
---
Outside diff comments:
In `@apps/web-next/src/lib/facade/resolvers/orchestrators.ts`:
- Around line 178-191: The merge is silently dropping rows with empty uris
because merged rows are filtered by hasNonBlankServiceUri; update the logic
around orderedKeys/merged to (1) add instrumentation counters inside the
mergeOrchestratorServiceUris flow to log total input keys (orderedKeys), rows
before filter, rows after filter and count of rows dropped due to both dashboard
serviceUri and netData.urisByAddress being empty (use
mapDashboardIntoNetRow/netOnlyPlaceholder outputs to detect drops), and (2)
expand orderedKeys to also include addresses from pipelineModelsByAddress and
lastSeenMsByAddress so inventory-only records without uris are not excluded;
reference functions mapDashboardIntoNetRow, netOnlyPlaceholder,
hasNonBlankServiceUri and the merged/orderedKeys variables when making these
changes.
---
Nitpick comments:
In `@apps/web-next/src/hooks/useJobFeedStream.ts`:
- Around line 135-139: The comment above JOB_FEED_RECOVERY_RETRY_DELAYS is
misleading — it says “on first load” but the delay ladder is used whenever
feedFailureStreak > 0 (any failure streak). Update the comment to state that
these delays are used for recovery retries on any failure streak (transient or
persistent) rather than only on first load, and mention the tie to
feedFailureStreak so future readers see when the array is consulted.
In `@apps/web-next/src/lib/facade/resolvers/orchestrators.ts`:
- Around line 67-77: mergeOrchestratorServiceUris can append near-duplicate URIs
because only dash.serviceUri is trimmed before the includes check; normalize
both sides instead: map netUris to a normalized form (trim, optional lowercase,
and remove trailing slash) and compute the normalized form of extra (from
dash.serviceUri or dash.service_uri) and check against that normalized list
before pushing; update references to netUris, extra, out, and the function
mergeOrchestratorServiceUris so the original array preserves values but the
membership check uses the normalized set to avoid whitespace/case/trailing-slash
duplicates.
- Around line 120-125: The runtime uses a null-check for dash.successRatio but
ApiOrchestrator.successRatio is declared non-nullable; make the contract
consistent by updating ApiOrchestrator.successRatio to be number | null (or
number|null) and keep the existing guard in orchestrators.ts (the mapping that
produces the successRatio field); also review any other callsites that assume a
non-null number and adjust them to handle null (or coerce to 0) so the type
change does not break consumers.
In `@plugins/developer-api/backend/src/server.ts`:
- Around line 405-413: The catch block in server.ts currently checks
error.message.includes('NAAP_API_SERVER_URL') which is brittle; instead
introduce a sentinel error class (e.g., class NaapConfigError extends Error) and
throw it from getNaapApiServerBase when the NAAP config is missing/invalid, then
replace the substring check with an instanceof NaapConfigError check in the
catch (near netModelsInflight.delete and the existing res.status calls) so true
config errors return 503 and all other errors return 502.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e1620015-2792-4c8d-b05e-9cd8e27dc01a
📒 Files selected for processing (4)
apps/web-next/src/components/dashboard/overview-content.tsxapps/web-next/src/hooks/useJobFeedStream.tsapps/web-next/src/lib/facade/resolvers/orchestrators.tsplugins/developer-api/backend/src/server.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- apps/web-next/src/components/dashboard/overview-content.tsx
- Enhanced `fetchJobFeed` to accept a base URL and effective polling interval, improving flexibility. - Updated retry logic with `JOB_FEED_RECOVERY_RETRY_DELAYS` for better handling of transient errors. - Refactored job feed polling to utilize the new fetch function, ensuring consistent cache busting behavior. - Added documentation for recovery backoff strategy after failed polls.
Summary
This PR aligns the web BFF and related resolvers with the repo OpenAPI v1 contract: dashboard data is sourced from streaming and requests resources where specified, merged where the UI needs a single view, and cached consistently. It also tightens dashboard
windowhandling and timeouts for large ranges, fixes orchestrator listing by unioning dashboard/orchestrators with inventory and mergingserviceUri, documents warm routes against the spec, and updates the developer-api network-models path to useNAAP_API_SERVER_URL(no hardcoded NAAP base) and merged streaming/models + requests/models.Changes
openapi.yaml/ upstream semantics.NAAP_API_SERVER_URLonly.perf-by-modelbacked bystreaming/modelswith a single in-process + SWR cache key (24h semantics per OpenAPI).avg_fpswith header tooltip (24h, live-video scope); KPI/pipelines useformatDashboardWindowand scaled upstream timeouts for large windows.serviceUriintouris; pipelines/pricing resolvers updated for OpenAPI v1 shapes.Type
Plugin(s) Affected
plugins/developer-api(backend network-models / NAAP upstream env)Checklist
npm run lint)npm run build)Breaking Changes
None for consumers of the public dashboard JSON shapes beyond what is already required by the OpenAPI v1 migration (e.g.
NAAP_API_SERVER_URLmust be set for developer-api network-models when calling NAAP).Screenshots / Recordings
Optional: Overview Pipelines & GPUs (FPS column + info tooltip) and Orchestrators table length before/after if you want visual proof.
Summary by CodeRabbit
New Features
Changes