[SDKs] Self-heal catalog cache on BYOM cache-miss#760
Conversation
The v1 SDKs (js, python, cs, rust) cache the catalog in memory with a 6-hour TTL. When a user manually drops a model directory (BYOM) into the cache while the SDK is running, the SDK's in-memory map is stale: Core returns the BYOM id from `get_cached_models` IPC and `get_model_list`, but the SDK filters it out because its `modelIdToModelVariant` map does not yet know the id. Add a force-refresh path on cache-miss in the four user-facing entry points: - get_cached_models / get_loaded_models: when any id in the fresh list returned by Core does not resolve against the local map, force a catalog refresh and re-resolve only the unresolved ids. - get_model(alias) / get_model_variant(id): on miss, force a refresh and retry the lookup once. No new public APIs and no new IPC commands. Relies on the FLCore `GetModelsAsync` override that surfaces BYOM stubs. Validated end-to-end via the Python SDK against a real BYOM directory dropped into the cache at runtime: get_cached_models now returns the BYOM, get_model resolves the alias, and Core's view is consistent. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Pull request overview
Adds “self-healing” catalog refresh behavior across the Rust, Python, JS, and C# SDKs so that when the on-disk BYOM cache changes while the SDK is running (and Core returns new IDs), the SDK refreshes its in-memory catalog on cache-miss and retries resolution.
Changes:
- Add a forced refresh + retry path to
get_model(alias)/get_model_variant(id)when the initial lookup misses. - Add forced refresh-on-miss when resolving IDs returned by Core for
get_cached_models/get_loaded_models, via new helper resolvers. - Add
force/force: trueparameters to the various “update models” caching methods to bypass the 6-hour TTL when self-healing.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| sdk/rust/src/catalog.rs | Adds forced-refresh retry on alias/id miss and a helper to resolve cached/loaded model IDs. |
| sdk/python/src/catalog.py | Adds force refresh support and self-heal retry logic for lookups and cached/loaded model resolution. |
| sdk/js/src/catalog.ts | Adds force refresh support and self-heal retry logic plus ID resolution helper for cached/loaded APIs. |
| sdk/cs/src/Catalog.cs | Adds forced-refresh retry logic and shared ID-resolution helper; updates UpdateModels signature to support forced refresh. |
PR #760 review feedback: the previous resolve_model_ids implementation in all four SDKs (Rust, Python, JS, C#) split the input model_ids into `resolved` and `unresolved` lists in a single pass, then on cache-miss forced a refresh and appended the newly-resolvable IDs after the first batch. This reordered the output relative to the input model_ids list, which is a behavioral regression vs. pre-PR get_cached_models / get_loaded_models that returned variants in Core's ID order (minus unknowns). Switch all four SDKs to the same shape: 1. First pass: check whether any id is missing (short-circuit on first miss); this needs no scratch storage. 2. If anything is missing, force-refresh once. 3. Final pass: iterate model_ids in input order and collect variants present in the catalog (filter-map style). The single final pass preserves input ordering on both the warm path (no refresh) and the self-heal path (after refresh). The JS dedupe behavior via Set<IModel> is preserved. Side benefit: -32 net lines across the four files; the implementation now matches the structure the reviewer suggested. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Addresses PR #760 review comments: - Python get_model/get_model_variant and C# GetModelImplAsync/GetModelVariantImplAsync now short-circuit on null/empty/whitespace input, mirroring the validation already present in Rust and JS. Without this, those entry points triggered the wasteful self-heal forced refresh just to return None on bad input. - Adds Python unit tests for self-heal on get_model and get_cached_models (mock _MockCoreInterop returns no models on warm fetch, then BYOM on forced refresh) and a no-refresh-on-empty-input test that warms the catalog and asserts the forced refresh is not triggered by '', ' ', or None. - Adds matching JS self-heal unit tests for getModel/getModelVariant and getCachedModels using the same mock pattern. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| // Self-heal: the alias may belong to a BYOM model added to the cache | ||
| // directory after our last catalog refresh. | ||
| self.invalidator.invalidate(); | ||
| self.update_models().await?; |
There was a problem hiding this comment.
Real but narrow race: window is ~one HTTP call to the local Core, and only manifests if two concurrent callers both miss on the same just-added BYOM at nearly the same instant. Worst case is a one-shot Unknown alias error that a caller-side retry resolves immediately.
Self-heal-on-cache-miss is scoped here as best-effort recovery for the single-caller BYOM scenario, not as a strict contract for concurrent BYOM discovery. Promoting it to a coalesced refresh_now (per-request timestamp under refresh_gate) is a clean follow-up but out of scope for this PR.
| // Self-heal: the id may belong to a BYOM model added to the cache | ||
| // directory after our last catalog refresh. | ||
| self.invalidator.invalidate(); | ||
| self.update_models().await?; |
There was a problem hiding this comment.
Real but narrow race: window is ~one HTTP call to the local Core, and only manifests if two concurrent callers both miss on the same just-added BYOM at nearly the same instant. Worst case is a one-shot Unknown alias error that a caller-side retry resolves immediately.
Self-heal-on-cache-miss is scoped here as best-effort recovery for the single-caller BYOM scenario, not as a strict contract for concurrent BYOM discovery. Promoting it to a coalesced refresh_now (per-request timestamp under refresh_gate) is a clean follow-up but out of scope for this PR.
| if needs_refresh { | ||
| self.invalidator.invalidate(); | ||
| self.update_models().await?; | ||
| } |
There was a problem hiding this comment.
Real but narrow race: window is ~one HTTP call to the local Core, and only manifests if two concurrent callers both miss on the same just-added BYOM at nearly the same instant. Worst case is a one-shot Unknown alias error that a caller-side retry resolves immediately.
Self-heal-on-cache-miss is scoped here as best-effort recovery for the single-caller BYOM scenario, not as a strict contract for concurrent BYOM discovery. Promoting it to a coalesced refresh_now (per-request timestamp under refresh_gate) is a clean follow-up but out of scope for this PR.
The SDKs cache the catalog in memory with a 6-hour TTL. When a user manually drops a model directory (BYOM) into the cache while the SDK is running, the SDK's in-memory map is stale: Core returns the BYOM id from
get_cached_modelsIPC andget_model_list, but the SDK filters it out because itsmodelIdToModelVariantmap does not yet know the id.Add a force-refresh path on cache-miss in the four user-facing entry points:
Added calls to
await UpdateModels(ct).ConfigureAwait(false);in the C# SDK for consistency with the other SDKs, though it is not completely necessary for the BYOM featureValidated against a real BYOM directory dropped into the cache at runtime