Skip to content

[Bug]: Regression: OpenCode probe blocks cold start 45-75s (no timeout) #2248

@mwolson

Description

@mwolson

Before submitting

  • I searched existing issues and did not find a duplicate.
  • I included enough detail to reproduce or investigate the problem.

Area

apps/server

Steps to reproduce

  1. Install the opencode CLI (/usr/bin/opencode, v1.4.7 on this box). Do not configure an external serverUrl, so T3 Code will spawn its own local OpenCode server for the probe.
  2. Leave the default setting providers.opencode.enabled: true.
  3. Cold-start T3 Code and measure the gap between Running all migrations... / Migrations ran successfully and Listening on http://... in ~/.t3/userdata/logs/server-child.log.

Expected behavior

Backend HTTP readiness (Listening on ...) does not block on any one provider's probe. A slow or hung provider should degrade that provider's status display, not delay every other layer that transitively depends on ProviderRegistry.

Actual behavior

On this install, cold starts routinely block 45-75s waiting for the OpenCode probe. The gap before Listening on ... is dominated by checkOpenCodeProviderStatus, and it is unbounded because runOpenCodeSdk has no timeout.

Measured from one run of apps/server/src/telemetry trace output (server.trace.ndjson) over 56 checkOpenCodeProviderStatus spans during normal use:

metric duration
min 3.6s
p50 47s
p90 55s
max 73s

Independently verified by toggling the setting. With providers.opencode.enabled: false, cold start drops by 45-70s on the same install:

Scenario DB size opencode.enabled Migrations -> Listening
Small DB 30 MB true ~50s+ (varies)
Small DB 30 MB false 1.58s
Large DB 584 MB false 3.51s

The DB-size axis is a separate, much smaller (~2s) issue tracked in #2245. The remaining tens of seconds are entirely the OpenCode probe.

Root cause

Three mechanisms compose to make one provider's slow probe a global cold-start blocker:

  1. makeManagedServerProvider (apps/server/src/provider/makeManagedServerProvider.ts:140) forks the initial snapshot with Effect.forkScoped, but that fiber is serialized against every subsequent .getSnapshot call via refreshSemaphore.withPermits(1) (makeManagedServerProvider.ts:121). So getSnapshot is effectively synchronous with the initial probe:

    const applySnapshot = (nextSettings, options?) =>
      refreshSemaphore.withPermits(1)(applySnapshotBase(nextSettings, options));
    // ...
    yield* applySnapshot(initialSettings, { forceRefresh: true }).pipe(
      Effect.ignoreCause({ log: true }),
      Effect.forkScoped,
    );
    // ...
    return {
      getSnapshot: input.getSettings.pipe(Effect.flatMap(applySnapshot), ...),
      ...
    };
  2. ProviderRegistryLive calls .getSnapshot on every provider during layer acquisition (apps/server/src/provider/Layers/ProviderRegistry.ts:266), so the Layer body waits on the forked OpenCode probe before it can return:

    yield* loadProviders(providerSources).pipe(
      Effect.flatMap((providers) => upsertProviders(providers, { publish: false })),
    );
  3. runOpenCodeSdk (apps/server/src/provider/opencodeRuntime.ts:78) wraps the OpenCode SDK calls (client.provider.list(), client.app.agents()) in Effect.tryPromise with no timeout. loadOpenCodeInventory runs them concurrently with { concurrency: "unbounded" }, but either one hanging is enough to hold the probe:

    export const runOpenCodeSdk = <A>(operation, fn) =>
      Effect.tryPromise({ try: fn, catch: ... })
        .pipe(Effect.withSpan(`opencode.${operation}`));

    startOpenCodeServerProcess does have a timeout (opencodeRuntime.ts:406, Deferred.await(readyDeferred).pipe(Effect.timeoutOption(timeoutMs))), but that only bounds server startup; once the local server reports ready, the SDK calls against it are unbounded.

Because ProviderRegistryLive is a transitive dependency of the HTTP router layer, the Listening on ... log cannot fire until the probe returns.

Impact

Major degradation or frequent failure. Cold-start delay is the visible symptom, but the real problem is that any single provider being slow (for any reason: slow disk I/O spawning the process, OpenCode internal slowness, a wedged local server) holds up the whole backend. On this install the user-visible gap before the app becomes usable was 50-75s per launch.

Version or commit

Observed on a local build of main at f6978db6 (the package.json version string still reads 0.0.20, but that commit is post-v0.0.20 and corresponds to the v0.0.21-nightly.20260420.77 release).

This is a regression in the post-v0.0.20 nightly series:

  • v0.0.20 stable (tag v0.0.20, commit b2cca674, 2026-04-16) did not have OpenCode provider support at all, so it was not affected.
  • feat: add opencode provider support (feat: add opencode provider support #1758, commit ce94feee, 2026-04-17) introduced OpenCodeProviderLive using makeManagedServerProvider and runOpenCodeSdk without any timeout around the SDK calls. This commit first shipped in nightly v0.0.21-nightly.20260420.75 (commit 66c326b8).
  • Refactor OpenCode lifecycle and structured output handling (Refactor OpenCode lifecycle and structured output handling #2218, commit 306ec4bb, 2026-04-19) reshaped the lifecycle but did not add a timeout to the probe; shipped in nightly v0.0.21-nightly.20260420.77.

Users who stayed on v0.0.20 stable will not see this; users who updated to any nightly from v0.0.21-nightly.20260420.75 onward and have opencode installed locally will.

Environment

Linux (niri, Wayland), Electron 40.6.0, AppImage. opencode v1.4.7 at /usr/bin/opencode. No external serverUrl configured (local server path). Default timeoutMs for server startup.

Workaround

Disable the OpenCode probe via ~/.t3/userdata/settings.json:

{
  "providers": {
    "opencode": {
      "enabled": false
    }
  }
}

This hits the early-return at apps/server/src/provider/Layers/OpenCodeProvider.ts:308 and skips the probe entirely. Verified end-to-end: cold start drops from 50-75s to ~1.6s on the same DB. The OpenCode provider then shows as disabled in the UI.

Possible directions

A few ideas, though will likely leave the decisions to the maintainers:

  1. Bound checkOpenCodeProviderStatus with a timeout (e.g. a few seconds total, covering version + inventory). On timeout, fall through to the existing fallback(...) path the function already uses for errors, and surface the timeout in the probe status. This is the smallest fix and localizes the bound to the provider.
  2. Add a timeout inside runOpenCodeSdk itself (or a timeoutOption-based wrapper around each SDK call) so client.provider.list() and client.app.agents() cannot hang indefinitely. Similar bound, different seam.
  3. Make ProviderRegistryLive layer acquisition non-blocking on the first probe: have loadProviders consume the already-persisted initialSnapshot (the "pending" stub from makePendingOpenCodeProvider already exists for this) and let streamChanges publish the real snapshot when the forked probe finishes. That decouples every provider's slowness from Listening.
  4. Drop the refreshSemaphore serialization between the forked initial probe and .getSnapshot, or have .getSnapshot return the current cached snapshot without waiting for an in-flight refresh. Today a caller asking "what's the current snapshot?" during startup is forced to wait for a refresh it didn't request.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions