Skip to content

Fix launcher startup regressions#48501

Merged
Takhoffman merged 5 commits intomainfrom
codex/startup-memory-launcher-fix
Mar 16, 2026
Merged

Fix launcher startup regressions#48501
Takhoffman merged 5 commits intomainfrom
codex/startup-memory-launcher-fix

Conversation

@Takhoffman
Copy link
Contributor

Summary

Describe the problem and fix in 2–5 bullets:

  • Problem: openclaw.mjs startup could fail after build when a transitive import was missing, and the launcher incorrectly reported that dist/entry.(m)js was missing.
  • Why it matters: this broke the startup-memory CI lane at the launcher boundary and hid the real root cause from developers.
  • What changed: made gaxios compat lazy/optional with a safe fallback, hardened the launcher to preserve transitive import errors, added launcher smoke coverage in CI, and added startup-memory failure guidance with an LLM-ready prompt.
  • What did NOT change (scope boundary): this PR does not change the separate startup RSS budget or reduce command memory usage.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #
  • Related #

User-visible / Behavior Changes

  • node openclaw.mjs --help now surfaces real transitive import failures instead of always collapsing them into a generic missing-build-output error.
  • startup-memory CI failures now print a short local repro path and an LLM-ready prompt.

Security Impact (required)

  • New permissions/capabilities? (Yes/No) No
  • Secrets/tokens handling changed? (Yes/No) No
  • New/changed network calls? (Yes/No) No
  • Command/tool execution surface changed? (Yes/No) No
  • Data access scope changed? (Yes/No) No
  • If any Yes, explain risk + mitigation:

Repro + Verification

Environment

  • OS: macOS
  • Runtime/container: Node 24.14.0 for build/help verification; local shell for targeted tests
  • Model/provider: N/A
  • Integration/channel (if any): N/A
  • Relevant config (redacted): default local config; one unrelated stale plugin warning observed during status --json

Steps

  1. pnpm build
  2. node openclaw.mjs --help
  3. pnpm test:startup:memory

Expected

  • Built launcher commands succeed.
  • Missing transitive imports surface the real error.
  • startup-memory failures emit actionable guidance.

Actual

  • Verified locally after the patch: launcher smoke commands succeed and startup-memory passes on this branch.

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios: pnpm vitest run src/infra/gaxios-fetch-compat.test.ts test/openclaw-launcher.e2e.test.ts src/index.test.ts, pnpm build, node openclaw.mjs --help, node openclaw.mjs status --json --timeout 1, and pnpm test:startup:memory.
  • Edge cases checked: missing direct gaxios resolution falls back to the legacy fetch shim; launcher preserves transitive ERR_MODULE_NOT_FOUND; true missing build output still reports the friendly launcher message.
  • What you did not verify: full GitHub Actions execution on Linux runners.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? (Yes/No) Yes
  • Config/env changes? (Yes/No) No
  • Migration needed? (Yes/No) No
  • If yes, exact upgrade steps:

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: revert commit aa8180156c.
  • Files/config to restore: openclaw.mjs, src/entry.ts, src/index.ts, src/infra/gaxios-fetch-compat.ts, .github/workflows/ci.yml, scripts/check-cli-startup-memory.mjs.
  • Known bad symptoms reviewers should watch for: openclaw.mjs --help failing after build, generic missing-dist errors masking transitive import failures, or startup-memory failures without remediation guidance.

Risks and Mitigations

  • Risk: the fallback compat path could diverge from the directly patched gaxios path in nested dependency layouts.
    • Mitigation: targeted compat tests now cover both direct patching and fallback shim behavior.

@aisle-research-bot
Copy link

aisle-research-bot bot commented Mar 16, 2026

🔒 Aisle Security Analysis

We found 2 potential security issue(s) in this PR:

# Severity Title
1 🔵 Low Race condition in async installGaxiosFetchCompat can allow requests before proxy/TLS dispatcher patch is applied
2 🔵 Low CI smoke test runs openclaw status --json which performs outbound network calls (npm registry, git fetch, gateway probe) despite --timeout 1

1. 🔵 Race condition in async installGaxiosFetchCompat can allow requests before proxy/TLS dispatcher patch is applied

Property Value
Severity Low
CWE CWE-362
Location src/infra/gaxios-fetch-compat.ts:268-274

Description

The async initializer installGaxiosFetchCompat() uses a global installState guard that returns immediately for any state other than "not-installed".

Because the function now performs asynchronous work (dynamic module resolution/import), concurrent callers can observe installState === "installing" and return without waiting for the patch to complete. If such a caller proceeds to make HTTP requests via gaxios immediately after its awaited installGaxiosFetchCompat() call resolves, those requests may run without the compat fetch, meaning:

  • proxy / NO_PROXY and HTTP(S)_PROXY translation to undici ProxyAgent may not be applied
  • mTLS options (cert/key) may not be applied via an undici dispatcher

This creates a startup-time window where requests can bypass expected proxy/TLS routing controls.

Vulnerable code:

export async function installGaxiosFetchCompat(): Promise<void> {
  if (installState !== "not-installed" || typeof globalThis.fetch !== "function") {
    return;
  }

  installState = "installing";// async work ...
}

Notes on current repo call sites:

  • Current internal call sites (src/entry.ts, src/index.ts) are awaited and appear to be invoked once, so exploitability depends on whether other code (plugins/consumers) can call this concurrently or trigger network traffic during startup. However, the function is exported and the race is real if used concurrently.

Recommendation

Make installation idempotent and awaitable by sharing a single in-flight promise. Callers arriving during "installing" should await the same promise rather than returning early.

Example fix:

let installPromise: Promise<void> | null = null;

export function installGaxiosFetchCompat(): Promise<void> {
  if (typeof globalThis.fetch !== "function") return Promise.resolve();
  if (installState === "installed" || installState === "shimmed") return Promise.resolve();
  if (installState === "installing" && installPromise) return installPromise;

  installState = "installing";
  installPromise = (async () => {
    const Gaxios = await loadGaxiosConstructor();
    if (!Gaxios) {
      installLegacyWindowFetchShim();
      installState = "shimmed";
      return;
    }

    const prototype = Gaxios.prototype;
    const originalDefaultAdapter = prototype._defaultAdapter;
    const compatFetch = createGaxiosCompatFetch();

    prototype._defaultAdapter = function patchedDefaultAdapter(this: unknown, config: GaxiosFetchRequestInit) {
      return originalDefaultAdapter.call(this, config.fetchImplementation ? config : {
        ...config,
        fetchImplementation: compatFetch,
      });
    };

    installState = "installed";
  })().catch(err => {
    installState = "not-installed";
    throw err;
  });

  return installPromise;
}

Optionally, consider whether "shimmed" should allow retrying installation if gaxios becomes available later (to avoid a permanent degraded mode).


2. 🔵 CI smoke test runs openclaw status --json which performs outbound network calls (npm registry, git fetch, gateway probe) despite --timeout 1

Property Value
Severity Low
CWE CWE-693
Location .github/workflows/ci.yml:322-326

Description

The new CI smoke test step runs node openclaw.mjs status --json --timeout 1 on PRs. The status --json command performs multiple network-capable operations that are not disabled by --timeout 1:

  • NPM registry HTTP request: getUpdateCheckResult({ includeRegistry: true }) calls fetchWithTimeout("https://registry.npmjs.org/openclaw/<tag>").
  • Git remote fetch: getUpdateCheckResult({ fetchGit: true }) runs git fetch --quiet --prune.
  • Gateway probe WebSocket connection attempt: scanStatusJsonFast() calls probeGateway({ url, timeoutMs }), which creates a GatewayClient connection to the resolved Gateway URL.

Even with --timeout 1, the gateway probe enforces a minimum 250ms timer, and the update check uses a fixed ~2500ms timeout (independent of CLI --timeout).

Security impact / risk in CI context:

  • Outbound requests from PR jobs can be undesirable (policy/compliance) and may leak metadata (runner IP, timing) to external services.
  • Adds flakiness due to reliance on network availability.

Relevant code locations:

  • Workflow step: .github/workflows/ci.yml
  • Status JSON scan triggers update check + probe: src/commands/status.scan.fast-json.ts (getUpdateCheckResult({ fetchGit: true, includeRegistry: true }), resolveGatewayProbeSnapshot())
  • Registry fetch + git fetch implementation: src/infra/update-check.ts (fetchWithTimeout("https://registry.npmjs.org/... "), git fetch --quiet --prune)
  • Gateway probe uses WebSocket client + min timer: src/gateway/probe.ts (Math.max(250, opts.timeoutMs) timer)

Recommendation

Make the CI smoke test deterministic and offline-safe:

  • Prefer a pure local smoke test that does not reach the network (e.g., only --help, or a dedicated --smoke/--offline mode).
  • If you need to keep status --json in CI, add and use a flag/env to disable update checks and probes in smoke contexts (example):
- name: Smoke test CLI launcher status json (offline)
  env:
    OPENCLAW_DISABLE_UPDATE_CHECK: "1"
    OPENCLAW_DISABLE_GATEWAY_PROBE: "1"
  run: node openclaw.mjs status --json --timeout 1

and gate the implementation in code (example):

const disableUpdate = process.env.OPENCLAW_DISABLE_UPDATE_CHECK === "1";
const updatePromise = disableUpdate
  ? Promise.resolve({ /* minimal update object */ })
  : getUpdateCheckResult({ timeoutMs: updateTimeoutMs, fetchGit: true, includeRegistry: true });

This keeps CI from making outbound HTTP/git requests and reduces flakiness.


Analyzed PR: #48501 at commit be78da9

Last updated on: 2026-03-16T23:00:41Z

@openclaw-barnacle openclaw-barnacle bot added docs Improvements or additions to documentation scripts Repository scripts size: XL maintainer Maintainer-authored PR labels Mar 16, 2026
@Takhoffman Takhoffman force-pushed the codex/startup-memory-launcher-fix branch from aa81801 to b53818c Compare March 16, 2026 21:41
@openclaw-barnacle openclaw-barnacle bot added size: M and removed docs Improvements or additions to documentation size: XL labels Mar 16, 2026
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 16, 2026

Greptile Summary

This PR fixes two startup regressions: (1) the openclaw.mjs launcher now distinguishes direct entry-file misses from transitive import failures, surfacing the real root cause instead of always printing a generic "missing dist/entry.(m)js" message; and (2) installGaxiosFetchCompat is made async/lazy so a missing gaxios package no longer crashes startup—it falls back to the legacy window-fetch shim instead. Supporting changes add CI smoke steps for the launcher, structured actionable guidance for startup-memory failures, and a new plugin-boundary ratchet script + baseline.

  • openclaw.mjs: The tryImport helper uses a URL-aware isDirectModuleNotFoundError check so transitive ERR_MODULE_NOT_FOUND errors are re-thrown and visible. Complementary e2e tests cover both the transitive-failure and truly-missing-dist paths.
  • gaxios-fetch-compat.ts: Top-level import { Gaxios } from "gaxios" removed; loading is now lazy via loadGaxiosConstructor(). The fallback installLegacyWindowFetchShim is idempotent. entry.ts and index.ts correctly await the now-async function. One concern: the idempotency guard (installState !== "not-installed") is evaluated before the first await, leaving a window where concurrent callers can both proceed and double-wrap _defaultAdapter.
  • check-plugin-boundary-ratchet.mjs: Solid new lint script using TypeScript AST parsing to track cross-boundary imports in bundled plugins against a snapshot baseline, supporting gradual migration without blocking on a single large refactor.
  • check-cli-startup-memory.mjs: Failure output now includes a local repro command and an LLM-ready prompt, making CI failures directly actionable.

Confidence Score: 4/5

  • Safe to merge; the core launcher and gaxios fixes are correct and well-tested, with one low-risk async idempotency gap in the compat module.
  • The launcher fix is clean and URL-precise. The gaxios lazy-load fallback is functionally correct for all realistic single-caller paths. The one real concern — the async gap between the installState guard and the first await in installGaxiosFetchCompat — would only manifest with concurrent callers, which is not how any current call site uses it. All changes ship with targeted tests and the new CI smoke steps catch launcher regressions early.
  • src/infra/gaxios-fetch-compat.ts — review the async idempotency gap in installGaxiosFetchCompat before the pattern is reused elsewhere.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/infra/gaxios-fetch-compat.ts
Line: 228-246

Comment:
**Async gap leaves idempotency guard open to concurrent calls**

`installState` is checked before the `await loadGaxiosConstructor()` call but is not updated until after the `await` resolves. If two callers invoke `installGaxiosFetchCompat()` concurrently (both are awaited on separate microtask queues before either resolves), both will pass the `installState !== "not-installed"` guard simultaneously. The first caller then patches `Gaxios.prototype._defaultAdapter` and sets `installState = "installed"`, but the second caller has already captured `originalDefaultAdapter` before the first patched it — or the second call proceeds and double-wraps the adapter.

In practice this is low risk today (both call sites are serial), but the invariant is fragile. A minimal fix is to set `installState` to a sentinel before the `await`:

```typescript
export async function installGaxiosFetchCompat(): Promise<void> {
  if (installState !== "not-installed" || typeof globalThis.fetch !== "function") {
    return;
  }
  installState = "installed"; // claim the slot before the first await

  const Gaxios = await loadGaxiosConstructor();
  if (!Gaxios) {
    installLegacyWindowFetchShim();
    installState = "shimmed";
    return;
  }
  // ... rest of install
}
```

(Adjust the sentinel value as appropriate for the final state machine.)

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: test/openclaw-launcher.e2e.test.ts
Line: 7-16

Comment:
**Temp directory leaks if `makeLauncherFixture` throws mid-setup**

`fs.mkdtemp` creates the directory before `fs.copyFile` and `fs.mkdir` are called. If either of those subsequent operations throws, the function exits without returning the path, so the caller never pushes it to `fixtureRoots`, and the `afterEach` cleanup never runs on it.

Consider registering the path for cleanup as soon as it is created:

```typescript
async function makeLauncherFixture(fixtureRoots: string[]): Promise<string> {
  const fixtureRoot = await fs.mkdtemp(path.join(os.tmpdir(), "openclaw-launcher-"));
  fixtureRoots.push(fixtureRoot); // register immediately so afterEach can clean it up
  await fs.copyFile(
    path.resolve(process.cwd(), "openclaw.mjs"),
    path.join(fixtureRoot, "openclaw.mjs"),
  );
  await fs.mkdir(path.join(fixtureRoot, "dist"), { recursive: true });
  return fixtureRoot;
}
```

Each test would then pass `fixtureRoots` to the helper instead of pushing after the fact.

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: b53818c

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2dc3411350

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 26f6757a3c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@Takhoffman Takhoffman merged commit 313e5bb into main Mar 16, 2026
29 of 37 checks passed
@Takhoffman Takhoffman deleted the codex/startup-memory-launcher-fix branch March 16, 2026 22:21
vincentkoc pushed a commit to vincentkoc/openclaw that referenced this pull request Mar 17, 2026
* Fix launcher startup regressions

* Fix CI follow-up regressions

* Fix review follow-ups

* Fix workflow audit shell inputs

* Handle require resolve gaxios misses
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

maintainer Maintainer-authored PR scripts Repository scripts size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant