[Security] ClusterClient askId predictability via Date.now()+counter

## Severity / Size

- **Severity**: HIGH — an attacker who can inject a single frame onto the cluster-client TCP socket (MitM on cleartext `tcp://`, compromised peer, or a malicious cluster node sitting between the client's contact-point and the eventual target) can resolve in-flight asks with attacker-chosen payloads.
- **Size**: S (~1d).
- **Threat model**: anyone with frame-level write access to the wire — MitM on plaintext, malicious peer, on-path observer + injector.  TLS for the cluster transport (which exists today via `TlsTransportSettings`) closes the network-injection path, but doesn't help against a compromised cluster peer that can issue frames legitimately.

## Affected files

- `src/cluster/ClusterClient.ts:82-86` — `nextAskId()` generator.
- `src/cluster/ClusterClient.ts:153-180` — `ask()` registers the pending callback under that ID and waits for `cluster-client-reply { askId, ... }`.
- `src/cluster/ClusterClient.ts:270-300` — reply-handler: looks up `pending.get(askId)` and resolves the promise.

## Background

`ClusterClient` is the outside-in handle landed in v0.8.0 (#86 / commit `5567dc5`).  It exchanges `cluster-client-envelope` / `cluster-client-reply` frames with a `ClusterClientReceptionist` on the cluster side.  Each `ask()` call generates an `askId` used to route the matching reply back to the right pending-promise.

The current generator is:

```ts
let _askCounter = 0;
function nextAskId(): string {
  _askCounter = (_askCounter + 1) >>> 0;
  return `c${Date.now()}-${_askCounter}`;
}
```

Two predictable inputs: millisecond wallclock + monotonic counter starting at 0.  Both are observable from outside the process — a single captured frame gives the attacker a tight window for the next 100+ askIds.

## Exploit walkthrough

Setup: legitimate `ClusterClient` connected to a cluster.  Attacker is on the network path (or in a malicious cluster node that observes traffic).

Step 1 — observation: attacker captures one outbound frame from the client.  The frame contains `askId: c1715000000123-7` (let's say).  Attacker now knows the current `_askCounter = 7` and the current `Date.now()` epoch on the client process.

Step 2 — prediction: any ask issued in the same millisecond would be `c1715000000123-8`.  An ask within ~1ms would be `c1715000000123-8` or `c1715000000124-8` (jitter).  The attacker can enumerate plausible next-askIds: 2 × 5 = 10 candidates.

Step 3 — pre-emptive injection: attacker sends a forged `cluster-client-reply` frame for each predicted askId, each with attacker-chosen `body`.  The client's reply-handler matches `askId` to the pending promise and resolves with the forged body.  The legitimate reply, when it eventually arrives, finds nothing in the pending map and is silently dropped.

Step 4 — impact: any `ask()` consumer that trusts the reply is now operating on attacker-injected data.  For an actor coordinating cluster-wide invariants (e.g. "is X allowed?", "what's the current configuration?"), this is total compromise of the ask channel.

The exploit requires network-frame injection.  It does **not** require breaking TLS — a compromised cluster node has legitimate frame-write access by definition.  And cleartext `tcp://` deployments are fully exposed.

## How the 8 already-landed security fixes inform this

- **Hello-handshake hijack** (`9c3b005`): first-conn-wins on the TCP transport.  That fix closed the byPeer-overwrite path; the askId fix closes the orthogonal reply-injection path on the same wire.  Both fixes harden the v0.8.0 ClusterClient feature.
- **FrameDecoder size cap** (`d454079`): used the pattern "validate at the entry-point, throw before allocating".  Same pattern fits here: validate askId entropy at generation; bind reply to socket at handling.
- **Idempotency body-fingerprint** (`4cac92a`): bound a cached response to the request that produced it via SHA-256 of method+path+body.  Same idea reused here: bind a reply to the *connection* it must come back on, plus an unguessable correlation ID.

## Fix design

Two complementary defenses, both small.

**Defense A — unguessable askId (primary).**

Replace `nextAskId()` with crypto-random:

```ts
function nextAskId(): string {
  // 128 bits of entropy — same as a UUID, but encoded compactly.
  // Prefer crypto.randomUUID where available; fall back to
  // getRandomValues + base64url for older runtimes.
  if (typeof globalThis.crypto?.randomUUID === 'function') {
    return globalThis.crypto.randomUUID();
  }
  const bytes = new Uint8Array(16);
  globalThis.crypto.getRandomValues(bytes);
  // base64url, no padding — 22 chars
  return base64urlEncode(bytes);
}
```

`crypto.randomUUID()` is available on Node 19+, Bun 1.0+, Deno 2.0+ — every runtime we already support.  The fallback covers the older-runtime path.

**Defense B — bind reply to socket (defense-in-depth).**

The `pending` map is currently keyed only by `askId`.  Augment it to include the socket the ask was sent on:

```ts
this.pending.set(askId, {
  socket: this.socket,  // capture the socket reference at ask-time
  resolve, reject, timer,
});
```

In the reply-handler, the frame's `askId` is the lookup key but the resolving condition is `pending.socket === currentSocket`.  If the socket has changed (reconnect), the pending askId is invalidated.

This protects against the scenario where the attacker can inject frames on a **different** socket while spoofing the askId — which becomes irrelevant in practice once Defense A makes askIds unguessable, but is cheap defense-in-depth.

**Receptionist side (symmetric).**

The `ClusterClientReceptionist` echoes the `askId` from the envelope into the reply.  It already trusts the envelope's askId without validation; that's fine on the cluster side since the cluster doesn't track in-flight asks — but for completeness, the receptionist should refuse to echo `askId` longer than a reasonable cap (say 256 chars) to prevent body-bloat attacks.

## API surface

No public API changes.  `ClusterClient.ask()` keeps its signature.  The fix is internal to the askId generator + the pending-map shape.

## Backward compatibility

The askId format changes from `c1715000000123-7` to either a UUID (`a1b2c3...`) or 22-char base64url.  Cluster-side receptionist parses the `askId` opaquely — it's just an echo field — so no change there.  No on-the-wire breaking change.

## Test plan

Three tests, again mirroring the established Security-Fix style:

1. **Exploit test (`tests/multi-node/cluster-client-security.test.ts`)**: capture an outgoing ask frame from a real `ClusterClient`; enumerate the next 20 plausible askIds based on Date.now() + counter; inject a forged `cluster-client-reply` for each.  Pre-fix: the legit ask resolves with attacker payload; this test would have failed-as-expected on the old generator.  Post-fix: attacker enumeration fails (no useful prediction); the legit reply lands.

2. **Defense test**: generate 100K askIds in a tight loop; verify pairwise uniqueness; no Date.now()-prefix detectable; no monotonic counter pattern.

3. **Reconnect-test**: ask issued on socket A; socket A drops; client reconnects on socket B; pending ask is invalidated (rejected with `ConnectionLostError`) rather than being matched by a reply on socket B.  Validates Defense B.

4. **Regression**: existing `tests/multi-node/cluster-client.test.ts` tests still pass.

## Acceptance criteria

- [ ] `nextAskId()` uses `crypto.randomUUID()` (or `getRandomValues` fallback).
- [ ] Pending map captures the socket at ask-time; reply-handler verifies socket match.
- [ ] Exploit test demonstrates the pre-fix vulnerability and the post-fix block.
- [ ] Defense test: 100K-askId uniqueness + entropy distribution.
- [ ] Reconnect test: cross-socket reply rejected.
- [ ] No public API change; receptionist symmetric (defensive askId length cap optional).
- [ ] Plan-doc + README "Known security caveats" entry updated on land.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Security] ClusterClient askId predictability via Date.now()+counter #120

Severity / Size

Affected files

Background

Exploit walkthrough

How the 8 already-landed security fixes inform this

Fix design

API surface

Backward compatibility

Test plan

Acceptance criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Security] ClusterClient askId predictability via Date.now()+counter #120

Description

Severity / Size

Affected files

Background

Exploit walkthrough

How the 8 already-landed security fixes inform this

Fix design

API surface

Backward compatibility

Test plan

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions