Skip to content

fix(matrix): rewrite /keys/upload OTK ID-collision 400 to a synthetic 200#74529

Open
nklock wants to merge 2 commits intoopenclaw:mainfrom
nklock:fix/matrix-otk-upload-collision
Open

fix(matrix): rewrite /keys/upload OTK ID-collision 400 to a synthetic 200#74529
nklock wants to merge 2 commits intoopenclaw:mainfrom
nklock:fix/matrix-otk-upload-collision

Conversation

@nklock
Copy link
Copy Markdown
Contributor

@nklock nklock commented Apr 29, 2026

Rewrite /keys/upload OTK ID-collision 400 to a synthetic 200

What this PR does

Unblocks Matrix E2EE bootstrap on accounts where matrix-rust-sdk regenerates a one-time-key with an ID it has already published. Before this change, the OTK upload deterministically 400s with signed_curve25519:<id> already exists, requestWithRetry does not retry 4xx, and bootstrapCrossSigning() fails — every affected account is stuck in plaintext-only.

This change adds a tiny, narrowly-scoped rewrite at the fetchFn boundary:

  • If a POST to a URL ending /keys/upload returns HTTP 400 and the body matches /(?:signed_curve25519|curve25519):[A-Za-z0-9_-]+ already exists/i, the response is rewritten to 200 {"one_time_key_counts":{}}.
  • The empty one_time_key_counts signals to the rust SDK's outgoing-request loop that it should mint fresh OTK IDs and re-upload them on the next tick. Semantically the swallow is correct — the colliding key is genuinely already on the server.
  • Anything else — non-collision 400 bodies, 400s on other endpoints (/keys/query, /keys/device_signing/upload), non-POST methods — passes through unchanged.
  • Logs once per host so the workaround is observable but not noisy.

This is a transport-layer mitigation only. The long-term fix belongs in matrix-rust-sdk (OlmMachine OTK ID generation/tracking — likely the ID counter is seeded from persisted state without advancing past previously-published-but-not-yet-claimed IDs). I'll file the upstream rust-sdk issue separately and reference it once the workaround lands.

Where the change lives

extensions/matrix/src/matrix/sdk/keys-upload-collision.ts (new, ~55 lines)

Pure helpers. No I/O.

  • isKeysUploadCollision400({url, method, status, body}) — strictly gated: status === 400 AND method === POST AND URL(url).pathname ends with /keys/upload AND body matches the collision regex.
  • synthesizeKeysUploadCollisionResponse(url) — builds new Response('{"one_time_key_counts":{}}', { status: 200, ... }) with a content-type: application/json header and response.url set to the original request URL.

extensions/matrix/src/matrix/sdk/transport.ts

Inside createMatrixGuardedFetch, after the response body is buffered:

  1. Decode the body as UTF-8 (non-fatal).
  2. Check the matcher.
  3. On match, log once via a module-scoped Set<string> keyed by URL host, and return the synthetic 200.
  4. Otherwise fall through to buildBufferedResponse.

The check is inside createMatrixGuardedFetch (not performMatrixRequest) so it scopes to the matrix-js-sdk fetchFn path only — the higher-level MatrixAuthedHttpClient.requestJson semantics are untouched.

Tests

extensions/matrix/src/matrix/sdk/keys-upload-collision.test.ts (new, 9 tests):

  • collision body returns true on 400 POST /keys/upload
  • matches both signed_curve25519 and bare curve25519
  • rejects non-400 statuses
  • rejects non-POST methods
  • rejects other paths (/keys/query, /keys/device_signing/upload)
  • rejects non-collision 400 bodies
  • handles malformed URLs
  • handles non-JSON bodies
  • synthesized response is 200 with the correct shape and URL

extensions/matrix/src/matrix/sdk/transport.test.ts (4 new tests on createMatrixGuardedFetch):

  • collision 400 → caller observes 200 with {"one_time_key_counts":{}}
  • consecutive collisions both synthesize 200
  • non-collision 400 (M_FORBIDDEN) on /keys/upload passes through unchanged
  • collision-shaped 400 on /keys/query passes through unchanged

pnpm test:extension matrix — 118 test files green, 0 failed.

What this PR explicitly does not do

  • Does not modify matrix-rust-sdk or matrix-js-sdk. The workaround is openclaw-only.
  • Does not rewrite any other 4xx response. The matcher is intentionally narrow.
  • Does not silence the failure — the warn-once log surfaces the synthesis on each affected host.
  • Does not address the cross-signing UIA / MAS-reset surface, which is the companion change in the MSC3967 PR (filed separately).

Verified end-to-end

Confirmed on matrix.thepolycule.ca (Synapse 1.145.0+ess.1, MAS-fronted): the same account that previously wedged on /keys/upload retries now completes cross-signing bootstrap idempotently. Re-running openclaw matrix encryption setup against an already-bootstrapped account no longer 400s.

AI-assisted disclosure

This change was authored with AI assistance (Claude). The author reviewed every line, ran pnpm test:extension matrix locally, and verified the fix end-to-end against a MAS-fronted Synapse 1.145.0+ess.1 deployment.

@openclaw-barnacle openclaw-barnacle Bot added channel: matrix Channel integration: matrix size: M labels Apr 29, 2026
@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented Apr 29, 2026

Codex review: needs maintainer review before merge.

Summary
The PR adds a Matrix guarded-fetch workaround that converts narrowly matched POST /keys/upload OTK-collision 400 responses to a synthetic 200, with focused tests and a Matrix changelog entry.

Reproducibility: yes. for the failure mechanics: the linked Matrix recovery report includes the exact /keys/upload OTK collision 400, current main lacks a rewrite, and matrix-js-sdk parses non-OK responses as errors after the injected fetch boundary. I did not run a live homeserver reproduction in this read-only pass.

Next step before merge
Maintainers should review the Matrix E2EE semantics and inspect or rerun the failed PR-head checks; no focused automated repair is evident from the diff or annotations.

Security
Cleared: The diff introduces no dependency, workflow, lockfile, install-script, secret-handling, or permission changes; the security-sensitive behavior is limited to a tightly gated Matrix transport response rewrite.

Review details

Best possible solution:

Land the narrow Matrix transport mitigation if maintainers accept the temporary synthetic-response semantics, keep matrix-org/matrix-rust-sdk#6520 as the root-cause tracker, and require focused Matrix tests plus a clean changed-lane gate.

Do we have a high-confidence way to reproduce the issue?

Yes for the failure mechanics: the linked Matrix recovery report includes the exact /keys/upload OTK collision 400, current main lacks a rewrite, and matrix-js-sdk parses non-OK responses as errors after the injected fetch boundary. I did not run a live homeserver reproduction in this read-only pass.

Is this the best way to solve the issue?

Yes, pending Matrix maintainer sign-off: a Matrix-plugin transport hook is the narrowest OpenClaw-owned workaround because the failing request goes through fetchFn and the PR passes non-collision 400s and other endpoints through. The safer long-term solution remains fixing OTK ID generation/tracking upstream in matrix-rust-sdk.

Acceptance criteria:

  • pnpm test extensions/matrix/src/matrix/sdk/keys-upload-collision.test.ts extensions/matrix/src/matrix/sdk/transport.test.ts
  • pnpm exec oxfmt --check --threads=1 extensions/matrix/src/matrix/sdk/keys-upload-collision.ts extensions/matrix/src/matrix/sdk/keys-upload-collision.test.ts extensions/matrix/src/matrix/sdk/transport.ts extensions/matrix/src/matrix/sdk/transport.test.ts
  • pnpm check:changed in Testbox before merge
  • Inspect or rerun failed PR-head GitHub Actions jobs checks-node-core and checks-node-agentic-control-plane

What I checked:

Likely related people:

  • gumadeiras: Recent current-main commits touch Matrix E2EE setup, identity trust, QA stabilization, sdk.ts, crypto-bootstrap.ts, and Matrix docs adjacent to the reported cross-signing/bootstrap failure. (role: recent Matrix E2EE and crypto-bootstrap maintainer; confidence: high; commits: 99159f89da03, 72731a37d255, 94081d886336; files: extensions/matrix/src/matrix/sdk.ts, extensions/matrix/src/matrix/sdk/crypto-bootstrap.ts, docs/channels/matrix.md)
  • steipete: Recent path history for transport.ts includes guarded transport mockability, Matrix transport tests, runtime fetch guarding, and test-boundary work. (role: recent Matrix guarded-transport maintainer; confidence: high; commits: 8c249a8cca1a, 7815d25eef59, 41ef752dd888; files: extensions/matrix/src/matrix/sdk/transport.ts, extensions/matrix/src/matrix/sdk/transport.test.ts)
  • nklock: Separate from this PR, current-main history includes merged Matrix E2EE verification work around SAS completion and cross-signing upload behavior. (role: recent adjacent Matrix E2EE contributor; confidence: medium; commits: 86956f71e62a; files: extensions/matrix/src/matrix/actions/verification.ts, extensions/matrix/src/matrix/sdk/verification-manager.ts, extensions/matrix/CHANGELOG.md)

Remaining risk / open question:

  • The PR head currently has two failed GitHub Actions checks that need inspection or rerun before merge.
  • The durable fix remains upstream in matrix-rust-sdk; this PR is a temporary OpenClaw transport mitigation whose synthetic-response semantics need Matrix maintainer sign-off.
  • I did not independently reproduce the live homeserver scenario in this read-only review, so end-to-end proof rests on the linked issue context and the PR author's validation.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 0fad53a19281.

… 200

matrix-rust-sdk's OlmMachine occasionally emits a `KeysUploadRequest`
containing a one-time-key whose `signed_curve25519:<id>` collides with an
OTK ID already published earlier in the same session. Synapse rejects the
duplicate with HTTP 400 and a body whose `error` field matches
"<algorithm>:<id> already exists". matrix-js-sdk's `requestWithRetry` does
not retry 4xx, so the bootstrap call fails outright. The collision
reproduces deterministically on `matrix-sdk-crypto-nodejs@0.5.1` /
`matrix-sdk-crypto-wasm@18.2.0` during cross-signing bootstrap retries
and blocks every affected account from completing E2EE setup.

Until the upstream rust-sdk OTK ID generation/tracking issue is fixed,
this rewrites the single failure mode at the `fetchFn` boundary:

- Detects POST /keys/upload responses where status is 400, errcode is
  M_INVALID_PARAM, and the error string matches
  /(?:signed_curve25519|curve25519):[A-Za-z0-9_-]+ already exists/i.
- Synthesizes a 200 with `{"one_time_key_counts":{}}`. The empty counts
  signal to the rust SDK's outgoing-request loop to mint fresh OTKs
  (with new IDs) and re-upload them on the next tick. Semantically
  correct — the colliding key is genuinely already on the server.
- Logs once per host on the synthesis path so the workaround is
  observable but not noisy.
- Strictly scoped: only POST, only /keys/upload (not /keys/query, not
  /keys/device_signing/upload), only the precise collision shape. Other
  4xx bodies pass through unchanged.

Tests cover the matcher in isolation (status, method, path, body shape)
and the transport-layer integration (collision rewrite, repeated
collisions, non-collision 400 passthrough, /keys/query passthrough).

Verified on `matrix.thepolycule.ca` (Synapse 1.145.0+ess.1, MAS-fronted)
where the same account that previously wedged on /keys/upload retries
now completes bootstrap idempotently.
@nklock nklock force-pushed the fix/matrix-otk-upload-collision branch from 9561d81 to 9a3f0df Compare April 30, 2026 05:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: matrix Channel integration: matrix size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant