fix(matrix): rewrite /keys/upload OTK ID-collision 400 to a synthetic 200#74529
fix(matrix): rewrite /keys/upload OTK ID-collision 400 to a synthetic 200#74529nklock wants to merge 2 commits intoopenclaw:mainfrom
Conversation
|
Codex review: needs maintainer review before merge. Summary Reproducibility: yes. for the failure mechanics: the linked Matrix recovery report includes the exact Next step before merge Security Review detailsBest possible solution: Land the narrow Matrix transport mitigation if maintainers accept the temporary synthetic-response semantics, keep matrix-org/matrix-rust-sdk#6520 as the root-cause tracker, and require focused Matrix tests plus a clean changed-lane gate. Do we have a high-confidence way to reproduce the issue? Yes for the failure mechanics: the linked Matrix recovery report includes the exact Is this the best way to solve the issue? Yes, pending Matrix maintainer sign-off: a Matrix-plugin transport hook is the narrowest OpenClaw-owned workaround because the failing request goes through Acceptance criteria:
What I checked:
Likely related people:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 0fad53a19281. |
bdb169f to
9561d81
Compare
… 200
matrix-rust-sdk's OlmMachine occasionally emits a `KeysUploadRequest`
containing a one-time-key whose `signed_curve25519:<id>` collides with an
OTK ID already published earlier in the same session. Synapse rejects the
duplicate with HTTP 400 and a body whose `error` field matches
"<algorithm>:<id> already exists". matrix-js-sdk's `requestWithRetry` does
not retry 4xx, so the bootstrap call fails outright. The collision
reproduces deterministically on `matrix-sdk-crypto-nodejs@0.5.1` /
`matrix-sdk-crypto-wasm@18.2.0` during cross-signing bootstrap retries
and blocks every affected account from completing E2EE setup.
Until the upstream rust-sdk OTK ID generation/tracking issue is fixed,
this rewrites the single failure mode at the `fetchFn` boundary:
- Detects POST /keys/upload responses where status is 400, errcode is
M_INVALID_PARAM, and the error string matches
/(?:signed_curve25519|curve25519):[A-Za-z0-9_-]+ already exists/i.
- Synthesizes a 200 with `{"one_time_key_counts":{}}`. The empty counts
signal to the rust SDK's outgoing-request loop to mint fresh OTKs
(with new IDs) and re-upload them on the next tick. Semantically
correct — the colliding key is genuinely already on the server.
- Logs once per host on the synthesis path so the workaround is
observable but not noisy.
- Strictly scoped: only POST, only /keys/upload (not /keys/query, not
/keys/device_signing/upload), only the precise collision shape. Other
4xx bodies pass through unchanged.
Tests cover the matcher in isolation (status, method, path, body shape)
and the transport-layer integration (collision rewrite, repeated
collisions, non-collision 400 passthrough, /keys/query passthrough).
Verified on `matrix.thepolycule.ca` (Synapse 1.145.0+ess.1, MAS-fronted)
where the same account that previously wedged on /keys/upload retries
now completes bootstrap idempotently.
9561d81 to
9a3f0df
Compare
Rewrite /keys/upload OTK ID-collision 400 to a synthetic 200
What this PR does
Unblocks Matrix E2EE bootstrap on accounts where
matrix-rust-sdkregenerates a one-time-key with an ID it has already published. Before this change, the OTK upload deterministically 400s withsigned_curve25519:<id> already exists,requestWithRetrydoes not retry 4xx, andbootstrapCrossSigning()fails — every affected account is stuck in plaintext-only.This change adds a tiny, narrowly-scoped rewrite at the
fetchFnboundary:POSTto a URL ending/keys/uploadreturns HTTP 400 and the body matches/(?:signed_curve25519|curve25519):[A-Za-z0-9_-]+ already exists/i, the response is rewritten to200 {"one_time_key_counts":{}}.one_time_key_countssignals to the rust SDK's outgoing-request loop that it should mint fresh OTK IDs and re-upload them on the next tick. Semantically the swallow is correct — the colliding key is genuinely already on the server./keys/query,/keys/device_signing/upload), non-POST methods — passes through unchanged.This is a transport-layer mitigation only. The long-term fix belongs in
matrix-rust-sdk(OlmMachine OTK ID generation/tracking — likely the ID counter is seeded from persisted state without advancing past previously-published-but-not-yet-claimed IDs). I'll file the upstream rust-sdk issue separately and reference it once the workaround lands.Where the change lives
extensions/matrix/src/matrix/sdk/keys-upload-collision.ts(new, ~55 lines)Pure helpers. No I/O.
isKeysUploadCollision400({url, method, status, body})— strictly gated: status === 400 AND method === POST ANDURL(url).pathnameends with/keys/uploadAND body matches the collision regex.synthesizeKeysUploadCollisionResponse(url)— buildsnew Response('{"one_time_key_counts":{}}', { status: 200, ... })with acontent-type: application/jsonheader andresponse.urlset to the original request URL.extensions/matrix/src/matrix/sdk/transport.tsInside
createMatrixGuardedFetch, after the response body is buffered:Set<string>keyed by URL host, and return the synthetic 200.buildBufferedResponse.The check is inside
createMatrixGuardedFetch(notperformMatrixRequest) so it scopes to the matrix-js-sdkfetchFnpath only — the higher-levelMatrixAuthedHttpClient.requestJsonsemantics are untouched.Tests
extensions/matrix/src/matrix/sdk/keys-upload-collision.test.ts(new, 9 tests):signed_curve25519and barecurve25519/keys/query,/keys/device_signing/upload)extensions/matrix/src/matrix/sdk/transport.test.ts(4 new tests oncreateMatrixGuardedFetch):{"one_time_key_counts":{}}M_FORBIDDEN) on/keys/uploadpasses through unchanged/keys/querypasses through unchangedpnpm test:extension matrix— 118 test files green, 0 failed.What this PR explicitly does not do
matrix-rust-sdkormatrix-js-sdk. The workaround is openclaw-only.Verified end-to-end
Confirmed on
matrix.thepolycule.ca(Synapse 1.145.0+ess.1, MAS-fronted): the same account that previously wedged on/keys/uploadretries now completes cross-signing bootstrap idempotently. Re-runningopenclaw matrix encryption setupagainst an already-bootstrapped account no longer 400s.AI-assisted disclosure
This change was authored with AI assistance (Claude). The author reviewed every line, ran
pnpm test:extension matrixlocally, and verified the fix end-to-end against a MAS-fronted Synapse 1.145.0+ess.1 deployment.