[Security] DurableState revision tampering via JSON-body manipulation

## Severity / Size

- **Severity**: CRITICAL — bypasses optimistic-concurrency control on DurableState; opens a write/read corruption path for any deployment that doesn't enable client-side encryption.
- **Size**: M (~1.5d).
- **Threat model**: an attacker with **write access to the object-storage backend** (shared S3 bucket with overly permissive IAM, co-tenant on a filesystem-backed store, insider, compromised CI pipeline pushing to the bucket).  TCP-level cluster access is **not** required.

## Affected files

- `src/persistence/durable-state-stores/ObjectStorageDurableStateStore.ts:75-98` — `load()` parses `{ revision, state, timestamp }` from the body and trusts the in-body `revision` for both the return value and the etagCache.
- `src/persistence/durable-state-stores/ObjectStorageDurableStateStore.ts:100-185` — `upsert()` uses the cached revision for the CAS check; tampered cache → bypass.
- `src/persistence/object-storage/BodyCodec.ts` — the wire format.  AES-GCM's auth-tag covers the ciphertext, **but** for unencrypted bodies there is no integrity field.

## Background

The framework's ObjectStorage-backed DurableStateStore wraps any S3/MinIO/filesystem backend.  Records are JSON-encoded into the wire format defined in `BodyCodec.ts`:

```
ATS1 | flags | [keyVersion] | [iv] | payload
```

The `payload` for DurableState is `JSON.stringify({ revision: N, state: <user>, timestamp: T })`.  When encryption is enabled (`mode: 'client-aes256-gcm'`), AES-GCM's auth tag covers the whole ciphertext; flipping any byte in the body — including the `revision` digits — invalidates the tag and the decode throws.

For the **`mode: 'none'`** setup (the default), there is no integrity check.  `decodeBody()` returns the bytes; `load()` `JSON.parse`s and trusts the result.  This is the gap the exploit walks through.

## Exploit walkthrough

Setup: a DurableState-backed actor (`BankAccount` say) with `encryption: 'none'`, stored on a shared S3 bucket where the attacker has `PutObject` permission.

Step 1 — legitimate write: the actor upserts state `{ balance: 100 }` with `expectedRevision=4`.  The store writes a body containing JSON `{ revision: 5, state: { balance: 100 }, timestamp: 1715... }`.  Backend returns etag `"abc"`.  The store caches `{ etag: "abc", revision: 5 }`.

Step 2 — attacker tamper: attacker downloads the object, edits the JSON to `{ revision: 999, state: { balance: 100 }, timestamp: 1715... }` and PUTs it back.  Backend assigns a new etag `"def"` (we never see this).

Step 3 — actor cold-start (or any other process that calls `load()` without a stale cache): `load()` fetches, parses, and caches `{ etag: "def", revision: 999 }`.  Returns `{ revision: 999, state: { balance: 100 } }` to the caller.

Step 4 — actor logic relies on the revision counter for ordering decisions (audit log, idempotency, billing). It now reasons about an artificially-inflated revision.  Worst case (banking-style): attacker writes `revision: 7` instead of `999`.  Next legitimate `upsert(expectedRevision=5)` compares cached.revision (7) ≠ expected (5) → throws CAS error.  Operator sees a phantom "concurrent writer" and may re-issue with `expectedRevision=7`.  That succeeds — and overwrites with whatever state the attacker chose (because the attacker's payload also kept the `state` field).

The attacker doesn't need to be in the cluster.  They don't need encryption keys.  They just need write access to the bucket.

## How the 8 already-landed security fixes inform this

Two existing fixes are closely related and should set the pattern:

- **Snapshot seq corruption** (`99de741`): in `PersistentActor.recover()` we now cross-check `snapshot.sequenceNr` against `journal.highestSeq(pid)`.  A snapshot from outer space (`MAX_SAFE_INTEGER`) is rejected because the journal can't corroborate it.  → For DurableState there is no journal to cross-check against; revisions are self-contained.  Need a different defense.
- **Memcached CRLF / Frame-DoS / Path-Traversal**: each added a small input-validation helper at the boundary.  → For revision-tampering the equivalent is an integrity-MAC at the codec boundary.

The eight existing fixes are all in the body of the commit history (`d454079..4cac92a`).  Follow the same style: exploit-test (reproduces the issue), defense-test (verifies fix), regression-test (legitimate path unchanged).

## Fix design

Two-track approach because the encrypted vs unencrypted paths have different integrity stories.

**Track 1 — encrypted bodies: lift `revision` into the manifest header.**

Move `revision` out of the JSON payload and into the ATS1 manifest, just like `keyVersion`.  AES-GCM's `additionalAuthenticatedData` parameter binds the manifest to the ciphertext: tampering with `revision` in the header invalidates the auth tag.

```
ATS1 | flags | [keyVersion] | revision: u64-LE | [iv] | ciphertext-with-AAD
```

The `aesGcmEncrypt(subkey, iv, plaintext, aad)` call gets a new `aad` parameter; the manifest's revision-byte-range is the AAD.  `decode()` parses `revision` from the manifest and passes the same byte-range as AAD.

This is the strongest defense — it costs ~8 bytes of header per record and adds one `subtle.encrypt` argument.

**Track 2 — unencrypted bodies: opt-in HMAC-SHA256.**

Add an `IntegrityConfig`:

```ts
export type IntegrityConfig =
  | { readonly mode: 'none' }
  | { readonly mode: 'hmac-sha256';
      readonly integrityKey: Uint8Array;  // 32 bytes
    };
```

When set, the codec computes HMAC over `(revision-bytes || timestamp-bytes || state-bytes)` with the integrity key and appends an 8-byte HMAC-tag (truncated SHA-256, since we just need collision-resistance not cryptographic-strength signature).  A new flag bit (`FLAG_INTEGRITY_HMAC`) signals presence in the manifest.

The integrity key is **separate** from the encryption master key — it's only used for HMAC, and is allowed to live alongside the data (the threat is tampering, not confidentiality).  Operators can rotate it with a similar sweep helper to the existing re-encryption sweep (`reEncryptObjectStorage`).

Default: `integrity: { mode: 'none' }` — back-compat.  Bodies written before this fix have `FLAG_INTEGRITY_HMAC` unset and decode normally without the HMAC check (legacy-safe).

**Track 3 — revise the etagCache.**

Even with integrity in place, `load()`'s etagCache should not trust an unverified revision.  Refactor so the cache is populated AFTER the integrity check passes.

## API surface

```ts
// src/persistence/PersistenceOptions.ts
export interface DurableStateOptions {
  // ... existing
  readonly integrity?: IntegrityConfig;
}

// New helper for legitimate sweep + key rotation:
// src/persistence/object-storage/integrityKeyRotation.ts
export async function rotateIntegrityKey(
  backend: ObjectStorageBackend,
  opts: { keyPrefix: string; oldKey: Uint8Array; newKey: Uint8Array; onProgress?: ... }
): Promise<{ rewrote: number }>;
```

## Backward compatibility

- Legacy bodies (no FLAG_INTEGRITY_HMAC) decode unchanged — the integrity check is opt-in.
- Operators enable `integrity: { mode: 'hmac-sha256' }` in their `DurableStateStore` config; existing records get re-validated on next write.
- A migration helper (sweep) rewrites legacy records with the HMAC.  Pattern matches the existing `reEncryptObjectStorage`.

## Test plan

Three tests, mirroring the established Security-Fix style (exploit / defense / regression):

1. **Exploit test (`tests/unit/persistence/durable-state-stores/integrity-tampering.test.ts`)**: write a record with `{revision: 5}`, manually edit the on-disk JSON to `{revision: 999}`, `load()` returns the **tampered** value (proves the issue exists pre-fix).  After the fix lands, this test asserts that load throws when `integrity` is configured.

2. **Defense test**: same record, same tamper, but `DurableStateStore({integrity: { mode: 'hmac-sha256', integrityKey }})`; `load()` throws `JournalError: integrity check failed`.  Subsequent legitimate writes still work.

3. **Regression test**: legacy body (written without integrity, `FLAG_INTEGRITY_HMAC` unset) decodes cleanly under a config that has integrity enabled — back-compat preserved.

4. **Encrypted-AAD test**: with encryption enabled, tamper with `revision` byte in the manifest → `aesGcmDecrypt` throws (AAD mismatch).  No new code needed beyond the AAD wiring.

5. **Concurrency-test regression**: existing `ObjectStorageDurableStateStore` concurrency tests still pass.

## Acceptance criteria

- [ ] `IntegrityConfig` exposed in `PersistenceOptions.ts`.
- [ ] `BodyCodec` carries an optional `integrity` argument on encode + verifies on decode; new flag bit reserved.
- [ ] `aesGcmEncrypt` / `aesGcmDecrypt` accept an `aad` parameter; manifest revision flows through as AAD when encryption is active.
- [ ] `ObjectStorageDurableStateStore.load()` checks integrity before populating etagCache.
- [ ] Exploit-test demonstrates the pre-fix vulnerability and the post-fix block (4 tests above).
- [ ] Backward-compat: existing tests in `tests/unit/persistence/durable-state-stores/` all green.
- [ ] Plan-doc + README "Known security caveats" entry updated when this lands (move out of "remaining" into "fixed").


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Security] DurableState revision tampering via JSON-body manipulation #116

Severity / Size

Affected files

Background

Exploit walkthrough

How the 8 already-landed security fixes inform this

Fix design

API surface

Backward compatibility

Test plan

Acceptance criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Security] DurableState revision tampering via JSON-body manipulation #116

Description

Severity / Size

Affected files

Background

Exploit walkthrough

How the 8 already-landed security fixes inform this

Fix design

API surface

Backward compatibility

Test plan

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions