Skip to content

feat(audit): hash-chained audit stream for tamper detection#222

Closed
naveen-kurra wants to merge 1 commit into
initializ:mainfrom
naveen-kurra:feat/gov-r5-hash-chain-v2
Closed

feat(audit): hash-chained audit stream for tamper detection#222
naveen-kurra wants to merge 1 commit into
initializ:mainfrom
naveen-kurra:feat/gov-r5-hash-chain-v2

Conversation

@naveen-kurra

Copy link
Copy Markdown
Collaborator

Closes #212. (Re-open of #217 which GitHub auto-closed after main was force-pushed; local branch rebased onto new main.)

Summary

  • Every emitted AuditEvent carries a prev_hash field: sha256 of the previous event's marshaled JSON (genesis = 64 zeros).
  • Chain-mint + marshal + hash update + sink write happen under a single mutex in AuditLogger.Emit so concurrent goroutines cannot interleave and produce an invalid chain.
  • coreruntime.VerifyAuditLog walks NDJSON and reports the first bad line + reason; forge audit verify <file> is the CLI on top.

Test plan

  • go test ./forge-core/runtime/... (genesis, verify clean, tampering, deletion, 200-goroutine concurrent, genesis constant, malformed, prev_hash always emitted)
  • go test ./forge-cli/cmd/... (CLI OK / failure paths)
  • gofmt -w + golangci-lint run all clean

Docs at docs/security/audit-tamper-evidence.md.

…nitializ#212)

Every AuditEvent now carries a `prev_hash` field pinning it to the
sha256 of the previous event's canonical JSON bytes. Together with
the per-emit tail-hash update this forms a chain over the audit
stream — any post-hoc alteration (changed field, added byte, dropped
line) breaks the chain at the point of tampering, and the new
`forge audit verify` CLI walks a captured stream to report the
break.

This satisfies governance requirement R5 (tamper-evident receipts)
in the strict reading. R6 (per-event Ed25519 signing) layers on top
in issue initializ#213.

## Changes

- `forge-core/runtime/audit.go`
  - New `AuditChainGenesis` constant (32 zero bytes, hex-encoded)
  - `AuditEvent.PrevHash` field (json:"prev_hash", NOT omitempty —
    absence is itself a tampering signal)
  - `AuditLogger.lastHash` state, mu-guarded
  - `Emit` now serializes chain-mint + marshal + tail-update +
    sink-write under a single mutex so concurrent emits produce a
    strictly-ordered chain

- `forge-core/runtime/audit_verify.go` (new)
  - `VerifyAuditLog(io.Reader) (VerifyResult, error)` — walks an
    NDJSON stream, recomputes each event's canonical hash,
    reports the first break with expected/actual prev_hash

- `forge-cli/cmd/audit.go` (new) — `forge audit verify <file>`
  subcommand. Exits 0 on clean, non-zero with a report on
  tampering. Reads from stdin when file is "-".

- `docs/security/audit-tamper-evidence.md` — operator guide

## Tests

`forge-core/runtime/audit_hash_chain_test.go`:
- `TestHashChain_GenesisAndProgression` — first event carries genesis;
  each subsequent event's prev_hash equals sha256 of the previous
  event's canonical JSON
- `TestHashChain_VerifyWalksCleanly` — 20-event stream round-trips
  cleanly through the verifier
- `TestHashChain_TamperingDetected` — altering event initializ#2's contents
  makes the verifier flag line 3 (successor whose prev_hash no
  longer matches the tampered event's recomputed hash)
- `TestHashChain_DeletionDetected` — dropping event initializ#2 flags line 2
  (the successor now sees line 1's hash as its predecessor's, but
  its own prev_hash pointed at event initializ#2)
- `TestHashChain_ConcurrentEmitsProduceValidChain` — 200 concurrent
  emitters still produce a chain that verifies (mutex covers the
  whole chain+write path)
- `TestHashChain_GenesisConstantShape` — pins the 64-hex-zero wire
  constant so a well-meaning refactor can't change it silently
- `TestHashChain_VerifyDetectsMalformedLine` — non-JSON garbage is
  reported cleanly, not a panic
- `TestHashChain_PrevHashAlwaysWritten` — pins the "no omitempty"
  choice

`forge-cli/cmd/audit_test.go`:
- `TestAuditVerify_CleanStreamExitsZero` — end-to-end OK path
- `TestAuditVerify_TamperedStreamReports` — end-to-end fail path;
  stdout contains "TAMPERING DETECTED" + line number + hashes
- `TestAuditVerify_UnreadablePath` — missing file returns OS-shaped
  error, not a crash

## Compatibility

The stream shape is backward-compatible for consumers that ignore
unknown fields. `AuditSchemaVersion` is unchanged (this is an
additive optional field per the documented schema policy).
Existing tests in `forge-core/runtime/` continue to pass without
modification.

## Effort

1 day as estimated in the R5 issue.

@initializ-mk initializ-mk left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read the full diff and cross-checked it against #220 and the base audit.go. This is a well-tested feature, but it has two blocking problems — one is the same deadlock I flagged on #220, and the other is that #222 and #220 collide and can't both merge as-is.

🔴 Blocking #1 — self-deadlock in Emit (same bug as #220)

Emit now takes a.mu.Lock(); defer a.mu.Unlock() at the top and runs the sink-write loop while holding the lock. On a sink error it calls a.logSinkErrorOnce(...), which itself does a.mu.Lock() (audit.go:584). Go's sync.Mutex isn't reentrant → self-deadlock: the emitting goroutine hangs while holding a.mu, which then hangs every other Emit/config call.

  • Trigger: any sink Write error with an ops logger configured (SetOpsLogger).
  • Why tests miss it: the test sinks (buffer/file) never error and opsLog is nil.

Here it's trickier than #220 because you have a legitimate reason to hold the lock across the writes — chain order must equal on-disk order — so the fix isn't "move writes outside the lock." Instead, don't call the re-locking logSinkErrorOnce from inside the held section: either add a logSinkErrorOnceLocked variant that assumes the caller holds a.mu, or collect the errored sinks during the locked loop and log them after an explicit Unlock() (not via defer). Ordering is preserved either way; only the error logging moves.

🔴 Blocking #2#222 and #220 cannot both merge as-is

#220 (signing, #213) and this PR create the same files with incompatible contents:

  • forge-core/runtime/audit_verify.go — both define VerifyAuditLog/VerifyResult, with different signatures: #220 is VerifyAuditLog(r, opts VerifyOptions) with FirstBadLine/SigChecked; this PR is VerifyAuditLog(r) with FirstTamperedLine/GenesisSeen/ExpectedPrevHash. Same symbol, same package — the second to merge won't compile.
  • forge-cli/cmd/audit.go — both define auditCmd/auditVerifyCmd/auditVerifyRun.
  • forge-core/runtime/audit.go — both add fields to AuditEvent and both restructure Emit's locking, in conflicting ways.

The two features are explicitly meant to compose (this PR's docs say "R6 layers Ed25519 signatures on top"; #220's say "combine signing + hash chaining"), but they're built as independent rewrites of the same code. They need to be reconciled into: one AuditEvent (prev_hash + kid + sig), one VerifyAuditLog that checks both chain and signatures, one forge audit verify, and one Emit locking/sequencing strategy. Sequencing matters too — the signature must cover prev_hash, and the tail hash must be computed over the final signed bytes. Please coordinate before either merges (or land one, then rebase the other onto it as a real integration rather than a re-add). Posting a linked note on #220 as well.

🟠 Important — re-marshal hashing has a large-integer precision hole

The verifier deliberately re-marshals the parsed event (json.Marshal(evt)) rather than hashing the raw line, to tolerate benign whitespace. That makes verification depend on unmarshal→marshal being a fixed point — which it isn't for Fields map[string]any numbers: JSON numbers decode to float64, so any field value > 2^53 (or a non-round float) re-marshals to different bytes than the producer emitted from its native int64, so an untampered stream fails verification. Today's audit fields are all small ints, so it doesn't bite yet, but it's a latent landmine (a nanosecond epoch or large ID in a field would trip it).

Worth reconsidering: for a tamper-evidence tool, hashing the raw written bytes (line minus trailing newline) is both simpler and strictly safer — it catches every byte change (including whitespace, which for tamper-evidence you want to catch) and sidesteps the precision issue. The producer already controls the canonical bytes, so the verifier doesn't need to reconstruct them. If you keep re-marshaling, document the "no field integers > 2^53" constraint.

🟡 Minor

  • Doc comment contradicts code: VerifyAuditLog's comment says it "does NOT stop at the first parse error — it keeps reading," but the code returns immediately on the first malformed line. Fix the comment (or the behavior).
  • Cross-language portability overstated: the docs invite reimplementing the walk "in any language (sha256 + json.Marshal-equivalent canonical form)." Go's json.Marshal isn't a standardized canonical form (struct field order, HTML-escaping of < > &, map-key sorting), so a non-Go verifier reproducing it is non-trivial. Same caveat as #220; hashing raw bytes (above) largely removes it.
  • Head-truncation is only a soft stderr note (GenesisSeen=false still returns OK). That's an inherent hash-chain limitation and it's honestly disclosed — just confirming it's intentional.

What's good

The concurrency design is sound and its rationale (writes must land in chain order) is correct and well-tested with 200 goroutines. prev_hash without omitempty is the right call and is pinned by a test. The "what it does NOT buy you" section (confidentiality, availability, non-repudiation-without-R6) is honest and accurate.

Verdict

Request changes — the deadlock must be fixed, and the #220/#222 collision needs to be resolved before either lands. The precision / raw-bytes question is worth settling now since it's cheap before merge and awkward after. Once those are addressed this is a solid R5 implementation.

@naveen-kurra

Copy link
Copy Markdown
Collaborator Author

Ack on all points. Plan: land #220 first (its review is addressed on the -v2 branch), then rebase this PR onto post-#220 main as a real integration — single AuditEvent (prev_hash + kid + sig), single VerifyAuditLog (chain + sig checks), single forge audit verify CLI, single Emit sequencing where the signature covers prev_hash and the tail hash is computed over the final signed bytes.

Fixes I'll roll into the rebase:

  • Deadlock: chain order requires holding a.mu across the sink-write loop, so I can't move writes out (that's the feat(audit): Ed25519 per-event signing + JWKS endpoint #220 pattern). Will add logSinkErrorOnceLocked for the caller-holds-lock path, or collect errored sinks under lock and log after an explicit Unlock().
  • Precision hole: dropping the re-marshal, hashing raw line bytes (minus trailing \n) instead. Simpler, strictly safer, and closes the >2^53 landmine cleanly. Verifier no longer needs to reconstruct producer bytes.
  • Comment/behavior mismatch on VerifyAuditLog + portability wording — fixed alongside.
  • Head-truncation soft note: confirming intentional; will surface in the doc.

Pre-staging on the branch now so the integration is ready when #220 merges.

@naveen-kurra

Copy link
Copy Markdown
Collaborator Author

Pre-staged the integration at feat/gov-r5-r6-integrated (branch pushed, no PR opened yet — waiting for #220 to merge before promoting).

What's in it:

  1. Unified AuditEvent carries prev_hash (always, no omitempty) + kid + sig (both omitempty). The signature covers prev_hash, so a tamperer who tries to recompute downstream prev_hashes breaks the sig too — proved in TestIntegration_SigCoversPrevHash.

  2. Single Emit under one mutex: chain-mint → sign → marshal → hash → sink write → release. Deadlock fix: since chain order requires holding the lock across sink writes, I collect sink errors under lock and call logSinkErrorOnce after Unlock(). New regression test hangs the pre-fix code and passes with the current impl.

  3. Precision-hole fix as suggested: hash the raw line bytes (writer-authored, minus trailing \n), not a re-marshaled event. Verifier hashes raw bytes read from stream. Closes the >2^53 landmine.

  4. Unified verifier: VerifyAuditLog(r, opts VerifyOptions) walks chain + (when pubkeys supplied) signatures. --skip-chain on the CLI for SIEM tail ingestion. Head-truncation is a soft warning per your intentional-limitation note.

  5. Single CLI forge audit verify <file> [--pubkey <jwks>] [--skip-chain].

  6. Docs merged in audit-tamper-evidence.md (chain), audit-signing.md (sig, unchanged from feat(audit): Ed25519 per-event signing + JWKS endpoint #220-v2).

Full test sweep + gofmt + golangci-lint clean. Plan: once #220 merges, I'll close this PR (#222) and open the integration branch as the successor R5 PR against post-#220 main. If you'd prefer to review the integration directly against this branch's diff before that, I can open it now as a draft against the r6-v2 branch.

@naveen-kurra

Copy link
Copy Markdown
Collaborator Author

Superseded by the integration branch feat/gov-r5-r6-integrated — opening as a fresh PR against post-#220 main. Same code, now with signature covering prev_hash + raw-line-bytes hashing (precision-hole fix) + deadlock-safe error logging under the chain-lock.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

R5 (governance): add hash-chained audit for tamper evidence

2 participants