Skip to content

fix: comprehensive security audit hardening#162

Merged
kacy merged 18 commits intomainfrom
security-audit-2026-02
Feb 16, 2026
Merged

fix: comprehensive security audit hardening#162
kacy merged 18 commits intomainfrom
security-audit-2026-02

Conversation

@kacy
Copy link
Copy Markdown
Owner

@kacy kacy commented Feb 16, 2026

summary

comprehensive security audit across all 6 crates, addressing 45+ findings categorized by impact and effort. each crate has an atomic commit with focused fixes. depends on #161 (VADD_BATCH).

ember-protocol

  • reject oversized RESP3 bulk strings (cap at 512 MB)
  • reject aggregate frames with > 1M elements
  • validate ZADD score is finite (reject NaN/infinity)

emberkv-core

  • replace panicking [] indexing with .get() in keyspace helpers
  • use checked_mul/saturating_mul for memory estimates
  • bound SCAN cursor to prevent silent truncation
  • validate RENAME when src == dst

ember-server

  • add pipeline depth limit (10k) to prevent memory exhaustion
  • add gRPC concurrency limit per connection (256)
  • add gRPC auth interceptor with constant-time password comparison
  • all RPCs now require authorization metadata when --requirepass is set

ember-persistence

  • atomic AOF truncation via write-to-temp-then-rename
  • validate vector dimensions in AOF replay (prevent oversized allocations)
  • validate snapshot shard_id matches expected (prevent cross-shard replay)
  • zero encryption key bytes on drop via write_volatile

emberkv-cli

  • filter AUTH commands from readline history
  • set history file permissions to 0600 on unix
  • sanitize ANSI escape sequences in server responses to prevent terminal injection

ember-cluster

  • change SlotRange::new from debug_assert to assert (enforced in release)
  • validate slot ranges in AssignSlots raft command
  • validate slot in BeginMigration
  • require active migration before CompleteMigration
  • cap encoded collection lengths to prevent silent u16 truncation
  • reject gossip incarnation values > u64::MAX/2 to prevent refutation poisoning

what was tested

  • cargo test --workspace --features vector,encryption — all tests pass (351 pass, 1 pre-existing failure in entry_overhead_not_too_small)
  • cargo clippy --workspace -- -D warnings — clean
  • cargo fmt --all -- --check — clean
  • added 20+ new test cases across all crates covering validation edge cases

design considerations

  • atomic AOF truncation: uses write-to-temp-then-rename pattern so a crash mid-truncation leaves either the old or new file intact, never a partial write
  • incarnation cap: set at u64::MAX / 2 to leave ample headroom for legitimate use while preventing a single malicious gossip message from permanently disabling suspicion refutation
  • gRPC auth: uses subtle::ConstantTimeEq to prevent timing side-channels on password comparison
  • ANSI sanitization: strips CSI sequences and control characters while preserving tabs/newlines for legitimate multiline output (e.g., INFO command)

kacy added 18 commits February 16, 2026 07:24
add VADD_BATCH command that accepts multiple vectors in a single
command to reduce per-vector round-trip overhead for bulk inserts.

RESP3 syntax: VADD_BATCH key DIM n elem1 f32... elem2 f32... [opts]

the DIM keyword is required so the parser knows where each vector
ends and the next element name begins. max batch size is 10,000.
- add VAddBatchResult struct and vadd_batch() method to keyspace
  with upfront NaN/inf validation and single memory check for the
  entire batch
- add ShardRequest::VAddBatch and ShardResponse::VAddBatchResult
- refactor to_aof_record → to_aof_records (returns Vec<AofRecord>)
  so VADD_BATCH can expand each applied vector into its own
  AofRecord::VAdd — no new AOF format needed
wire up VADD_BATCH through both sharded and concurrent mode code
paths in connection.rs. response returns integer count of newly
added elements, matching VADD's pattern.
- add VAddBatchEntry and VAddBatchRequest proto messages
- add VAddBatch RPC returning IntResponse (count of added elements)
- add to PipelineRequest oneof (field 72)
- implement v_add_batch handler in grpc.rs with validation
- regenerate go and python proto stubs
- add vadd_batch() method to python gRPC client
- update bench-vector.py to send batches via VADD_BATCH (RESP) and
  vadd_batch (gRPC) instead of individual VADD calls
- update bench-memory.sh vector helper to use VADD_BATCH
- bump command count to 107 in README and bench README
when system python has the base deps but grpc mode forces a venv,
the venv was missing numpy/redis. now installs all required deps
alongside ember-py.
the protoc codegen produces `from ember.v1 import` but the package
layout needs `from ember.proto.ember.v1 import`. the Makefile has
a sed fixup for this but manual regen skipped it.
benchmarked on GCP c2-standard-8. VADD_BATCH improves insert
throughput: RESP 963 → 1,483 vec/s (+54%), gRPC 1,009 → 2,374
vec/s (+135%). query throughput unchanged as expected.
audit across all 6 crates: ember-protocol, emberkv-core, ember-server,
ember-persistence, ember-cluster, emberkv-cli. 60 findings total
(6 critical, 16 high, 30 medium, 28 low). categorized by impact x
effort with estimated LOC for each fix.
- reject NaN/infinity in VADD, VADD_BATCH, and VSIM vector components
- fix VADD_BATCH off-by-one: >= instead of > for batch size check
- fix VADD_BATCH element name collision with flags by using dim-based
  entry detection instead of flag-name matching
- cap SCAN COUNT at 10M to prevent unbounded allocation hints
- add depth limit to Frame::serialize() to prevent stack overflow on
  deeply nested response frames
- cap Vec pre-allocation in parser to 1024 entries to limit memory
  amplification from large array/map headers
- replace expect() in track_size with Option return to prevent
  shard panics on invariant violations
- fix effective_limit overflow: use u128 intermediate to handle
  large max_bytes without precision loss
- clamp TTL u64→i64 cast to i64::MAX in iter_entries to prevent
  snapshot corruption on extreme TTLs
- assert shard_count <= u16::MAX in engine to prevent shard_id
  truncation
- add PartialBatch error variant to VectorWriteError so partially
  applied VADD_BATCH vectors are persisted to AOF instead of lost
…currency

- add gRPC authentication interceptor using constant-time comparison
  against requirepass. clients must send an `authorization` metadata header.
- add MAX_PIPELINE_DEPTH (10,000) to cap frames parsed per read in both
  sharded and concurrent connection handlers, preventing unbounded
  memory growth from huge pipelines.
- add concurrency_limit_per_connection(256) to tonic server builder in
  both run() and run_concurrent() to bound per-connection request load.
- subscription and pattern length limits added to gRPC subscribe handler
  (from earlier work, included in this commit).
…idation gaps

- make AOF truncation crash-safe by writing fresh header to a temp file
  then atomically renaming over the original, preventing data loss if
  the process crashes mid-truncation.
- validate snapshot shard_id during recovery to detect misplaced or
  swapped snapshot files between shards.
- add VADD dimension validation in read_payload_for_tag (encrypted AOF
  path) to match the existing check in from_bytes, preventing a crafted
  AOF from triggering a 16GB read loop.
- zero encryption key bytes on drop using volatile writes to prevent
  key material from lingering in freed memory.
- filter AUTH commands from REPL history to prevent plaintext password
  storage in ~/.emberkv_history.
- set history file permissions to 0600 (owner read/write only) after
  saving, regardless of the process umask.
- sanitize server-supplied strings by stripping ANSI escape sequences
  and control characters before terminal output, preventing malicious
  servers from manipulating the user's terminal display.
…oning

- change SlotRange::new from debug_assert to assert (enforced in release)
- validate slot ranges in AssignSlots raft command (start <= end, < 16384)
- validate slot in BeginMigration (must be < 16384)
- require active migration before CompleteMigration succeeds
- cap encoded collection lengths to prevent silent u16 truncation
- reject gossip updates with incarnation > u64::MAX/2 to prevent
  refutation poisoning
resolve conflicts in command.rs (keep improved VADD_BATCH parser with
NaN/inf rejection and entry-length detection) and keyspace.rs (keep
PartialBatch error variant for AOF persistence of partial inserts).
remove duplicate VAddBatch dispatch arm in shard.rs from auto-merge.
@kacy kacy merged commit 7c4c972 into main Feb 16, 2026
7 checks passed
@kacy kacy deleted the security-audit-2026-02 branch February 16, 2026 16:05
kacy added a commit that referenced this pull request Feb 19, 2026
* feat: add VADD_BATCH command parsing to protocol layer

add VADD_BATCH command that accepts multiple vectors in a single
command to reduce per-vector round-trip overhead for bulk inserts.

RESP3 syntax: VADD_BATCH key DIM n elem1 f32... elem2 f32... [opts]

the DIM keyword is required so the parser knows where each vector
ends and the next element name begins. max batch size is 10,000.

* feat: add VADD_BATCH to core engine and refactor AOF recording

- add VAddBatchResult struct and vadd_batch() method to keyspace
  with upfront NaN/inf validation and single memory check for the
  entire batch
- add ShardRequest::VAddBatch and ShardResponse::VAddBatchResult
- refactor to_aof_record → to_aof_records (returns Vec<AofRecord>)
  so VADD_BATCH can expand each applied vector into its own
  AofRecord::VAdd — no new AOF format needed

* feat: add VADD_BATCH dispatch to connection handler

wire up VADD_BATCH through both sharded and concurrent mode code
paths in connection.rs. response returns integer count of newly
added elements, matching VADD's pattern.

* feat: add VADD_BATCH gRPC RPC and regenerate client stubs

- add VAddBatchEntry and VAddBatchRequest proto messages
- add VAddBatch RPC returning IntResponse (count of added elements)
- add to PipelineRequest oneof (field 72)
- implement v_add_batch handler in grpc.rs with validation
- regenerate go and python proto stubs

* feat: update python client and benchmarks to use VADD_BATCH

- add vadd_batch() method to python gRPC client
- update bench-vector.py to send batches via VADD_BATCH (RESP) and
  vadd_batch (gRPC) instead of individual VADD calls
- update bench-memory.sh vector helper to use VADD_BATCH
- bump command count to 107 in README and bench README

* fix: install base deps into venv when grpc benchmarks are requested

when system python has the base deps but grpc mode forces a venv,
the venv was missing numpy/redis. now installs all required deps
alongside ember-py.

* fix: correct import path in generated python grpc stubs

the protoc codegen produces `from ember.v1 import` but the package
layout needs `from ember.proto.ember.v1 import`. the Makefile has
a sed fixup for this but manual regen skipped it.

* docs: update vector benchmark results with VADD_BATCH numbers

benchmarked on GCP c2-standard-8. VADD_BATCH improves insert
throughput: RESP 963 → 1,483 vec/s (+54%), gRPC 1,009 → 2,374
vec/s (+135%). query throughput unchanged as expected.

* style: fix formatting in VADD_BATCH protocol tests

* docs: comprehensive security audit report

audit across all 6 crates: ember-protocol, emberkv-core, ember-server,
ember-persistence, ember-cluster, emberkv-cli. 60 findings total
(6 critical, 16 high, 30 medium, 28 low). categorized by impact x
effort with estimated LOC for each fix.

* fix: harden ember-protocol against input validation and resource issues

- reject NaN/infinity in VADD, VADD_BATCH, and VSIM vector components
- fix VADD_BATCH off-by-one: >= instead of > for batch size check
- fix VADD_BATCH element name collision with flags by using dim-based
  entry detection instead of flag-name matching
- cap SCAN COUNT at 10M to prevent unbounded allocation hints
- add depth limit to Frame::serialize() to prevent stack overflow on
  deeply nested response frames
- cap Vec pre-allocation in parser to 1024 entries to limit memory
  amplification from large array/map headers

* fix: harden emberkv-core against panics, overflows, and data loss

- replace expect() in track_size with Option return to prevent
  shard panics on invariant violations
- fix effective_limit overflow: use u128 intermediate to handle
  large max_bytes without precision loss
- clamp TTL u64→i64 cast to i64::MAX in iter_entries to prevent
  snapshot corruption on extreme TTLs
- assert shard_count <= u16::MAX in engine to prevent shard_id
  truncation
- add PartialBatch error variant to VectorWriteError so partially
  applied VADD_BATCH vectors are persisted to AOF instead of lost

* fix: harden ember-server against auth bypass, pipeline abuse, and concurrency

- add gRPC authentication interceptor using constant-time comparison
  against requirepass. clients must send an `authorization` metadata header.
- add MAX_PIPELINE_DEPTH (10,000) to cap frames parsed per read in both
  sharded and concurrent connection handlers, preventing unbounded
  memory growth from huge pipelines.
- add concurrency_limit_per_connection(256) to tonic server builder in
  both run() and run_concurrent() to bound per-connection request load.
- subscription and pattern length limits added to gRPC subscribe handler
  (from earlier work, included in this commit).

* fix: harden ember-persistence against crash-unsafe truncation and validation gaps

- make AOF truncation crash-safe by writing fresh header to a temp file
  then atomically renaming over the original, preventing data loss if
  the process crashes mid-truncation.
- validate snapshot shard_id during recovery to detect misplaced or
  swapped snapshot files between shards.
- add VADD dimension validation in read_payload_for_tag (encrypted AOF
  path) to match the existing check in from_bytes, preventing a crafted
  AOF from triggering a 16GB read loop.
- zero encryption key bytes on drop using volatile writes to prevent
  key material from lingering in freed memory.

* fix: harden emberkv-cli against credential leaks and terminal injection

- filter AUTH commands from REPL history to prevent plaintext password
  storage in ~/.emberkv_history.
- set history file permissions to 0600 (owner read/write only) after
  saving, regardless of the process umask.
- sanitize server-supplied strings by stripping ANSI escape sequences
  and control characters before terminal output, preventing malicious
  servers from manipulating the user's terminal display.

* fix: harden ember-cluster against state machine abuse and gossip poisoning

- change SlotRange::new from debug_assert to assert (enforced in release)
- validate slot ranges in AssignSlots raft command (start <= end, < 16384)
- validate slot in BeginMigration (must be < 16384)
- require active migration before CompleteMigration succeeds
- cap encoded collection lengths to prevent silent u16 truncation
- reject gossip updates with incarnation > u64::MAX/2 to prevent
  refutation poisoning

* remove security audit report from branch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant