fix: comprehensive security audit hardening by kacy · Pull Request #162 · kacy/ember

kacy · 2026-02-16T15:31:49Z

summary

comprehensive security audit across all 6 crates, addressing 45+ findings categorized by impact and effort. each crate has an atomic commit with focused fixes. depends on #161 (VADD_BATCH).

ember-protocol

reject oversized RESP3 bulk strings (cap at 512 MB)
reject aggregate frames with > 1M elements
validate ZADD score is finite (reject NaN/infinity)

emberkv-core

replace panicking [] indexing with .get() in keyspace helpers
use checked_mul/saturating_mul for memory estimates
bound SCAN cursor to prevent silent truncation
validate RENAME when src == dst

ember-server

add pipeline depth limit (10k) to prevent memory exhaustion
add gRPC concurrency limit per connection (256)
add gRPC auth interceptor with constant-time password comparison
all RPCs now require authorization metadata when --requirepass is set

ember-persistence

atomic AOF truncation via write-to-temp-then-rename
validate vector dimensions in AOF replay (prevent oversized allocations)
validate snapshot shard_id matches expected (prevent cross-shard replay)
zero encryption key bytes on drop via write_volatile

emberkv-cli

filter AUTH commands from readline history
set history file permissions to 0600 on unix
sanitize ANSI escape sequences in server responses to prevent terminal injection

ember-cluster

change SlotRange::new from debug_assert to assert (enforced in release)
validate slot ranges in AssignSlots raft command
validate slot in BeginMigration
require active migration before CompleteMigration
cap encoded collection lengths to prevent silent u16 truncation
reject gossip incarnation values > u64::MAX/2 to prevent refutation poisoning

what was tested

cargo test --workspace --features vector,encryption — all tests pass (351 pass, 1 pre-existing failure in entry_overhead_not_too_small)
cargo clippy --workspace -- -D warnings — clean
cargo fmt --all -- --check — clean
added 20+ new test cases across all crates covering validation edge cases

design considerations

atomic AOF truncation: uses write-to-temp-then-rename pattern so a crash mid-truncation leaves either the old or new file intact, never a partial write
incarnation cap: set at u64::MAX / 2 to leave ample headroom for legitimate use while preventing a single malicious gossip message from permanently disabling suspicion refutation
gRPC auth: uses subtle::ConstantTimeEq to prevent timing side-channels on password comparison
ANSI sanitization: strips CSI sequences and control characters while preserving tabs/newlines for legitimate multiline output (e.g., INFO command)

add VADD_BATCH command that accepts multiple vectors in a single command to reduce per-vector round-trip overhead for bulk inserts. RESP3 syntax: VADD_BATCH key DIM n elem1 f32... elem2 f32... [opts] the DIM keyword is required so the parser knows where each vector ends and the next element name begins. max batch size is 10,000.

- add VAddBatchResult struct and vadd_batch() method to keyspace with upfront NaN/inf validation and single memory check for the entire batch - add ShardRequest::VAddBatch and ShardResponse::VAddBatchResult - refactor to_aof_record → to_aof_records (returns Vec<AofRecord>) so VADD_BATCH can expand each applied vector into its own AofRecord::VAdd — no new AOF format needed

wire up VADD_BATCH through both sharded and concurrent mode code paths in connection.rs. response returns integer count of newly added elements, matching VADD's pattern.

- add VAddBatchEntry and VAddBatchRequest proto messages - add VAddBatch RPC returning IntResponse (count of added elements) - add to PipelineRequest oneof (field 72) - implement v_add_batch handler in grpc.rs with validation - regenerate go and python proto stubs

- add vadd_batch() method to python gRPC client - update bench-vector.py to send batches via VADD_BATCH (RESP) and vadd_batch (gRPC) instead of individual VADD calls - update bench-memory.sh vector helper to use VADD_BATCH - bump command count to 107 in README and bench README

when system python has the base deps but grpc mode forces a venv, the venv was missing numpy/redis. now installs all required deps alongside ember-py.

the protoc codegen produces `from ember.v1 import` but the package layout needs `from ember.proto.ember.v1 import`. the Makefile has a sed fixup for this but manual regen skipped it.

benchmarked on GCP c2-standard-8. VADD_BATCH improves insert throughput: RESP 963 → 1,483 vec/s (+54%), gRPC 1,009 → 2,374 vec/s (+135%). query throughput unchanged as expected.

audit across all 6 crates: ember-protocol, emberkv-core, ember-server, ember-persistence, ember-cluster, emberkv-cli. 60 findings total (6 critical, 16 high, 30 medium, 28 low). categorized by impact x effort with estimated LOC for each fix.

- reject NaN/infinity in VADD, VADD_BATCH, and VSIM vector components - fix VADD_BATCH off-by-one: >= instead of > for batch size check - fix VADD_BATCH element name collision with flags by using dim-based entry detection instead of flag-name matching - cap SCAN COUNT at 10M to prevent unbounded allocation hints - add depth limit to Frame::serialize() to prevent stack overflow on deeply nested response frames - cap Vec pre-allocation in parser to 1024 entries to limit memory amplification from large array/map headers

- replace expect() in track_size with Option return to prevent shard panics on invariant violations - fix effective_limit overflow: use u128 intermediate to handle large max_bytes without precision loss - clamp TTL u64→i64 cast to i64::MAX in iter_entries to prevent snapshot corruption on extreme TTLs - assert shard_count <= u16::MAX in engine to prevent shard_id truncation - add PartialBatch error variant to VectorWriteError so partially applied VADD_BATCH vectors are persisted to AOF instead of lost

…currency - add gRPC authentication interceptor using constant-time comparison against requirepass. clients must send an `authorization` metadata header. - add MAX_PIPELINE_DEPTH (10,000) to cap frames parsed per read in both sharded and concurrent connection handlers, preventing unbounded memory growth from huge pipelines. - add concurrency_limit_per_connection(256) to tonic server builder in both run() and run_concurrent() to bound per-connection request load. - subscription and pattern length limits added to gRPC subscribe handler (from earlier work, included in this commit).

…idation gaps - make AOF truncation crash-safe by writing fresh header to a temp file then atomically renaming over the original, preventing data loss if the process crashes mid-truncation. - validate snapshot shard_id during recovery to detect misplaced or swapped snapshot files between shards. - add VADD dimension validation in read_payload_for_tag (encrypted AOF path) to match the existing check in from_bytes, preventing a crafted AOF from triggering a 16GB read loop. - zero encryption key bytes on drop using volatile writes to prevent key material from lingering in freed memory.

- filter AUTH commands from REPL history to prevent plaintext password storage in ~/.emberkv_history. - set history file permissions to 0600 (owner read/write only) after saving, regardless of the process umask. - sanitize server-supplied strings by stripping ANSI escape sequences and control characters before terminal output, preventing malicious servers from manipulating the user's terminal display.

…oning - change SlotRange::new from debug_assert to assert (enforced in release) - validate slot ranges in AssignSlots raft command (start <= end, < 16384) - validate slot in BeginMigration (must be < 16384) - require active migration before CompleteMigration succeeds - cap encoded collection lengths to prevent silent u16 truncation - reject gossip updates with incarnation > u64::MAX/2 to prevent refutation poisoning

resolve conflicts in command.rs (keep improved VADD_BATCH parser with NaN/inf rejection and entry-length detection) and keyspace.rs (keep PartialBatch error variant for AOF persistence of partial inserts). remove duplicate VAddBatch dispatch arm in shard.rs from auto-merge.

* feat: add VADD_BATCH command parsing to protocol layer add VADD_BATCH command that accepts multiple vectors in a single command to reduce per-vector round-trip overhead for bulk inserts. RESP3 syntax: VADD_BATCH key DIM n elem1 f32... elem2 f32... [opts] the DIM keyword is required so the parser knows where each vector ends and the next element name begins. max batch size is 10,000. * feat: add VADD_BATCH to core engine and refactor AOF recording - add VAddBatchResult struct and vadd_batch() method to keyspace with upfront NaN/inf validation and single memory check for the entire batch - add ShardRequest::VAddBatch and ShardResponse::VAddBatchResult - refactor to_aof_record → to_aof_records (returns Vec<AofRecord>) so VADD_BATCH can expand each applied vector into its own AofRecord::VAdd — no new AOF format needed * feat: add VADD_BATCH dispatch to connection handler wire up VADD_BATCH through both sharded and concurrent mode code paths in connection.rs. response returns integer count of newly added elements, matching VADD's pattern. * feat: add VADD_BATCH gRPC RPC and regenerate client stubs - add VAddBatchEntry and VAddBatchRequest proto messages - add VAddBatch RPC returning IntResponse (count of added elements) - add to PipelineRequest oneof (field 72) - implement v_add_batch handler in grpc.rs with validation - regenerate go and python proto stubs * feat: update python client and benchmarks to use VADD_BATCH - add vadd_batch() method to python gRPC client - update bench-vector.py to send batches via VADD_BATCH (RESP) and vadd_batch (gRPC) instead of individual VADD calls - update bench-memory.sh vector helper to use VADD_BATCH - bump command count to 107 in README and bench README * fix: install base deps into venv when grpc benchmarks are requested when system python has the base deps but grpc mode forces a venv, the venv was missing numpy/redis. now installs all required deps alongside ember-py. * fix: correct import path in generated python grpc stubs the protoc codegen produces `from ember.v1 import` but the package layout needs `from ember.proto.ember.v1 import`. the Makefile has a sed fixup for this but manual regen skipped it. * docs: update vector benchmark results with VADD_BATCH numbers benchmarked on GCP c2-standard-8. VADD_BATCH improves insert throughput: RESP 963 → 1,483 vec/s (+54%), gRPC 1,009 → 2,374 vec/s (+135%). query throughput unchanged as expected. * style: fix formatting in VADD_BATCH protocol tests * docs: comprehensive security audit report audit across all 6 crates: ember-protocol, emberkv-core, ember-server, ember-persistence, ember-cluster, emberkv-cli. 60 findings total (6 critical, 16 high, 30 medium, 28 low). categorized by impact x effort with estimated LOC for each fix. * fix: harden ember-protocol against input validation and resource issues - reject NaN/infinity in VADD, VADD_BATCH, and VSIM vector components - fix VADD_BATCH off-by-one: >= instead of > for batch size check - fix VADD_BATCH element name collision with flags by using dim-based entry detection instead of flag-name matching - cap SCAN COUNT at 10M to prevent unbounded allocation hints - add depth limit to Frame::serialize() to prevent stack overflow on deeply nested response frames - cap Vec pre-allocation in parser to 1024 entries to limit memory amplification from large array/map headers * fix: harden emberkv-core against panics, overflows, and data loss - replace expect() in track_size with Option return to prevent shard panics on invariant violations - fix effective_limit overflow: use u128 intermediate to handle large max_bytes without precision loss - clamp TTL u64→i64 cast to i64::MAX in iter_entries to prevent snapshot corruption on extreme TTLs - assert shard_count <= u16::MAX in engine to prevent shard_id truncation - add PartialBatch error variant to VectorWriteError so partially applied VADD_BATCH vectors are persisted to AOF instead of lost * fix: harden ember-server against auth bypass, pipeline abuse, and concurrency - add gRPC authentication interceptor using constant-time comparison against requirepass. clients must send an `authorization` metadata header. - add MAX_PIPELINE_DEPTH (10,000) to cap frames parsed per read in both sharded and concurrent connection handlers, preventing unbounded memory growth from huge pipelines. - add concurrency_limit_per_connection(256) to tonic server builder in both run() and run_concurrent() to bound per-connection request load. - subscription and pattern length limits added to gRPC subscribe handler (from earlier work, included in this commit). * fix: harden ember-persistence against crash-unsafe truncation and validation gaps - make AOF truncation crash-safe by writing fresh header to a temp file then atomically renaming over the original, preventing data loss if the process crashes mid-truncation. - validate snapshot shard_id during recovery to detect misplaced or swapped snapshot files between shards. - add VADD dimension validation in read_payload_for_tag (encrypted AOF path) to match the existing check in from_bytes, preventing a crafted AOF from triggering a 16GB read loop. - zero encryption key bytes on drop using volatile writes to prevent key material from lingering in freed memory. * fix: harden emberkv-cli against credential leaks and terminal injection - filter AUTH commands from REPL history to prevent plaintext password storage in ~/.emberkv_history. - set history file permissions to 0600 (owner read/write only) after saving, regardless of the process umask. - sanitize server-supplied strings by stripping ANSI escape sequences and control characters before terminal output, preventing malicious servers from manipulating the user's terminal display. * fix: harden ember-cluster against state machine abuse and gossip poisoning - change SlotRange::new from debug_assert to assert (enforced in release) - validate slot ranges in AssignSlots raft command (start <= end, < 16384) - validate slot in BeginMigration (must be < 16384) - require active migration before CompleteMigration succeeds - cap encoded collection lengths to prevent silent u16 truncation - reject gossip updates with incarnation > u64::MAX/2 to prevent refutation poisoning * remove security audit report from branch

kacy added 18 commits February 16, 2026 07:24

feat: add VADD_BATCH dispatch to connection handler

543d550

wire up VADD_BATCH through both sharded and concurrent mode code paths in connection.rs. response returns integer count of newly added elements, matching VADD's pattern.

fix: install base deps into venv when grpc benchmarks are requested

cb882a4

when system python has the base deps but grpc mode forces a venv, the venv was missing numpy/redis. now installs all required deps alongside ember-py.

fix: correct import path in generated python grpc stubs

3abd604

the protoc codegen produces `from ember.v1 import` but the package layout needs `from ember.proto.ember.v1 import`. the Makefile has a sed fixup for this but manual regen skipped it.

docs: update vector benchmark results with VADD_BATCH numbers

019bf48

benchmarked on GCP c2-standard-8. VADD_BATCH improves insert throughput: RESP 963 → 1,483 vec/s (+54%), gRPC 1,009 → 2,374 vec/s (+135%). query throughput unchanged as expected.

style: fix formatting in VADD_BATCH protocol tests

b826ff9

docs: comprehensive security audit report

b4bcefd

audit across all 6 crates: ember-protocol, emberkv-core, ember-server, ember-persistence, ember-cluster, emberkv-cli. 60 findings total (6 critical, 16 high, 30 medium, 28 low). categorized by impact x effort with estimated LOC for each fix.

remove security audit report from branch

225b797

kacy merged commit 7c4c972 into main Feb 16, 2026
7 checks passed

kacy deleted the security-audit-2026-02 branch February 16, 2026 16:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: comprehensive security audit hardening#162

fix: comprehensive security audit hardening#162
kacy merged 18 commits intomainfrom
security-audit-2026-02

kacy commented Feb 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kacy commented Feb 16, 2026

summary

ember-protocol

emberkv-core

ember-server

ember-persistence

emberkv-cli

ember-cluster

what was tested

design considerations

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant