Harden embedded etcd against freelist bloat#733
Conversation
We hit a production outage where etcd's BoltDB file grew to 627 MB with only 53 MB of live data. Compaction was running (periodic/1h), but compaction just marks old revisions as deleted — BoltDB never actually releases pages without an explicit defrag. The freelist grew so large that write operations stalled. Two things here: switch the BoltDB freelist type from `array` (O(n) page allocation) to `map` (O(1)), and add a maintenance loop that watches db size vs live data and triggers defrag automatically when the bloat ratio exceeds 2x. The loop also logs health metrics every 5 minutes at appropriate severity levels so we have visibility into db growth over time. Closes MIR-968
📝 WalkthroughWalkthroughThe etcd component now persists a Comment |
The etcd container spec (env vars, args) is baked in at creation time. When miren upgrades and restarts, Start() would find the existing container and reuse it, meaning new env vars like the freelist type change would never take effect. Add a ConfigVersion to the persisted etcd state. When the version doesn't match, the container is torn down and recreated with the current spec. This generalizes the existing TLS mismatch detection so any future container spec change just needs a version bump.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
components/etcd/maintenance.go (1)
115-119: Keepbloat_rationumeric in the structured log.Formatting it with
fmt.Sprintf("%.2f", bloatRatio)turns the field into a string, which makes numeric filtering and aggregation harder in log backends.Possible cleanup
attrs := []any{ "db_size_bytes", dbSize, "db_size_in_use_bytes", dbSizeInUse, - "bloat_ratio", fmt.Sprintf("%.2f", bloatRatio), + "bloat_ratio", bloatRatio, }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/etcd/maintenance.go` around lines 115 - 119, The structured log currently converts bloat_ratio to a string via fmt.Sprintf("%.2f", bloatRatio) which prevents numeric filtering—change the attrs entry to pass bloatRatio as a numeric value (e.g. bloatRatio or a rounded numeric value using math.Round(bloatRatio*100)/100) instead of fmt.Sprintf; update the attrs slice declaration that contains "db_size_bytes", dbSize, "db_size_in_use_bytes", dbSizeInUse, "bloat_ratio", ... to use the numeric bloatRatio, and remove the now-unused fmt import if applicable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/etcd/etcd.go`:
- Around line 254-255: The maintenance loop (e.StartMaintenanceLoop(ctx)) is
only started on fresh container creation, so containers that hit
restartExistingContainer() return early and never get health-check/auto-defrag;
update restartExistingContainer() to invoke e.StartMaintenanceLoop(ctx) on the
reuse/restart path as well (place the call before any early returns or at the
end of restartExistingContainer() so the loop is started for reused containers),
and ensure StartMaintenanceLoop is idempotent or guarded so calling it twice is
safe.
---
Nitpick comments:
In `@components/etcd/maintenance.go`:
- Around line 115-119: The structured log currently converts bloat_ratio to a
string via fmt.Sprintf("%.2f", bloatRatio) which prevents numeric
filtering—change the attrs entry to pass bloatRatio as a numeric value (e.g.
bloatRatio or a rounded numeric value using math.Round(bloatRatio*100)/100)
instead of fmt.Sprintf; update the attrs slice declaration that contains
"db_size_bytes", dbSize, "db_size_in_use_bytes", dbSizeInUse, "bloat_ratio", ...
to use the numeric bloatRatio, and remove the now-unused fmt import if
applicable.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 71c63952-9098-4fb0-ba77-9d97441e5c7a
📒 Files selected for processing (2)
components/etcd/etcd.gocomponents/etcd/maintenance.go
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
components/etcd/etcd.go (1)
188-191:⚠️ Potential issue | 🟠 MajorStart maintenance loop on successful restart/reuse too.
A successful restart returns at Line 190, so the maintenance loop at Lines 259-260 is skipped. Reused containers then miss health logging and auto-defrag.
💡 Proposed fix
err = e.restartExistingContainer(ctx, existingContainer, config) if err == nil { + // Ensure maintenance runs for reuse/restart path as well. + e.StartMaintenanceLoop(ctx) return nil }Also applies to: 259-260
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/etcd/etcd.go` around lines 188 - 191, The code returns immediately after a successful call to restartExistingContainer, which skips starting the maintenance loop and causes reused containers to miss health logging and auto-defrag; change the flow so that after a successful restartExistingContainer(ctx, existingContainer, config) (i.e., when err == nil) you do not return but instead proceed to start the maintenance loop (the same logic currently at Lines 259-260), ensuring the maintenance goroutine/loop (the maintenance/health/auto-defrag routine referenced in this file) is invoked for restarted/reused containers as well. Locate the restartExistingContainer call and remove the early return on success, or move the maintenance loop startup into a shared code path executed on both restart and fresh container creation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/etcd/etcd.go`:
- Around line 162-166: When prevState is nil we must force a recreate to avoid
reusing containers with stale env/args; update the conditional in the block that
checks prevState (the logic setting configChanged) so that configChanged is set
to true whenever prevState == nil (not only when requestedTLSEnabled is true).
Locate the prevState check and change the assignment of configChanged variable
(and adjust the comment) so a missing persisted state always triggers a
recreate, ensuring migration of config/version changes (look for symbols
prevState, configChanged, requestedTLSEnabled).
---
Duplicate comments:
In `@components/etcd/etcd.go`:
- Around line 188-191: The code returns immediately after a successful call to
restartExistingContainer, which skips starting the maintenance loop and causes
reused containers to miss health logging and auto-defrag; change the flow so
that after a successful restartExistingContainer(ctx, existingContainer, config)
(i.e., when err == nil) you do not return but instead proceed to start the
maintenance loop (the same logic currently at Lines 259-260), ensuring the
maintenance goroutine/loop (the maintenance/health/auto-defrag routine
referenced in this file) is invoked for restarted/reused containers as well.
Locate the restartExistingContainer call and remove the early return on success,
or move the maintenance loop startup into a shared code path executed on both
restart and fresh container creation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f76399aa-ecfe-44f9-a005-8033c5f63270
📒 Files selected for processing (1)
components/etcd/etcd.go
…oat_ratio Start the maintenance loop when reusing an existing container via restartExistingContainer, not just on fresh creation. Without this, restarted containers would never get health logging or auto-defrag. Also pass bloat_ratio as a numeric value in structured logs instead of formatting it as a string, so log backends can filter and aggregate on it.
There was a problem hiding this comment.
🧹 Nitpick comments (2)
components/etcd/maintenance.go (2)
27-29: Consider guarding against multiple concurrent maintenance loops.
StartMaintenanceLoopspawns a new goroutine each time it's called. IfStart()were ever called twice (e.g., due to a bug or future refactoring), this would create duplicate maintenance loops leading to redundant status checks, log spam, and potentially concurrent defrag attempts.A simple guard using
sync.Onceor an atomic flag would make this idempotent.♻️ Proposed fix using sync.Once
Add a field to
EtcdComponentinetcd.go:type EtcdComponent struct { *base.BaseComponent clientPort int peerPort int tlsEnabled bool config EtcdConfig maintenanceOnce sync.Once // guards StartMaintenanceLoop }Then in
maintenance.go:func (e *EtcdComponent) StartMaintenanceLoop(ctx context.Context) { - go e.maintenanceLoop(ctx) + e.maintenanceOnce.Do(func() { + go e.maintenanceLoop(ctx) + }) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/etcd/maintenance.go` around lines 27 - 29, StartMaintenanceLoop currently unconditionally starts a new goroutine which can create duplicate maintenance loops; add a sync.Once field (e.g., maintenanceOnce) to the EtcdComponent struct and use it in StartMaintenanceLoop to call maintenanceOnce.Do(func(){ go e.maintenanceLoop(ctx) }) so the maintenanceLoop is only spawned once; reference the EtcdComponent type, the new maintenanceOnce field, the StartMaintenanceLoop method, and the existing maintenanceLoop function when making the change.
95-98: Set TLS minimum version for defense in depth.The TLS config is missing
MinVersion. While etcd 3.5 defaults to TLS 1.2 on the server side, explicitly setting the minimum version hardens the client against downgrade attacks.🔒 Proposed fix to set MinVersion
return &tls.Config{ Certificates: []tls.Certificate{cert}, RootCAs: pool, + MinVersion: tls.VersionTLS12, }, nil🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/etcd/maintenance.go` around lines 95 - 98, The TLS config returned by the function building the client TLS settings (the block that returns &tls.Config{Certificates: []tls.Certificate{cert}, RootCAs: pool,}, nil) lacks an explicit MinVersion; update the tls.Config in that function (the code that constructs and returns *tls.Config in maintenance.go) to set MinVersion: tls.VersionTLS12 to enforce a minimum TLS1.2 and prevent downgrade attacks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@components/etcd/maintenance.go`:
- Around line 27-29: StartMaintenanceLoop currently unconditionally starts a new
goroutine which can create duplicate maintenance loops; add a sync.Once field
(e.g., maintenanceOnce) to the EtcdComponent struct and use it in
StartMaintenanceLoop to call maintenanceOnce.Do(func(){ go
e.maintenanceLoop(ctx) }) so the maintenanceLoop is only spawned once; reference
the EtcdComponent type, the new maintenanceOnce field, the StartMaintenanceLoop
method, and the existing maintenanceLoop function when making the change.
- Around line 95-98: The TLS config returned by the function building the client
TLS settings (the block that returns &tls.Config{Certificates:
[]tls.Certificate{cert}, RootCAs: pool,}, nil) lacks an explicit MinVersion;
update the tls.Config in that function (the code that constructs and returns
*tls.Config in maintenance.go) to set MinVersion: tls.VersionTLS12 to enforce a
minimum TLS1.2 and prevent downgrade attacks.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a1ddb532-6dd9-47c1-9dfe-4c2991cf78b0
📒 Files selected for processing (2)
components/etcd/etcd.gocomponents/etcd/maintenance.go

We hit a production outage where etcd's BoltDB file ballooned to 627 MB with only 53 MB of live data. Hourly compaction was already configured, but compaction only marks old revisions as deleted. BoltDB never releases those pages back without an explicit defrag call, so the freelist just kept growing until write operations stalled and took down the cluster.
This does two things to prevent a recurrence. First, it switches the BoltDB freelist type from
arraytomap. The array freelist does O(n) page allocation, which is exactly the thing that wedged under pressure. The map freelist uses a hashmap for O(1) lookups. Despite theexperimental-prefix on the flag, this has been stable since etcd 3.4.9 and is widely used in production.Second, it adds a maintenance loop that ticks every 5 minutes, calls the etcd maintenance API to check
DbSizevsDbSizeInUse, and triggers defrag when the bloat ratio exceeds 2x. The same loop logs structured health metrics (db size, live data size, bloat ratio) at INFO/WARN/ERROR depending on how bloated things are, so we'll have early visibility into any future growth trends. Defrag briefly pauses the etcd server, but for our single-node embedded setup that's a sub-second blip.Config versioning for upgrades
Container specs (env vars, args) are baked in at creation time and survive across restarts. Without intervention, the freelist type change would never take effect on existing installs because
Start()reuses the existing container. This PR adds aConfigVersionfield to the persisted etcd state file. When the version doesn't match, the container is torn down and recreated with the current spec (data is preserved via bind mount). Existing installs will go from version 0 to 1 on their first restart after this upgrade.This is a pattern worth adopting for the other container-managed components (buildkit, victoriametrics, victorialogs). See RFD-70 for the broader discussion.
Closes MIR-968