Skip to content

Harden embedded etcd against freelist bloat#733

Merged
phinze merged 3 commits into
mainfrom
phinze/mir-968-harden-embedded-etcd-freelist-type-automated-defrag
Apr 6, 2026
Merged

Harden embedded etcd against freelist bloat#733
phinze merged 3 commits into
mainfrom
phinze/mir-968-harden-embedded-etcd-freelist-type-automated-defrag

Conversation

@phinze
Copy link
Copy Markdown
Contributor

@phinze phinze commented Apr 6, 2026

We hit a production outage where etcd's BoltDB file ballooned to 627 MB with only 53 MB of live data. Hourly compaction was already configured, but compaction only marks old revisions as deleted. BoltDB never releases those pages back without an explicit defrag call, so the freelist just kept growing until write operations stalled and took down the cluster.

This does two things to prevent a recurrence. First, it switches the BoltDB freelist type from array to map. The array freelist does O(n) page allocation, which is exactly the thing that wedged under pressure. The map freelist uses a hashmap for O(1) lookups. Despite the experimental- prefix on the flag, this has been stable since etcd 3.4.9 and is widely used in production.

Second, it adds a maintenance loop that ticks every 5 minutes, calls the etcd maintenance API to check DbSize vs DbSizeInUse, and triggers defrag when the bloat ratio exceeds 2x. The same loop logs structured health metrics (db size, live data size, bloat ratio) at INFO/WARN/ERROR depending on how bloated things are, so we'll have early visibility into any future growth trends. Defrag briefly pauses the etcd server, but for our single-node embedded setup that's a sub-second blip.

Config versioning for upgrades

Container specs (env vars, args) are baked in at creation time and survive across restarts. Without intervention, the freelist type change would never take effect on existing installs because Start() reuses the existing container. This PR adds a ConfigVersion field to the persisted etcd state file. When the version doesn't match, the container is torn down and recreated with the current spec (data is preserved via bind mount). Existing installs will go from version 0 to 1 on their first restart after this upgrade.

This is a pattern worth adopting for the other container-managed components (buildkit, victoriametrics, victorialogs). See RFD-70 for the broader discussion.

Closes MIR-968

We hit a production outage where etcd's BoltDB file grew to 627 MB
with only 53 MB of live data. Compaction was running (periodic/1h),
but compaction just marks old revisions as deleted — BoltDB never
actually releases pages without an explicit defrag. The freelist
grew so large that write operations stalled.

Two things here: switch the BoltDB freelist type from `array` (O(n)
page allocation) to `map` (O(1)), and add a maintenance loop that
watches db size vs live data and triggers defrag automatically when
the bloat ratio exceeds 2x. The loop also logs health metrics every
5 minutes at appropriate severity levels so we have visibility into
db growth over time.

Closes MIR-968
@phinze phinze requested a review from a team as a code owner April 6, 2026 16:35
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 6, 2026

📝 Walkthrough

Walkthrough

The etcd component now persists a ConfigVersion and introduces currentConfigVersion; EtcdComponent.Start recreates the container when TLS enablement changes, when persisted ConfigVersion differs from currentConfigVersion, or when no prior state exists and TLS is requested. Start launches StartMaintenanceLoop(ctx) immediately after restart or creation. The maintenance loop (5‑minute tick) creates an etcd maintenance client (optional TLS via server.crt/server.key/ca.crt), computes bloatRatio = DbSize/DbSizeInUse, logs severity by thresholds, defragments if DbSize > 2.0 * DbSizeInUse, and logs reclaimed bytes. The container environment adds ETCD_EXPERIMENTAL_BACKEND_BBOLT_FREELIST_TYPE=map.


Comment @coderabbitai help to get the list of available commands and usage tips.

The etcd container spec (env vars, args) is baked in at creation time.
When miren upgrades and restarts, Start() would find the existing
container and reuse it, meaning new env vars like the freelist type
change would never take effect.

Add a ConfigVersion to the persisted etcd state. When the version
doesn't match, the container is torn down and recreated with the
current spec. This generalizes the existing TLS mismatch detection
so any future container spec change just needs a version bump.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
components/etcd/maintenance.go (1)

115-119: Keep bloat_ratio numeric in the structured log.

Formatting it with fmt.Sprintf("%.2f", bloatRatio) turns the field into a string, which makes numeric filtering and aggregation harder in log backends.

Possible cleanup
 	attrs := []any{
 		"db_size_bytes", dbSize,
 		"db_size_in_use_bytes", dbSizeInUse,
-		"bloat_ratio", fmt.Sprintf("%.2f", bloatRatio),
+		"bloat_ratio", bloatRatio,
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/etcd/maintenance.go` around lines 115 - 119, The structured log
currently converts bloat_ratio to a string via fmt.Sprintf("%.2f", bloatRatio)
which prevents numeric filtering—change the attrs entry to pass bloatRatio as a
numeric value (e.g. bloatRatio or a rounded numeric value using
math.Round(bloatRatio*100)/100) instead of fmt.Sprintf; update the attrs slice
declaration that contains "db_size_bytes", dbSize, "db_size_in_use_bytes",
dbSizeInUse, "bloat_ratio", ... to use the numeric bloatRatio, and remove the
now-unused fmt import if applicable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/etcd/etcd.go`:
- Around line 254-255: The maintenance loop (e.StartMaintenanceLoop(ctx)) is
only started on fresh container creation, so containers that hit
restartExistingContainer() return early and never get health-check/auto-defrag;
update restartExistingContainer() to invoke e.StartMaintenanceLoop(ctx) on the
reuse/restart path as well (place the call before any early returns or at the
end of restartExistingContainer() so the loop is started for reused containers),
and ensure StartMaintenanceLoop is idempotent or guarded so calling it twice is
safe.

---

Nitpick comments:
In `@components/etcd/maintenance.go`:
- Around line 115-119: The structured log currently converts bloat_ratio to a
string via fmt.Sprintf("%.2f", bloatRatio) which prevents numeric
filtering—change the attrs entry to pass bloatRatio as a numeric value (e.g.
bloatRatio or a rounded numeric value using math.Round(bloatRatio*100)/100)
instead of fmt.Sprintf; update the attrs slice declaration that contains
"db_size_bytes", dbSize, "db_size_in_use_bytes", dbSizeInUse, "bloat_ratio", ...
to use the numeric bloatRatio, and remove the now-unused fmt import if
applicable.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 71c63952-9098-4fb0-ba77-9d97441e5c7a

📥 Commits

Reviewing files that changed from the base of the PR and between 8d8b403 and 8dd2d73.

📒 Files selected for processing (2)
  • components/etcd/etcd.go
  • components/etcd/maintenance.go

Comment thread components/etcd/etcd.go
Comment thread components/etcd/etcd.go
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
components/etcd/etcd.go (1)

188-191: ⚠️ Potential issue | 🟠 Major

Start maintenance loop on successful restart/reuse too.

A successful restart returns at Line 190, so the maintenance loop at Lines 259-260 is skipped. Reused containers then miss health logging and auto-defrag.

💡 Proposed fix
 			err = e.restartExistingContainer(ctx, existingContainer, config)
 			if err == nil {
+				// Ensure maintenance runs for reuse/restart path as well.
+				e.StartMaintenanceLoop(ctx)
 				return nil
 			}

Also applies to: 259-260

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/etcd/etcd.go` around lines 188 - 191, The code returns immediately
after a successful call to restartExistingContainer, which skips starting the
maintenance loop and causes reused containers to miss health logging and
auto-defrag; change the flow so that after a successful
restartExistingContainer(ctx, existingContainer, config) (i.e., when err == nil)
you do not return but instead proceed to start the maintenance loop (the same
logic currently at Lines 259-260), ensuring the maintenance goroutine/loop (the
maintenance/health/auto-defrag routine referenced in this file) is invoked for
restarted/reused containers as well. Locate the restartExistingContainer call
and remove the early return on success, or move the maintenance loop startup
into a shared code path executed on both restart and fresh container creation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/etcd/etcd.go`:
- Around line 162-166: When prevState is nil we must force a recreate to avoid
reusing containers with stale env/args; update the conditional in the block that
checks prevState (the logic setting configChanged) so that configChanged is set
to true whenever prevState == nil (not only when requestedTLSEnabled is true).
Locate the prevState check and change the assignment of configChanged variable
(and adjust the comment) so a missing persisted state always triggers a
recreate, ensuring migration of config/version changes (look for symbols
prevState, configChanged, requestedTLSEnabled).

---

Duplicate comments:
In `@components/etcd/etcd.go`:
- Around line 188-191: The code returns immediately after a successful call to
restartExistingContainer, which skips starting the maintenance loop and causes
reused containers to miss health logging and auto-defrag; change the flow so
that after a successful restartExistingContainer(ctx, existingContainer, config)
(i.e., when err == nil) you do not return but instead proceed to start the
maintenance loop (the same logic currently at Lines 259-260), ensuring the
maintenance goroutine/loop (the maintenance/health/auto-defrag routine
referenced in this file) is invoked for restarted/reused containers as well.
Locate the restartExistingContainer call and remove the early return on success,
or move the maintenance loop startup into a shared code path executed on both
restart and fresh container creation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f76399aa-ecfe-44f9-a005-8033c5f63270

📥 Commits

Reviewing files that changed from the base of the PR and between 8dd2d73 and 1c310c9.

📒 Files selected for processing (1)
  • components/etcd/etcd.go

Comment thread components/etcd/etcd.go
Copy link
Copy Markdown
Contributor

@evanphx evanphx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image

…oat_ratio

Start the maintenance loop when reusing an existing container via
restartExistingContainer, not just on fresh creation. Without this,
restarted containers would never get health logging or auto-defrag.

Also pass bloat_ratio as a numeric value in structured logs instead
of formatting it as a string, so log backends can filter and
aggregate on it.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
components/etcd/maintenance.go (2)

27-29: Consider guarding against multiple concurrent maintenance loops.

StartMaintenanceLoop spawns a new goroutine each time it's called. If Start() were ever called twice (e.g., due to a bug or future refactoring), this would create duplicate maintenance loops leading to redundant status checks, log spam, and potentially concurrent defrag attempts.

A simple guard using sync.Once or an atomic flag would make this idempotent.

♻️ Proposed fix using sync.Once

Add a field to EtcdComponent in etcd.go:

type EtcdComponent struct {
	*base.BaseComponent

	clientPort      int
	peerPort        int
	tlsEnabled      bool
	config          EtcdConfig
	maintenanceOnce sync.Once // guards StartMaintenanceLoop
}

Then in maintenance.go:

 func (e *EtcdComponent) StartMaintenanceLoop(ctx context.Context) {
-	go e.maintenanceLoop(ctx)
+	e.maintenanceOnce.Do(func() {
+		go e.maintenanceLoop(ctx)
+	})
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/etcd/maintenance.go` around lines 27 - 29, StartMaintenanceLoop
currently unconditionally starts a new goroutine which can create duplicate
maintenance loops; add a sync.Once field (e.g., maintenanceOnce) to the
EtcdComponent struct and use it in StartMaintenanceLoop to call
maintenanceOnce.Do(func(){ go e.maintenanceLoop(ctx) }) so the maintenanceLoop
is only spawned once; reference the EtcdComponent type, the new maintenanceOnce
field, the StartMaintenanceLoop method, and the existing maintenanceLoop
function when making the change.

95-98: Set TLS minimum version for defense in depth.

The TLS config is missing MinVersion. While etcd 3.5 defaults to TLS 1.2 on the server side, explicitly setting the minimum version hardens the client against downgrade attacks.

🔒 Proposed fix to set MinVersion
 	return &tls.Config{
 		Certificates: []tls.Certificate{cert},
 		RootCAs:      pool,
+		MinVersion:   tls.VersionTLS12,
 	}, nil
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/etcd/maintenance.go` around lines 95 - 98, The TLS config returned
by the function building the client TLS settings (the block that returns
&tls.Config{Certificates: []tls.Certificate{cert}, RootCAs: pool,}, nil) lacks
an explicit MinVersion; update the tls.Config in that function (the code that
constructs and returns *tls.Config in maintenance.go) to set MinVersion:
tls.VersionTLS12 to enforce a minimum TLS1.2 and prevent downgrade attacks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@components/etcd/maintenance.go`:
- Around line 27-29: StartMaintenanceLoop currently unconditionally starts a new
goroutine which can create duplicate maintenance loops; add a sync.Once field
(e.g., maintenanceOnce) to the EtcdComponent struct and use it in
StartMaintenanceLoop to call maintenanceOnce.Do(func(){ go
e.maintenanceLoop(ctx) }) so the maintenanceLoop is only spawned once; reference
the EtcdComponent type, the new maintenanceOnce field, the StartMaintenanceLoop
method, and the existing maintenanceLoop function when making the change.
- Around line 95-98: The TLS config returned by the function building the client
TLS settings (the block that returns &tls.Config{Certificates:
[]tls.Certificate{cert}, RootCAs: pool,}, nil) lacks an explicit MinVersion;
update the tls.Config in that function (the code that constructs and returns
*tls.Config in maintenance.go) to set MinVersion: tls.VersionTLS12 to enforce a
minimum TLS1.2 and prevent downgrade attacks.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a1ddb532-6dd9-47c1-9dfe-4c2991cf78b0

📥 Commits

Reviewing files that changed from the base of the PR and between 1c310c9 and 1823b0d.

📒 Files selected for processing (2)
  • components/etcd/etcd.go
  • components/etcd/maintenance.go

@phinze phinze enabled auto-merge April 6, 2026 21:08
@phinze phinze disabled auto-merge April 6, 2026 21:10
@phinze phinze enabled auto-merge April 6, 2026 21:19
@phinze phinze merged commit 9f06006 into main Apr 6, 2026
19 of 21 checks passed
@phinze phinze deleted the phinze/mir-968-harden-embedded-etcd-freelist-type-automated-defrag branch April 6, 2026 21:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants