Skip to content

feat: enable failsafe mode for single-host nodes#380

Merged
jason-lynch merged 2 commits intorelease/v0.8.0from
feat/PLAT-545/enable-failsafe-for-single-host-nodes
May 6, 2026
Merged

feat: enable failsafe mode for single-host nodes#380
jason-lynch merged 2 commits intorelease/v0.8.0from
feat/PLAT-545/enable-failsafe-for-single-host-nodes

Conversation

@jason-lynch
Copy link
Copy Markdown
Member

Summary

Patroni's failsafe mode prevents instances from entering read-only mode when they lose their connection to the DCS, as long as they can contact all other Patroni instances in the cluster. There could be edge cases to this behavior when there's more than one instance in the Patroni cluster; however, it's known to be safe if there's only one host in the cluster.

This commit enables failsafe mode when there's only a single host in the node (meaning the Patroni cluster only has a single instance) and disables it otherwise.

This change is backward-compatible and does not need a migration.

Testing

# run the compose dev environment
make dev-watch

# then in another terminal, create a database with two nodes: one with two
# hosts and another with one host
cp1-req create-database <<EOF | cp-follow-task
{
  "id": "storefront",
  "spec": {
    "database_name": "storefront",
    "database_users": [
      {
        "username": "admin",
        "password": "password",
        "db_owner": true,
        "attributes": ["SUPERUSER", "LOGIN"]
      }
    ],
    "port": 0,
    "nodes": [
      { "name": "n1", "host_ids": ["host-1", "host-3"] },
      { "name": "n2", "host_ids": ["host-2"] }
    ]
  }
}
EOF

# in separate terminals, start connections to n1's primary and n2's instance using
# using the interactive prompts
cp-psql

# then, stop the control plane servers with ctrl+c

# wait a few seconds for Patroni to notice that it's lost its connection to Etcd
# then in your psql windows, query pg_is_in_recovery()
select pg_is_in_recovery();

# what you should see is that n1's primary entered recovery mode because it's
# configured with failsafe_mode: false
# n2's primary should _not_ have entered recovery mode and should still be
# writable because it only has a single host.

PLAT-545

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 5, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 38464325-3f5f-4766-ad87-b420fd20c680

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Added a public NodeSize field to InstanceSpec, populate it per-node by counting HostIDs, and use it to patch Patroni's dynamic config to enable/disable failsafe mode before opening the PostgreSQL connection; also imported utils and reused a cached patroniClient instance.

Changes

Per-Node Sizing and Failsafe Configuration

Layer / File(s) Summary
Data Shape
server/internal/database/spec.go
Added exported NodeSize int to InstanceSpec and updated Clone() to include it.
Data Population
server/internal/database/spec.go
When building per-node InstanceSpec in NodeInstances, compute nodeSize := len(node.HostIDs) and set NodeSize.
Core Behavior
server/internal/database/instance_resource.go
Cache patroniClient in a local variable and reuse it; insert logic to patch Patroni dynamic config to set failsafe (using utils.PointerTo for the boolean) based on NodeSize before establishing the PostgreSQL connection; add error handling for the patch.
Imports / Wiring
server/internal/database/instance_resource.go
Added import of the utils package required for pointer helper usage.

Poem

A rabbit counts each host with care,
NodeSize grows from HostIDs there,
Patroni patched before DB starts,
Failsafe set by tiny hearts,
Hopping changes done with flair 🐇✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning Description covers summary, changes, and testing with detailed instructions. Checklist sections are incomplete: missing marks for tests, documentation, issue linking, changelog entry, and breaking changes callout. Complete all checklist items by marking them as done or noting why they don't apply, and explicitly document any breaking changes if applicable.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed Title clearly and specifically describes the main change: enabling failsafe mode for single-host nodes, which is the primary objective of this PR.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/PLAT-545/enable-failsafe-for-single-host-nodes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented May 5, 2026

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 0 complexity · 0 duplication

Metric Results
Complexity 0
Duplication 0

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@jason-lynch
Copy link
Copy Markdown
Member Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 6, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
server/internal/database/spec.go (1)

580-601: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Clone drops InPlaceRestore state.

Line 580-601: InstanceSpec.Clone() does not copy InPlaceRestore, so cloned specs silently reset it to false.

Proposed fix
 return &InstanceSpec{
 	InstanceID:       s.InstanceID,
 	TenantID:         utils.ClonePointer(s.TenantID),
 	DatabaseID:       s.DatabaseID,
 	HostID:           s.HostID,
 	DatabaseName:     s.DatabaseName,
 	NodeName:         s.NodeName,
 	NodeOrdinal:      s.NodeOrdinal,
 	PgEdgeVersion:    s.PgEdgeVersion.Clone(),
 	Port:             utils.ClonePointer(s.Port),
 	PatroniPort:      utils.ClonePointer(s.PatroniPort),
 	CPUs:             s.CPUs,
 	MemoryBytes:      s.MemoryBytes,
 	DatabaseUsers:    users,
 	BackupConfig:     s.BackupConfig.Clone(),
 	RestoreConfig:    s.RestoreConfig.Clone(),
 	PostgreSQLConf:   maps.Clone(s.PostgreSQLConf),
 	ClusterSize:      s.ClusterSize,
 	NodeSize:         s.NodeSize,
+	InPlaceRestore:   s.InPlaceRestore,
 	OrchestratorOpts: s.OrchestratorOpts.Clone(),
 	AllHostIDs:       slices.Clone(s.AllHostIDs),
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/internal/database/spec.go` around lines 580 - 601, InstanceSpec.Clone
currently omits the InPlaceRestore field causing clones to reset that state;
update the Clone implementation (the constructor returning &InstanceSpec{...})
to include the InPlaceRestore field by adding InPlaceRestore: s.InPlaceRestore
(or the appropriate pointer clone if the field is a pointer) so the cloned
InstanceSpec preserves the original InPlaceRestore value.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@server/internal/database/spec.go`:
- Around line 580-601: InstanceSpec.Clone currently omits the InPlaceRestore
field causing clones to reset that state; update the Clone implementation (the
constructor returning &InstanceSpec{...}) to include the InPlaceRestore field by
adding InPlaceRestore: s.InPlaceRestore (or the appropriate pointer clone if the
field is a pointer) so the cloned InstanceSpec preserves the original
InPlaceRestore value.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fda37d05-3da5-4bd2-8d96-aaa1c3abfce8

📥 Commits

Reviewing files that changed from the base of the PR and between 21f23cb and 0fa0123.

📒 Files selected for processing (2)
  • server/internal/database/instance_resource.go
  • server/internal/database/spec.go

@jason-lynch jason-lynch marked this pull request as ready for review May 6, 2026 13:56
Patroni's failsafe mode prevents instances from going into read-only
mode when they lose their connection to the DCS as long as they're able
to contact all other Patroni instances in the cluster. There could be
edge-cases to this behavior when there's more than one instance in the
Patroni cluster, however it's known to be safe if there's only one host
in cluster.

This commit enables failsafe mode when there's only a single host in the
node (meaning the Patroni cluster only has a single instance) and
disables it otherwise.

This change is backward-compatible and does not need a migration.

PLAT-545
@jason-lynch jason-lynch force-pushed the feat/PLAT-545/enable-failsafe-for-single-host-nodes branch from 0fa0123 to 7d83404 Compare May 6, 2026 15:13
@jason-lynch jason-lynch changed the base branch from main to release/v0.8.0 May 6, 2026 15:14
@jason-lynch jason-lynch force-pushed the feat/PLAT-545/enable-failsafe-for-single-host-nodes branch from 7d83404 to 3a368c3 Compare May 6, 2026 15:36
@jason-lynch jason-lynch merged commit ffd8ddc into release/v0.8.0 May 6, 2026
3 checks passed
@jason-lynch jason-lynch deleted the feat/PLAT-545/enable-failsafe-for-single-host-nodes branch May 6, 2026 16:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants