Skip to content

AROSLSRE-830: Mitigate etcd cascading quorum loss during node drains#8549

Draft
cssjr wants to merge 3 commits into
openshift:mainfrom
cssjr:AROSLSRE-830/reduce-cluster-crashes
Draft

AROSLSRE-830: Mitigate etcd cascading quorum loss during node drains#8549
cssjr wants to merge 3 commits into
openshift:mainfrom
cssjr:AROSLSRE-830/reduce-cluster-crashes

Conversation

@cssjr
Copy link
Copy Markdown

@cssjr cssjr commented May 19, 2026

Summary

Fixes critical etcd cascading failure issue (4.5% of clusters) by correcting PodDisruptionBudget configuration and increasing liveness probe tolerance.

Root Cause

When AKS drains a node, the current PDB (minAvailable: 1) allows Kubernetes to evict up to 2 etcd pods simultaneously, breaking quorum (need 2/3 in a 3-pod cluster). This triggers cascading shutdowns and endless restart loops.

Analysis of prow-ci cluster logs shows:

  • 95.5% of clusters handle SIGTERM gracefully
  • 4.5% enter death spiral due to incorrect PDB allowing simultaneous evictions
  • Burst failures (7 clusters in 6 minutes) confirm management cluster events

Changes

  1. PDB fix (pdb.yaml): minAvailable: 1minAvailable: 2

    • Prevents Kubernetes from breaking quorum during voluntary disruptions (node drains, upgrades)
    • Industry best practice for 3-node etcd clusters
  2. Liveness probe (statefulset.yaml): failureThreshold: 5failureThreshold: 12

    • Increases tolerance from 25s to 60s before killing pods
    • Reduces false-positive restarts during transient issues

Expected Impact

  • Reduce failure rate from 4.5% to < 1%
  • Should eliminate ~80% of cascading failures
  • Zero performance or resource cost
  • Protects against voluntary disruptions

Test Plan

  • Deploy to prow-ci environment
  • Monitor failure rate over 7 days
  • Target: < 0.5% failure rate
  • Re-evaluate 5-replica option (Phase 2) only if failures persist above 1%

Links

  • Jira: AROSLSRE-830
  • Analysis: dnsresolver/fix-evaluation.md

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Chores
    • Enhanced etcd availability by increasing the minimum number of pods required to remain available during disruptions, improving service continuity.
    • Increased resilience to temporary etcd pod responsiveness issues, reducing unnecessary restarts during brief connectivity fluctuations.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 19, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented May 19, 2026

@cssjr: This pull request references AROSLSRE-830 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

Fixes critical etcd cascading failure issue (4.5% of clusters) by correcting PodDisruptionBudget configuration and increasing liveness probe tolerance.

Root Cause

When AKS drains a node, the current PDB (minAvailable: 1) allows Kubernetes to evict up to 2 etcd pods simultaneously, breaking quorum (need 2/3 in a 3-pod cluster). This triggers cascading shutdowns and endless restart loops.

Analysis of prow-ci cluster logs shows:

  • 95.5% of clusters handle SIGTERM gracefully
  • 4.5% enter death spiral due to incorrect PDB allowing simultaneous evictions
  • Burst failures (7 clusters in 6 minutes) confirm management cluster events

Changes

  1. PDB fix (pdb.yaml): minAvailable: 1minAvailable: 2
  • Prevents Kubernetes from breaking quorum during voluntary disruptions (node drains, upgrades)
  • Industry best practice for 3-node etcd clusters
  1. Liveness probe (statefulset.yaml): failureThreshold: 5failureThreshold: 12
  • Increases tolerance from 25s to 60s before killing pods
  • Reduces false-positive restarts during transient issues
  1. Documentation (dnsresolver/fix-evaluation.md):
  • Comprehensive analysis of all fix options
  • Explains why PDB is the root cause
  • Phase 2 guidance if issues persist

Expected Impact

  • Reduce failure rate from 4.5% to < 1%
  • Should eliminate ~80% of cascading failures
  • Zero performance or resource cost
  • Protects against voluntary disruptions

Test Plan

  • Deploy to prow-ci environment
  • Monitor failure rate over 7 days
  • Target: < 0.5% failure rate
  • Re-evaluate 5-replica option (Phase 2) only if failures persist above 1%

Links

  • Jira: AROSLSRE-830
  • Analysis: dnsresolver/fix-evaluation.md

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 19, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 19, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 19, 2026

📝 Walkthrough

Walkthrough

This PR adjusts two etcd resilience configuration parameters. The pod disruption budget's minimum available replicas is increased from 1 to 2, ensuring at least two etcd instances remain available during voluntary disruptions. The liveness probe failure threshold is raised from 5 to 12, allowing the etcd pod to tolerate more consecutive failed health checks before triggering a restart by Kubernetes.

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Topology-Aware Scheduling Compatibility ⚠️ Warning Static PDB YAML hardcodes minAvailable: 2 without topology awareness, which blocks voluntary disruptions on Two-Node clusters (DualReplica/HighlyAvailableArbiter with only 2 schedulable nodes). Remove hardcoded minAvailable: 2 from pdb.yaml or parameterize based on replica count; rely solely on runtime AdaptPodDisruptionBudget adaptation function for topology handling.
✅ Passed checks (11 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names in the new etcd_test.go file are stable. No test titles contain dynamic values like pod names, timestamps, UUIDs, or node names. All use clear, static, descriptive strings.
Test Structure And Quality ✅ Passed PR's new etcd_test.go uses standard Go testing with Gomega, not Ginkgo. The custom check applies only to Ginkgo test code, not unit tests.
Microshift Test Compatibility ✅ Passed The PR modifies only YAML Kubernetes resource definitions (PodDisruptionBudget and StatefulSet) and documentation—no Ginkgo e2e tests are added, making the check not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed This PR modifies only YAML configuration files (pdb.yaml, statefulset.yaml), not Ginkgo e2e tests. The SNO test compatibility check does not apply.
Ote Binary Stdout Contract ✅ Passed PR modifies only declarative YAML Kubernetes manifests (pdb.yaml, statefulset.yaml); no test code, executable code, or stdout operations that could violate OTE Binary Stdout Contract.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No Ginkgo e2e tests are added in this PR; changes are limited to Kubernetes YAML resource files (pdb.yaml, statefulset.yaml) and documentation. The check is not applicable.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main objective: mitigating etcd cascading quorum loss during node drains, which matches the core changes (PDB minAvailable increase and liveness probe threshold adjustment).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 19, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: cssjr
Once this PR has been reviewed and has the lgtm label, please assign cblecker for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release and removed do-not-merge/needs-area labels May 19, 2026
Comment thread dnsresolver/fix-evaluation.md Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/pdb.yaml (1)

6-6: 💤 Low value

Consider adding an explanatory comment.

To help future maintainers understand the relationship between this PDB and the replica count, consider adding a comment:

 spec:
+  # Ensures quorum (2/3) is maintained during voluntary disruptions for a 3-replica etcd cluster
   minAvailable: 2
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/pdb.yaml`
at line 6, Add an inline explanatory comment to the etcd PodDisruptionBudget
(pdb.yaml) above the minAvailable: 2 setting explaining why minAvailable is set
to 2 (e.g., to preserve etcd quorum given the etcd replica count of X, tolerate
one node disruption while maintaining majority/quorum), and reference the
related resource (the etcd StatefulSet/Deployment that sets the replica count)
so future maintainers know to update this value when the replica count changes;
place the comment immediately above the minAvailable field in the pdb.yaml.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/pdb.yaml`:
- Line 6: Add an inline explanatory comment to the etcd PodDisruptionBudget
(pdb.yaml) above the minAvailable: 2 setting explaining why minAvailable is set
to 2 (e.g., to preserve etcd quorum given the etcd replica count of X, tolerate
one node disruption while maintaining majority/quorum), and reference the
related resource (the etcd StatefulSet/Deployment that sets the replica count)
so future maintainers know to update this value when the replica count changes;
place the comment immediately above the minAvailable field in the pdb.yaml.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: bf47cd8b-5517-44d2-b5f9-976a0b794375

📥 Commits

Reviewing files that changed from the base of the PR and between 9e283ae and 6172a6c.

📒 Files selected for processing (3)
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/pdb.yaml
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/statefulset.yaml
  • dnsresolver/fix-evaluation.md

When AKS drains a node, the current PDB (minAvailable: 1) allows
Kubernetes to evict up to 2 etcd pods simultaneously, breaking
quorum (need 2/3 in a 3-pod cluster). This triggers cascading
shutdowns and endless restart loops.

Changes:
- PDB: minAvailable 1 → 2 to prevent voluntary quorum loss
- Liveness probe: failureThreshold 5 → 12 (25s → 60s) to
  reduce false-positive restarts during transient issues

Expected impact: reduce failure rate from 4.5% to < 1%.

Analysis shows 95.5% of prow-ci clusters handle SIGTERM
gracefully, but 4.5% enter death spiral due to incorrect PDB
allowing simultaneous evictions.

See: AROSLSRE-830
See-also: dnsresolver/fix-evaluation.md

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@cssjr cssjr force-pushed the AROSLSRE-830/reduce-cluster-crashes branch from 509cac9 to 05d664e Compare May 19, 2026 18:31
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/pdb.yaml`:
- Line 6: The PodDisruptionBudget currently hardcodes minAvailable: 2 which
assumes a 3-member etcd and breaks for other topologies; change the pdb.yaml to
compute/render minAvailable from the effective etcd replica count or
control-plane topology instead of the hardcoded value. Locate the minAvailable
entry in the etcd PDB asset and replace the literal with a
templated/parameterized value tied to the etcd replica count (for example a
template variable like .Values.etcd.replicas or a controlPlaneTopology switch),
and implement logic so minAvailable = 1 for single-replica topologies, = 2 for
3-replica HA, and otherwise compute a safe quorum-aware value (e.g.
floor(replicas/2)+1) so voluntary disruptions remain schedulable across
different topologies. Ensure the new template falls back to a sensible default
if the replica count is missing and add tests/manifest rendering checks to
validate generated minAvailable for 1, 3, and other replica counts.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 98e5e970-3cb9-460b-8f5f-b3a6ba931e64

📥 Commits

Reviewing files that changed from the base of the PR and between 6172a6c and 05d664e.

📒 Files selected for processing (2)
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/pdb.yaml
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/statefulset.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/statefulset.yaml

@codecov
Copy link
Copy Markdown

codecov Bot commented May 19, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 40.34%. Comparing base (9e283ae) to head (3cecc2b).
⚠️ Report is 6 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #8549   +/-   ##
=======================================
  Coverage   40.34%   40.34%           
=======================================
  Files         755      755           
  Lines       93167    93167           
=======================================
  Hits        37587    37587           
  Misses      52877    52877           
  Partials     2703     2703           
Flag Coverage Δ
cmd-support 34.30% <ø> (ø)
cpo-hostedcontrolplane 41.76% <ø> (ø)
cpo-other 40.14% <ø> (ø)
hypershift-operator 50.72% <ø> (ø)
other 31.54% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

cssjr and others added 2 commits May 19, 2026 11:42
…obe threshold

Updates test fixtures to reflect the failureThreshold change from 5 to 12
in the etcd liveness probe, which was part of the fix to prevent cascading
quorum loss during node drains.

See-also: AROSLSRE-830

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@cssjr
Copy link
Copy Markdown
Author

cssjr commented May 20, 2026

/test all

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 20, 2026

@cssjr: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@cssjr cssjr changed the title AROSLSRE-830: Fix etcd cascading quorum loss during node drains AROSLSRE-830: Mitigate etcd cascading quorum loss during node drains May 20, 2026
@hypershift-jira-solve-ci
Copy link
Copy Markdown

hypershift-jira-solve-ci Bot commented May 21, 2026

Now I have the complete root cause analysis. Here is the final report:

Test Failure Analysis Complete

Job Information

  • Prow Job: Red Hat Konflux / hypershift-operator-main-enterprise-contract / hypershift-operator-main
  • Build ID: hypershift-operator-main-enterprise-contract-29wgc
  • Second Job: Red Hat Konflux / hypershift-operator-enterprise-contract / hypershift-operator-main
  • Second Build ID: hypershift-operator-enterprise-contract-s8jxg
  • Snapshot: hypershift-operator-20260520-002927-000
  • PR: AROSLSRE-830: Mitigate etcd cascading quorum loss during node drains #8549 (AROSLSRE-830: Mitigate etcd cascading quorum loss during node drains)
  • PR SHA: 3cecc2b

Test Failure Analysis

Error

Enterprise Contract verify task FAILURE: 254 success(es), 24 warning(s), 2 failure(s)

Summary

Both Konflux Enterprise Contract checks fail because PR #8549's branch is 20 commits behind main and is missing PR #8557 ("NO-JIRA: Update Konflux Tekton task bundles", merged 2026-05-20), which updated all Tekton task bundle digests and migrated build-image-index from v0.2 to v0.3. The PR branch still references outdated/deprecated task bundle digests that no longer pass Enterprise Contract verification, producing exactly 2 failures. This is unrelated to the PR's actual code changes (etcd PDB and StatefulSet YAML).

Root Cause

The Enterprise Contract (EC) verification enforces that container images were built using trusted, up-to-date Tekton task bundles. PR #8549's branch was forked from main before PR #8557 landed, so it carries stale .tekton/ pipeline configuration:

  1. build-image-index task v0.2 → v0.3 migration not applied: The PR branch still uses task-build-image-index:0.2@sha256:c7b0f7e1... and passes the deprecated COMMIT_SHA and IMAGE_EXPIRES_AFTER parameters. PR NO-JIRA: Update Konflux Tekton task bundles #8557 upgraded to task-build-image-index:0.3@sha256:b33bfa8d... and removed those parameters per the official migration guide. The old v0.2 digest is no longer in the trusted bundle allowlist, causing an EC failure.

  2. Outdated task bundle digests: 32+ other Tekton tasks (e.g., buildah-remote-oci-ta, clair-scan, clamav-scan, rpms-signature-scan, etc.) have stale SHA256 digests. When Konflux rotates trusted digests, the old digests eventually fall off the acceptable list, producing the second EC failure.

The 2 failures correspond to Enterprise Contract policy rules that validate task bundle provenance — specifically that the build pipeline used currently-trusted task bundle references. The PR's actual code changes (etcd pdb.yaml, statefulset.yaml, and test fixtures) are irrelevant to these failures.

Proof: On main (after PR #8557), the identical EC checks pass with 512 successes, 16 warnings, 0 failures. On the PR branch, they show 254 successes, 24 warnings, 2 failures.

Recommendations
  1. Rebase PR AROSLSRE-830: Mitigate etcd cascading quorum loss during node drains #8549 onto main — this is the only fix needed. Rebasing will pull in PR NO-JIRA: Update Konflux Tekton task bundles #8557's Tekton task bundle updates (.tekton/hypershift-operator-main-tag.yaml and .tekton/pipelines/common-operator-build.yaml), which will resolve both EC failures:

    git fetch origin main
    git rebase origin/main
    git push --force-with-lease
  2. No code changes required — the PR's actual changes (etcd PDB and StatefulSet modifications) are not related to the failure.

  3. Consider enabling Konflux auto-rebasing — if the repository doesn't already have it, enabling automatic rebasing of PR branches when .tekton/ files change on main would prevent this class of failure from blocking unrelated PRs in the future.

Evidence
Evidence Detail
PR branch status 20 commits behind main, merge base 9e283aee
Missing commit PR #8557 (ef9cde06) "Update Konflux Tekton task bundles" merged 2026-05-20T16:05:39Z
PR branch build-image-index v0.2 (sha256:c7b0f7e1...) with deprecated COMMIT_SHA and IMAGE_EXPIRES_AFTER params
Main branch build-image-index v0.3 (sha256:b33bfa8d...) without deprecated params
PR branch EC result 254 successes, 24 warnings, 2 failuresconclusion: failure
Main branch EC result 512 successes, 16 warnings, 0 failuresconclusion: neutral
Files changed by PR #8549 Only etcd-related: pdb.yaml, statefulset.yaml, test fixtures (no .tekton/ files)
Files changed by PR #8557 .tekton/hypershift-operator-main-tag.yaml, .tekton/pipelines/common-operator-build.yaml
EC check 1 pipeline hypershift-operator-main-enterprise-contract-29wgc
EC check 2 pipeline hypershift-operator-enterprise-contract-s8jxg
Both checks ran 2026-05-20T00:29:29Z (~16 hours before PR #8557 merged)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants