AROSLSRE-830: Mitigate etcd cascading quorum loss during node drains by cssjr · Pull Request #8549 · openshift/hypershift

cssjr · 2026-05-19T18:22:49Z

Summary

Fixes critical etcd cascading failure issue (4.5% of clusters) by correcting PodDisruptionBudget configuration and increasing liveness probe tolerance.

Root Cause

When AKS drains a node, the current PDB (minAvailable: 1) allows Kubernetes to evict up to 2 etcd pods simultaneously, breaking quorum (need 2/3 in a 3-pod cluster). This triggers cascading shutdowns and endless restart loops.

Analysis of prow-ci cluster logs shows:

95.5% of clusters handle SIGTERM gracefully
4.5% enter death spiral due to incorrect PDB allowing simultaneous evictions
Burst failures (7 clusters in 6 minutes) confirm management cluster events

Changes

PDB fix (pdb.yaml): minAvailable: 1 → minAvailable: 2
- Prevents Kubernetes from breaking quorum during voluntary disruptions (node drains, upgrades)
- Industry best practice for 3-node etcd clusters
Liveness probe (statefulset.yaml): failureThreshold: 5 → failureThreshold: 12
- Increases tolerance from 25s to 60s before killing pods
- Reduces false-positive restarts during transient issues

Expected Impact

Reduce failure rate from 4.5% to < 1%
Should eliminate ~80% of cascading failures
Zero performance or resource cost
Protects against voluntary disruptions

Test Plan

Deploy to prow-ci environment
Monitor failure rate over 7 days
Target: < 0.5% failure rate
Re-evaluate 5-replica option (Phase 2) only if failures persist above 1%

Links

Jira: AROSLSRE-830
Analysis: dnsresolver/fix-evaluation.md

🤖 Generated with Claude Code

Summary by CodeRabbit

Chores
- Enhanced etcd availability by increasing the minimum number of pods required to remain available during disruptions, improving service continuity.
- Increased resilience to temporary etcd pod responsiveness issues, reducing unnecessary restarts during brief connectivity fluctuations.

openshift-merge-bot · 2026-05-19T18:22:52Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

openshift-ci-robot · 2026-05-19T18:22:54Z

@cssjr: This pull request references AROSLSRE-830 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

Fixes critical etcd cascading failure issue (4.5% of clusters) by correcting PodDisruptionBudget configuration and increasing liveness probe tolerance.

Root Cause

When AKS drains a node, the current PDB (minAvailable: 1) allows Kubernetes to evict up to 2 etcd pods simultaneously, breaking quorum (need 2/3 in a 3-pod cluster). This triggers cascading shutdowns and endless restart loops.

Analysis of prow-ci cluster logs shows:

95.5% of clusters handle SIGTERM gracefully

4.5% enter death spiral due to incorrect PDB allowing simultaneous evictions

Burst failures (7 clusters in 6 minutes) confirm management cluster events

Changes

PDB fix (pdb.yaml): minAvailable: 1 → minAvailable: 2

Prevents Kubernetes from breaking quorum during voluntary disruptions (node drains, upgrades)

Industry best practice for 3-node etcd clusters

Liveness probe (statefulset.yaml): failureThreshold: 5 → failureThreshold: 12

Increases tolerance from 25s to 60s before killing pods

Reduces false-positive restarts during transient issues

Documentation (dnsresolver/fix-evaluation.md):

Comprehensive analysis of all fix options

Explains why PDB is the root cause

Phase 2 guidance if issues persist

Expected Impact

Reduce failure rate from 4.5% to < 1%

Should eliminate ~80% of cascading failures

Zero performance or resource cost

Protects against voluntary disruptions

Test Plan

Deploy to prow-ci environment

Monitor failure rate over 7 days

Target: < 0.5% failure rate

Re-evaluate 5-replica option (Phase 2) only if failures persist above 1%

Links

Jira: AROSLSRE-830

Analysis: dnsresolver/fix-evaluation.md

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-05-19T18:22:57Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

coderabbitai · 2026-05-19T18:23:06Z

📝 Walkthrough

Walkthrough

This PR adjusts two etcd resilience configuration parameters. The pod disruption budget's minimum available replicas is increased from 1 to 2, ensuring at least two etcd instances remain available during voluntary disruptions. The liveness probe failure threshold is raised from 5 to 12, allowing the etcd pod to tolerate more consecutive failed health checks before triggering a restart by Kubernetes.

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Topology-Aware Scheduling Compatibility	⚠️ Warning	Static PDB YAML hardcodes minAvailable: 2 without topology awareness, which blocks voluntary disruptions on Two-Node clusters (DualReplica/HighlyAvailableArbiter with only 2 schedulable nodes).	Remove hardcoded minAvailable: 2 from pdb.yaml or parameterize based on replica count; rely solely on runtime AdaptPodDisruptionBudget adaptation function for topology handling.

✅ Passed checks (11 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	All test names in the new etcd_test.go file are stable. No test titles contain dynamic values like pod names, timestamps, UUIDs, or node names. All use clear, static, descriptive strings.
Test Structure And Quality	✅ Passed	PR's new etcd_test.go uses standard Go testing with Gomega, not Ginkgo. The custom check applies only to Ginkgo test code, not unit tests.
Microshift Test Compatibility	✅ Passed	The PR modifies only YAML Kubernetes resource definitions (PodDisruptionBudget and StatefulSet) and documentation—no Ginkgo e2e tests are added, making the check not applicable.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	This PR modifies only YAML configuration files (pdb.yaml, statefulset.yaml), not Ginkgo e2e tests. The SNO test compatibility check does not apply.
Ote Binary Stdout Contract	✅ Passed	PR modifies only declarative YAML Kubernetes manifests (pdb.yaml, statefulset.yaml); no test code, executable code, or stdout operations that could violate OTE Binary Stdout Contract.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	No Ginkgo e2e tests are added in this PR; changes are limited to Kubernetes YAML resource files (pdb.yaml, statefulset.yaml) and documentation. The check is not applicable.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main objective: mitigating etcd cascading quorum loss during node drains, which matches the core changes (PDB minAvailable increase and liveness probe threshold adjustment).

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-05-19T18:23:41Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: cssjr
Once this PR has been reviewed and has the lgtm label, please assign cblecker for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai

🧹 Nitpick comments (1)

control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/pdb.yaml (1)

6-6: 💤 Low value

Consider adding an explanatory comment.

To help future maintainers understand the relationship between this PDB and the replica count, consider adding a comment:

 spec:
+  # Ensures quorum (2/3) is maintained during voluntary disruptions for a 3-replica etcd cluster
   minAvailable: 2

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/pdb.yaml`
at line 6, Add an inline explanatory comment to the etcd PodDisruptionBudget
(pdb.yaml) above the minAvailable: 2 setting explaining why minAvailable is set
to 2 (e.g., to preserve etcd quorum given the etcd replica count of X, tolerate
one node disruption while maintaining majority/quorum), and reference the
related resource (the etcd StatefulSet/Deployment that sets the replica count)
so future maintainers know to update this value when the replica count changes;
place the comment immediately above the minAvailable field in the pdb.yaml.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/pdb.yaml`:
- Line 6: Add an inline explanatory comment to the etcd PodDisruptionBudget
(pdb.yaml) above the minAvailable: 2 setting explaining why minAvailable is set
to 2 (e.g., to preserve etcd quorum given the etcd replica count of X, tolerate
one node disruption while maintaining majority/quorum), and reference the
related resource (the etcd StatefulSet/Deployment that sets the replica count)
so future maintainers know to update this value when the replica count changes;
place the comment immediately above the minAvailable field in the pdb.yaml.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: bf47cd8b-5517-44d2-b5f9-976a0b794375

📥 Commits

Reviewing files that changed from the base of the PR and between 9e283ae and 6172a6c.

📒 Files selected for processing (3)

control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/pdb.yaml
control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/statefulset.yaml
dnsresolver/fix-evaluation.md

When AKS drains a node, the current PDB (minAvailable: 1) allows Kubernetes to evict up to 2 etcd pods simultaneously, breaking quorum (need 2/3 in a 3-pod cluster). This triggers cascading shutdowns and endless restart loops. Changes: - PDB: minAvailable 1 → 2 to prevent voluntary quorum loss - Liveness probe: failureThreshold 5 → 12 (25s → 60s) to reduce false-positive restarts during transient issues Expected impact: reduce failure rate from 4.5% to < 1%. Analysis shows 95.5% of prow-ci clusters handle SIGTERM gracefully, but 4.5% enter death spiral due to incorrect PDB allowing simultaneous evictions. See: AROSLSRE-830 See-also: dnsresolver/fix-evaluation.md Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/pdb.yaml`:
- Line 6: The PodDisruptionBudget currently hardcodes minAvailable: 2 which
assumes a 3-member etcd and breaks for other topologies; change the pdb.yaml to
compute/render minAvailable from the effective etcd replica count or
control-plane topology instead of the hardcoded value. Locate the minAvailable
entry in the etcd PDB asset and replace the literal with a
templated/parameterized value tied to the etcd replica count (for example a
template variable like .Values.etcd.replicas or a controlPlaneTopology switch),
and implement logic so minAvailable = 1 for single-replica topologies, = 2 for
3-replica HA, and otherwise compute a safe quorum-aware value (e.g.
floor(replicas/2)+1) so voluntary disruptions remain schedulable across
different topologies. Ensure the new template falls back to a sensible default
if the replica count is missing and add tests/manifest rendering checks to
validate generated minAvailable for 1, 3, and other replica counts.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 98e5e970-3cb9-460b-8f5f-b3a6ba931e64

📥 Commits

Reviewing files that changed from the base of the PR and between 6172a6c and 05d664e.

📒 Files selected for processing (2)

control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/pdb.yaml
control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/statefulset.yaml

🚧 Files skipped from review as they are similar to previous changes (1)

control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/statefulset.yaml

codecov · 2026-05-19T18:37:37Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 40.34%. Comparing base (9e283ae) to head (3cecc2b).
⚠️ Report is 6 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #8549   +/-   ##
=======================================
  Coverage   40.34%   40.34%           
=======================================
  Files         755      755           
  Lines       93167    93167           
=======================================
  Hits        37587    37587           
  Misses      52877    52877           
  Partials     2703     2703

Flag	Coverage Δ
cmd-support	`34.30% <ø> (ø)`
cpo-hostedcontrolplane	`41.76% <ø> (ø)`
cpo-other	`40.14% <ø> (ø)`
hypershift-operator	`50.72% <ø> (ø)`
other	`31.54% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…luster-crashes

…obe threshold Updates test fixtures to reflect the failureThreshold change from 5 to 12 in the etcd liveness probe, which was part of the fix to prevent cascading quorum loss during node drains. See-also: AROSLSRE-830 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

cssjr · 2026-05-20T13:51:07Z

/test all

openshift-ci · 2026-05-20T14:05:37Z

@cssjr: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

hypershift-jira-solve-ci · 2026-05-21T07:47:05Z

Now I have the complete root cause analysis. Here is the final report:

Test Failure Analysis Complete

Job Information

Prow Job: Red Hat Konflux / hypershift-operator-main-enterprise-contract / hypershift-operator-main
Build ID: hypershift-operator-main-enterprise-contract-29wgc
Second Job: Red Hat Konflux / hypershift-operator-enterprise-contract / hypershift-operator-main
Second Build ID: hypershift-operator-enterprise-contract-s8jxg
Snapshot: hypershift-operator-20260520-002927-000
PR: AROSLSRE-830: Mitigate etcd cascading quorum loss during node drains #8549 (AROSLSRE-830: Mitigate etcd cascading quorum loss during node drains)
PR SHA: 3cecc2b

Test Failure Analysis

Error

Enterprise Contract verify task FAILURE: 254 success(es), 24 warning(s), 2 failure(s)

Summary

Both Konflux Enterprise Contract checks fail because PR #8549's branch is 20 commits behind main and is missing PR #8557 ("NO-JIRA: Update Konflux Tekton task bundles", merged 2026-05-20), which updated all Tekton task bundle digests and migrated build-image-index from v0.2 to v0.3. The PR branch still references outdated/deprecated task bundle digests that no longer pass Enterprise Contract verification, producing exactly 2 failures. This is unrelated to the PR's actual code changes (etcd PDB and StatefulSet YAML).

Root Cause

The Enterprise Contract (EC) verification enforces that container images were built using trusted, up-to-date Tekton task bundles. PR #8549's branch was forked from main before PR #8557 landed, so it carries stale .tekton/ pipeline configuration:

build-image-index task v0.2 → v0.3 migration not applied: The PR branch still uses task-build-image-index:0.2@sha256:c7b0f7e1... and passes the deprecated COMMIT_SHA and IMAGE_EXPIRES_AFTER parameters. PR NO-JIRA: Update Konflux Tekton task bundles #8557 upgraded to task-build-image-index:0.3@sha256:b33bfa8d... and removed those parameters per the official migration guide. The old v0.2 digest is no longer in the trusted bundle allowlist, causing an EC failure.
Outdated task bundle digests: 32+ other Tekton tasks (e.g., buildah-remote-oci-ta, clair-scan, clamav-scan, rpms-signature-scan, etc.) have stale SHA256 digests. When Konflux rotates trusted digests, the old digests eventually fall off the acceptable list, producing the second EC failure.

The 2 failures correspond to Enterprise Contract policy rules that validate task bundle provenance — specifically that the build pipeline used currently-trusted task bundle references. The PR's actual code changes (etcd pdb.yaml, statefulset.yaml, and test fixtures) are irrelevant to these failures.

Proof: On main (after PR #8557), the identical EC checks pass with 512 successes, 16 warnings, 0 failures. On the PR branch, they show 254 successes, 24 warnings, 2 failures.

Recommendations

Rebase PR AROSLSRE-830: Mitigate etcd cascading quorum loss during node drains #8549 onto main — this is the only fix needed. Rebasing will pull in PR NO-JIRA: Update Konflux Tekton task bundles #8557's Tekton task bundle updates (.tekton/hypershift-operator-main-tag.yaml and .tekton/pipelines/common-operator-build.yaml), which will resolve both EC failures:
```
git fetch origin main
git rebase origin/main
git push --force-with-lease
```
No code changes required — the PR's actual changes (etcd PDB and StatefulSet modifications) are not related to the failure.
Consider enabling Konflux auto-rebasing — if the repository doesn't already have it, enabling automatic rebasing of PR branches when .tekton/ files change on main would prevent this class of failure from blocking unrelated PRs in the future.

Evidence

Evidence	Detail
PR branch status	20 commits behind `main`, merge base `9e283aee`
Missing commit	PR #8557 (`ef9cde06`) "Update Konflux Tekton task bundles" merged 2026-05-20T16:05:39Z
PR branch `build-image-index`	v0.2 (`sha256:c7b0f7e1...`) with deprecated `COMMIT_SHA` and `IMAGE_EXPIRES_AFTER` params
Main branch `build-image-index`	v0.3 (`sha256:b33bfa8d...`) without deprecated params
PR branch EC result	254 successes, 24 warnings, 2 failures → `conclusion: failure`
Main branch EC result	512 successes, 16 warnings, 0 failures → `conclusion: neutral`
Files changed by PR #8549	Only etcd-related: `pdb.yaml`, `statefulset.yaml`, test fixtures (no `.tekton/` files)
Files changed by PR #8557	`.tekton/hypershift-operator-main-tag.yaml`, `.tekton/pipelines/common-operator-build.yaml`
EC check 1 pipeline	`hypershift-operator-main-enterprise-contract-29wgc`
EC check 2 pipeline	`hypershift-operator-enterprise-contract-s8jxg`
Both checks ran	2026-05-20T00:29:29Z (~16 hours before PR #8557 merged)

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 19, 2026

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 19, 2026

openshift-ci Bot added the do-not-merge/needs-area label May 19, 2026

openshift-ci Bot added area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release and removed do-not-merge/needs-area labels May 19, 2026

cssjr commented May 19, 2026

View reviewed changes

Comment thread dnsresolver/fix-evaluation.md Outdated

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

cssjr force-pushed the AROSLSRE-830/reduce-cluster-crashes branch from 509cac9 to 05d664e Compare May 19, 2026 18:31

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

Comment thread control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/pdb.yaml

cssjr and others added 2 commits May 19, 2026 11:42

Merge remote-tracking branch 'origin/main' into AROSLSRE-830/reduce-c…

72c5113

…luster-crashes

cssjr changed the title ~~AROSLSRE-830: Fix etcd cascading quorum loss during node drains~~ AROSLSRE-830: Mitigate etcd cascading quorum loss during node drains May 20, 2026

hypershift-jira-solve-ci Bot mentioned this pull request May 21, 2026

OCPBUGS-55621: Replace konnectivity Dial with DialContext in konnectivity-https-proxy/cmd.go #8550

Draft

4 tasks

Conversation

cssjr commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Changes

Expected Impact

Test Plan

Links

Summary by CodeRabbit

Uh oh!

openshift-merge-bot Bot commented May 19, 2026

Uh oh!

openshift-ci-robot commented May 19, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Changes

Expected Impact

Test Plan

Links

Uh oh!

openshift-ci Bot commented May 19, 2026

Uh oh!

coderabbitai Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

❌ Failed checks (1 warning)

Uh oh!

openshift-ci Bot commented May 19, 2026

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cssjr commented May 20, 2026

Uh oh!

openshift-ci Bot commented May 20, 2026

Uh oh!

hypershift-jira-solve-ci Bot commented May 21, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cssjr commented May 19, 2026 •

edited

Loading

openshift-ci-robot commented May 19, 2026 •

edited by openshift-ci Bot

Loading

coderabbitai Bot commented May 19, 2026 •

edited

Loading

codecov Bot commented May 19, 2026 •

edited

Loading

hypershift-jira-solve-ci Bot commented May 21, 2026 •

edited by openshift-ci Bot

Loading