AROSLSRE-830: Mitigate etcd cascading quorum loss during node drains#8549
AROSLSRE-830: Mitigate etcd cascading quorum loss during node drains#8549cssjr wants to merge 3 commits into
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
@cssjr: This pull request references AROSLSRE-830 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Skipping CI for Draft Pull Request. |
📝 WalkthroughWalkthroughThis PR adjusts two etcd resilience configuration parameters. The pod disruption budget's minimum available replicas is increased from 1 to 2, ensuring at least two etcd instances remain available during voluntary disruptions. The liveness probe failure threshold is raised from 5 to 12, allowing the etcd pod to tolerate more consecutive failed health checks before triggering a restart by Kubernetes. 🚥 Pre-merge checks | ✅ 11 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: cssjr The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/pdb.yaml (1)
6-6: 💤 Low valueConsider adding an explanatory comment.
To help future maintainers understand the relationship between this PDB and the replica count, consider adding a comment:
spec: + # Ensures quorum (2/3) is maintained during voluntary disruptions for a 3-replica etcd cluster minAvailable: 2🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/pdb.yaml` at line 6, Add an inline explanatory comment to the etcd PodDisruptionBudget (pdb.yaml) above the minAvailable: 2 setting explaining why minAvailable is set to 2 (e.g., to preserve etcd quorum given the etcd replica count of X, tolerate one node disruption while maintaining majority/quorum), and reference the related resource (the etcd StatefulSet/Deployment that sets the replica count) so future maintainers know to update this value when the replica count changes; place the comment immediately above the minAvailable field in the pdb.yaml.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/pdb.yaml`:
- Line 6: Add an inline explanatory comment to the etcd PodDisruptionBudget
(pdb.yaml) above the minAvailable: 2 setting explaining why minAvailable is set
to 2 (e.g., to preserve etcd quorum given the etcd replica count of X, tolerate
one node disruption while maintaining majority/quorum), and reference the
related resource (the etcd StatefulSet/Deployment that sets the replica count)
so future maintainers know to update this value when the replica count changes;
place the comment immediately above the minAvailable field in the pdb.yaml.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: bf47cd8b-5517-44d2-b5f9-976a0b794375
📒 Files selected for processing (3)
control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/pdb.yamlcontrol-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/statefulset.yamldnsresolver/fix-evaluation.md
When AKS drains a node, the current PDB (minAvailable: 1) allows Kubernetes to evict up to 2 etcd pods simultaneously, breaking quorum (need 2/3 in a 3-pod cluster). This triggers cascading shutdowns and endless restart loops. Changes: - PDB: minAvailable 1 → 2 to prevent voluntary quorum loss - Liveness probe: failureThreshold 5 → 12 (25s → 60s) to reduce false-positive restarts during transient issues Expected impact: reduce failure rate from 4.5% to < 1%. Analysis shows 95.5% of prow-ci clusters handle SIGTERM gracefully, but 4.5% enter death spiral due to incorrect PDB allowing simultaneous evictions. See: AROSLSRE-830 See-also: dnsresolver/fix-evaluation.md Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
509cac9 to
05d664e
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/pdb.yaml`:
- Line 6: The PodDisruptionBudget currently hardcodes minAvailable: 2 which
assumes a 3-member etcd and breaks for other topologies; change the pdb.yaml to
compute/render minAvailable from the effective etcd replica count or
control-plane topology instead of the hardcoded value. Locate the minAvailable
entry in the etcd PDB asset and replace the literal with a
templated/parameterized value tied to the etcd replica count (for example a
template variable like .Values.etcd.replicas or a controlPlaneTopology switch),
and implement logic so minAvailable = 1 for single-replica topologies, = 2 for
3-replica HA, and otherwise compute a safe quorum-aware value (e.g.
floor(replicas/2)+1) so voluntary disruptions remain schedulable across
different topologies. Ensure the new template falls back to a sensible default
if the replica count is missing and add tests/manifest rendering checks to
validate generated minAvailable for 1, 3, and other replica counts.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 98e5e970-3cb9-460b-8f5f-b3a6ba931e64
📒 Files selected for processing (2)
control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/pdb.yamlcontrol-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/statefulset.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
- control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/statefulset.yaml
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #8549 +/- ##
=======================================
Coverage 40.34% 40.34%
=======================================
Files 755 755
Lines 93167 93167
=======================================
Hits 37587 37587
Misses 52877 52877
Partials 2703 2703
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
…obe threshold Updates test fixtures to reflect the failureThreshold change from 5 to 12 in the etcd liveness probe, which was part of the fix to prevent cascading quorum loss during node drains. See-also: AROSLSRE-830 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
/test all |
|
@cssjr: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
Now I have the complete root cause analysis. Here is the final report: Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryBoth Konflux Enterprise Contract checks fail because PR #8549's branch is 20 commits behind Root CauseThe Enterprise Contract (EC) verification enforces that container images were built using trusted, up-to-date Tekton task bundles. PR #8549's branch was forked from
The 2 failures correspond to Enterprise Contract policy rules that validate task bundle provenance — specifically that the build pipeline used currently-trusted task bundle references. The PR's actual code changes (etcd Proof: On Recommendations
Evidence
|
Summary
Fixes critical etcd cascading failure issue (4.5% of clusters) by correcting PodDisruptionBudget configuration and increasing liveness probe tolerance.
Root Cause
When AKS drains a node, the current PDB (
minAvailable: 1) allows Kubernetes to evict up to 2 etcd pods simultaneously, breaking quorum (need 2/3 in a 3-pod cluster). This triggers cascading shutdowns and endless restart loops.Analysis of prow-ci cluster logs shows:
Changes
PDB fix (
pdb.yaml):minAvailable: 1→minAvailable: 2Liveness probe (
statefulset.yaml):failureThreshold: 5→failureThreshold: 12Expected Impact
Test Plan
Links
dnsresolver/fix-evaluation.md🤖 Generated with Claude Code
Summary by CodeRabbit