Skip to content

OCPBUGS-85258: 5.0 rebase 3.6.11#375

Merged
openshift-merge-bot[bot] merged 19 commits intoopenshift:mainfrom
dusk125:5.0-rebase-3.6.11
May 8, 2026
Merged

OCPBUGS-85258: 5.0 rebase 3.6.11#375
openshift-merge-bot[bot] merged 19 commits intoopenshift:mainfrom
dusk125:5.0-rebase-3.6.11

Conversation

@dusk125
Copy link
Copy Markdown

@dusk125 dusk125 commented May 7, 2026

Rebase handled by Claude

❯ ./bin/etcd --version
etcd Version: 3.6.11
Git SHA: 821d95e
Go Version: go1.25.9
Go OS/Arch: darwin/arm64

Summary by CodeRabbit

  • New Features

    • Stronger transaction authorization checks for operations using previous values and leases.
    • Added end-to-end and integration tests validating member-add and auth/transaction behaviors.
  • Bug Fixes

    • Quorum connectivity check when adding members now requires connection to a majority of peers.
  • Chores

    • Bumped project version to 3.6.11, Go toolchain to 1.25.9, updated build base images and several indirect Go dependencies.

jonathan-albrecht-ibm and others added 18 commits April 5, 2026 20:24
In CI, the TestGateway and TestMixVersionsSnapshotByAddingMember
are flaky due to the TestMixVersionsSnapshotByAddingMember test
sometimes not closing the second etcd process. This happens if
the second process has not had enough time to become healthy according
to the logic in EtcdServer.mayRemoveMember.

Fix this by retrying member removal for twice the etcdserver.HealthInterval
in EtcdProcessCluster.CloseProc.

Signed-off-by: Jonathan Albrecht <jonathan.albrecht@ibm.com>
…ck-of-#20840-upstream-release-3.6

Automated cherry pick of etcd-io#20840
Signed-off-by: Wei Fu <fuweid89@gmail.com>
…r is down

Signed-off-by: Benjamin Wang <benjamin.ahrtr@gmail.com>
…g member is down

Assume the new member is unavailable and check whether quorum is still preserved.

Signed-off-by: Benjamin Wang <benjamin.ahrtr@gmail.com>
Signed-off-by: Benjamin Wang <benjamin.ahrtr@gmail.com>
[release-3.6] Bump golang.org/x/image to v0.39.0 to resolve GO-2026-4962
[release-3.6] Fix the issue that cannot add a new member when one member is down, even if quorum is still satisfied
Signed-off-by: Benjamin Wang <benjamin.ahrtr@gmail.com>
Signed-off-by: Benjamin Wang <benjamin.ahrtr@gmail.com>
[release-3.6] Refactor auth check for Put requests in TXN
…rbac check issue

Signed-off-by: Benjamin Wang <benjamin.ahrtr@gmail.com>
…XN bypass RBAC check

Signed-off-by: Benjamin Wang <benjamin.ahrtr@gmail.com>
…ck issue

Signed-off-by: Benjamin Wang <benjamin.ahrtr@gmail.com>
[release-3.6] Fix read access via PrevKv or lease attachment in a Put request in etcd transactions bypass RBAC authorization checks
Signed-off-by: Ivan Valdes <iv@a.ki>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 7, 2026

Walkthrough

Upgrades Go toolchain/dependencies and OpenShift base images; refactors txn authorization by moving auth checks into the apply layer (adding exported CheckTxnAuth); and relaxes member-add quorum gating from “connected to all peers” to “connected to a majority” with related test and helper changes.

Changes

Toolchain & Build Infrastructure

Layer / File(s) Summary
Build config
.ci-operator.yaml
build_root_image.tag updated from ...-openshift-4.23 to ...-openshift-5.0.
Docker images / Multi-stage builds
Dockerfile* (e.g. Dockerfile.art-cachi2, Dockerfile.installer, Dockerfile.installer.art-cachi2, Dockerfile.rhel)
All builder/runtime base image tags bumped from OCP 4.23 variants to 5.0 variants.
Go toolchain pins
.go-version, */go.mod, tools/*/go.mod, tests/go.mod
toolchain/.go-version bumped from go1.25.8go1.25.9 across modules and tools.
Module dependency bumps
go.mod, api/go.mod, client/*/go.mod, etcdctl/go.mod, etcdutl/go.mod, pkg/go.mod, server/go.mod, tests/go.mod, ...
Internal module versions advanced from v3.6.10v3.6.11 and multiple golang.org/x/* indirect deps bumped (net, sys, text, crypto, etc.).
Version constant
api/version/version.go
Exported Version constant bumped 3.6.103.6.11.

Authorization & Transaction Handling (apply-layer takeover)

Layer / File(s) Summary
Core apply auth implementation
server/etcdserver/apply/apply_auth.go
Moved txn authorization into apply layer: added exported CheckTxnAuth(as auth.AuthStore, ai *auth.AuthInfo, lessor lease.Lessor, rt *pb.TxnRequest) error, plus helpers checkPutAuth, checkTxnPermission, checkTxnReqsPermission, checkLeasePuts, checkLeasePutsKeys; refactored Put, Txn, and LeaseRevoke to call helpers.
Tests for apply auth
server/etcdserver/apply/apply_auth_test.go
Refactored tests to call new helper signatures; added TestCheckTxnAuth table-driven cases and setupAuth helper; updated TestCheckLeasePutsKeys.
Removed auth from txn layer
server/etcdserver/txn/txn.go
Removed auth import and deleted previous CheckTxnAuth + related helpers from txn package (auth checks relocated).
Txn tests adjusted
server/etcdserver/txn/txn_test.go
Removed auth-focused tests and imports that exercised txn-layer auth checks; retained non-auth txn tests.
Call site updated
server/etcdserver/v3_server.go
Read-only txn auth check switched from txn.CheckTxnAuth to apply2.CheckTxnAuth.
Integration tests
tests/integration/v3_auth_test.go
Added perm field to test users and two tests: TestReadWithPrevKvInTXN and TestPutWithLeaseInTXN verifying permission-denied behavior for PrevKv and lease-attached puts.

Member Addition & Quorum Logic

Layer / File(s) Summary
Quorum helpers
server/etcdserver/util.go
Added isConnectedToQuorumAfterAddingNewMemberSince and quorum(num int); removed isConnectedFullySince; retained isConnectedToQuorumSince.
Member add gating
server/etcdserver/server.go
mayAddMember now uses quorum-majority check (isConnectedToQuorumAfterAddingNewMemberSince) instead of requiring full connectivity; log/error message updated accordingly.
E2E test
tests/e2e/ctl_v3_member_test.go
Added TestCtlV3MemberAddAsLearnerWithOneMemberDown covering member-add scenarios with members down across cluster sizes.
Test framework retry logic
tests/framework/e2e/cluster.go
CloseProc member-removal retry loop now uses time-derived retry count based on etcdserver.HealthInterval and reports tries in error message.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Server as EtcdServer
    participant Apply as apply.CheckTxnAuth
    participant AuthStore
    participant Lessor

    Client->>Server: Txn(request)
    Server->>Apply: CheckTxnAuth(authStore, authInfo, lessor, txnReq)
    Apply->>AuthStore: Validate compare key permissions
    AuthStore-->>Apply: ok / denied
    alt compares authorized
        Apply->>Apply: checkTxnReqsPermission(successOps)
        Apply->>AuthStore: IsRangePermitted / IsPutPermitted per op
        AuthStore-->>Apply: ok / denied
    end
    alt lease-attached puts present
        Apply->>Lessor: Lookup(leaseID)
        Lessor-->>Apply: lease keys
        Apply->>AuthStore: IsPutPermitted for lease keys
        AuthStore-->>Apply: ok / denied
    end
    alt permission denied
        Apply-->>Server: ErrPermissionDenied
        Server-->>Client: PERMISSION_DENIED
    else all checks pass
        Apply-->>Server: nil
        Server->>Server: Execute Txn
        Server-->>Client: Txn result
    end
Loading
sequenceDiagram
    participant Admin
    participant Server as EtcdServer
    participant MemberAdd as mayAddMember
    participant Util as util.isConnectedToQuorumAfterAddingNewMemberSince
    participant Transporter

    Admin->>Server: AddMember(newMember)
    Server->>MemberAdd: mayAddMember(newMember)
    MemberAdd->>Util: isConnectedToQuorumAfterAddingNewMemberSince(transport, since, self, members)
    Util->>Util: compute quorum(currentMembers + 1)
    Util->>Transporter: check connectivity to peers
    Transporter-->>Util: active peers list
    alt connected to majority after add
        Util-->>MemberAdd: true
        MemberAdd-->>Server: proceed
        Server-->>Admin: Member added
    else not connected to majority
        Util-->>MemberAdd: false
        MemberAdd-->>Server: error - would break active quorum
        Server-->>Admin: FAILED - not connected to majority
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (11 passed)
Check name Status Explanation
Title check ✅ Passed The title references OCPBUGS-85258 and mentions '5.0 rebase 3.6.11', which directly corresponds to the PR's primary changes: updating to etcd v3.6.11 with OpenShift 5.0 base images and Go 1.25.9.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names are stable and deterministic. New tests use static strings with no dynamic information like timestamps, UUIDs, or generated identifiers.
Test Structure And Quality ✅ Passed Custom check is not applicable. PR contains only standard Go testing framework tests (func Test*), not Ginkgo tests. Check requires assessment of Ginkgo DSL test code.
Microshift Test Compatibility ✅ Passed The PR adds new tests but none use Ginkgo patterns (It, Describe, Context, When). The check applies only to Ginkgo e2e tests, so it is not applicable here.
Single Node Openshift (Sno) Test Compatibility ✅ Passed Custom check not applicable. PR adds etcd native tests using standard Go testing.T, not OpenShift Ginkgo e2e tests. The check specifically targets new Ginkgo tests.
Topology-Aware Scheduling Compatibility ✅ Passed This PR updates etcd to v3.6.11 with source code refactoring, build configuration updates, and dependency versions. No Kubernetes manifests, operator code, or scheduling constraints were introduced.
Ote Binary Stdout Contract ✅ Passed No stdout writes in process-level code. No fmt.Print, log.Print, println, or BeforeSuite functions found. Changes are config updates and standard tests only.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed Custom check is for Ginkgo e2e tests only. New tests use standard Go testing.T, not Ginkgo. Repository has no Ginkgo framework usage.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels May 7, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@dusk125: This pull request references Jira Issue OCPBUGS-85258, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sandeepknd

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

❯ ./bin/etcd --version
etcd Version: 3.6.11
Git SHA: 821d95e
Go Version: go1.25.9
Go OS/Arch: darwin/arm64

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot requested review from deads2k, sandeepknd and tjungblu May 7, 2026 15:28
@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 7, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@dusk125: This pull request references Jira Issue OCPBUGS-85258, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sandeepknd

Details

In response to this:

❯ ./bin/etcd --version
etcd Version: 3.6.11
Git SHA: 821d95e
Go Version: go1.25.9
Go OS/Arch: darwin/arm64

Summary by CodeRabbit

Release Notes

  • New Features

  • Enhanced transaction authorization checks for operations with previous values and leases.

  • Added end-to-end test coverage for member addition resilience.

  • Bug Fixes

  • Improved quorum connectivity validation when adding new members; now requires majority connection instead of full connectivity.

  • Chores

  • Updated Go toolchain to 1.25.9 and updated build base images.

  • Bumped indirect Go dependencies (golang.org/x/crypto, golang.org/x/net, golang.org/x/sys, golang.org/x/text).

  • Version bumped to 3.6.11.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
etcdutl/go.mod (1)

73-81: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Upgrade go.opentelemetry.io/otel to v1.41.0 to address HIGH severity DoS (GHSA-mh2q-q3fh-2475 / CVE-2026-29181).

go.opentelemetry.io/otel versions v1.36.0–v1.40.0 are affected; the fix is v1.41.0. The vulnerability allows attackers to amplify CPU and allocations by sending many baggage: header lines, even when each individual value is within the per-value parse limit. CVSS score is 7.5 HIGH (AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:H).

This dependency appears at v1.40.0 in etcdutl/go.mod, go.mod, server/go.mod, and tests/go.mod. All affected go.mod files should be updated together to v1.41.0 or above.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@etcdutl/go.mod` around lines 73 - 81, The import entry for
go.opentelemetry.io/otel is pinned to v1.40.0 (e.g., the line
"go.opentelemetry.io/otel v1.40.0 // indirect") and must be upgraded to v1.41.0
to fix CVE-2026-29181; update that module line in all affected go.mod files
(etcdutl/go.mod, root go.mod, server/go.mod, tests/go.mod) to v1.41.0 (or
later), run "go get go.opentelemetry.io/otel@v1.41.0" and "go mod tidy" in each
module to refresh transitive deps, then run the project's tests/build to ensure
nothing breaks.
🧹 Nitpick comments (2)
tests/e2e/ctl_v3_member_test.go (2)

660-660: ⚡ Quick win

time.Sleep after Kill() does not wait for a new leader to be elected.

If the killed member(s) include the current leader, the remaining cluster must hold a new election before MemberAddAsLearner can succeed. etcdserver.HealthInterval + 2*time.Second (≈ 2.5 s) is usually enough, but it's unconditional and timing-sensitive. A WaitLeader call on the surviving members before proceeding would make the test deterministic without being slower on average.

🔧 Suggested fix
 			time.Sleep(etcdserver.HealthInterval + 2*time.Second)
+			epc.WaitLeader(t)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/e2e/ctl_v3_member_test.go` at line 660, Replace the unconditional
time.Sleep after Kill() with a deterministic wait for a new leader by invoking
WaitLeader on the surviving member(s) before calling MemberAddAsLearner;
specifically, remove the time.Sleep(etcdserver.HealthInterval + 2*time.Second)
and call the cluster/peer helper like survivingMember.WaitLeader(ctx, timeout)
(or the existing test helper used elsewhere) so the test only proceeds once a
leader is elected and MemberAddAsLearner is invoked reliably.

649-651: 💤 Low value

Consider logging or propagating epc.Close() errors.

Silently discarding the close error with _ = epc.Close() can hide resource-cleanup failures between sub-test iterations. Other defers in this file use require.NoError(t, epc.Close()). At minimum, a t.Logf on error would help diagnose flaky teardowns.

🔧 Suggested fix
-			defer func() {
-				_ = epc.Close()
-			}()
+			defer func() {
+				require.NoError(t, epc.Close())
+			}()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/e2e/ctl_v3_member_test.go` around lines 649 - 651, The defer is
silently discarding epc.Close() errors; replace the anonymous defer using `_ =
epc.Close()` with an explicit error check so teardown failures are
surfaced—e.g., in the defer func() call epc.Close(), capture its error and call
require.NoError(t, err) (or if outside a subtest context use t.Logf("epc.Close
error: %v", err) then fail as appropriate). Update the defer that wraps
epc.Close() to use that explicit check instead of silencing the return value.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@server/etcdserver/apply/apply_auth_test.go`:
- Around line 1049-1052: The test hardcodes auth.AuthInfo{Username: "foo",
Revision: 8} which can mismatch the authorizer's current revision and produce
ErrAuthOldRevision; update the table test to use the authorizer's current
revision by replacing the literal revision with as.Revision() when calling
CheckTxnAuth (i.e., pass &auth.AuthInfo{Username: "foo", Revision:
as.Revision()}), so CheckTxnAuth and the permission logic are validated against
the current auth revision.

In `@server/etcdserver/server.go`:
- Around line 1431-1432: The quorum-safety check currently always treats the new
member as a voting member by calling
isConnectedToQuorumAfterAddingNewMemberSince(s.r.transport, ..., s.MemberID(),
s.cluster.VotingMembers()); change this so the "post-add quorum bump" is only
applied when the member being added is a voting member — for learner additions
do NOT call the "AfterAddingNewMember" variant and instead check quorum against
the current voting set (i.e., use the non-post-add quorum check such as
isConnectedToQuorumSince or the equivalent check with s.cluster.VotingMembers()
that does not include the new member). Ensure you detect learner vs voter using
the member type information available in the add path and switch between
isConnectedToQuorumAfterAddingNewMemberSince(...) for voters and the non-bumped
quorum check for learners.

In `@tests/framework/e2e/cluster.go`:
- Around line 849-863: The loop that calls memberCtl.MemberRemove is treating
only the "member not found" error as success and ignores the successful case
(err == nil); update the loop in the function containing memberCtl.MemberRemove
so that if MemberRemove returns nil you set memberRemoved = true and break
immediately, otherwise keep the existing check for strings.Contains(err.Error(),
"member not found") to mark success; this change ensures both a successful
removal (nil error) and an already-removed state mark memberRemoved and stop
retrying.

In `@tests/integration/v3_auth_test.go`:
- Around line 384-390: The test currently silently falls back to READWRITE when
user.perm is non-empty but not a valid key in authpb.Permission_Type_value;
change the logic in the test setup so that when len(user.perm) > 0 and the
lookup into authpb.Permission_Type_value yields ok == false, the test fails
immediately (e.g., call t.Fatalf or require.FailNow) with a clear message
referencing the invalid user.perm, instead of assigning permType :=
authpb.READWRITE; otherwise keep the default behavior when user.perm is empty.

---

Outside diff comments:
In `@etcdutl/go.mod`:
- Around line 73-81: The import entry for go.opentelemetry.io/otel is pinned to
v1.40.0 (e.g., the line "go.opentelemetry.io/otel v1.40.0 // indirect") and must
be upgraded to v1.41.0 to fix CVE-2026-29181; update that module line in all
affected go.mod files (etcdutl/go.mod, root go.mod, server/go.mod, tests/go.mod)
to v1.41.0 (or later), run "go get go.opentelemetry.io/otel@v1.41.0" and "go mod
tidy" in each module to refresh transitive deps, then run the project's
tests/build to ensure nothing breaks.

---

Nitpick comments:
In `@tests/e2e/ctl_v3_member_test.go`:
- Line 660: Replace the unconditional time.Sleep after Kill() with a
deterministic wait for a new leader by invoking WaitLeader on the surviving
member(s) before calling MemberAddAsLearner; specifically, remove the
time.Sleep(etcdserver.HealthInterval + 2*time.Second) and call the cluster/peer
helper like survivingMember.WaitLeader(ctx, timeout) (or the existing test
helper used elsewhere) so the test only proceeds once a leader is elected and
MemberAddAsLearner is invoked reliably.
- Around line 649-651: The defer is silently discarding epc.Close() errors;
replace the anonymous defer using `_ = epc.Close()` with an explicit error check
so teardown failures are surfaced—e.g., in the defer func() call epc.Close(),
capture its error and call require.NoError(t, err) (or if outside a subtest
context use t.Logf("epc.Close error: %v", err) then fail as appropriate). Update
the defer that wraps epc.Close() to use that explicit check instead of silencing
the return value.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 577368ab-5095-4462-b0cf-12e5afe14f82

📥 Commits

Reviewing files that changed from the base of the PR and between 7fff975 and 821d95e.

⛔ Files ignored due to path filters (12)
  • api/go.sum is excluded by !**/*.sum
  • client/pkg/go.sum is excluded by !**/*.sum
  • client/v3/go.sum is excluded by !**/*.sum
  • etcdctl/go.sum is excluded by !**/*.sum
  • etcdutl/go.sum is excluded by !**/*.sum
  • go.sum is excluded by !**/*.sum
  • pkg/go.sum is excluded by !**/*.sum
  • server/go.sum is excluded by !**/*.sum
  • tests/go.sum is excluded by !**/*.sum
  • tools/mod/go.sum is excluded by !**/*.sum
  • tools/rw-heatmaps/go.sum is excluded by !**/*.sum
  • tools/testgrid-analysis/go.sum is excluded by !**/*.sum
📒 Files selected for processing (29)
  • .ci-operator.yaml
  • .go-version
  • Dockerfile.art-cachi2
  • Dockerfile.installer
  • Dockerfile.installer.art-cachi2
  • Dockerfile.rhel
  • api/go.mod
  • api/version/version.go
  • client/pkg/go.mod
  • client/v3/go.mod
  • etcdctl/go.mod
  • etcdutl/go.mod
  • go.mod
  • pkg/go.mod
  • server/etcdserver/apply/apply_auth.go
  • server/etcdserver/apply/apply_auth_test.go
  • server/etcdserver/server.go
  • server/etcdserver/txn/txn.go
  • server/etcdserver/txn/txn_test.go
  • server/etcdserver/util.go
  • server/etcdserver/v3_server.go
  • server/go.mod
  • tests/e2e/ctl_v3_member_test.go
  • tests/framework/e2e/cluster.go
  • tests/go.mod
  • tests/integration/v3_auth_test.go
  • tools/mod/go.mod
  • tools/rw-heatmaps/go.mod
  • tools/testgrid-analysis/go.mod
💤 Files with no reviewable changes (2)
  • server/etcdserver/txn/txn.go
  • server/etcdserver/txn/txn_test.go

Comment on lines +1049 to +1052
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
err := CheckTxnAuth(as, &auth.AuthInfo{Username: "foo", Revision: 8}, &lease.FakeLessor{}, tt.txnRequest)
assert.Equal(t, tt.err, err)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use the current auth revision in the table test.

Line 1051 hardcodes Revision: 8, so any extra setup write will flip these cases to ErrAuthOldRevision and stop validating the permission logic you actually care about. Use as.Revision() here instead.

Suggested fix
-			err := CheckTxnAuth(as, &auth.AuthInfo{Username: "foo", Revision: 8}, &lease.FakeLessor{}, tt.txnRequest)
+			err := CheckTxnAuth(as, &auth.AuthInfo{Username: "foo", Revision: as.Revision()}, &lease.FakeLessor{}, tt.txnRequest)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
err := CheckTxnAuth(as, &auth.AuthInfo{Username: "foo", Revision: 8}, &lease.FakeLessor{}, tt.txnRequest)
assert.Equal(t, tt.err, err)
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
err := CheckTxnAuth(as, &auth.AuthInfo{Username: "foo", Revision: as.Revision()}, &lease.FakeLessor{}, tt.txnRequest)
assert.Equal(t, tt.err, err)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/etcdserver/apply/apply_auth_test.go` around lines 1049 - 1052, The
test hardcodes auth.AuthInfo{Username: "foo", Revision: 8} which can mismatch
the authorizer's current revision and produce ErrAuthOldRevision; update the
table test to use the authorizer's current revision by replacing the literal
revision with as.Revision() when calling CheckTxnAuth (i.e., pass
&auth.AuthInfo{Username: "foo", Revision: as.Revision()}), so CheckTxnAuth and
the permission logic are validated against the current auth revision.

Comment on lines +1431 to +1432
// Treat the new member as unavailable when checking quorum safety.
if !isConnectedToQuorumAfterAddingNewMemberSince(s.r.transport, time.Now().Add(-HealthInterval), s.MemberID(), s.cluster.VotingMembers()) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't apply the post-add quorum bump to learner additions.

Line 1432 uses the “after adding a new member” quorum calculation for every add. That is correct for voting members, but not for learners: adding a learner does not increase the voting quorum. In a 3-voter cluster with one voter down, this now rejects an add-as-learner even though the current voting quorum is still intact.

Suggested fix
-	// Treat the new member as unavailable when checking quorum safety.
-	if !isConnectedToQuorumAfterAddingNewMemberSince(s.r.transport, time.Now().Add(-HealthInterval), s.MemberID(), s.cluster.VotingMembers()) {
+	// Treat a new voting member as unavailable when checking quorum safety.
+	since := time.Now().Add(-HealthInterval)
+	connectedToRequiredQuorum := isConnectedToQuorumSince(s.r.transport, since, s.MemberID(), s.cluster.VotingMembers())
+	if !memb.IsLearner {
+		connectedToRequiredQuorum = isConnectedToQuorumAfterAddingNewMemberSince(s.r.transport, since, s.MemberID(), s.cluster.VotingMembers())
+	}
+	if !connectedToRequiredQuorum {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/etcdserver/server.go` around lines 1431 - 1432, The quorum-safety
check currently always treats the new member as a voting member by calling
isConnectedToQuorumAfterAddingNewMemberSince(s.r.transport, ..., s.MemberID(),
s.cluster.VotingMembers()); change this so the "post-add quorum bump" is only
applied when the member being added is a voting member — for learner additions
do NOT call the "AfterAddingNewMember" variant and instead check quorum against
the current voting set (i.e., use the non-post-add quorum check such as
isConnectedToQuorumSince or the equivalent check with s.cluster.VotingMembers()
that does not include the new member). Ensure you detect learner vs voter using
the member type information available in the add path and switch between
isConnectedToQuorumAfterAddingNewMemberSince(...) for voters and the non-bumped
quorum check for learners.

Comment on lines +849 to +863
sleepDuration := 500 * time.Millisecond
maxRetries := int((2 * etcdserver.HealthInterval) / sleepDuration)
memberRemoved := false
for i := 0; i < 10; i++ {
for i := 0; i < maxRetries; i++ {
_, err := memberCtl.MemberRemove(ctx, memberID)
if err != nil && strings.Contains(err.Error(), "member not found") {
memberRemoved = true
break
}

time.Sleep(500 * time.Millisecond)
time.Sleep(sleepDuration)
}

if !memberRemoved {
return errors.New("failed to remove member after 10 tries")
return fmt.Errorf("failed to remove member after %d tries", maxRetries)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Mark successful member removal as success immediately.

At Line 853, a successful MemberRemove (err == nil) does not set memberRemoved, so the code retries unnecessarily and can fail later despite already removing the member.

💡 Proposed fix
 sleepDuration := 500 * time.Millisecond
 maxRetries := int((2 * etcdserver.HealthInterval) / sleepDuration)
 memberRemoved := false
 for i := 0; i < maxRetries; i++ {
 	_, err := memberCtl.MemberRemove(ctx, memberID)
-	if err != nil && strings.Contains(err.Error(), "member not found") {
+	if err == nil || (err != nil && strings.Contains(err.Error(), "member not found")) {
 		memberRemoved = true
 		break
 	}
 
 	time.Sleep(sleepDuration)
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
sleepDuration := 500 * time.Millisecond
maxRetries := int((2 * etcdserver.HealthInterval) / sleepDuration)
memberRemoved := false
for i := 0; i < 10; i++ {
for i := 0; i < maxRetries; i++ {
_, err := memberCtl.MemberRemove(ctx, memberID)
if err != nil && strings.Contains(err.Error(), "member not found") {
memberRemoved = true
break
}
time.Sleep(500 * time.Millisecond)
time.Sleep(sleepDuration)
}
if !memberRemoved {
return errors.New("failed to remove member after 10 tries")
return fmt.Errorf("failed to remove member after %d tries", maxRetries)
sleepDuration := 500 * time.Millisecond
maxRetries := int((2 * etcdserver.HealthInterval) / sleepDuration)
memberRemoved := false
for i := 0; i < maxRetries; i++ {
_, err := memberCtl.MemberRemove(ctx, memberID)
if err == nil || (err != nil && strings.Contains(err.Error(), "member not found")) {
memberRemoved = true
break
}
time.Sleep(sleepDuration)
}
if !memberRemoved {
return fmt.Errorf("failed to remove member after %d tries", maxRetries)
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/framework/e2e/cluster.go` around lines 849 - 863, The loop that calls
memberCtl.MemberRemove is treating only the "member not found" error as success
and ignores the successful case (err == nil); update the loop in the function
containing memberCtl.MemberRemove so that if MemberRemove returns nil you set
memberRemoved = true and break immediately, otherwise keep the existing check
for strings.Contains(err.Error(), "member not found") to mark success; this
change ensures both a successful removal (nil error) and an already-removed
state mark memberRemoved and stop retrying.

Comment on lines +384 to +390
permType := authpb.READWRITE
if len(user.perm) > 0 {
val, ok := authpb.Permission_Type_value[strings.ToUpper(user.perm)]
if ok {
permType = authpb.Permission_Type(val)
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fail fast on invalid permission strings in test setup.

Line [386] currently falls back to READWRITE when user.perm is invalid. That can accidentally over-grant and hide auth test mistakes; make invalid values fail the test instead.

Proposed fix
 		permType := authpb.READWRITE
 		if len(user.perm) > 0 {
 			val, ok := authpb.Permission_Type_value[strings.ToUpper(user.perm)]
-			if ok {
-				permType = authpb.Permission_Type(val)
-			}
+			require.Truef(t, ok, "invalid permission type %q for user %s", user.perm, user.name)
+			permType = authpb.Permission_Type(val)
 		}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
permType := authpb.READWRITE
if len(user.perm) > 0 {
val, ok := authpb.Permission_Type_value[strings.ToUpper(user.perm)]
if ok {
permType = authpb.Permission_Type(val)
}
}
permType := authpb.READWRITE
if len(user.perm) > 0 {
val, ok := authpb.Permission_Type_value[strings.ToUpper(user.perm)]
require.Truef(t, ok, "invalid permission type %q for user %s", user.perm, user.name)
permType = authpb.Permission_Type(val)
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/v3_auth_test.go` around lines 384 - 390, The test currently
silently falls back to READWRITE when user.perm is non-empty but not a valid key
in authpb.Permission_Type_value; change the logic in the test setup so that when
len(user.perm) > 0 and the lookup into authpb.Permission_Type_value yields ok ==
false, the test fails immediately (e.g., call t.Fatalf or require.FailNow) with
a clear message referencing the invalid user.perm, instead of assigning permType
:= authpb.READWRITE; otherwise keep the default behavior when user.perm is
empty.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@dusk125 dusk125 force-pushed the 5.0-rebase-3.6.11 branch from 821d95e to 67297a5 Compare May 7, 2026 15:40
@tjungblu
Copy link
Copy Markdown

tjungblu commented May 8, 2026

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 8, 2026
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 8, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dusk125, tjungblu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tjungblu
Copy link
Copy Markdown

tjungblu commented May 8, 2026

/retest

@tjungblu
Copy link
Copy Markdown

tjungblu commented May 8, 2026

/verified by @tjungblu

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label May 8, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@tjungblu: This PR has been marked as verified by @tjungblu.

Details

In response to this:

/verified by @tjungblu

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-bot
Copy link
Copy Markdown

/retest-required

Remaining retests: 0 against base HEAD 7fff975 and 2 for PR HEAD 67297a5 in total

@dusk125
Copy link
Copy Markdown
Author

dusk125 commented May 8, 2026

/retest-required

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 8, 2026

@dusk125: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/upstream-e2e 67297a5 link false /test upstream-e2e
ci/prow/upstream-integration 67297a5 link false /test upstream-integration

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot Bot merged commit c543fe1 into openshift:main May 8, 2026
9 of 11 checks passed
@dusk125 dusk125 deleted the 5.0-rebase-3.6.11 branch May 8, 2026 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants