Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing CurrentReplicas and CurrentRevision in completeRollingUpdate #120731

Merged
merged 1 commit into from Oct 20, 2023

Conversation

adilGhaffarDev
Copy link
Contributor

@adilGhaffarDev adilGhaffarDev commented Sep 18, 2023

What type of PR is this?

/kind bug

What this PR does / why we need it:

In completeRollingUpdate function we should not compare status.UpdatedReplicas and status.ReadyReplicas with status.Replicas because status.Replicas is not showing total desired replicas, it shows only those replicas that are created, but here we want to compare with total desired replicas(set.Spec.Replicas) and only when set.Spec.Replicas == status.ReadyReplicas and set.Spec.Replicas == status.UpdatedReplicas we should change status.CurrentReplicas to status.UpdatedReplicas. This was causing updated image on pods with ordinal number lower than the rolling partition number when they were being deleted.

Which issue(s) this PR fixes:

Fixes #119685 #119684

Special notes for your reviewer:

This is taking e2e test from this PR #119759 , lot of discussion regarding the fix happened there. Just for reference.

Does this PR introduce a user-facing change?

Fixed the issue where pod with ordinal number lower than the rolling partitioning number was being deleted it was coming up with updated image.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


Thanks @liangyuanpeng for adding e2e tests for this. Integration is added by @mimowo , in addition, @mimowo also helped in fixing e2e test.
Co-authored-by: Lan Liang gcslyp@gmail.com. Michał Woźniak @mimowo

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 18, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @adilGhaffarDev. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Sep 18, 2023
@k8s-ci-robot k8s-ci-robot added area/test sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 18, 2023
@lentzi90
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 18, 2023
@aleksandra-malinowska
Copy link
Contributor

Thank you Adil for picking this up!

I believe this is also going to fix #120700.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 18, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 7e345dbea0711a4b27991d7686e63118c743f564

status.UpdatedReplicas == status.Replicas &&
status.ReadyReplicas == status.Replicas {
status.UpdatedReplicas == *set.Spec.Replicas &&
status.ReadyReplicas == *set.Spec.Replicas {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I'm wondering if we might also need a check for status.Replicas == *set.Spec.Replicas and/or status.CurrentReplicas == 0?

I don't think we can get here now if we have condemned pods that haven't been successfully removed, but given the complexity of this controller's logic and the strong guarantee we need from this check (to avoid unexpected rollout progress), I think it might make it more resilient to check all conditions we care about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I will add this check too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added that check kindly check

@liangyuanpeng
Copy link
Contributor

liangyuanpeng commented Sep 18, 2023

Have some lint failed of e2e test and i can help to fix it at tomorrow.

@adilGhaffarDev adilGhaffarDev changed the title wip: Fixing CurrentReplicas and CurrentRevision in completeRollingUpdate Fixing CurrentReplicas and CurrentRevision in completeRollingUpdate Oct 19, 2023
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 19, 2023
Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/assign @soltysh

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 19, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: e7527d29c11bc9adaaba378a22cec6af16a2752a

@Vyom-Yadav
Copy link
Member

Excellent, I will add it to the 1.29 milestone.

/milestone v1.29

@k8s-ci-robot k8s-ci-robot added this to the v1.29 milestone Oct 19, 2023
Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still issue of the mismatching statuses, as I explained in slack, but I'll also post here for visibility:

While testing this PR I found STS status isn't also fully correct, I started logging out full status right after each wait command and I'm seeing this:

// CurrentRevision is the only revision, and the CurrentReplicas matches Replicas, ie. 3
STEP: 1: Status v1.StatefulSetStatus{ObservedGeneration:1, Replicas:3, ReadyReplicas:3, CurrentReplicas:3, UpdatedReplicas:3, CurrentRevision:"ss2-7b6c9599d5", UpdateRevision:"ss2-7b6c9599d5", AvailableReplicas:3}
// after first update, with Partition=1, we need to update only 2 pods (1 and 2), but CurrentReplicas (ie. old) should be 0, UpdatedReplicas (ie. new) should be 2
STEP: 2: Status v1.StatefulSetStatus{ObservedGeneration:2, Replicas:3, ReadyReplicas:3, CurrentReplicas:2, UpdatedReplicas:0, CurrentRevision:"ss2-7b6c9599d5", UpdateRevision:"ss2-5459d8585b", AvailableReplicas:3}
// after the pod removal, we need to recreate the pod for CurrentRevision, so CurrentReplicas=1, and UpdatedReplicas=2
STEP: 3: Status v1.StatefulSetStatus{ObservedGeneration:2, Replicas:3, ReadyReplicas:3, CurrentReplicas:1, UpdatedReplicas:2, CurrentRevision:"ss2-7b6c9599d5", UpdateRevision:"ss2-5459d8585b", AvailableReplicas:3}

if one checks the descriptions we have in our API we clearly are doing a bad job reflecting the actual changes happening to Current(Replicas|Revision) and Updated(Replicas|Revisions) such that they don't reflect the actual state of world, since the division between Current and Updated should be strictly tied to Partitions (as described in the original proposal).

Still this PR improves the current situation and fixes parts of the problem we're seeing, and adds sufficient tests allowing us to ensure that .spec.partition is working as expected.

/hold cancel
/lgtm
/approve

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 20, 2023
@soltysh
Copy link
Contributor

soltysh commented Oct 20, 2023

/triage accepted
/priority backlog

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/backlog Higher priority than priority/awaiting-more-evidence. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Oct 20, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: adilGhaffarDev, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 20, 2023
@soltysh
Copy link
Contributor

soltysh commented Oct 20, 2023

Also, based on the discussion @mimowo @adilGhaffarDev and I had on slack, we're going to backport this to only 1.28 and 1.27 due to limitations. More details can be found in this slack thread. Although, I'd like to see us fixing that status I described above before attempting the picks, since there's still time before the next patch releases.

@k8s-ci-robot k8s-ci-robot merged commit 568aee1 into kubernetes:master Oct 20, 2023
15 checks passed
k8s-ci-robot added a commit that referenced this pull request Oct 23, 2023
…31-upstream-release-1.28

Automated cherry pick of #120731: Fixing CurrentReplicas and CurrentRevision in
k8s-ci-robot added a commit that referenced this pull request Oct 23, 2023
…31-upstream-release-1.27

Automated cherry pick of #120731: Fixing CurrentReplicas and CurrentRevision in
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/backlog Higher priority than priority/awaiting-more-evidence. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Archived in project
10 participants