Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In kubernetes V1.27.1, Image not rolling back to older version for pod with ordinal number 0, in case of upgrade failure. #119684

Closed
ankushhifi007 opened this issue Jul 31, 2023 · 29 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps.

Comments

@ankushhifi007
Copy link

I am using statefulset for my application with 2 replicas, and updating pods with rolling update partition with following detail using helm.

updateStrategy:
rollingUpdate:
partition: 1
type: RollingUpdate

upgrade is starting from 1 to 0.
in 1 scenario, pod-0 upgrade fails, and i tried to do rollback using helm rollback 1 by 1 but image did not update on any pod.
REVISION UPDATED STATUS CHART APP VERSION DESCRIPTION

1 Fri Jul 28 12:25:31 2023 superseded 1.7.2+65 app 5.0 Install complete

2 Fri Jul 28 13:14:43 2023 superseded 1.8.0-203 app 5.0 Upgrade complete

3 Fri Jul 28 13:36:35 2023 superseded 1.8.0-203 app 5.0 Upgrade complete

4 Fri Jul 28 14:12:20 2023 superseded 1.7.2+65 app 5.0 Rollback to 2

Same procedure working till V1.26.1.

@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 31, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ankushhifi007
Copy link
Author

/SIG Apps

@ankushhifi007 ankushhifi007 changed the title In kubernetes V1.27.1, Image not rolling back to older version for pod with ordinal number1 In kubernetes V1.27.1, Image not rolling back to older version for pod with ordinal number 0, in case of upgrade failure. Jul 31, 2023
@ankushhifi007
Copy link
Author

kind/bug

@liangyuanpeng
Copy link
Contributor

Some steps to reproduce it would be great.

@ankushhifi007
Copy link
Author

sts.zip
-- Sample yaml attached.

1 - Deploy STS using attached yaml where replica count is 2 and updatingstrategy.rolling.partition as 1.

2 - Edit STS and update the image to 1.15. At this stage pod-1 will be updated with image 1.15.

3 - Now delete the pod pod-0 and check the image tag for POD-0 came up with new image which is not expected [behavio]

@aojea
Copy link
Member

aojea commented Aug 1, 2023

/sig apps
/kind bug

@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. kind/bug Categorizes issue or PR as related to a bug. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 1, 2023
@ankushhifi007
Copy link
Author

Any work around for this issue?

@aojea
Copy link
Member

aojea commented Aug 1, 2023

3 - Now delete the pod pod-0 and check the image tag for POD-0 came up with new image which is not expected [behavio]

what do you mean by "POD-0 came up with new image" , the Pod-0 has to have image 1.15 that is the one you have updated to

@ankushhifi007
Copy link
Author

Yes, it is has new image 1.15.
But I have only updated the image for pod-1 using partition 1 in statefulset.

@liangyuanpeng
Copy link
Contributor

liangyuanpeng commented Aug 2, 2023

I'm interested in this, let me check it out.

/assign

@ankushhifi007
Copy link
Author

HI liangyuanpeng,
Any update,
Do you have any work around for this issue.

@aojea
Copy link
Member

aojea commented Aug 3, 2023

oh, I missed that part, indeed sounds like a bug and we should have an e2e test verifying that behavior, it seems a simple e2e test to add, @liangyuanpeng please add the e2e test reproducing the issue if you are going to work on this

/cc @soltysh

https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#rolling-updates

Partitioned rolling updates
The RollingUpdate update strategy can be partitioned, by specifying a .spec.updateStrategy.rollingUpdate.partition. If a partition is specified, all Pods with an ordinal that is greater than or equal to the partition will be updated when the StatefulSet's .spec.template is updated. All Pods with an ordinal that is less than the partition will not be updated, and, even if they are deleted, they will be recreated at the previous version.

@ankushhifi007
Copy link
Author

Hi @liangyuanpeng,
Need 1 info for your code changes, with respect to actual problem for my application upgrade.
I have attached the nginx helm chart according to my application and steps to reproduce the issue.
Can you please check and share, why it is working till v1.26.1 and also will it be solved, after your code change.

nginx.tar.gz
fallback issue step.txt

@liangyuanpeng
Copy link
Contributor

@ankushhifi007
In my test, this problem exists in 1.27.x, and I packaged a patch version, maybe it's worth your try.

ghcr.io/liangyuanpeng/kube-controller-manager-amd64:v1.27-patch

I will try to test again with your files.

@ankushhifi007
Copy link
Author

@liangyuanpeng
Any further findings with my steps and also any possible workaround for me.

@ankushhifi007
Copy link
Author

@liangyuanpeng
Any further findings with my steps.

@vkatabat
Copy link

@aleksandra-malinowska Can #119096 this be potential cause for this issue which is seen only in 1.27.1 but not in 1.26.

@ankushhifi007
Copy link
Author

ankushhifi007 commented Aug 18, 2023

@liangyuanpeng,
I have tested your patch, pod image change issue is fixed during pod restart,

But now upgrade procedure is breaking.
I have tested my scenario where stateful set having 2 replicas.
I started upgrade from pod with ordinal number 1, that is working fine.
But image is not updating while doing the upgrade of pod with ordinal number zero.

same upgrade procedure is working with v1.27.0 but after applying the patch it is not working.

@ankushhifi007
Copy link
Author

@liangyuanpeng,
1 additional point.
This behaviour is only with helm upgrade. while upgrading with editing stateful set, it is working fine.

@lowang-bh
Copy link
Member

Maybe it is as designed, other replicas will keep old version utill the first upgraded one finished upgring.

@ankushhifi007
Copy link
Author

Maybe it is as designed, other replicas will keep old version utill the first upgraded one finished upgring.

Yes you are right other replicas keeping the older version till first one is upgraded, but when I am upgrading other pod after
finish of first pod, it should come up with new image, but with @liangyuanpeng patch helm upgrade with partition 0 is not updating new image in pod-0.

@lowang-bh
Copy link
Member

with partition 0 is not updating new image in pod-0

I think you can check the real partition value in yaml. k8s will update pod with index from replicas-1 to partition if partition is set.

// we compute the minimum ordinal of the target sequence for a destructive update based on the strategy.
updateMin := 0
if set.Spec.UpdateStrategy.RollingUpdate != nil {
updateMin = int(*set.Spec.UpdateStrategy.RollingUpdate.Partition)
}
// we terminate the Pod with the largest ordinal that does not match the update revision.
for target := len(replicas) - 1; target >= updateMin; target-- {
// delete the Pod if it is not already terminating and does not match the update revision.
if getPodRevision(replicas[target]) != updateRevision.Name && !isTerminating(replicas[target]) {
logger.V(2).Info("Pod of StatefulSet is terminating for update",
"statefulSet", klog.KObj(set), "pod", klog.KObj(replicas[target]))
if err := ssc.podControl.DeleteStatefulPod(set, replicas[target]); err != nil {
if !errors.IsNotFound(err) {
return &status, err
}
}
status.CurrentReplicas--
return &status, err
}
// wait for unhealthy Pods on update
if !isHealthy(replicas[target]) {
logger.V(4).Info("StatefulSet is waiting for Pod to update",
"statefulSet", klog.KObj(set), "pod", klog.KObj(replicas[target]))
return &status, nil
}
}
return &status, nil

@ankushhifi007
Copy link
Author

with partition 0 is not updating new image in pod-0

I think you can check the real partition value in yaml. k8s will update pod with index from replicas-1 to partition if partition is set.

// we compute the minimum ordinal of the target sequence for a destructive update based on the strategy.
updateMin := 0
if set.Spec.UpdateStrategy.RollingUpdate != nil {
updateMin = int(*set.Spec.UpdateStrategy.RollingUpdate.Partition)
}
// we terminate the Pod with the largest ordinal that does not match the update revision.
for target := len(replicas) - 1; target >= updateMin; target-- {
// delete the Pod if it is not already terminating and does not match the update revision.
if getPodRevision(replicas[target]) != updateRevision.Name && !isTerminating(replicas[target]) {
logger.V(2).Info("Pod of StatefulSet is terminating for update",
"statefulSet", klog.KObj(set), "pod", klog.KObj(replicas[target]))
if err := ssc.podControl.DeleteStatefulPod(set, replicas[target]); err != nil {
if !errors.IsNotFound(err) {
return &status, err
}
}
status.CurrentReplicas--
return &status, err
}
// wait for unhealthy Pods on update
if !isHealthy(replicas[target]) {
logger.V(4).Info("StatefulSet is waiting for Pod to update",
"statefulSet", klog.KObj(set), "pod", klog.KObj(replicas[target]))
return &status, nil
}
}
return &status, nil

I am checking and setting partition values as per my upgrade requirement but my my point is with @liangyuanpeng 's patch
ghcr.io/liangyuanpeng/kube-controller-manager-amd64:v1.27-patch

Helm upgrade is not working as per helm design now that is the issue. apart from original reported issue.

@ankushhifi007
Copy link
Author

@liangyuanpeng
did you test the scenario with my helm chart.

@aleksandra-malinowska
Copy link
Contributor

Aleksandra Malinowska Can #119096 this be potential cause for this issue which is seen only in 1.27.1 but not in 1.26.

#119096 was cherry-picked to 1.27.4, it's not in 1.27.1

@liangyuanpeng
Copy link
Contributor

liangyuanpeng commented Nov 6, 2023

@ankushhifi007 I believe that it's fixed by #120731

/unassign

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 4, 2024
@adilGhaffarDev
Copy link
Contributor

closing this because it is fixed in #120731 and backported to 1.27 and 1.28
/close

@k8s-ci-robot
Copy link
Contributor

@adilGhaffarDev: Closing this issue.

In response to this:

closing this because it is fixed in #120731 and backported to 1.27 and 1.28
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Projects
Archived in project
Development

No branches or pull requests

9 participants