Replicaset and Deployment are out-of-sync #117008

akunszt · 2023-03-30T08:56:37Z

What happened?

This is a very elusive bug we started to see after we upgraded to 1.26. Everything works great but after a a few days the kube-controller-manager stop syncing/reconciling(?) the current replicas between the Deployment and the Replicaset. When this happens it affects all the Deployments in the cluster and our only workaround is to kill the active kube-controller process on the master node. The new active kube-controller which takes over sync the values and everything is fine, for a few days.
It affects all of our clusters running Kubernetes 1.26 but we don't know how to reproduce and how much we need to wait for it to happen.

What did you expect to happen?

The Replicaset and Deployment replica values are in-sync and we can scale our Deployments safely.

How can we reproduce it (as minimally and precisely as possible)?

It looks like a race condition issue which we cannot reproduce.

Anything else we need to know?

We are using HPA with external metrics, this might be totally unrelated though.

The kube-controller-manager logs related to a Deployment which failed to scale.

I0330 07:39:23.843138       1 event.go:294] "Event occurred" object="nisp-gateway-prod/gateway-access-log-worker" fieldPath="" kind="Deployment" apiVersion="apps/v1" type="Normal" reason="ScalingReplicaSet" message="Scaled up replica set gateway-access-log-worker-7665f9978c to 3 from 1"
I0330 07:39:25.180900       1 event.go:294] "Event occurred" object="nisp-gateway-prod/gateway-access-log-worker-7665f9978c" fieldPath="" kind="ReplicaSet" apiVersion="apps/v1" type="Normal" reason="SuccessfulCreate" message="Created pod: gateway-access-log-worker-7665f9978c-7lkjj"
I0330 07:39:25.464273       1 event.go:294] "Event occurred" object="nisp-gateway-prod/gateway-access-log-worker-7665f9978c" fieldPath="" kind="ReplicaSet" apiVersion="apps/v1" type="Normal" reason="SuccessfulCreate" message="Created pod: gateway-access-log-worker-7665f9978c-djg7x"
I0330 07:40:08.309182       1 horizontal.go:801] Successful rescale of gateway-access-log-worker, old size: 1, new size: 2, reason: external metric kafka_topic_throughput_desired_and_actual_consumer_count_ratio(&LabelSelector{MatchLabels:map[string]string{group: prod.gateway-access-log-worker.v1,topic: log.nisp-gateway-prod.hiya-edge-gateway,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
I0330 07:40:08.309287       1 event.go:294] "Event occurred" object="nisp-gateway-prod/gateway-access-log-worker" fieldPath="" kind="HorizontalPodAutoscaler" apiVersion="autoscaling/v2" type="Normal" reason="SuccessfulRescale" message="New size: 2; reason: external metric kafka_topic_throughput_desired_and_actual_consumer_count_ratio(&LabelSelector{MatchLabels:map[string]string{group: prod.gateway-access-log-worker.v1,topic: log.nisp-gateway-prod.hiya-edge-gateway,},MatchExpressions:[]LabelSelectorRequirement{},}) above target"
E0330 07:41:34.734786       1 deployment_controller.go:496] Operation cannot be fulfilled on deployments.apps "gateway-access-log-worker": the object has been modified; please apply your changes to the latest version and try again
I0330 07:56:58.114749       1 horizontal.go:801] Successful rescale of gateway-access-log-worker, old size: 3, new size: 2, reason: All metrics below target
I0330 07:56:58.114904       1 event.go:294] "Event occurred" object="nisp-gateway-prod/gateway-access-log-worker" fieldPath="" kind="HorizontalPodAutoscaler" apiVersion="autoscaling/v2" type="Normal" reason="SuccessfulRescale" message="New size: 2; reason: All metrics below target"
I0330 08:00:59.619488       1 horizontal.go:452] Horizontal Pod Autoscaler gateway-access-log-worker has been deleted in nisp-gateway-prod

As a test we deleted the HPA but it didn't resolve the issue. We still had to terminate the kube-controller process. Although this can be explained with that this point the kube-controller-manager was already in a bad state.

This is how the Deployment status looked like:

status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: "2022-04-25T21:46:13Z"
    lastUpdateTime: "2022-07-27T18:39:51Z"
    message: ReplicaSet "gateway-access-log-worker-7665f9978c" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  - lastTransitionTime: "2023-03-29T22:57:08Z"
    lastUpdateTime: "2023-03-29T22:57:08Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  observedGeneration: 638
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1

While the ReplicaSets showed a different number of replicas.

gateway-access-log-worker-5987d87b4f     0         0         0       301d
gateway-access-log-worker-68dc74bfc8     0         0         0       335d
gateway-access-log-worker-69b7c79699     0         0         0       337d
gateway-access-log-worker-6fdd54b878     0         0         0       338d
gateway-access-log-worker-7665f9978c     3         3         3       245d

And there were 3 running pods too.

gateway-access-log-worker-7665f9978c-7lkjj    1/1     Running   0          16m
gateway-access-log-worker-7665f9978c-djg7x    1/1     Running   0          16m
gateway-access-log-worker-7665f9978c-lzxnr    1/1     Running   0          23d

As a test I scaled another Deployment in a different namespace which also failed and I didn't find any logs related to the scale operation I initiated in the kube-controller-manager logs. Which is strange.

Kubernetes version

$ kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.2", GitCommit:"fc04e732bb3e7198d2fa44efa5457c7c6f8c0f5b", GitTreeState:"archive", BuildDate:"2023-03-20T09:46:21Z", GoVersion:"go1.20.2", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.2", GitCommit:"fc04e732bb3e7198d2fa44efa5457c7c6f8c0f5b", GitTreeState:"clean", BuildDate:"2023-02-22T13:32:22Z", GoVersion:"go1.19.6", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

We are running in AWS using the aws-cloud-controller-manager.

OS version

# On Linux:
$ cat /etc/os-release
NAME="Flatcar Container Linux by Kinvolk"
ID=flatcar
ID_LIKE=coreos
VERSION=3374.2.4
VERSION_ID=3374.2.4
BUILD_ID=2023-02-15-1824
SYSEXT_LEVEL=1.0
PRETTY_NAME="Flatcar Container Linux by Kinvolk 3374.2.4 (Oklo)"
ANSI_COLOR="38;5;75"
HOME_URL="https://flatcar.org/"
BUG_REPORT_URL="https://issues.flatcar.org"
FLATCAR_BOARD="amd64-usr"
CPE_NAME="cpe:2.3:o:flatcar-linux:flatcar_linux:3374.2.4:*:*:*:*:*:*:*"
$ uname -a
Linux ip-10-22-8-235 5.15.89-flatcar #1 SMP Wed Feb 15 18:00:42 -00 2023 x86_64 AMD EPYC 7R32 AuthenticAMD GNU/Linux

Install tools

N/A

Container runtime (CRI) and version (if applicable)

# crictl version
Version:  0.1.0
RuntimeName:  containerd
RuntimeVersion:  1.6.8
RuntimeApiVersion:  v1

Related plugins (CNI, CSI, ...) and versions (if applicable)

I don't think these are relevant:

aws-ebs-csi 1.13.0
aws-efs-csi 1.4.3
aws-vpc-cni 1.12.0

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2023-03-30T08:56:46Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

HirazawaUi · 2023-03-30T09:02:20Z

/sig apps

jutley · 2023-04-20T16:06:47Z

I work with @akunszt and we have been continuing to try and figure this issue out. We're still unsure of what exactly is causing this behavior, but we have some new observations:

The Deployment controller is still making updates. When we reach this state, if we manually update the controlled ReplicaSets, then the Deployment controller reverts these manual changes. This indicates that the issue is specifically that the Deployment controller is operating with stale data.
This issue doesn't seem to be across all controllers. It is specific to Deployments. When the issue is active, we can see that the ReplicaSet, Service, CronJob, and Job controllers continue to operate as expected. Furthermore, we have alerts based on time since CronJob completion which would fire if CronJobs were affected by this behavior, and these alerts have never recently fired. I'm not sure this is actually relevant. With the deployment, we are seeing the Deployment work but updates to the Deployment not take effect. These CronJobs are not updating, so I don't think it is a fair comparison. However, the fact that manual updates to ReplicaSets are being reverted indicates that some watches are staying up-to-date.
Manually checking the Deployments shows correct information, so it doesn't seem like kube-apiserver is returning stale data. Rather, it seems the Deployment watch is going stale and not getting refreshed.
This one is spicy. Terminating the kube-controller-manager container does not resolve the issue. It may temporarily resolve the issue if another instance takes over as leader, but if leadership returns to the original instance, the problem manifests again. This seems to suggest that the issue is not tied to the kube-controller-manager, but rather with the kube-apiserver (or a lower layer).

jutley · 2023-04-24T21:06:25Z

With more investigation we have learned that this behavior is specifically tied to watch requests. With a cluster in the degraded state, get requests for a test resource returns up-to-date information, but watch requests for the same resource returns stale data.

I believe that somehow kube-apiserver ended up in a state where is stopped updating its watch cache. We don't have any custom configuration of the watch cache. I'm surprised that others are not running into this issue, so I suspect there must be something about our environment that is triggering some rare edge case.

jutley · 2023-05-11T23:12:44Z

We've now seen the behavior with Secrets as well. Further investigation seems to point to our load balancer between kube-apiserver and etcd making a big difference.

We use an AWS ALB in between these components. This stackoverflow thread has reports that ALBs don't forward HTTP/2 pings, and we suspect that this is the mechanism responsible for keeping the watches to etcd alive when they are otherwise idle.
https://stackoverflow.com/questions/66818645/http2-ping-frames-over-aws-alb-grpc-keepalive-ping

When we do an etcd watch from our dev machines via a direct connection the an etcd instance, via an ALB, and via an NLB, only the ALB drops the connection after 10 minutes or so. With the direct connection and with the NLB, the watch is alive for hours despite there being no data to send through the connection.

Normally this doesn't seem to be a problem. It seems like when the connection closes, kube-apiserver will recognize this and recreate the connection. However, in some rare scenarios this doesn't happen. This could be a networking issue (the client doesn't see the connection being closed) or some kind of bug in kube-apiserver where it doesn't recreate the connection.

I'm not sure which is the real scenario. However, I suspect it is a bug in kube-apiserver because we started observing this behavior when we upgraded to Kubernetes v1.26. We didn't upgrade etcd or change anything about the networking in between kube-apiserver and etcd.

k8s-triage-robot · 2024-01-20T03:11:02Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-02-19T04:07:13Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-03-20T04:59:01Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-03-20T04:59:05Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

HirazawaUi · 2024-03-20T09:27:59Z

/reopen

k8s-ci-robot · 2024-03-20T09:28:03Z

@HirazawaUi: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

HirazawaUi · 2024-03-20T09:28:32Z

/sig api-machinery

HirazawaUi · 2024-03-20T09:29:18Z

/remove-lifecycle rotten

leilajal · 2024-03-26T19:29:26Z

/remove-sig api-machinery

akunszt added the kind/bug Categorizes issue or PR as related to a bug. label Mar 30, 2023

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 30, 2023

k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 30, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 20, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 19, 2024

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 20, 2024

k8s-ci-robot reopened this Mar 20, 2024

k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Mar 20, 2024

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 20, 2024

k8s-ci-robot removed the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Mar 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replicaset and Deployment are out-of-sync #117008

Replicaset and Deployment are out-of-sync #117008

akunszt commented Mar 30, 2023

k8s-ci-robot commented Mar 30, 2023

HirazawaUi commented Mar 30, 2023

jutley commented Apr 20, 2023 •

edited

jutley commented Apr 24, 2023

jutley commented May 11, 2023

k8s-triage-robot commented Jan 20, 2024

k8s-triage-robot commented Feb 19, 2024

k8s-triage-robot commented Mar 20, 2024

k8s-ci-robot commented Mar 20, 2024

HirazawaUi commented Mar 20, 2024

k8s-ci-robot commented Mar 20, 2024

HirazawaUi commented Mar 20, 2024 •

edited

HirazawaUi commented Mar 20, 2024

leilajal commented Mar 26, 2024

Replicaset and Deployment are out-of-sync #117008

Replicaset and Deployment are out-of-sync #117008

Comments

akunszt commented Mar 30, 2023

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

k8s-ci-robot commented Mar 30, 2023

HirazawaUi commented Mar 30, 2023

jutley commented Apr 20, 2023 • edited

jutley commented Apr 24, 2023

jutley commented May 11, 2023

k8s-triage-robot commented Jan 20, 2024

k8s-triage-robot commented Feb 19, 2024

k8s-triage-robot commented Mar 20, 2024

k8s-ci-robot commented Mar 20, 2024

HirazawaUi commented Mar 20, 2024

k8s-ci-robot commented Mar 20, 2024

HirazawaUi commented Mar 20, 2024 • edited

HirazawaUi commented Mar 20, 2024

leilajal commented Mar 26, 2024

jutley commented Apr 20, 2023 •

edited

HirazawaUi commented Mar 20, 2024 •

edited