Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replicaset and Deployment are out-of-sync #117008

Open
akunszt opened this issue Mar 30, 2023 · 14 comments
Open

Replicaset and Deployment are out-of-sync #117008

akunszt opened this issue Mar 30, 2023 · 14 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps.

Comments

@akunszt
Copy link

akunszt commented Mar 30, 2023

What happened?

This is a very elusive bug we started to see after we upgraded to 1.26. Everything works great but after a a few days the kube-controller-manager stop syncing/reconciling(?) the current replicas between the Deployment and the Replicaset. When this happens it affects all the Deployments in the cluster and our only workaround is to kill the active kube-controller process on the master node. The new active kube-controller which takes over sync the values and everything is fine, for a few days.
It affects all of our clusters running Kubernetes 1.26 but we don't know how to reproduce and how much we need to wait for it to happen.

What did you expect to happen?

The Replicaset and Deployment replica values are in-sync and we can scale our Deployments safely.

How can we reproduce it (as minimally and precisely as possible)?

It looks like a race condition issue which we cannot reproduce.

Anything else we need to know?

We are using HPA with external metrics, this might be totally unrelated though.

The kube-controller-manager logs related to a Deployment which failed to scale.

I0330 07:39:23.843138       1 event.go:294] "Event occurred" object="nisp-gateway-prod/gateway-access-log-worker" fieldPath="" kind="Deployment" apiVersion="apps/v1" type="Normal" reason="ScalingReplicaSet" message="Scaled up replica set gateway-access-log-worker-7665f9978c to 3 from 1"
I0330 07:39:25.180900       1 event.go:294] "Event occurred" object="nisp-gateway-prod/gateway-access-log-worker-7665f9978c" fieldPath="" kind="ReplicaSet" apiVersion="apps/v1" type="Normal" reason="SuccessfulCreate" message="Created pod: gateway-access-log-worker-7665f9978c-7lkjj"
I0330 07:39:25.464273       1 event.go:294] "Event occurred" object="nisp-gateway-prod/gateway-access-log-worker-7665f9978c" fieldPath="" kind="ReplicaSet" apiVersion="apps/v1" type="Normal" reason="SuccessfulCreate" message="Created pod: gateway-access-log-worker-7665f9978c-djg7x"
I0330 07:40:08.309182       1 horizontal.go:801] Successful rescale of gateway-access-log-worker, old size: 1, new size: 2, reason: external metric kafka_topic_throughput_desired_and_actual_consumer_count_ratio(&LabelSelector{MatchLabels:map[string]string{group: prod.gateway-access-log-worker.v1,topic: log.nisp-gateway-prod.hiya-edge-gateway,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
I0330 07:40:08.309287       1 event.go:294] "Event occurred" object="nisp-gateway-prod/gateway-access-log-worker" fieldPath="" kind="HorizontalPodAutoscaler" apiVersion="autoscaling/v2" type="Normal" reason="SuccessfulRescale" message="New size: 2; reason: external metric kafka_topic_throughput_desired_and_actual_consumer_count_ratio(&LabelSelector{MatchLabels:map[string]string{group: prod.gateway-access-log-worker.v1,topic: log.nisp-gateway-prod.hiya-edge-gateway,},MatchExpressions:[]LabelSelectorRequirement{},}) above target"
E0330 07:41:34.734786       1 deployment_controller.go:496] Operation cannot be fulfilled on deployments.apps "gateway-access-log-worker": the object has been modified; please apply your changes to the latest version and try again
I0330 07:56:58.114749       1 horizontal.go:801] Successful rescale of gateway-access-log-worker, old size: 3, new size: 2, reason: All metrics below target
I0330 07:56:58.114904       1 event.go:294] "Event occurred" object="nisp-gateway-prod/gateway-access-log-worker" fieldPath="" kind="HorizontalPodAutoscaler" apiVersion="autoscaling/v2" type="Normal" reason="SuccessfulRescale" message="New size: 2; reason: All metrics below target"
I0330 08:00:59.619488       1 horizontal.go:452] Horizontal Pod Autoscaler gateway-access-log-worker has been deleted in nisp-gateway-prod

As a test we deleted the HPA but it didn't resolve the issue. We still had to terminate the kube-controller process. Although this can be explained with that this point the kube-controller-manager was already in a bad state.

This is how the Deployment status looked like:

status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: "2022-04-25T21:46:13Z"
    lastUpdateTime: "2022-07-27T18:39:51Z"
    message: ReplicaSet "gateway-access-log-worker-7665f9978c" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  - lastTransitionTime: "2023-03-29T22:57:08Z"
    lastUpdateTime: "2023-03-29T22:57:08Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  observedGeneration: 638
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1

While the ReplicaSets showed a different number of replicas.

gateway-access-log-worker-5987d87b4f     0         0         0       301d
gateway-access-log-worker-68dc74bfc8     0         0         0       335d
gateway-access-log-worker-69b7c79699     0         0         0       337d
gateway-access-log-worker-6fdd54b878     0         0         0       338d
gateway-access-log-worker-7665f9978c     3         3         3       245d

And there were 3 running pods too.

gateway-access-log-worker-7665f9978c-7lkjj    1/1     Running   0          16m
gateway-access-log-worker-7665f9978c-djg7x    1/1     Running   0          16m
gateway-access-log-worker-7665f9978c-lzxnr    1/1     Running   0          23d

As a test I scaled another Deployment in a different namespace which also failed and I didn't find any logs related to the scale operation I initiated in the kube-controller-manager logs. Which is strange.

Kubernetes version

$ kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.2", GitCommit:"fc04e732bb3e7198d2fa44efa5457c7c6f8c0f5b", GitTreeState:"archive", BuildDate:"2023-03-20T09:46:21Z", GoVersion:"go1.20.2", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.2", GitCommit:"fc04e732bb3e7198d2fa44efa5457c7c6f8c0f5b", GitTreeState:"clean", BuildDate:"2023-02-22T13:32:22Z", GoVersion:"go1.19.6", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

We are running in AWS using the aws-cloud-controller-manager.

OS version

# On Linux:
$ cat /etc/os-release
NAME="Flatcar Container Linux by Kinvolk"
ID=flatcar
ID_LIKE=coreos
VERSION=3374.2.4
VERSION_ID=3374.2.4
BUILD_ID=2023-02-15-1824
SYSEXT_LEVEL=1.0
PRETTY_NAME="Flatcar Container Linux by Kinvolk 3374.2.4 (Oklo)"
ANSI_COLOR="38;5;75"
HOME_URL="https://flatcar.org/"
BUG_REPORT_URL="https://issues.flatcar.org"
FLATCAR_BOARD="amd64-usr"
CPE_NAME="cpe:2.3:o:flatcar-linux:flatcar_linux:3374.2.4:*:*:*:*:*:*:*"
$ uname -a
Linux ip-10-22-8-235 5.15.89-flatcar #1 SMP Wed Feb 15 18:00:42 -00 2023 x86_64 AMD EPYC 7R32 AuthenticAMD GNU/Linux

Install tools

N/A

Container runtime (CRI) and version (if applicable)

# crictl version
Version:  0.1.0
RuntimeName:  containerd
RuntimeVersion:  1.6.8
RuntimeApiVersion:  v1

Related plugins (CNI, CSI, ...) and versions (if applicable)

I don't think these are relevant:

  • aws-ebs-csi 1.13.0
  • aws-efs-csi 1.4.3
  • aws-vpc-cni 1.12.0
@akunszt akunszt added the kind/bug Categorizes issue or PR as related to a bug. label Mar 30, 2023
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 30, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@HirazawaUi
Copy link
Contributor

/sig apps

@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 30, 2023
@jutley
Copy link

jutley commented Apr 20, 2023

I work with @akunszt and we have been continuing to try and figure this issue out. We're still unsure of what exactly is causing this behavior, but we have some new observations:

  • The Deployment controller is still making updates. When we reach this state, if we manually update the controlled ReplicaSets, then the Deployment controller reverts these manual changes. This indicates that the issue is specifically that the Deployment controller is operating with stale data.

  • This issue doesn't seem to be across all controllers. It is specific to Deployments. When the issue is active, we can see that the ReplicaSet, Service, CronJob, and Job controllers continue to operate as expected. Furthermore, we have alerts based on time since CronJob completion which would fire if CronJobs were affected by this behavior, and these alerts have never recently fired. I'm not sure this is actually relevant. With the deployment, we are seeing the Deployment work but updates to the Deployment not take effect. These CronJobs are not updating, so I don't think it is a fair comparison. However, the fact that manual updates to ReplicaSets are being reverted indicates that some watches are staying up-to-date.

  • Manually checking the Deployments shows correct information, so it doesn't seem like kube-apiserver is returning stale data. Rather, it seems the Deployment watch is going stale and not getting refreshed.

  • This one is spicy. Terminating the kube-controller-manager container does not resolve the issue. It may temporarily resolve the issue if another instance takes over as leader, but if leadership returns to the original instance, the problem manifests again. This seems to suggest that the issue is not tied to the kube-controller-manager, but rather with the kube-apiserver (or a lower layer).

@jutley
Copy link

jutley commented Apr 24, 2023

With more investigation we have learned that this behavior is specifically tied to watch requests. With a cluster in the degraded state, get requests for a test resource returns up-to-date information, but watch requests for the same resource returns stale data.

I believe that somehow kube-apiserver ended up in a state where is stopped updating its watch cache. We don't have any custom configuration of the watch cache. I'm surprised that others are not running into this issue, so I suspect there must be something about our environment that is triggering some rare edge case.

@jutley
Copy link

jutley commented May 11, 2023

We've now seen the behavior with Secrets as well. Further investigation seems to point to our load balancer between kube-apiserver and etcd making a big difference.

We use an AWS ALB in between these components. This stackoverflow thread has reports that ALBs don't forward HTTP/2 pings, and we suspect that this is the mechanism responsible for keeping the watches to etcd alive when they are otherwise idle.
https://stackoverflow.com/questions/66818645/http2-ping-frames-over-aws-alb-grpc-keepalive-ping

When we do an etcd watch from our dev machines via a direct connection the an etcd instance, via an ALB, and via an NLB, only the ALB drops the connection after 10 minutes or so. With the direct connection and with the NLB, the watch is alive for hours despite there being no data to send through the connection.

Normally this doesn't seem to be a problem. It seems like when the connection closes, kube-apiserver will recognize this and recreate the connection. However, in some rare scenarios this doesn't happen. This could be a networking issue (the client doesn't see the connection being closed) or some kind of bug in kube-apiserver where it doesn't recreate the connection.

I'm not sure which is the real scenario. However, I suspect it is a bug in kube-apiserver because we started observing this behavior when we upgraded to Kubernetes v1.26. We didn't upgrade etcd or change anything about the networking in between kube-apiserver and etcd.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 20, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 19, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 20, 2024
@HirazawaUi
Copy link
Contributor

/reopen

@k8s-ci-robot
Copy link
Contributor

@HirazawaUi: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Mar 20, 2024
@HirazawaUi
Copy link
Contributor

HirazawaUi commented Mar 20, 2024

/sig api-machinery

@k8s-ci-robot k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Mar 20, 2024
@HirazawaUi
Copy link
Contributor

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 20, 2024
@leilajal
Copy link
Contributor

/remove-sig api-machinery

@k8s-ci-robot k8s-ci-robot removed the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Mar 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Projects
Status: In Progress
Development

No branches or pull requests

6 participants