Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Waiting for app ReplicaSet be marked available indefinitely #1978

Closed
jsravn opened this issue Apr 26, 2022 · 11 comments
Closed

Waiting for app ReplicaSet be marked available indefinitely #1978

jsravn opened this issue Apr 26, 2022 · 11 comments
Assignees
Labels
kind/bug Some behavior is incorrect or out of spec resolution/fixed This issue was fixed

Comments

@jsravn
Copy link

jsravn commented Apr 26, 2022

What happened?

A completely healthy deployment gets stuck on "Waiting for app ReplicaSet be marked available" despite all replicas being fully available.

Steps to reproduce

It's not clear yet to me how to reproduce reliably. It seems to happen sporadically. Also I see there are quite a few issues already that have the same problem - I guess it's not completely fixed?

Expected Behavior

It should notice the deployment is healthy and proceed.

Actual Behavior

Do you want to perform this update? yes
Updating (infra.pre-dev):
     Type                                    Name                 Status       Info
     pulumi:pulumi:Stack                     infra-infra.pre-dev  running      read kubernetes:core/v1:Secret gke-cluster-addons-feed-original-tls
     └─ core_gcp:infra:cluster_addons        gke-cluster-addons                
        ├─ core:gcp:lib:traefik              traefik-internal                  
 ~      │  └─ kubernetes:apps/v1:Deployment  traefik-internal     updating...  [diff: ~spec]; [1/2] Waiting for app ReplicaSet be marked available (3/3 Pods available)
        └─ core:gcp:lib:traefik              traefik-external                  
 ~         └─ kubernetes:apps/v1:Deployment  traefik-external     updating     [diff: ~spec]; [1/2] Waiting for app ReplicaSet be marked available (1/1 Pods available)

ReplicaSet status:

status:
  availableReplicas: 3
  fullyLabeledReplicas: 3
  observedGeneration: 11
  readyReplicas: 3
  replicas: 3

Deployment status:

status:
  availableReplicas: 3
  conditions:
  - lastTransitionTime: "2022-04-26T10:41:05Z"
    lastUpdateTime: "2022-04-26T10:41:05Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  - lastTransitionTime: "2022-04-13T09:10:26Z"
    lastUpdateTime: "2022-04-26T10:41:26Z"
    message: ReplicaSet "traefik-internal-b28bb768-6547995858" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  observedGeneration: 24
  readyReplicas: 3
  replicas: 3
  updatedReplicas: 3

Versions used

CLI          
Version      3.30.0
Go Version   go1.17.9
Go Compiler  gc

Plugins
NAME        VERSION
gcp         6.20.0
gcp         6.20.0
kubernetes  3.18.2
kubernetes  3.18.2
nodejs      unknown
random      4.4.2
random      4.4.2

Host     
OS       nixos
Version  21.11 (Porcupine)
Arch     x86_64

This project is written in nodejs (/nix/store/46g0dmf6rcpikbzs22y7w4amyg0ciksi-nodejs-16.14.2/bin/node v16.14.2)

Additional context

Tried running a pulumi refresh beforehand, but it made no difference. Kubernetes is version 1.22.

Contributing

Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

@jsravn jsravn added the kind/bug Some behavior is incorrect or out of spec label Apr 26, 2022
@jsravn
Copy link
Author

jsravn commented Apr 26, 2022

Unsure why, but after updating my pulumi-kubernetes provider to 3.18.3 (from 3.18.2) it proceeded to work.

@jsravn jsravn closed this as completed Apr 26, 2022
@jsravn
Copy link
Author

jsravn commented May 6, 2022

This is still happening to us. Not sure what triggers it yet.

@jsravn jsravn reopened this May 6, 2022
@stack72 stack72 added kind/bug Some behavior is incorrect or out of spec and removed kind/bug Some behavior is incorrect or out of spec labels May 9, 2022
@stack72 stack72 added this to the 0.72 milestone May 9, 2022
@jsravn
Copy link
Author

jsravn commented May 25, 2022

This keeps happening to us and we haven't quite figured out what triggers it. I think it may be related to performing a refresh before doing pulumi up, since it seems to happen pretty reliably when I do pulumi --refresh --skip-preview --yes up. Whereas it never seems to happen if I avoid refresh.

@lukehoban lukehoban modified the milestones: 0.72, 0.73 Jun 3, 2022
@stack72 stack72 removed this from the 0.73 milestone Jun 27, 2022
@Shrooblord
Copy link

Hey there! Thanks for the insight into pulumi refresh. I routinely do pulumi refresh before a pulumi up and seem to be running into the same problem. However, if I do pulumi refresh --> pulumi up --> error --> pulumi up, I still get the same errors you describe in the first post.

Is there anything you do to 'reset' pulumi so the refresh bug doesn't come up?

@thanhtoan1196
Copy link

any update?

@kralikba
Copy link

I'm having the same problem with Pulumi.Kubernetes 3.23.1. (c#, .net 7 project) Any ideas/updates?

@jsravn
Copy link
Author

jsravn commented Jan 16, 2023

It's been a while since I looked at this but I believe after a lot of debugging we found this happens if something in the cluster is modifying the pod spec after deployment. This seems to screw up Pulumi's ability to perform an upgrade if the spec doesn't match what it last applied. It should be easy to reproduce this by manually modifying and seeing if Pulumi can update it.

@lblackstone lblackstone self-assigned this Jul 14, 2023
@lblackstone lblackstone added the resolution/fixed This issue was fixed label Jul 14, 2023
@lblackstone
Copy link
Member

We fixed some bugs related to refresh in #2445 so I'm hoping this is fixed in v4. I'm going to close as resolved, but please let us know if you're still seeing the error after upgrading.

@RouxAntoine
Copy link

hello,
I use github.com/pulumi/pulumi-kubernetes/sdk/v4 v4.5.3 and have the same effect

image

replicaset status seems good :

status:
  availableReplicas: 1
  fullyLabeledReplicas: 1
  observedGeneration: 1
  readyReplicas: 1
  replicas: 1

no event in replicaset

Events:                            <none>

this event in deployment :

Conditions:
  Type           Status  Reason
  ----           ------  ------
  Progressing    True    NewReplicaSetAvailable
  Available      True    MinimumReplicasAvailable
OldReplicaSets:  ampere-1-deployment-66221e9a-5d8ff544c6 (0/0 replicas created), ampere-1-deployment-66221e9a-d447578c7 (0/0 replicas created), ampere-1-deployment-66221e9a-6f5f89f6f5 (0/0 replicas created), ampere-1-deployment-66221e9a-659c47cc96 (0/0 replicas created), ampere-1-deployment-66221e9a-7996b77fd9 (0/0 replicas created), ampere-1-deployment-66221e9a-5bf5546fd6 (0/0 replicas created)
NewReplicaSet:   ampere-1-deployment-66221e9a-8f74b44bb (1/1 replicas created)
Events:          <none>

hope it could help

@jtmarmon
Copy link

hey @lblackstone @jsravn, I'm running into this very reliably and I think your assessment

after a lot of debugging we found this happens if something in the cluster is modifying the pod spec after deployment

Is likely accurate. I'm on GCP GKE using Autopilot mode, and I'm running into this only when I modify my deployment's podspec resource limits. I can modify resource requests no problem, but resource limits reliably causes the operation to hang. I presume this has something to do with autopilot magic around resource limits causing the podspec modification you mentioned

@lblackstone
Copy link
Member

Thanks for the updated info. I opened #2662 to track the bug with the new info you provided.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Some behavior is incorrect or out of spec resolution/fixed This issue was fixed
Projects
None yet
Development

No branches or pull requests

9 participants