Pulumi gets stuck waiting on a deployment to be updated while it has updated #1502

monchier · 2021-03-19T01:22:10Z

We are running Pulumi on to release to our Kubernetes cluster. It is quite a standard setup, but we have many deployments (in the same namespace) - approximatley 5000. We only manage some of them with Pulumi since the other deployment are application-level (app creates and remove them dynamically).

Expected behavior

When the deployment has updated in Kubernetes, Pulumi should progress. This is the status of the deployment when Pulumi gets stuck.

   availableReplicas: 3
  conditions:
  - lastTransitionTime: "2021-03-18T20:31:42Z"
    lastUpdateTime: "2021-03-18T20:31:42Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  - lastTransitionTime: "2021-03-16T22:19:53Z"
    lastUpdateTime: "2021-03-18T22:21:05Z"
    message: ReplicaSet "frontend-3t2wvsc8-5584694cb5" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  observedGeneration: 15
  readyReplicas: 3
  replicas: 3
  updatedReplicas: 3

Current behavior

Pulumi is stuck indefinitely as

I0318 18:20:40.336010   99357 rpc.go:72] Marshaling property for RPC[Provider[kubernetes, 0xc003c0d860].Update(default/frontend-3t2wvsc8,urn:pulumi:prod::services::kubernetes:apps/v1:Deployment::frontend).news]: resources={map[limits:{map[cpu:{1} memory:{1Gi}]} requests:{map[cpu:{100m} memory:{100Mi}]}]}
I0318 18:20:40.336019   99357 rpc.go:72] Marshaling property for RPC[Provider[kubernetes, 0xc003c0d860].Update(default/frontend-3t2wvsc8,urn:pulumi:prod::services::kubernetes:apps/v1:Deployment::frontend).news]: limits={map[cpu:{1} memory:{1Gi}]}
I0318 18:20:40.336025   99357 rpc.go:72] Marshaling property for RPC[Provider[kubernetes, 0xc003c0d860].Update(default/frontend-3t2wvsc8,urn:pulumi:prod::services::kubernetes:apps/v1:Deployment::frontend).news]: cpu={1}
I0318 18:20:40.336029   99357 rpc.go:72] Marshaling property for RPC[Provider[kubernetes, 0xc003c0d860].Update(default/frontend-3t2wvsc8,urn:pulumi:prod::services::kubernetes:apps/v1:Deployment::frontend).news]: memory={1Gi}
I0318 18:20:40.336050   99357 rpc.go:72] Marshaling property for RPC[Provider[kubernetes, 0xc003c0d860].Update(default/frontend-3t2wvsc8,urn:pulumi:prod::services::kubernetes:apps/v1:Deployment::frontend).news]: requests={map[cpu:{100m} memory:{100Mi}]}
I0318 18:20:40.336056   99357 rpc.go:72] Marshaling property for RPC[Provider[kubernetes, 0xc003c0d860].Update(default/frontend-3t2wvsc8,urn:pulumi:prod::services::kubernetes:apps/v1:Deployment::frontend).news]: cpu={100m}
I0318 18:20:40.336060   99357 rpc.go:72] Marshaling property for RPC[Provider[kubernetes, 0xc003c0d860].Update(default/frontend-3t2wvsc8,urn:pulumi:prod::services::kubernetes:apps/v1:Deployment::frontend).news]: memory={100Mi}
I0318 18:20:40.336065   99357 rpc.go:72] Marshaling property for RPC[Provider[kubernetes, 0xc003c0d860].Update(default/frontend-3t2wvsc8,urn:pulumi:prod::services::kubernetes:apps/v1:Deployment::frontend).news]: volumeMounts={[]}
I0318 18:20:40.336098   99357 rpc.go:72] Marshaling property for RPC[Provider[kubernetes, 0xc003c0d860].Update(default/frontend-3t2wvsc8,urn:pulumi:prod::services::kubernetes:apps/v1:Deployment::frontend).news]: enableServiceLinks={false}
I0318 18:20:40.336103   99357 rpc.go:72] Marshaling property for RPC[Provider[kubernetes, 0xc003c0d860].Update(default/frontend-3t2wvsc8,urn:pulumi:prod::services::kubernetes:apps/v1:Deployment::frontend).news]: nodeSelector={map[tier:{platform}]}
I0318 18:20:40.336111   99357 rpc.go:72] Marshaling property for RPC[Provider[kubernetes, 0xc003c0d860].Update(default/frontend-3t2wvsc8,urn:pulumi:prod::services::kubernetes:apps/v1:Deployment::frontend).news]: tier={platform}
I0318 18:20:40.336115   99357 rpc.go:72] Marshaling property for RPC[Provider[kubernetes, 0xc003c0d860].Update(default/frontend-3t2wvsc8,urn:pulumi:prod::services::kubernetes:apps/v1:Deployment::frontend).news]: volumes={[]}
⠂I0318 18:20:42.322325   99357 eventsink.go:70] eventSink::Info(<{%reset%}>[1/2] Waiting for app ReplicaSet be marked available<{%reset%}>)
[1/2] Waiting for app ReplicaSet be marked available
✨ updating...⠴I0318 18:20:54.115652   99357 eventsink.go:70] eventSink::Info(<{%reset%}>[1/2] Waiting for app ReplicaSet be marked available (0/1 Pods available)<{%reset%}>)
[1/2] Waiting for app ReplicaSet be marked available (0/1 Pods available)
✨ updating...⠂

As a workaround, we interrupts Pulumi and restart Pulumi. on the second run Pulumi will progress just fine. Since we have multiple deployments, this happens for every deployment.

The text was updated successfully, but these errors were encountered:

tusharshahrs · 2021-03-23T18:09:30Z

From #kubernetes community slack:
https://pulumi-community.slack.com/archives/CRFURDVQB/p1615870653002400?thread_ts=1615855744.002100&cid=CRFURDVQB

Pulumi CLI version 2.22.0
Pulumi-Kubernetes plugin version 2.8.2

tusharshahrs · 2021-03-23T21:00:49Z

Saw issue in ci/cd and we were able to repro by running the same scripts locally.
Use --parallel 1 to limit the memory usage
Can be reproduced fairly reliably

monchier · 2021-03-24T00:00:38Z

We are doing that. Does not help. --matteo

…

On Tue, Mar 23, 2021 at 5:01 PM Tushar Shah ***@***.***> wrote: Saw issue in CircleCI and we were able to repro by running the same scripts locally. We use --parallel 1 to limit the memory usage — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1502 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABUBK4KAPZPFUVFVYZ76XL3TFD6RJANCNFSM4ZN2K3IQ> .

tusharshahrs · 2021-03-30T20:04:29Z

Work Around for the issue:

If you do the following:

kc describe deploy frontend-3t2wvsc8-5584694cb5

There is a line that shows up

Annotations : kubectl. kubernetes.io/restartedAt: “2021-03-04...”

Removing this line and then do a pulumi up and the cluster will continue.

lblackstone · 2021-04-29T22:01:16Z

It looks like this annotation is caused by running kubectl rollout restart deployment on 1.15+ k8s clusters. The current readiness logic doesn't handle this annotation, which I expect is leading to the stuck updates. I'm investigating now.

lblackstone · 2021-04-30T18:15:07Z

I haven't been able to reproduce this locally. Here's what I've tried:

Create Deployment with Pulumi
Run kubectl rollout restart deployment <deployment> to add the restartedAt annotation
Update the Deployment spec and run pulumi up

I’ve also tried a few combinations of pulumi refresh and updating the Deployment metadata rather than the spec, but everything seems to be proceeding as expected.

@monchier Are you still hitting this issue, and if so, do you have any more ideas on what I could do to repro?

lblackstone · 2021-05-03T16:48:14Z

I got some more context on this issue from the user. Initially, they were running a very large number of Deployments (5000+) in a single Namespace. Most of these Deployments were managed externally (not by Pulumi), but a small number were managed with Pulumi. These managed Deployments were the ones getting stuck. Moving the small number of managed Deployments to a separate Namespace resolved the issue. (It appears that the restartedAt annotation may have been a red herring.)

This leads me to believe that the issue is related to the scale rather than a bug in the await logic itself. I have two initial theories:

The large number of Deployments in a single Namespace is delaying status updates on the cluster itself.
The await logic uses client-go to watch Events for the Namespace. In this case, the number of Events is much larger than we anticipated/tested, which could be delaying processing beyond the default timeout.

Based on the Deployment status given in the initial issue report, I think we can eliminate option 1.

For option 2, processing occurs serially, and has not been optimized for performance. It seems likely that this would be a bottleneck since it would be processing Events from every Deployment in the Namespace; not only the ones managed by Pulumi.

I'll need to do some more testing to confirm this theory, but in the meantime, it sounds like moving the Pulumi-managed Deployments into a separate Namespace solved the problem for the user.

lblackstone · 2021-05-03T22:46:56Z

I reviewed the relevant code and also wasn't able to reproduce locally, so this appears to be a performance issue caused by a large number of Deployments in the same Namespace. The await logic serially processes every incoming Event for the target Namespace, which is likely not a viable solution for more extreme cases like this.

It may be worth optimizing the await logic in the future, but I'm going to close this issue for now since we have a viable workaround and the problem seems to be a scaling issue rather than a correctness issue.

joeduffy · 2021-05-06T01:26:54Z

There is a performance problem in our code here, due to the way we have implemented await logic. The scenario which led to this is perfectly valid and should work with our provider, just as I assume kubectl can deploy to such a large cluster. The fact that there's a difficult-to-find workaround is not sufficient justification to close this out won't fix. I am also honestly surprised we are O(N) in the size of the cluster, rather than the size of the deployment being performed here, and furthermore 5,000 really isn't that large of a number to start failing on. I am reopening this. If we have a performance problem to track, let's either use this issue, or file one that's more targeted with a more thorough description of the problem.

viveklak · 2021-05-28T18:42:45Z

There are a few intersecting issues here:

Await logic, especially for higher-order resources like deployment is fairly convoluted and makes somewhat unreasonable consistency assumptions. The change in Fix hanging updates for deployments #1596 should help alleviate the immediate problem identified in this issue. However, a refactor of the await logic is long overdue. See Simplify Deployment awaiter #451 (and other related issues: Simplify StatefulSet awaiter #452, Simplify Service awaiter #453, Simplify Ingress awaiter #454)
Currently the await logic creates a watch for each updated resource (and in the case of higher-order resources, for the underlying resources, such as ReplicaSet, Pods, PVCs, etc.). Since these watches are placed on the entire namespace, especially with shared namespaces (as the case in this bug), we incur a large amount of redundant traffic and seem to unnecessarily load the Kubernetes control plane. We should strongly consider restructuring the await logic to use shared informers instead see Restructure await logic to use shared informers instead of relying on fine-grained watches #1598

I am planning to target #1596 to fix this specific issue and prioritize the other linked issues subsequently.

lblackstone · 2021-06-15T18:58:48Z

Fixed by #1598

monchier added the kind/bug Some behavior is incorrect or out of spec label Mar 19, 2021

viveklak assigned lblackstone Mar 23, 2021

leezen added this to the 0.56 milestone Apr 28, 2021

lblackstone added impact/performance Something is slower than expected kind/bug Some behavior is incorrect or out of spec and removed kind/bug Some behavior is incorrect or out of spec labels May 3, 2021

lblackstone closed this as completed May 3, 2021

lblackstone added the resolution/wont-fix This issue won't be fixed label May 3, 2021

joeduffy reopened this May 6, 2021

lukehoban removed the resolution/wont-fix This issue won't be fixed label May 6, 2021

viveklak assigned viveklak and unassigned lblackstone May 14, 2021

leezen modified the milestones: 0.56, 0.57 May 17, 2021

This was referenced May 28, 2021

Fix hanging updates for deployments #1596

Merged

Restructure await logic to use shared informers instead of relying on fine-grained watches #1598

Closed

leezen modified the milestones: 0.57, 0.58 Jun 8, 2021

lblackstone added the resolution/fixed This issue was fixed label Jun 15, 2021

lblackstone closed this as completed Jun 15, 2021

leezen mentioned this issue Jun 22, 2021

pulumi up stuck on "Waiting for app ReplicaSet be marked available" #1628

Closed

viveklak mentioned this issue Jun 26, 2021

Use shared informer for await logic for deployments #1639

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pulumi gets stuck waiting on a deployment to be updated while it has updated #1502

Pulumi gets stuck waiting on a deployment to be updated while it has updated #1502

monchier commented Mar 19, 2021 •

edited

Loading

tusharshahrs commented Mar 23, 2021

tusharshahrs commented Mar 23, 2021 •

edited

Loading

monchier commented Mar 24, 2021 via email

tusharshahrs commented Mar 30, 2021

lblackstone commented Apr 29, 2021

lblackstone commented Apr 30, 2021

lblackstone commented May 3, 2021

lblackstone commented May 3, 2021

joeduffy commented May 6, 2021 •

edited

Loading

viveklak commented May 28, 2021 •

edited

Loading

lblackstone commented Jun 15, 2021

Pulumi gets stuck waiting on a deployment to be updated while it has updated #1502

Pulumi gets stuck waiting on a deployment to be updated while it has updated #1502

Comments

monchier commented Mar 19, 2021 • edited Loading

Expected behavior

Current behavior

tusharshahrs commented Mar 23, 2021

tusharshahrs commented Mar 23, 2021 • edited Loading

monchier commented Mar 24, 2021 via email

tusharshahrs commented Mar 30, 2021

lblackstone commented Apr 29, 2021

lblackstone commented Apr 30, 2021

lblackstone commented May 3, 2021

lblackstone commented May 3, 2021

joeduffy commented May 6, 2021 • edited Loading

viveklak commented May 28, 2021 • edited Loading

lblackstone commented Jun 15, 2021

monchier commented Mar 19, 2021 •

edited

Loading

tusharshahrs commented Mar 23, 2021 •

edited

Loading

joeduffy commented May 6, 2021 •

edited

Loading

viveklak commented May 28, 2021 •

edited

Loading