Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pulumi gets stuck waiting on a deployment to be updated while it has updated #1502

Closed
monchier opened this issue Mar 19, 2021 · 11 comments
Closed
Assignees
Labels
impact/performance Something is slower than expected kind/bug Some behavior is incorrect or out of spec resolution/fixed This issue was fixed
Milestone

Comments

@monchier
Copy link

monchier commented Mar 19, 2021

We are running Pulumi on to release to our Kubernetes cluster. It is quite a standard setup, but we have many deployments (in the same namespace) - approximatley 5000. We only manage some of them with Pulumi since the other deployment are application-level (app creates and remove them dynamically).

Expected behavior

When the deployment has updated in Kubernetes, Pulumi should progress. This is the status of the deployment when Pulumi gets stuck.

   availableReplicas: 3
  conditions:
  - lastTransitionTime: "2021-03-18T20:31:42Z"
    lastUpdateTime: "2021-03-18T20:31:42Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  - lastTransitionTime: "2021-03-16T22:19:53Z"
    lastUpdateTime: "2021-03-18T22:21:05Z"
    message: ReplicaSet "frontend-3t2wvsc8-5584694cb5" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  observedGeneration: 15
  readyReplicas: 3
  replicas: 3
  updatedReplicas: 3

Current behavior

Pulumi is stuck indefinitely as

I0318 18:20:40.336010   99357 rpc.go:72] Marshaling property for RPC[Provider[kubernetes, 0xc003c0d860].Update(default/frontend-3t2wvsc8,urn:pulumi:prod::services::kubernetes:apps/v1:Deployment::frontend).news]: resources={map[limits:{map[cpu:{1} memory:{1Gi}]} requests:{map[cpu:{100m} memory:{100Mi}]}]}
I0318 18:20:40.336019   99357 rpc.go:72] Marshaling property for RPC[Provider[kubernetes, 0xc003c0d860].Update(default/frontend-3t2wvsc8,urn:pulumi:prod::services::kubernetes:apps/v1:Deployment::frontend).news]: limits={map[cpu:{1} memory:{1Gi}]}
I0318 18:20:40.336025   99357 rpc.go:72] Marshaling property for RPC[Provider[kubernetes, 0xc003c0d860].Update(default/frontend-3t2wvsc8,urn:pulumi:prod::services::kubernetes:apps/v1:Deployment::frontend).news]: cpu={1}
I0318 18:20:40.336029   99357 rpc.go:72] Marshaling property for RPC[Provider[kubernetes, 0xc003c0d860].Update(default/frontend-3t2wvsc8,urn:pulumi:prod::services::kubernetes:apps/v1:Deployment::frontend).news]: memory={1Gi}
I0318 18:20:40.336050   99357 rpc.go:72] Marshaling property for RPC[Provider[kubernetes, 0xc003c0d860].Update(default/frontend-3t2wvsc8,urn:pulumi:prod::services::kubernetes:apps/v1:Deployment::frontend).news]: requests={map[cpu:{100m} memory:{100Mi}]}
I0318 18:20:40.336056   99357 rpc.go:72] Marshaling property for RPC[Provider[kubernetes, 0xc003c0d860].Update(default/frontend-3t2wvsc8,urn:pulumi:prod::services::kubernetes:apps/v1:Deployment::frontend).news]: cpu={100m}
I0318 18:20:40.336060   99357 rpc.go:72] Marshaling property for RPC[Provider[kubernetes, 0xc003c0d860].Update(default/frontend-3t2wvsc8,urn:pulumi:prod::services::kubernetes:apps/v1:Deployment::frontend).news]: memory={100Mi}
I0318 18:20:40.336065   99357 rpc.go:72] Marshaling property for RPC[Provider[kubernetes, 0xc003c0d860].Update(default/frontend-3t2wvsc8,urn:pulumi:prod::services::kubernetes:apps/v1:Deployment::frontend).news]: volumeMounts={[]}
I0318 18:20:40.336098   99357 rpc.go:72] Marshaling property for RPC[Provider[kubernetes, 0xc003c0d860].Update(default/frontend-3t2wvsc8,urn:pulumi:prod::services::kubernetes:apps/v1:Deployment::frontend).news]: enableServiceLinks={false}
I0318 18:20:40.336103   99357 rpc.go:72] Marshaling property for RPC[Provider[kubernetes, 0xc003c0d860].Update(default/frontend-3t2wvsc8,urn:pulumi:prod::services::kubernetes:apps/v1:Deployment::frontend).news]: nodeSelector={map[tier:{platform}]}
I0318 18:20:40.336111   99357 rpc.go:72] Marshaling property for RPC[Provider[kubernetes, 0xc003c0d860].Update(default/frontend-3t2wvsc8,urn:pulumi:prod::services::kubernetes:apps/v1:Deployment::frontend).news]: tier={platform}
I0318 18:20:40.336115   99357 rpc.go:72] Marshaling property for RPC[Provider[kubernetes, 0xc003c0d860].Update(default/frontend-3t2wvsc8,urn:pulumi:prod::services::kubernetes:apps/v1:Deployment::frontend).news]: volumes={[]}
⠂I0318 18:20:42.322325   99357 eventsink.go:70] eventSink::Info(<{%reset%}>[1/2] Waiting for app ReplicaSet be marked available<{%reset%}>)
[1/2] Waiting for app ReplicaSet be marked available
✨ updating...⠴I0318 18:20:54.115652   99357 eventsink.go:70] eventSink::Info(<{%reset%}>[1/2] Waiting for app ReplicaSet be marked available (0/1 Pods available)<{%reset%}>)
[1/2] Waiting for app ReplicaSet be marked available (0/1 Pods available)
✨ updating...⠂

As a workaround, we interrupts Pulumi and restart Pulumi. on the second run Pulumi will progress just fine. Since we have multiple deployments, this happens for every deployment.

@monchier monchier added the kind/bug Some behavior is incorrect or out of spec label Mar 19, 2021
@tusharshahrs
Copy link

From #kubernetes community slack:
https://pulumi-community.slack.com/archives/CRFURDVQB/p1615870653002400?thread_ts=1615855744.002100&cid=CRFURDVQB

Pulumi CLI version 2.22.0
Pulumi-Kubernetes plugin version 2.8.2

@tusharshahrs
Copy link

tusharshahrs commented Mar 23, 2021

Saw issue in ci/cd and we were able to repro by running the same scripts locally.
Use --parallel 1 to limit the memory usage
Can be reproduced fairly reliably

@monchier
Copy link
Author

monchier commented Mar 24, 2021 via email

@tusharshahrs
Copy link

Work Around for the issue:

If you do the following:

kc describe deploy frontend-3t2wvsc8-5584694cb5

There is a line that shows up

Annotations : kubectl. kubernetes.io/restartedAt: “2021-03-04...”

Removing this line and then do a pulumi up and the cluster will continue.

@leezen leezen added this to the 0.56 milestone Apr 28, 2021
@lblackstone
Copy link
Member

It looks like this annotation is caused by running kubectl rollout restart deployment on 1.15+ k8s clusters. The current readiness logic doesn't handle this annotation, which I expect is leading to the stuck updates. I'm investigating now.

@lblackstone
Copy link
Member

I haven't been able to reproduce this locally. Here's what I've tried:

  1. Create Deployment with Pulumi
  2. Run kubectl rollout restart deployment <deployment> to add the restartedAt annotation
  3. Update the Deployment spec and run pulumi up

I’ve also tried a few combinations of pulumi refresh and updating the Deployment metadata rather than the spec, but everything seems to be proceeding as expected.


@monchier Are you still hitting this issue, and if so, do you have any more ideas on what I could do to repro?

@lblackstone
Copy link
Member

I got some more context on this issue from the user. Initially, they were running a very large number of Deployments (5000+) in a single Namespace. Most of these Deployments were managed externally (not by Pulumi), but a small number were managed with Pulumi. These managed Deployments were the ones getting stuck. Moving the small number of managed Deployments to a separate Namespace resolved the issue. (It appears that the restartedAt annotation may have been a red herring.)

This leads me to believe that the issue is related to the scale rather than a bug in the await logic itself. I have two initial theories:

  1. The large number of Deployments in a single Namespace is delaying status updates on the cluster itself.
  2. The await logic uses client-go to watch Events for the Namespace. In this case, the number of Events is much larger than we anticipated/tested, which could be delaying processing beyond the default timeout.

Based on the Deployment status given in the initial issue report, I think we can eliminate option 1.

For option 2, processing occurs serially, and has not been optimized for performance. It seems likely that this would be a bottleneck since it would be processing Events from every Deployment in the Namespace; not only the ones managed by Pulumi.


I'll need to do some more testing to confirm this theory, but in the meantime, it sounds like moving the Pulumi-managed Deployments into a separate Namespace solved the problem for the user.

@lblackstone lblackstone added impact/performance Something is slower than expected kind/bug Some behavior is incorrect or out of spec and removed kind/bug Some behavior is incorrect or out of spec labels May 3, 2021
@lblackstone
Copy link
Member

I reviewed the relevant code and also wasn't able to reproduce locally, so this appears to be a performance issue caused by a large number of Deployments in the same Namespace. The await logic serially processes every incoming Event for the target Namespace, which is likely not a viable solution for more extreme cases like this.

It may be worth optimizing the await logic in the future, but I'm going to close this issue for now since we have a viable workaround and the problem seems to be a scaling issue rather than a correctness issue.

@lblackstone lblackstone added the resolution/wont-fix This issue won't be fixed label May 3, 2021
@joeduffy joeduffy reopened this May 6, 2021
@joeduffy
Copy link
Member

joeduffy commented May 6, 2021

There is a performance problem in our code here, due to the way we have implemented await logic. The scenario which led to this is perfectly valid and should work with our provider, just as I assume kubectl can deploy to such a large cluster. The fact that there's a difficult-to-find workaround is not sufficient justification to close this out won't fix. I am also honestly surprised we are O(N) in the size of the cluster, rather than the size of the deployment being performed here, and furthermore 5,000 really isn't that large of a number to start failing on. I am reopening this. If we have a performance problem to track, let's either use this issue, or file one that's more targeted with a more thorough description of the problem.

@viveklak
Copy link
Contributor

viveklak commented May 28, 2021

There are a few intersecting issues here:

  1. Await logic, especially for higher-order resources like deployment is fairly convoluted and makes somewhat unreasonable consistency assumptions. The change in Fix hanging updates for deployments #1596 should help alleviate the immediate problem identified in this issue. However, a refactor of the await logic is long overdue. See Simplify Deployment awaiter #451 (and other related issues: Simplify StatefulSet awaiter #452, Simplify Service awaiter #453, Simplify Ingress awaiter #454)
  2. Currently the await logic creates a watch for each updated resource (and in the case of higher-order resources, for the underlying resources, such as ReplicaSet, Pods, PVCs, etc.). Since these watches are placed on the entire namespace, especially with shared namespaces (as the case in this bug), we incur a large amount of redundant traffic and seem to unnecessarily load the Kubernetes control plane. We should strongly consider restructuring the await logic to use shared informers instead see Restructure await logic to use shared informers instead of relying on fine-grained watches #1598

I am planning to target #1596 to fix this specific issue and prioritize the other linked issues subsequently.

@leezen leezen modified the milestones: 0.57, 0.58 Jun 8, 2021
@lblackstone lblackstone added the resolution/fixed This issue was fixed label Jun 15, 2021
@lblackstone
Copy link
Member

Fixed by #1598

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
impact/performance Something is slower than expected kind/bug Some behavior is incorrect or out of spec resolution/fixed This issue was fixed
Projects
None yet
Development

No branches or pull requests

7 participants