Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pulumi up stuck on "Waiting for app ReplicaSet be marked available" #1628

Closed
eirikur-grid opened this issue Jun 22, 2021 · 22 comments · Fixed by #1639
Closed

pulumi up stuck on "Waiting for app ReplicaSet be marked available" #1628

eirikur-grid opened this issue Jun 22, 2021 · 22 comments · Fixed by #1639
Assignees
Labels
emergent An issue that was added to the current milestone after planning kind/bug Some behavior is incorrect or out of spec resolution/fixed This issue was fixed

Comments

@eirikur-grid
Copy link

eirikur-grid commented Jun 22, 2021

I've recently had issues with pulumi hanging (or timing out) when attempting to deploy changes.
Our stack has 8 deployments. Some of them get updated, others fail. The number varies. Usually 2-4 are successfully updated.

On the kubernetes side, it appears as if the deployment has been successful.

Versions info

OS: macos
Pulumi CLI: 3.5.1
Python: 3.7.3
Python package versions

  • pulumi: 3.5.1
  • pulumi-kubernetes: 3.4.0

k8s: 1.17 on AWS EKS

Here's the output after a failed deployment:

Updating (dev-delta-api):
     Type                                 Name                                  Status                  Info
     pulumi:pulumi:Stack                  grid-api-dev-delta-api                **failed**              1 error
 ~   ├─ pulumi-python:dynamic:Resource    grid-api-ecr-image                    updated                 [diff: ~image_name]
 ~   ├─ kubernetes:batch/v1beta1:CronJob  grid-api                              updated                 [diff: ~spec]
 ~   ├─ kubernetes:apps/v1:Deployment     grid-api                              updated                 [diff: ~spec]
 ~   ├─ kubernetes:apps/v1:Deployment     grid-api-thumbnail-worker             updated                 [diff: ~spec]
 ~   ├─ kubernetes:apps/v1:Deployment     grid-api-regenerate-thumbnail-worker  **updating failed**     [diff: ~spec]; 1 error
 ~   ├─ kubernetes:apps/v1:Deployment     grid-api-gdrive-monitor-worker        **updating failed**     [diff: ~spec]; 1 error
 ~   ├─ kubernetes:apps/v1:Deployment     grid-api-cloud-drive-refresh-worker   **updating failed**     [diff: ~spec]; 1 error
 ~   ├─ kubernetes:apps/v1:Deployment     grid-api-document-analytics-writer    **updating failed**     [diff: ~spec]; 1 error
 ~   ├─ kubernetes:apps/v1:Deployment     grid-api-chargebee-webhook-worker     **updating failed**     [diff: ~spec]; 1 error
 ~   └─ kubernetes:apps/v1:Deployment     grid-api-orphaned-workbook-remover    **updating failed**     [diff: ~spec]; 1 error

Diagnostics:
  kubernetes:apps/v1:Deployment (grid-api-document-analytics-writer):
    error: 2 errors occurred:
    	* the Kubernetes API server reported that "grid-api-document-analytics-writer" failed to fully initialize or become live: 'grid-api-document-analytics-writer' timed out waiting to be Ready
    	* Attempted to roll forward to new ReplicaSet, but minimum number of Pods did not become live

  kubernetes:apps/v1:Deployment (grid-api-chargebee-webhook-worker):
    error: 2 errors occurred:
    	* the Kubernetes API server reported that "grid-api-chargebee-webhook-worker" failed to fully initialize or become live: 'grid-api-chargebee-webhook-worker' timed out waiting to be Ready
    	* Attempted to roll forward to new ReplicaSet, but minimum number of Pods did not become live

  pulumi:pulumi:Stack (grid-api-dev-delta-api):
    error: update failed

  kubernetes:apps/v1:Deployment (grid-api-orphaned-workbook-remover):
    error: 2 errors occurred:
    	* the Kubernetes API server reported that "grid-api-orphaned-workbook-remover" failed to fully initialize or become live: 'grid-api-orphaned-workbook-remover' timed out waiting to be Ready
    	* Attempted to roll forward to new ReplicaSet, but minimum number of Pods did not become live

  kubernetes:apps/v1:Deployment (grid-api-regenerate-thumbnail-worker):
    error: 2 errors occurred:
    	* the Kubernetes API server reported that "grid-api-regenerate-thumbnail-worker" failed to fully initialize or become live: 'grid-api-regenerate-thumbnail-worker' timed out waiting to be Ready
    	* Attempted to roll forward to new ReplicaSet, but minimum number of Pods did not become live

  kubernetes:apps/v1:Deployment (grid-api-gdrive-monitor-worker):
    error: 2 errors occurred:
    	* the Kubernetes API server reported that "grid-api-gdrive-monitor-worker" failed to fully initialize or become live: 'grid-api-gdrive-monitor-worker' timed out waiting to be Ready
    	* Attempted to roll forward to new ReplicaSet, but minimum number of Pods did not become live

  kubernetes:apps/v1:Deployment (grid-api-cloud-drive-refresh-worker):
    error: 2 errors occurred:
    	* the Kubernetes API server reported that "grid-api-cloud-drive-refresh-worker" failed to fully initialize or become live: 'grid-api-cloud-drive-refresh-worker' timed out waiting to be Ready
    	* Attempted to roll forward to new ReplicaSet, but minimum number of Pods did not become live

Resources:
    ~ 4 updated
    4 unchanged

Duration: 10m17s

While I was waiting for the deployment to complete, I ran kubectl get deploy and kubectl get rs

(⎈ |dev-delta)~❯ kubectl get deploy
NAME                                   READY   UP-TO-DATE   AVAILABLE   AGE
document-thumbnailer                   2/2     2            2           270d
excel-processor                        2/2     2            2           270d
grid-api                               3/3     3            3           269d
grid-api-chargebee-webhook-worker      1/1     1            1           98d
grid-api-cloud-drive-refresh-worker    1/1     1            1           115d
grid-api-document-analytics-writer     1/1     1            1           109d
grid-api-gdrive-monitor-worker         1/1     1            1           153d
grid-api-orphaned-workbook-remover     1/1     1            1           40d
grid-api-regenerate-thumbnail-worker   1/1     1            1           5d20h
grid-api-thumbnail-worker              1/1     1            1           269d
grid-api2                              1/1     1            1           248d
grid-api2-thumbnail-worker             0/0     0            0           248d
grid-client                            2/2     2            2           269d
nginx-proxy                            1/1     1            1           272d
traefik-router                         2/2     2            2           272d
(⎈ |dev-delta)~❯ kubectl get rs
NAME                                              DESIRED   CURRENT   READY   AGE
document-thumbnailer-5787bb49b5                   0         0         0       6d17h
document-thumbnailer-5d4d8f55bc                   0         0         0       63d
document-thumbnailer-6559d8649c                   2         2         2       6d16h
document-thumbnailer-659f5c67bf                   0         0         0       52d
document-thumbnailer-668f44f684                   0         0         0       214d
document-thumbnailer-74b8bd4d59                   0         0         0       53d
document-thumbnailer-75477cd55                    0         0         0       222d
document-thumbnailer-7974f57c89                   0         0         0       6d17h
document-thumbnailer-7c584ff45                    0         0         0       203d
document-thumbnailer-9c77fc8c6                    0         0         0       12d
document-thumbnailer-bff5fc75c                    0         0         0       221d
excel-processor-54fcd89bc6                        2         2         2       5d22h
excel-processor-5867f648dd                        0         0         0       6d17h
excel-processor-5fd65b7db7                        0         0         0       34d
excel-processor-64f6d866                          0         0         0       6d17h
excel-processor-65f68d8f69                        0         0         0       12d
excel-processor-6cb97d9d74                        0         0         0       13d
excel-processor-6fc6599f96                        0         0         0       12d
excel-processor-6fc76887fd                        0         0         0       11d
excel-processor-786b55c779                        0         0         0       20d
excel-processor-86697b4cb4                        0         0         0       20d
excel-processor-f58f854c5                         0         0         0       12d
grid-api-58c8d4cb7c                               0         0         0       5d21h
grid-api-58d57688c8                               0         0         0       5d20h
grid-api-5b4b666bb                                0         0         0       2d16h
grid-api-5c45dd78c9                               0         0         0       3d18h
grid-api-5cf8458c84                               0         0         0       4d15h
grid-api-6cfcdbbc64                               0         0         0       10h
grid-api-6dc5c7559f                               0         0         0       3d20h
grid-api-6dd9cc7889                               3         3         3       15h
grid-api-7644cd5475                               0         0         0       21h
grid-api-chargebee-webhook-worker-586c9c9754      0         0         0       18h
grid-api-chargebee-webhook-worker-5b7d7c6b67      0         0         0       5d22h
grid-api-chargebee-webhook-worker-5d86974455      0         0         0       5d21h
grid-api-chargebee-webhook-worker-5df6b68db7      0         0         0       16h
grid-api-chargebee-webhook-worker-65cd958846      0         0         0       10h
grid-api-chargebee-webhook-worker-69ccc7b4bd      0         0         0       2d16h
grid-api-chargebee-webhook-worker-6f5ff7459       1         1         1       15h
grid-api-chargebee-webhook-worker-6f8dc59f77      0         0         0       21h
grid-api-chargebee-webhook-worker-7c9dcb5966      0         0         0       4d15h
grid-api-chargebee-webhook-worker-7d97d9f876      0         0         0       5d20h
grid-api-chargebee-webhook-worker-7f8fdcd66f      0         0         0       3d20h
grid-api-cloud-drive-refresh-worker-549bc7f995    1         1         1       15h
grid-api-cloud-drive-refresh-worker-58cb8fb777    0         0         0       3d20h
grid-api-cloud-drive-refresh-worker-598974547     0         0         0       16h
grid-api-cloud-drive-refresh-worker-65945bdd95    0         0         0       5d21h
grid-api-cloud-drive-refresh-worker-6c79cc4cf8    0         0         0       18h
grid-api-cloud-drive-refresh-worker-6cbd645cf4    0         0         0       4d15h
grid-api-cloud-drive-refresh-worker-7b648bcdd     0         0         0       5d20h
grid-api-cloud-drive-refresh-worker-7c845f85bd    0         0         0       2d16h
grid-api-cloud-drive-refresh-worker-7d65d75c46    0         0         0       10h
grid-api-cloud-drive-refresh-worker-986fdf449     0         0         0       21h
grid-api-cloud-drive-refresh-worker-fc67f56c      0         0         0       5d22h
grid-api-ddb49db4f                                0         0         0       16h
grid-api-document-analytics-writer-5484d944db     0         0         0       21h
grid-api-document-analytics-writer-549f4f8567     0         0         0       10h
grid-api-document-analytics-writer-57b65bd65f     0         0         0       18h
grid-api-document-analytics-writer-57cb9fb9b9     0         0         0       2d16h
grid-api-document-analytics-writer-59c46d8d48     0         0         0       5d22h
grid-api-document-analytics-writer-5df8fc4cd8     0         0         0       16h
grid-api-document-analytics-writer-696b6fccb5     0         0         0       5d20h
grid-api-document-analytics-writer-6bc5c84988     1         1         1       15h
grid-api-document-analytics-writer-74fc7b76ff     0         0         0       4d15h
grid-api-document-analytics-writer-7954646894     0         0         0       3d20h
grid-api-document-analytics-writer-84f5ff7ddd     0         0         0       5d21h
grid-api-ff5fb75dd                                0         0         0       18h
grid-api-gdrive-monitor-worker-54494b89f7         0         0         0       5d22h
grid-api-gdrive-monitor-worker-5587bc8695         0         0         0       21h
grid-api-gdrive-monitor-worker-57fcb8495d         0         0         0       18h
grid-api-gdrive-monitor-worker-66bdd688cc         0         0         0       5d21h
grid-api-gdrive-monitor-worker-6cc69dfd6c         0         0         0       3d20h
grid-api-gdrive-monitor-worker-767ddb57c          1         1         1       15h
grid-api-gdrive-monitor-worker-768796d7cb         0         0         0       2d16h
grid-api-gdrive-monitor-worker-7879cdc974         0         0         0       10h
grid-api-gdrive-monitor-worker-7b75b8b557         0         0         0       16h
grid-api-gdrive-monitor-worker-84bdc587b          0         0         0       5d20h
grid-api-gdrive-monitor-worker-99b4dd598          0         0         0       4d15h
grid-api-orphaned-workbook-remover-556586c88      0         0         0       2d16h
grid-api-orphaned-workbook-remover-556697cfdc     0         0         0       21h
grid-api-orphaned-workbook-remover-59d4c6588f     0         0         0       5d21h
grid-api-orphaned-workbook-remover-5c8db59dbc     0         0         0       3d20h
grid-api-orphaned-workbook-remover-6588cc77fc     0         0         0       5d22h
grid-api-orphaned-workbook-remover-6b76cbd7b6     0         0         0       18h
grid-api-orphaned-workbook-remover-6f9b74c6d8     0         0         0       5d20h
grid-api-orphaned-workbook-remover-78758c47c      1         1         1       15h
grid-api-orphaned-workbook-remover-78b84b8bcd     0         0         0       4d15h
grid-api-orphaned-workbook-remover-7d8dd98cc8     0         0         0       10h
grid-api-orphaned-workbook-remover-9cb68c67c      0         0         0       16h
grid-api-regenerate-thumbnail-worker-58d947559f   0         0         0       18h
grid-api-regenerate-thumbnail-worker-64487cbd98   0         0         0       3d20h
grid-api-regenerate-thumbnail-worker-69b58765d9   0         0         0       2d16h
grid-api-regenerate-thumbnail-worker-6d9d4d9db7   0         0         0       10h
grid-api-regenerate-thumbnail-worker-74844bf9c6   0         0         0       16h
grid-api-regenerate-thumbnail-worker-777c4fd55b   1         1         1       15h
grid-api-regenerate-thumbnail-worker-7d8655496c   0         0         0       4d15h
grid-api-regenerate-thumbnail-worker-bf7fcb849    0         0         0       5d20h
grid-api-regenerate-thumbnail-worker-db8f65c4     0         0         0       21h
grid-api-thumbnail-worker-574f59674d              0         0         0       10h
grid-api-thumbnail-worker-5987d4956b              0         0         0       2d16h
grid-api-thumbnail-worker-59c6cc7cf6              0         0         0       18h
grid-api-thumbnail-worker-5d4f79c85               0         0         0       5d22h
grid-api-thumbnail-worker-5d6d95dffc              1         1         1       15h
grid-api-thumbnail-worker-687f9669f               0         0         0       21h
grid-api-thumbnail-worker-6f48ff74fc              0         0         0       5d21h
grid-api-thumbnail-worker-767f7dd649              0         0         0       16h
grid-api-thumbnail-worker-76f8865894              0         0         0       5d20h
grid-api-thumbnail-worker-7f985d7d88              0         0         0       4d15h
grid-api-thumbnail-worker-8c8c87847               0         0         0       3d20h
grid-api2-54879b9bf9                              0         0         0       223d
grid-api2-5c66fdddb4                              0         0         0       228d
grid-api2-657f59c995                              0         0         0       223d
grid-api2-664688b765                              0         0         0       236d
grid-api2-694bc66649                              1         1         1       185d
grid-api2-69c8449d9c                              0         0         0       236d
grid-api2-74dd495fdc                              0         0         0       185d
grid-api2-845ccb8f55                              0         0         0       248d
grid-api2-85f4c6b789                              0         0         0       222d
grid-api2-97675df68                               0         0         0       223d
grid-api2-bfc9fcc99                               0         0         0       223d
grid-api2-thumbnail-worker-59479ccd5f             0         0         0       236d
grid-api2-thumbnail-worker-5b7649ffc7             0         0         0       236d
grid-api2-thumbnail-worker-5bbdc7fff4             0         0         0       236d
grid-api2-thumbnail-worker-658898d85              0         0         0       248d
grid-api2-thumbnail-worker-6798b9b984             0         0         0       185d
grid-api2-thumbnail-worker-7d456fb6f8             0         0         0       222d
grid-api2-thumbnail-worker-7f85445b48             0         0         0       223d
grid-api2-thumbnail-worker-7f96f8bc8f             0         0         0       228d
grid-api2-thumbnail-worker-857c7698d9             0         0         0       223d
grid-api2-thumbnail-worker-859cbb6b6d             0         0         0       223d
grid-api2-thumbnail-worker-97b4b84bd              0         0         0       223d
grid-client-56cc6557db                            0         0         0       5d14h
grid-client-585856bb66                            0         0         0       5d15h
grid-client-5ff555c88c                            0         0         0       15h
grid-client-66458dcf47                            0         0         0       3d18h
grid-client-6d4f7d7f4c                            0         0         0       16h
grid-client-6d69478b9b                            2         2         2       9h
grid-client-7db7979fcf                            0         0         0       16h
grid-client-7f894448cd                            0         0         0       5d15h
grid-client-7f9f4c7b8                             0         0         0       3d20h
grid-client-9dd8bdfcd                             0         0         0       15h
grid-client-f555db5f7                             0         0         0       20h
nginx-proxy-566f4bb4f5                            0         0         0       12d
nginx-proxy-5b5cdf87c8                            0         0         0       5d17h
nginx-proxy-6894ccf656                            0         0         0       35d
nginx-proxy-6d76459864                            0         0         0       126d
nginx-proxy-755fd9bc95                            0         0         0       122d
nginx-proxy-7cc8c9d7df                            0         0         0       146d
nginx-proxy-7d69f64bbc                            0         0         0       157d
nginx-proxy-846c44f5c4                            1         1         1       15h
nginx-proxy-854d95c659                            0         0         0       122d
nginx-proxy-85755b9dcc                            0         0         0       126d
nginx-proxy-85d59cd96b                            0         0         0       16h
traefik-router-67cbbd787f                         2         2         2       91d
traefik-router-86c7fd6f89                         0         0         0       272d
@eirikur-grid eirikur-grid added the kind/bug Some behavior is incorrect or out of spec label Jun 22, 2021
@leezen
Copy link
Contributor

leezen commented Jun 22, 2021

@viveklak This sounds a lot like #1502 -- any thoughts here?

@viveklak viveklak self-assigned this Jun 22, 2021
@viveklak
Copy link
Contributor

Assigned myself. Will take a look. Thanks for reporting!

@roderik
Copy link

roderik commented Jun 23, 2021

Chiming in here, i recreated my EKS cluster and facing the same

CleanShot 2021-06-23 at 07 22 10@2x

CleanShot 2021-06-23 at 07 22 43@2x

Same versions, EKS cluster version 1.19, but using the node automation api.
Updating to 1.20 had no effect

I downgraded pulumi/kubernetes, same thing
I then upgraded again, and it worked.
So I guess it is not "broken", there is something else at play, rate-limiting maybe?

Destroyed it again, same thing, still waiting for the one time it works however
doing a refresh updates the status, afterwards updates work again.

@eirikur-grid
Copy link
Author

Had this happen again today. Saw very high CPU utilization from the 'pulumi-resource-kubernetes' process while waiting for the pulumi up command to finally time out.

@lkt82
Copy link

lkt82 commented Jun 24, 2021

I am seeing this on a Azure AKS cluster. It is making our pulumi pipelines very unstable as sometimes it works and sometimes it doesn't

@viveklak viveklak added emergent An issue that was added to the current milestone after planning p1 A bug severe enough to be the next item assigned to an engineer labels Jun 24, 2021
@viveklak
Copy link
Contributor

viveklak commented Jun 24, 2021

@lkt82 and @roderik - could you confirm if you were seeing this with pulumi-kubernetes 3.4.0?

@eirikur-grid (and others) - are there additional deployments in the namespace with the stuck deployments not controlled by Pulumi? Could you provide an estimate of how many such deployments/pods (not controlled by pulumi/same namespace) are there?

Does the problem reduce/go away if the Pulumi controlled deployments are put in a dedicated namespace?

@lkt82
Copy link

lkt82 commented Jun 24, 2021

@viveklak

Yes we are using pulumi-kubernetes 3.4.0 in C#

And a little output as well

[1/2] Waiting for app ReplicaSet be marked available

[1/2] Waiting for app ReplicaSet be marked available (0/2 Pods available)

warning: [MinimumReplicasUnavailable] Deployment does not have minimum availability.

[1/2] Waiting for app ReplicaSet be marked available (1/2 Pods available)

error: 2 errors occurred:
* resource aad-pod-identity/aad-pod-identity-mic was successfully created, but the Kubernetes API server reported that it failed to fully initialize or become live: 'aad-pod-identity-mic' timed out waiting to be Ready
* Minimum number of live Pods was not attained

@lkt82
Copy link

lkt82 commented Jun 24, 2021

I am also seeing some "Throttling request" to the controlplane endpoint in the log

@viveklak
Copy link
Contributor

@lkt82 Thanks. Actively looking into this. The throttling request may be a red herring but if it helps you temporarily - you can add the new skipAwait flag to the helm chart.

@viveklak
Copy link
Contributor

viveklak commented Jun 25, 2021

Actually I do believe there is an impact from throttling requests. Looks like our change to fix #1502 causes higher likelihood of throttles to occur (likely because of quorum reads when setting all our watches with revision number?). Our await logic doesn't handle dropped watches and cause the CPU tight loop that @eirikur-grid saw. Probably what @lkt82 saw too.

We have another bug #1635 that makes partial failures hard to recover from.

In general, #1598 seems increasingly important to fix this the right way. @lblackstone and I will prioritize fixing that in short order. For the moment, we will revert #1596 and cut a hotfix to reduce the likelihood of this.

This was referenced Jun 25, 2021
@viveklak
Copy link
Contributor

v3.4.1 is out with #1596 reverted. I am working on #1598 right now. In the meantime, could folks running into this issue try with 3.4.1 and report if things improve? Thanks for your patience.

@eirikur-grid
Copy link
Author

@eirikur-grid (and others) - are there additional deployments in the namespace with the stuck deployments not controlled by Pulumi? Could you provide an estimate of how many such deployments/pods (not controlled by pulumi/same namespace) are there?

There are two of them; nginx-proxy and traefik-router, with 1 and 2 replicas respectively.

@roderik
Copy link

roderik commented Jun 25, 2021

0 for me

@viveklak
Copy link
Contributor

@eirikur-grid @roderik thanks! Sounds good. I would expect things to get better with 3.4.1. Could you try it out and let us know?

@eirikur-grid
Copy link
Author

@viveklak I just upgraded to pulumi-kubernetes 3.4.1 and attempted a deployment to our staging environment. Unfortunately, the issue is not resolved, at least not for me.
Screenshot 2021-06-25 at 22 25 01

@eirikur-grid
Copy link
Author

@viveklak Let me know if you'd like me to enable verbose logging/tracing of some sorts. I acknowledge that I would probably have to censor that for sensitive information before sending to you.

@lkt82
Copy link

lkt82 commented Jun 27, 2021

It seems that I am getting a better result. I will do some more testing in the next days

+  pulumi:pulumi:Stack ApplicationPlatform-prod creating I0627 06:49:30.473330    2261 request.go:655] Throttling request took 7.386626923s, request: GET:############/apis/scheduling.k8s.io/v1beta1?timeout=32s
+  kubernetes:apiextensions.k8s.io/v1:CustomResourceDefinition cluster-aad-pod-identity/mic creating Retry #1; creation failed: no matches for kind "AzurePodIdentityException" in version "aadpodidentity.k8s.io/v1"
+  kubernetes:apiextensions.k8s.io/v1:CustomResourceDefinition cluster-azureidentitybindings.aadpodidentity.k8s.io created 
+  kubernetes:apiextensions.k8s.io/v1:CustomResourceDefinition cluster-azurepodidentityexceptions.aadpodidentity.k8s.io created 
+  kubernetes:apiextensions.k8s.io/v1:CustomResourceDefinition cluster-azureidentities.aadpodidentity.k8s.io created 
+  kubernetes:apiextensions.k8s.io/v1:CustomResourceDefinition cluster-azureassignedidentities.aadpodidentity.k8s.io created 
+  kubernetes:core/v1:ServiceAccount cluster-aad-pod-identity/aad-pod-identity-mic created 
+  kubernetes:core/v1:ServiceAccount cluster-aad-pod-identity/aad-pod-identity-nmi created 
+  kubernetes:apps/v1:Deployment cluster-aad-pod-identity/aad-pod-identity-mic creating [1/2] Waiting for app ReplicaSet be marked available (1/2 Pods available)
+  kubernetes:apps/v1:Deployment cluster-aad-pod-identity/aad-pod-identity-mic creating Deployment initialization complete
+  kubernetes:apps/v1:Deployment cluster-aad-pod-identity/aad-pod-identity-mic created Deployment initialization complete
+  kubernetes:apiextensions.k8s.io/v1:CustomResourceDefinition cluster-kube-system/aks-addon-exception creating 
+  kubernetes:apiextensions.k8s.io/v1:CustomResourceDefinition cluster-aad-pod-identity/mic creating 
+  kubernetes:apiextensions.k8s.io/v1:CustomResourceDefinition cluster-aad-pod-identity/mic created 
+  kubernetes:apiextensions.k8s.io/v1:CustomResourceDefinition cluster-kube-system/aks-addon-exception created

@viveklak
Copy link
Contributor

@eirikur-grid I am curious to hear in general how much load you are seeing on your api servers in general? Do you see frequent leader re-elections? In general, we don't seem to handle throttled or prematurely closed watches well at the moment so if you are consistently running into the high-cpu hang situation, that is pretty indicative of a closed watch from my experience. If so, you might have to use the skipAwait annotation for the moment. We are actively working on eliminating the use of low-level watches which should definitely help with this class of problems (#1639) but that needs some more baking/testing.

@lkt82 Thanks. Please keep us posted on your experience.

@eirikur-grid
Copy link
Author

eirikur-grid commented Jun 28, 2021

@viveklak

Do you see frequent leader re-elections?

In the past week, I can see two spurts of LeaderElection events for our staging environment. The latter occurred on Friday at 14:34 UTC, roughly 8 hours before I tested pulumi-kubernetes 3.4.1 for that environment.

I've successfully deployed twice to our production cluster today using v.3.4.1. There may be an improvement there over 3.4.0.
While this is anecdotal, I have a feeling that I more frequently have issues deploying from home than from the office.

@viveklak viveklak removed the p1 A bug severe enough to be the next item assigned to an engineer label Jun 30, 2021
@pulumi-bot pulumi-bot added the resolution/fixed This issue was fixed label Jun 30, 2021
@viveklak
Copy link
Contributor

viveklak commented Jul 1, 2021

@eirikur-grid curious to see if 3.5.0 seems to unblock you?

@eirikur-grid
Copy link
Author

@eirikur-grid curious to see if 3.5.0 seems to unblock you?

I've performed two deployments to our staging cluster and both went smoothly. v3.5.0 is looking 👌 so far.

@lkt82
Copy link

lkt82 commented Jul 2, 2021

I can say its works for me as well. Deletion however is extremely slow on v3.5.0 it I take take 10min more for the helm resources to be deleted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
emergent An issue that was added to the current milestone after planning kind/bug Some behavior is incorrect or out of spec resolution/fixed This issue was fixed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants