Invalid stack state stored when a K8s cluster is not responding during `refresh`? #1485

dotansimha · 2021-03-04T18:16:42Z

We are using Pulumi for our deployment pipeline. We had a failed deployment today and noticed that during the refresh phase, the K8s were not available (timeout). Seems like Pulumi didn't handle that as an error, leading to an invalid stack state store. In checkpoint files, we noticed that there are missing 400K for the state file.

This led to a pulumi up running and treating the cluster as empty -> so some parts of the deployment were failed.

The logs from refresh looks like that:

2021-03-04T15:53:41.6652021Z   kubernetes:rbac.authorization.k8s.io/v1beta1:ClusterRole (kube-system/fluentd):
2021-03-04T15:53:41.6653122Z     warning: configured Kubernetes cluster is unreachable: unable to load schema information from the API server: Get "k8s/openapi/v2?timeout=32s": dial tcp xxx.xxx.xxx.xxx:443: i/o timeout
2021-03-04T15:53:41.6653785Z  
2021-03-04T15:53:41.6654197Z   kubernetes:core/v1:Namespace (ingress-nginx):
2021-03-04T15:53:41.6655236Z     warning: configured Kubernetes cluster is unreachable: unable to load schema information from the API server: Get "k8s/openapi/v2?timeout=32s": dial tcp xxx.xxx.xxx.xxx:443: i/o timeout
2021-03-04T15:53:41.6655935Z  
2021-03-04T15:53:41.6656408Z   kubernetes:core/v1:ConfigMap (ingress-nginx/ingress-nginx-controller):
2021-03-04T15:53:41.6657486Z     warning: configured Kubernetes cluster is unreachable: unable to load schema information from the API server: Get "k8s/openapi/v2?timeout=32s": dial tcp xxx.xxx.xxx.xxx:443: i/o timeout
2021-03-04T15:53:41.6658140Z  
2021-03-04T15:53:41.6658620Z   kubernetes:batch/v1:Job (ingress-nginx/ingress-nginx-admission-create):
2021-03-04T15:53:41.6659694Z     warning: configured Kubernetes cluster is unreachable: unable to load schema information from the API server: Get "k8s/openapi/v2?timeout=32s": dial tcp xxx.xxx.xxx.xxx:443: i/o timeout
2021-03-04T15:53:41.6660331Z

And the exit code was 0.

Expected behavior

Pulumi should throw an error, failing the refresh command (exit code > 0).

Current behavior

Network error treated as empty value, leading to an empty stack state.

Steps to reproduce

Deploy any stack using K8s provider.
Take your cluster down.
Deploy a new version of the stack
Check the state file.

Context (Environment)

We are using a K8s cluster (1.18.10) in Azure AKS. Pulumi CLi v2.22.0. Our cluster management API is behind a firewall, and we modify the whitelist during deployment before running Pulumi. In some cases it takes a few seconds for the firewall to propagate the rules, leading to timeout while running Pulumi.

The text was updated successfully, but these errors were encountered:

leezen · 2021-03-04T18:50:23Z

What version of the pulumi-kubernetes provider are you using? I believe this was an issue that was fixed in #1379 to avoid deleting state when the cluster is unreachable.

dotansimha · 2021-03-05T09:20:57Z

Hi @leezen,
Thank you for the quick reply (and for transferring to the right place).

My package.json is based on these:

    "@pulumi/azure": "3.49.0",
    "@pulumi/kubernetes": "2.8.2",
    "@pulumi/pulumi": "2.22.0",

Also checked the yarn.lock, these versions are correctly fetched.

The issue you mentioned does seem very similar. In our case, I notice that the stack wasn't completely deleted. The state file was changed from around 580KB to 100KB, and some resources were still there (but most of them are gone). Maybe it's related to the fact we are running it as pulumi refresh --yes ?

If someone else is having a similar issue, our solution was to save a local backup of the corrupted state (stack export), take the last valid checkpoint file from the backup (we are using Azure Blob Storage for storing the state), and copy the checkpoint.latest section over the corrupted state, and then import it with stack import.

As a temporary workaround, we added a kubectl cluster-info dump before deployment, just to make sure the cluster is really available before running Pulumi.

lblackstone · 2021-04-09T01:33:01Z

Thanks for the detailed report! I tracked down the cause of this behavior to

pulumi-kubernetes/provider/pkg/provider/provider.go

Lines 1597 to 1602 in 80656f0

    
           // If the cluster is unreachable, consider the resource deleted and inform the user. 
        
           if k.clusterUnreachable { 
        
           	_ = k.host.Log(ctx, diag.Warning, urn, fmt.Sprintf( 
        
           		"configured Kubernetes cluster is unreachable: %s", k.clusterUnreachableReason)) 
        
           	return deleteResponse, nil 
        
           }

I think the most reasonable fix is to return an error instead of deleting resources from the state in case of an unreachable cluster. I'll do some more testing to make sure that fix doesn't cause unintended consequences, but otherwise, the fix would look something like this:

     Type                              Name                 Plan        Info
     pulumi:pulumi:Stack               pulumi-k8s-test-dev              1 error
 ~   └─ kubernetes:apps/v1:Deployment  foo                  refresh     1 error; 1 warning

Diagnostics:
  kubernetes:apps/v1:Deployment (foo):
    warning: configured Kubernetes cluster is unreachable: unable to load schema information from the API server: Get "https://kubernetes.docker.internal:6444/openapi/v2?timeout=32s": dial tcp 127.0.0.1:6444: connect: connection refused
    error: Preview failed: failed to read resource state due to unreachable cluster

lblackstone · 2021-04-09T21:03:32Z

I've got a fix up for this in #1522

Just to clarify on the reason for loss of state data, the provider was assuming that an unreachable cluster meant that it had been deleted, and deleting the related resource from the state as a result. This primarily affected the pulumi refresh operation, and would delete the resource from the state with a warning. As @dotansimha mentioned, the fix for resources being unintentionally deleted from the state like this is to revert to the previous checkpoint state. (Only the state is affected, not the actual k8s resources)

The state file was changed from around 580KB to 100KB, and some resources were still there (but most of them are gone)

Only k8s resources related to an unreachable k8s cluster are affected, so any other resources in the stack would have remained.

leezen transferred this issue from pulumi/pulumi Mar 4, 2021

leezen added the kind/bug Some behavior is incorrect or out of spec label Mar 4, 2021

leezen assigned lblackstone Mar 16, 2021

leezen added this to the 0.54 milestone Mar 16, 2021

infin8x modified the milestones: 0.54, 0.55 Apr 5, 2021

lblackstone mentioned this issue Apr 9, 2021

Error on refresh for an unreachable cluster. #1522

Merged

lblackstone closed this as completed in #1522 Apr 12, 2021

pulumi-bot added the resolution/fixed This issue was fixed label Apr 12, 2021

pnathan mentioned this issue Jun 17, 2022

Requesting: on error in refresh for unreachable cluster, allow marking as deleted #2033

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid stack state stored when a K8s cluster is not responding during `refresh`? #1485

Invalid stack state stored when a K8s cluster is not responding during `refresh`? #1485

dotansimha commented Mar 4, 2021

leezen commented Mar 4, 2021

dotansimha commented Mar 5, 2021 •

edited

lblackstone commented Apr 9, 2021

lblackstone commented Apr 9, 2021 •

edited

Invalid stack state stored when a K8s cluster is not responding during refresh? #1485

Invalid stack state stored when a K8s cluster is not responding during refresh? #1485

Comments

dotansimha commented Mar 4, 2021

Expected behavior

Current behavior

Steps to reproduce

Context (Environment)

leezen commented Mar 4, 2021

dotansimha commented Mar 5, 2021 • edited

lblackstone commented Apr 9, 2021

lblackstone commented Apr 9, 2021 • edited

Invalid stack state stored when a K8s cluster is not responding during `refresh`? #1485

Invalid stack state stored when a K8s cluster is not responding during `refresh`? #1485

dotansimha commented Mar 5, 2021 •

edited

lblackstone commented Apr 9, 2021 •

edited