Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid stack state stored when a K8s cluster is not responding during refresh? #1485

Closed
dotansimha opened this issue Mar 4, 2021 · 4 comments · Fixed by #1522
Closed

Invalid stack state stored when a K8s cluster is not responding during refresh? #1485

dotansimha opened this issue Mar 4, 2021 · 4 comments · Fixed by #1522
Assignees
Labels
kind/bug Some behavior is incorrect or out of spec resolution/fixed This issue was fixed
Milestone

Comments

@dotansimha
Copy link

We are using Pulumi for our deployment pipeline. We had a failed deployment today and noticed that during the refresh phase, the K8s were not available (timeout). Seems like Pulumi didn't handle that as an error, leading to an invalid stack state store. In checkpoint files, we noticed that there are missing 400K for the state file.

This led to a pulumi up running and treating the cluster as empty -> so some parts of the deployment were failed.

The logs from refresh looks like that:

2021-03-04T15:53:41.6652021Z   kubernetes:rbac.authorization.k8s.io/v1beta1:ClusterRole (kube-system/fluentd):
2021-03-04T15:53:41.6653122Z     warning: configured Kubernetes cluster is unreachable: unable to load schema information from the API server: Get "k8s/openapi/v2?timeout=32s": dial tcp xxx.xxx.xxx.xxx:443: i/o timeout
2021-03-04T15:53:41.6653785Z  
2021-03-04T15:53:41.6654197Z   kubernetes:core/v1:Namespace (ingress-nginx):
2021-03-04T15:53:41.6655236Z     warning: configured Kubernetes cluster is unreachable: unable to load schema information from the API server: Get "k8s/openapi/v2?timeout=32s": dial tcp xxx.xxx.xxx.xxx:443: i/o timeout
2021-03-04T15:53:41.6655935Z  
2021-03-04T15:53:41.6656408Z   kubernetes:core/v1:ConfigMap (ingress-nginx/ingress-nginx-controller):
2021-03-04T15:53:41.6657486Z     warning: configured Kubernetes cluster is unreachable: unable to load schema information from the API server: Get "k8s/openapi/v2?timeout=32s": dial tcp xxx.xxx.xxx.xxx:443: i/o timeout
2021-03-04T15:53:41.6658140Z  
2021-03-04T15:53:41.6658620Z   kubernetes:batch/v1:Job (ingress-nginx/ingress-nginx-admission-create):
2021-03-04T15:53:41.6659694Z     warning: configured Kubernetes cluster is unreachable: unable to load schema information from the API server: Get "k8s/openapi/v2?timeout=32s": dial tcp xxx.xxx.xxx.xxx:443: i/o timeout
2021-03-04T15:53:41.6660331Z  

And the exit code was 0.

Expected behavior

Pulumi should throw an error, failing the refresh command (exit code > 0).

Current behavior

Network error treated as empty value, leading to an empty stack state.

Steps to reproduce

  1. Deploy any stack using K8s provider.
  2. Take your cluster down.
  3. Deploy a new version of the stack
  4. Check the state file.

Context (Environment)

We are using a K8s cluster (1.18.10) in Azure AKS. Pulumi CLi v2.22.0. Our cluster management API is behind a firewall, and we modify the whitelist during deployment before running Pulumi. In some cases it takes a few seconds for the firewall to propagate the rules, leading to timeout while running Pulumi.

@leezen
Copy link
Contributor

leezen commented Mar 4, 2021

What version of the pulumi-kubernetes provider are you using? I believe this was an issue that was fixed in #1379 to avoid deleting state when the cluster is unreachable.

@leezen leezen transferred this issue from pulumi/pulumi Mar 4, 2021
@leezen leezen added the kind/bug Some behavior is incorrect or out of spec label Mar 4, 2021
@dotansimha
Copy link
Author

dotansimha commented Mar 5, 2021

Hi @leezen,
Thank you for the quick reply (and for transferring to the right place).

My package.json is based on these:

    "@pulumi/azure": "3.49.0",
    "@pulumi/kubernetes": "2.8.2",
    "@pulumi/pulumi": "2.22.0",

Also checked the yarn.lock, these versions are correctly fetched.

The issue you mentioned does seem very similar. In our case, I notice that the stack wasn't completely deleted. The state file was changed from around 580KB to 100KB, and some resources were still there (but most of them are gone). Maybe it's related to the fact we are running it as pulumi refresh --yes ?

If someone else is having a similar issue, our solution was to save a local backup of the corrupted state (stack export), take the last valid checkpoint file from the backup (we are using Azure Blob Storage for storing the state), and copy the checkpoint.latest section over the corrupted state, and then import it with stack import.

As a temporary workaround, we added a kubectl cluster-info dump before deployment, just to make sure the cluster is really available before running Pulumi.

@leezen leezen added this to the 0.54 milestone Mar 16, 2021
@infin8x infin8x modified the milestones: 0.54, 0.55 Apr 5, 2021
@lblackstone
Copy link
Member

Thanks for the detailed report! I tracked down the cause of this behavior to

// If the cluster is unreachable, consider the resource deleted and inform the user.
if k.clusterUnreachable {
_ = k.host.Log(ctx, diag.Warning, urn, fmt.Sprintf(
"configured Kubernetes cluster is unreachable: %s", k.clusterUnreachableReason))
return deleteResponse, nil
}

I think the most reasonable fix is to return an error instead of deleting resources from the state in case of an unreachable cluster. I'll do some more testing to make sure that fix doesn't cause unintended consequences, but otherwise, the fix would look something like this:

     Type                              Name                 Plan        Info
     pulumi:pulumi:Stack               pulumi-k8s-test-dev              1 error
 ~   └─ kubernetes:apps/v1:Deployment  foo                  refresh     1 error; 1 warning

Diagnostics:
  kubernetes:apps/v1:Deployment (foo):
    warning: configured Kubernetes cluster is unreachable: unable to load schema information from the API server: Get "https://kubernetes.docker.internal:6444/openapi/v2?timeout=32s": dial tcp 127.0.0.1:6444: connect: connection refused
    error: Preview failed: failed to read resource state due to unreachable cluster

@lblackstone
Copy link
Member

lblackstone commented Apr 9, 2021

I've got a fix up for this in #1522

Just to clarify on the reason for loss of state data, the provider was assuming that an unreachable cluster meant that it had been deleted, and deleting the related resource from the state as a result. This primarily affected the pulumi refresh operation, and would delete the resource from the state with a warning. As @dotansimha mentioned, the fix for resources being unintentionally deleted from the state like this is to revert to the previous checkpoint state. (Only the state is affected, not the actual k8s resources)

The state file was changed from around 580KB to 100KB, and some resources were still there (but most of them are gone)

Only k8s resources related to an unreachable k8s cluster are affected, so any other resources in the stack would have remained.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Some behavior is incorrect or out of spec resolution/fixed This issue was fixed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants