-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalid stack state stored when a K8s cluster is not responding during refresh
?
#1485
Comments
What version of the |
Hi @leezen, My
The issue you mentioned does seem very similar. In our case, I notice that the stack wasn't completely deleted. The state file was changed from around 580KB to 100KB, and some resources were still there (but most of them are gone). Maybe it's related to the fact we are running it as If someone else is having a similar issue, our solution was to save a local backup of the corrupted state ( As a temporary workaround, we added a |
Thanks for the detailed report! I tracked down the cause of this behavior to pulumi-kubernetes/provider/pkg/provider/provider.go Lines 1597 to 1602 in 80656f0
I think the most reasonable fix is to return an error instead of deleting resources from the state in case of an unreachable cluster. I'll do some more testing to make sure that fix doesn't cause unintended consequences, but otherwise, the fix would look something like this:
|
I've got a fix up for this in #1522 Just to clarify on the reason for loss of state data, the provider was assuming that an unreachable cluster meant that it had been deleted, and deleting the related resource from the state as a result. This primarily affected the
Only k8s resources related to an unreachable k8s cluster are affected, so any other resources in the stack would have remained. |
We are using Pulumi for our deployment pipeline. We had a failed deployment today and noticed that during the
refresh
phase, the K8s were not available (timeout). Seems like Pulumi didn't handle that as an error, leading to an invalid stack state store. In checkpoint files, we noticed that there are missing 400K for the state file.This led to a
pulumi up
running and treating the cluster as empty -> so some parts of the deployment were failed.The logs from
refresh
looks like that:And the exit code was
0
.Expected behavior
Pulumi should throw an error, failing the refresh command (exit code > 0).
Current behavior
Network error treated as empty value, leading to an empty stack state.
Steps to reproduce
Context (Environment)
We are using a K8s cluster (1.18.10) in Azure AKS. Pulumi CLi
v2.22.0
. Our cluster management API is behind a firewall, and we modify the whitelist during deployment before running Pulumi. In some cases it takes a few seconds for the firewall to propagate the rules, leading to timeout while running Pulumi.The text was updated successfully, but these errors were encountered: