Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression: Resources can get into a bad state if resource client init fails #364

Closed
lblackstone opened this issue Jan 22, 2019 · 6 comments · Fixed by #664
Closed

Regression: Resources can get into a bad state if resource client init fails #364

lblackstone opened this issue Jan 22, 2019 · 6 comments · Fixed by #664
Assignees
Labels
area/resource-management Issues related to Kubernetes resource provisioning, management, await logic, and semantics generally kind/bug Some behavior is incorrect or out of spec
Milestone

Comments

@lblackstone
Copy link
Member

lblackstone commented Jan 22, 2019

Following #348, I ran into a bug where the provider failed to get clients with the following errors:

kubernetes:extensions:Deployment (nginx-ingress-default-backend):
    error: Plan apply failed: Could not make client to watch Deployment "nginx-ingress-default-backend": unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: not found

  pulumi:pulumi:Stack (gke-ingress-gke-ingress-dev):
    E0122 16:13:15.160706   20968 memcache.go:135] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request

After this failure, I tried running pulumi up again, but got an error because the resource had actually created successfully, but not registered that status with the engine:

  kubernetes:extensions:Deployment (nginx-ingress-default-backend):
    error: Plan apply failed: deployments.extensions "nginx-ingress-default-backend" already exists

Also, a pulumi refresh did not correct the state mismatch.

@lblackstone lblackstone added this to the 0.21 milestone Jan 22, 2019
@lblackstone lblackstone self-assigned this Jan 22, 2019
@hausdorff hausdorff added kind/bug Some behavior is incorrect or out of spec area/resource-management Issues related to Kubernetes resource provisioning, management, await logic, and semantics generally labels Jan 24, 2019
@lblackstone
Copy link
Member Author

This bug is still present, but is much less likely to be triggered now that the client logic has been cleaned up further in #367 and #414.

Removing the P1 label, and expect to fix this as part of the ongoing await logic refactoring.

@lblackstone lblackstone modified the milestones: 0.21, 0.22 Feb 25, 2019
@lblackstone lblackstone removed this from the 0.22 milestone Mar 18, 2019
@hausdorff
Copy link
Contributor

I am confused about how this could happen. Do we have a repro or a timeline/set of logs to indicate how this could have happened? Until them I'm super reticent to believe that it's actually less likely because of the client logic cleanup.

Marking as Q3 -- can always remove later if we decide it's not super high priority.

@lblackstone
Copy link
Member Author

@hausdorff IIRC, the problem was that the resource create request was successful, but the watch client used by the await logic failed. We were not returning a partial failure in this case, so Pulumi didn't have a record of that resource being created. This can cause two error cases:

  1. auto-named resource - orphans a copy of the resource
  2. manually-named resource or cluster-scoped resource - subsequent operations fail because the resource already exists, but is not tracked by the engine

@hausdorff hausdorff added this to backlog in Q3 Kubernetes Jul 22, 2019
@lukehoban lukehoban added this to the 0.26 milestone Jul 25, 2019
@lukehoban
Copy link
Member

@lblackstone Do you have a repro for this?

@lblackstone
Copy link
Member Author

I don't have a repro, and the original cause was buggy resource client logic. I think we should still audit the error path here, but don't think it's likely to occur in practice at this point.

hausdorff added a commit that referenced this issue Jul 29, 2019
This commit will fix an almost-theoretical object leak in the code that
awaits resource creation, update, or read. The issue in each case is
that the final line of each of these `await` functions calls is a call
to `Get`, which will fail if the cluster is unreachable, returning `nil`
instead of an object. This causes Pulumi to believe the object was not
successfully created.

This commit will return an old version of the live object instead.

Fixes #364.
@hausdorff
Copy link
Contributor

I'm still skeptical that this is "really a bug" but I've put up #664 which fixes a (theoretical?) object leak that could have caused something like this.

hausdorff added a commit that referenced this issue Jul 30, 2019
This commit will fix an almost-theoretical object leak in the code that
awaits resource creation, update, or read. The issue in each case is
that the final line of each of these `await` functions calls is a call
to `Get`, which will fail if the cluster is unreachable, returning `nil`
instead of an object. This causes Pulumi to believe the object was not
successfully created.

This commit will return an old version of the live object instead.

Fixes #364.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/resource-management Issues related to Kubernetes resource provisioning, management, await logic, and semantics generally kind/bug Some behavior is incorrect or out of spec
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

3 participants