Regression: Resources can get into a bad state if resource client init fails #364

lblackstone · 2019-01-22T23:28:50Z

Following #348, I ran into a bug where the provider failed to get clients with the following errors:

kubernetes:extensions:Deployment (nginx-ingress-default-backend):
    error: Plan apply failed: Could not make client to watch Deployment "nginx-ingress-default-backend": unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: not found

  pulumi:pulumi:Stack (gke-ingress-gke-ingress-dev):
    E0122 16:13:15.160706   20968 memcache.go:135] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request

After this failure, I tried running pulumi up again, but got an error because the resource had actually created successfully, but not registered that status with the engine:

  kubernetes:extensions:Deployment (nginx-ingress-default-backend):
    error: Plan apply failed: deployments.extensions "nginx-ingress-default-backend" already exists

Also, a pulumi refresh did not correct the state mismatch.

The text was updated successfully, but these errors were encountered:

lblackstone · 2019-02-25T20:00:53Z

This bug is still present, but is much less likely to be triggered now that the client logic has been cleaned up further in #367 and #414.

Removing the P1 label, and expect to fix this as part of the ongoing await logic refactoring.

hausdorff · 2019-07-13T04:29:50Z

I am confused about how this could happen. Do we have a repro or a timeline/set of logs to indicate how this could have happened? Until them I'm super reticent to believe that it's actually less likely because of the client logic cleanup.

Marking as Q3 -- can always remove later if we decide it's not super high priority.

lblackstone · 2019-07-22T15:50:12Z

@hausdorff IIRC, the problem was that the resource create request was successful, but the watch client used by the await logic failed. We were not returning a partial failure in this case, so Pulumi didn't have a record of that resource being created. This can cause two error cases:

auto-named resource - orphans a copy of the resource
manually-named resource or cluster-scoped resource - subsequent operations fail because the resource already exists, but is not tracked by the engine

lukehoban · 2019-07-26T00:47:19Z

@lblackstone Do you have a repro for this?

lblackstone · 2019-07-29T16:11:48Z

I don't have a repro, and the original cause was buggy resource client logic. I think we should still audit the error path here, but don't think it's likely to occur in practice at this point.

This commit will fix an almost-theoretical object leak in the code that awaits resource creation, update, or read. The issue in each case is that the final line of each of these `await` functions calls is a call to `Get`, which will fail if the cluster is unreachable, returning `nil` instead of an object. This causes Pulumi to believe the object was not successfully created. This commit will return an old version of the live object instead. Fixes #364.

hausdorff · 2019-07-29T23:32:31Z

I'm still skeptical that this is "really a bug" but I've put up #664 which fixes a (theoretical?) object leak that could have caused something like this.

This commit will fix an almost-theoretical object leak in the code that awaits resource creation, update, or read. The issue in each case is that the final line of each of these `await` functions calls is a call to `Get`, which will fail if the cluster is unreachable, returning `nil` instead of an object. This causes Pulumi to believe the object was not successfully created. This commit will return an old version of the live object instead. Fixes #364.

lblackstone added the priority/P1 label Jan 22, 2019

lblackstone added this to the 0.21 milestone Jan 22, 2019

lblackstone self-assigned this Jan 22, 2019

hausdorff added kind/bug Some behavior is incorrect or out of spec area/resource-management Issues related to Kubernetes resource provisioning, management, await logic, and semantics generally labels Jan 24, 2019

lblackstone removed the priority/P1 label Feb 25, 2019

lblackstone modified the milestones: 0.21, 0.22 Feb 25, 2019

lblackstone removed this from the 0.22 milestone Mar 18, 2019

hausdorff added the feature/q3 label Jul 13, 2019

hausdorff added this to backlog in Q3 Kubernetes Jul 22, 2019

lukehoban added this to the 0.26 milestone Jul 25, 2019

hausdorff mentioned this issue Jul 29, 2019

Fix minor live object leak in await code #664

Merged

hausdorff closed this as completed in #664 Jul 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression: Resources can get into a bad state if resource client init fails #364

Regression: Resources can get into a bad state if resource client init fails #364

lblackstone commented Jan 22, 2019 •

edited

Loading

lblackstone commented Feb 25, 2019

hausdorff commented Jul 13, 2019

lblackstone commented Jul 22, 2019

lukehoban commented Jul 26, 2019

lblackstone commented Jul 29, 2019

hausdorff commented Jul 29, 2019

Regression: Resources can get into a bad state if resource client init fails #364

Regression: Resources can get into a bad state if resource client init fails #364

Comments

lblackstone commented Jan 22, 2019 • edited Loading

lblackstone commented Feb 25, 2019

hausdorff commented Jul 13, 2019

lblackstone commented Jul 22, 2019

lukehoban commented Jul 26, 2019

lblackstone commented Jul 29, 2019

hausdorff commented Jul 29, 2019

lblackstone commented Jan 22, 2019 •

edited

Loading