Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error: snapshot integrity failure; refusing to use it: duplicate resource <urn> (not marked for deletion) #13848

Closed
justinvp opened this issue Aug 31, 2023 · 2 comments · Fixed by #13883
Assignees
Labels
area/engine Pulumi engine customer/feedback Feedback from customers customer/lighthouse Lighthouse customer bugs impact/regression Something that used to work, but is now broken kind/bug Some behavior is incorrect or out of spec p0 Bugs severe enough to interrupt existing work resolution/fixed This issue was fixed
Milestone

Comments

@justinvp
Copy link
Member

justinvp commented Aug 31, 2023

Customer has an existing stack where a resource was going to get deleted and a new one created, the stack went thru half the creation then failed because of a prometheus deployment so it seems to have never gotten to the delete calls.

They added an alias to the code, because the original delete/create shouldn't have happened then upon next run it looks like it duplicated the resource, and now the state is corrupted:

Previewing update (dev):
error: snapshot integrity failure; refusing to use it: duplicate resource urn:pulumi:dev::dataflow-services-dev::aws:ec2/vpcEndpointServiceAllowedPrinciple:VpcEndpointServiceAllowedPrinciple::name (not marked for deletion)
@justinvp justinvp added kind/bug Some behavior is incorrect or out of spec p0 Bugs severe enough to interrupt existing work area/engine Pulumi engine labels Aug 31, 2023
@justinvp justinvp added this to the 0.93 milestone Aug 31, 2023
@justinvp justinvp self-assigned this Aug 31, 2023
@justinvp justinvp added customer/feedback Feedback from customers customer/lighthouse Lighthouse customer bugs labels Sep 1, 2023
@justinvp
Copy link
Member Author

justinvp commented Sep 5, 2023

I've reproduced this behavior with the following Python program:

from typing import Optional
import uuid
from pulumi import Alias, ComponentResource, Output, ResourceOptions
from pulumi.dynamic import CreateResult, Resource, ResourceProvider


class MyComponent(ComponentResource):
    def __init__(self, name):
        super().__init__("my:index:MyComponent", name)


class Echo(Resource):
    value: Output[str]

    class __EchoProvider(ResourceProvider):
        def create(self, inputs):
            return CreateResult(id_=str(uuid.uuid4()), outs=inputs)

    def __init__(self, name: str, value: str, opts: Optional[ResourceOptions] = None) -> None:
        super().__init__(Echo.__EchoProvider(), name, {"value": value}, opts)


class FailOnCreate(Resource):
    class __FailOnCreateProvider(ResourceProvider):
        def create(self, inputs):
            raise Exception("create failed")

    def __init__(self, name: str) -> None:
        super().__init__(FailOnCreate.__FailOnCreateProvider(), name, {})


component = MyComponent("component")


# Step 1
res1 = Echo("res1", value="hi", opts=ResourceOptions(parent=component))

# Step 2 (comment out previous step and uncomment below)
# res1 = Echo("res1", value="hi") # unparent
# res2 = FailOnCreate("res2")

# Step 3 (comment out previous steps and uncomment below)
# res1 = Echo("res1", value="hi", opts=ResourceOptions(aliases=[Alias(parent=component)])) # add alias

First run pulumi up for Step 1.

Step 2 unparents a resource without an alias, so a new one will be created and the old one deleted. We’ve also added a new resource that deliberately fails during create, so the update ends before the delete runs, but the create of the new unparented resource did run. This leaves the state of the stack having both the original resource and the new unparented resource. The original parented resource wasn’t deleted.


$ pulumi up
Previewing update (dev)

     Type                                  Name              Plan
     pulumi:pulumi:Stack                   pycreatefail-dev
 +   ├─ pulumi-python:dynamic:Resource     res2              create
 +   ├─ pulumi-python:dynamic:Resource     res1              create
     └─ my:index:MyComponent               component
 -      └─ pulumi-python:dynamic:Resource  res1              delete


Resources:
    + 2 to create
    - 1 to delete
    3 changes. 2 unchanged

Do you want to perform this update? yes
Updating (dev)

     Type                               Name              Status                  Info
     pulumi:pulumi:Stack                pycreatefail-dev  **failed**              1 error
 +   ├─ pulumi-python:dynamic:Resource  res2              **creating failed**     1 error
 +   └─ pulumi-python:dynamic:Resource  res1              created (0.49s)


Diagnostics:
  pulumi:pulumi:Stack (pycreatefail-dev):
    error: update failed

  pulumi-python:dynamic:Resource (res2):
    error: Exception calling application: create failed

Resources:
    + 1 created
    2 unchanged

Duration: 2s
{
    "urn": "urn:pulumi:dev::pycreatefail::pulumi-python:dynamic:Resource::res1",
    "custom": true,
    "id": "6ef81de0-1586-4eb0-9383-8fcecd847860",
    "type": "pulumi-python:dynamic:Resource",
    "inputs": {},
    "outputs": {},
    "parent": "urn:pulumi:dev::pycreatefail::pulumi:pulumi:Stack::pycreatefail-dev",
    "provider": "urn:pulumi:dev::pycreatefail::pulumi:providers:pulumi-python::default::4d2fb4d6-f75b-4db2-b012-0f30f35cd187",
},
{
    "urn": "urn:pulumi:dev::pycreatefail::my:index:MyComponent$pulumi-python:dynamic:Resource::res1",
    "custom": true,
    "id": "91e6a499-df8e-4032-95b8-0e4b9e7ac64c",
    "type": "pulumi-python:dynamic:Resource",
    "inputs": {},
    "outputs": {},
    "parent": "urn:pulumi:dev::pycreatefail::my:index:MyComponent::component",
    "provider": "urn:pulumi:dev::pycreatefail::pulumi:providers:pulumi-python::default::4d2fb4d6-f75b-4db2-b012-0f30f35cd187",
}

Step 3 adds the alias (to prevent the whole create/delete of that resource), and when running the update, we see the snapshot integrity failure; refusing to use it: duplicate resource error.

$ pulumi up
Previewing update (dev)

     Type                                  Name              Plan
     pulumi:pulumi:Stack                   pycreatefail-dev
     └─ my:index:MyComponent               component
 -      └─ pulumi-python:dynamic:Resource  res1              delete


Resources:
    - 1 to delete
    3 unchanged

Do you want to perform this update? yes
Updating (dev)

     Type                               Name              Status         Info
     pulumi:pulumi:Stack                pycreatefail-dev  **failed**     1 error
     └─ pulumi-python:dynamic:Resource  res1              **failed**     1 error


Diagnostics:
  pulumi:pulumi:Stack (pycreatefail-dev):
    error: update failed

  pulumi-python:dynamic:Resource (res1):
    error: post-step event returned an error: failed to verify snapshot: duplicate resource urn:pulumi:dev::pycreatefail::pulumi-python:dynamic:Resource::res1 (not marked for deletion)

Resources:
    3 unchanged

Duration: 1s
{
    "urn": "urn:pulumi:dev::pycreatefail::pulumi-python:dynamic:Resource::res1",
    "custom": true,
    "id": "91e6a499-df8e-4032-95b8-0e4b9e7ac64c",
    "type": "pulumi-python:dynamic:Resource",
    "inputs": {},
    "outputs": {},
    "parent": "urn:pulumi:dev::pycreatefail::pulumi:pulumi:Stack::pycreatefail-dev",
    "provider": "urn:pulumi:dev::pycreatefail::pulumi:providers:pulumi-python::default::4d2fb4d6-f75b-4db2-b012-0f30f35cd187",
    "aliases": [
        "urn:pulumi:dev::pycreatefail::my:index:MyComponent$pulumi-python:dynamic:Resource::res1"
    ],
},
{
    "urn": "urn:pulumi:dev::pycreatefail::pulumi-python:dynamic:Resource::res1",
    "custom": true,
    "id": "6ef81de0-1586-4eb0-9383-8fcecd847860",
    "type": "pulumi-python:dynamic:Resource",
    "inputs": {},
    "outputs": {},
    "parent": "urn:pulumi:dev::pycreatefail::pulumi:pulumi:Stack::pycreatefail-dev",
    "provider": "urn:pulumi:dev::pycreatefail::pulumi:providers:pulumi-python::default::4d2fb4d6-f75b-4db2-b012-0f30f35cd187",
}

Step 1 and Step 2 are behaving as expected. Step 3 is where there appears to be a case we're not handling correctly:

  1. There are two resources with different URNs in the state file
  2. The user program is asking to (a) alias one of them into the name of the other and (b) delete the original one with that name
  3. In this case, we currently allow the rename to happen without removing the other one from state - this leads to a corrupt checkpoint

We should error on this, asking the user to complete the delete before attempting to alias a resource into this existing name.

We do have handling for this for related cases, such as a single update that has two resources A and B that attempts to alias A to B and B to A. We do catch that case and return an error on it (including in the preview):

  pulumi-python:dynamic:Resource (res1):
    error: Duplicate resource URN 'urn:pulumi:dev2::pendingdelete::pulumi-python:dynamic:Resource::res1' conflicting with alias on resource with URN 'urn:pulumi:dev2::pendingdelete::pulumi-python:dynamic:Resource::res2'
    error: Duplicate resource alias 'urn:pulumi:dev2::pendingdelete::pulumi-python:dynamic:Resource::res2' applied to resource with URN 'urn:pulumi:dev2::pendingdelete::pulumi-python:dynamic:Resource::res1' conflicting with resource with URN 'urn:pulumi:dev2::pendingdelete::pulumi-python:dynamic:Resource::res1'

It seems a similar check and error should apply in this case, which will prevent the corrupt stack state.

@justinvp justinvp added the impact/regression Something that used to work, but is now broken label Sep 6, 2023
@justinvp
Copy link
Member Author

justinvp commented Sep 6, 2023

As I was working on the fix for this, I found that we had changed the behavior in #10819 when looking up old resources (regression), and that behavior change is the ultimate root cause that leads to the stack corruption in this case. With the behavior fixed, we no longer get into this situation where we would have to return an error to avoid the corruption.

The change: https://github.com/pulumi/pulumi/pull/10819/files#diff-23931859ebe7b762c1c12b464bc738bc5a4d59a8b549ddf9784065ffbfee7fdbL353-R451

In short:

-	for _, urnOrAlias := range append([]resource.URN{urn}, goal.Aliases...) {
+	aliases[urn] = struct{}{}
+	for urnOrAlias := range aliases {

Previously, we looked for old resources first by the resource's URN, then by any aliases:

// Check for an old resource so that we can figure out if this is a create, delete, etc., and/or
// to diff. We look up first by URN and then by any provided aliases. If it is found using an
// alias, record that alias so that we do not delete the aliased resource later.

#10819 changed this. Rather than tracking aliases as a slice it was changed to a map (a set). The resource URN was added to the map, but that doesn't guarantee that it will be looked at first when enumerating the map. If there are any aliases in the map, they will likely be looked at first (although the enumeration order for maps is unspecified).

In this case, it's causing the wrong resource to be chosen. We end up looking for an old resource using the alias, which chooses the original resource. That resource's URN is updated to the new URN and the alias is recorded. But we're still left with the new resource with the same URN, i.e. duplicate resource, corrupt stack.

With the behavior fixed, we look for the old resource using the URN, which chooses the new resource, and the original resource is deleted. No stack corruption.

github-merge-queue bot pushed a commit that referenced this issue Sep 6, 2023
This change fixes a regression in the step generator when looking for
old resources. When generating steps for a register resource event, we
previously looked for old resources first by the resource's URN and then
by aliases.

This regressed with #10819:

```diff
-	for _, urnOrAlias := range append([]resource.URN{urn}, goal.Aliases...) {
+	aliases[urn] = struct{}{}
+	for urnOrAlias := range aliases {
```

Previously, aliases were in a slice and we always looked for the URN
first, then aliases.

With #10819, aliases changed to being stored in a map (a set). The URN
was added to the map before iterating over it, but there's no guarantee
it will be looked at first (iteration order for maps is unspecified),
and with the current behavior when there are aliases in the map, the URN
very likely won't come first.

This can lead to duplicate resources in the state (stack corruption)
when the wrong old resource is chosen.

The fix is to move back to always checking for old resources using the
URN first. We also move back to maintaining aliases in a slice for
consistent ordering.

Fixes #13848
@pulumi-bot pulumi-bot added the resolution/fixed This issue was fixed label Sep 6, 2023
github-merge-queue bot pushed a commit that referenced this issue Sep 11, 2023
Add a comment closer to where we actually lookup an old resource, noting
that we should check the URN first before aliases and why.

Follow-up from postmortem discussion on
#13848
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/engine Pulumi engine customer/feedback Feedback from customers customer/lighthouse Lighthouse customer bugs impact/regression Something that used to work, but is now broken kind/bug Some behavior is incorrect or out of spec p0 Bugs severe enough to interrupt existing work resolution/fixed This issue was fixed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants