New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
error: snapshot integrity failure; refusing to use it: duplicate resource <urn> (not marked for deletion) #13848
Comments
I've reproduced this behavior with the following Python program: from typing import Optional
import uuid
from pulumi import Alias, ComponentResource, Output, ResourceOptions
from pulumi.dynamic import CreateResult, Resource, ResourceProvider
class MyComponent(ComponentResource):
def __init__(self, name):
super().__init__("my:index:MyComponent", name)
class Echo(Resource):
value: Output[str]
class __EchoProvider(ResourceProvider):
def create(self, inputs):
return CreateResult(id_=str(uuid.uuid4()), outs=inputs)
def __init__(self, name: str, value: str, opts: Optional[ResourceOptions] = None) -> None:
super().__init__(Echo.__EchoProvider(), name, {"value": value}, opts)
class FailOnCreate(Resource):
class __FailOnCreateProvider(ResourceProvider):
def create(self, inputs):
raise Exception("create failed")
def __init__(self, name: str) -> None:
super().__init__(FailOnCreate.__FailOnCreateProvider(), name, {})
component = MyComponent("component")
# Step 1
res1 = Echo("res1", value="hi", opts=ResourceOptions(parent=component))
# Step 2 (comment out previous step and uncomment below)
# res1 = Echo("res1", value="hi") # unparent
# res2 = FailOnCreate("res2")
# Step 3 (comment out previous steps and uncomment below)
# res1 = Echo("res1", value="hi", opts=ResourceOptions(aliases=[Alias(parent=component)])) # add alias First run Step 2 unparents a resource without an alias, so a new one will be created and the old one deleted. We’ve also added a new resource that deliberately fails during create, so the update ends before the delete runs, but the create of the new unparented resource did run. This leaves the state of the stack having both the original resource and the new unparented resource. The original parented resource wasn’t deleted.
{
"urn": "urn:pulumi:dev::pycreatefail::pulumi-python:dynamic:Resource::res1",
"custom": true,
"id": "6ef81de0-1586-4eb0-9383-8fcecd847860",
"type": "pulumi-python:dynamic:Resource",
"inputs": {},
"outputs": {},
"parent": "urn:pulumi:dev::pycreatefail::pulumi:pulumi:Stack::pycreatefail-dev",
"provider": "urn:pulumi:dev::pycreatefail::pulumi:providers:pulumi-python::default::4d2fb4d6-f75b-4db2-b012-0f30f35cd187",
},
{
"urn": "urn:pulumi:dev::pycreatefail::my:index:MyComponent$pulumi-python:dynamic:Resource::res1",
"custom": true,
"id": "91e6a499-df8e-4032-95b8-0e4b9e7ac64c",
"type": "pulumi-python:dynamic:Resource",
"inputs": {},
"outputs": {},
"parent": "urn:pulumi:dev::pycreatefail::my:index:MyComponent::component",
"provider": "urn:pulumi:dev::pycreatefail::pulumi:providers:pulumi-python::default::4d2fb4d6-f75b-4db2-b012-0f30f35cd187",
} Step 3 adds the alias (to prevent the whole create/delete of that resource), and when running the update, we see the
{
"urn": "urn:pulumi:dev::pycreatefail::pulumi-python:dynamic:Resource::res1",
"custom": true,
"id": "91e6a499-df8e-4032-95b8-0e4b9e7ac64c",
"type": "pulumi-python:dynamic:Resource",
"inputs": {},
"outputs": {},
"parent": "urn:pulumi:dev::pycreatefail::pulumi:pulumi:Stack::pycreatefail-dev",
"provider": "urn:pulumi:dev::pycreatefail::pulumi:providers:pulumi-python::default::4d2fb4d6-f75b-4db2-b012-0f30f35cd187",
"aliases": [
"urn:pulumi:dev::pycreatefail::my:index:MyComponent$pulumi-python:dynamic:Resource::res1"
],
},
{
"urn": "urn:pulumi:dev::pycreatefail::pulumi-python:dynamic:Resource::res1",
"custom": true,
"id": "6ef81de0-1586-4eb0-9383-8fcecd847860",
"type": "pulumi-python:dynamic:Resource",
"inputs": {},
"outputs": {},
"parent": "urn:pulumi:dev::pycreatefail::pulumi:pulumi:Stack::pycreatefail-dev",
"provider": "urn:pulumi:dev::pycreatefail::pulumi:providers:pulumi-python::default::4d2fb4d6-f75b-4db2-b012-0f30f35cd187",
} Step 1 and Step 2 are behaving as expected. Step 3 is where there appears to be a case we're not handling correctly:
We should error on this, asking the user to complete the delete before attempting to alias a resource into this existing name. We do have handling for this for related cases, such as a single update that has two resources A and B that attempts to alias A to B and B to A. We do catch that case and return an error on it (including in the preview):
It seems a similar check and error should apply in this case, which will prevent the corrupt stack state. |
As I was working on the fix for this, I found that we had changed the behavior in #10819 when looking up old resources (regression), and that behavior change is the ultimate root cause that leads to the stack corruption in this case. With the behavior fixed, we no longer get into this situation where we would have to return an error to avoid the corruption. In short: - for _, urnOrAlias := range append([]resource.URN{urn}, goal.Aliases...) {
+ aliases[urn] = struct{}{}
+ for urnOrAlias := range aliases { Previously, we looked for old resources first by the resource's URN, then by any aliases: pulumi/pkg/resource/deploy/step_generator.go Lines 478 to 480 in 00e8fb6
#10819 changed this. Rather than tracking aliases as a slice it was changed to a map (a set). The resource URN was added to the map, but that doesn't guarantee that it will be looked at first when enumerating the map. If there are any aliases in the map, they will likely be looked at first (although the enumeration order for maps is unspecified). In this case, it's causing the wrong resource to be chosen. We end up looking for an old resource using the alias, which chooses the original resource. That resource's URN is updated to the new URN and the alias is recorded. But we're still left with the new resource with the same URN, i.e. duplicate resource, corrupt stack. With the behavior fixed, we look for the old resource using the URN, which chooses the new resource, and the original resource is deleted. No stack corruption. |
This change fixes a regression in the step generator when looking for old resources. When generating steps for a register resource event, we previously looked for old resources first by the resource's URN and then by aliases. This regressed with #10819: ```diff - for _, urnOrAlias := range append([]resource.URN{urn}, goal.Aliases...) { + aliases[urn] = struct{}{} + for urnOrAlias := range aliases { ``` Previously, aliases were in a slice and we always looked for the URN first, then aliases. With #10819, aliases changed to being stored in a map (a set). The URN was added to the map before iterating over it, but there's no guarantee it will be looked at first (iteration order for maps is unspecified), and with the current behavior when there are aliases in the map, the URN very likely won't come first. This can lead to duplicate resources in the state (stack corruption) when the wrong old resource is chosen. The fix is to move back to always checking for old resources using the URN first. We also move back to maintaining aliases in a slice for consistent ordering. Fixes #13848
Add a comment closer to where we actually lookup an old resource, noting that we should check the URN first before aliases and why. Follow-up from postmortem discussion on #13848
Customer has an existing stack where a resource was going to get deleted and a new one created, the stack went thru half the creation then failed because of a prometheus deployment so it seems to have never gotten to the delete calls.
They added an alias to the code, because the original delete/create shouldn't have happened then upon next run it looks like it duplicated the resource, and now the state is corrupted:
The text was updated successfully, but these errors were encountered: