Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check for idempotent replacements #9903

Closed
wants to merge 9 commits into from
Closed

Conversation

Frassle
Copy link
Member

@Frassle Frassle commented Jun 21, 2022

Description

Some provider Create calls are idempotent, so when we ask for some state to be created if it already exists the provider just returns OK. This confuses the engine as it thinks a new resource has been created, so when it goes to delete the old resource being replaced it actually deletes the cloud resource that was also backing the new logical resource.

This changes the engine to error in these cases (where it can see the old and new ID are the same), an alternative would be to accept this and just logically delete the old resource from state but not issue the provider delete call for it (i.e. set delete=true and retainOnDelete=true). I think that might be more user friendly but there might be some more obscure corner cases it could trip up on that I haven't thought of.

Part of fixing pulumi/pulumi-aws#2009

Checklist

  • I have added tests that prove my fix is effective or that my feature works
  • Yes, there are changes in this PR that warrants bumping the Pulumi Service API version

@AaronFriel
Copy link
Member

This changes the engine to error in these cases (where it can see the old and new ID are the same), an alternative would be to accept this and just logically delete the old resource from state but not issue the provider delete call for it (i.e. set delete=true and retainOnDelete=true). I think that might be more user friendly but there might be some more obscure corner cases it could trip up on that I haven't thought of.

If we never call Delete, it's fine for idempotent creates, but for non-idempotent creates it seems likely to result in a resource that's orphaned and outside of Pulumi's management, and the create will fail.

I think erroring on overlapping IDs is OK, but I wonder if introducing the error here is a breaking change for any consumers whose code is working fine somehow but relies on these overlapping IDs.

Hypothetically, and not necessarily a blocker for this: Is there a way to make this a warning and add a no-op engine event that contains the resource type that ends up being relayed to the service when users are using the SaaS backend? Could we then query our engine events to determine how often these warnings occur and on what resource types?

@Frassle
Copy link
Member Author

Frassle commented Jun 21, 2022

If we never call Delete, it's fine for idempotent creates, but for non-idempotent creates it seems likely to result in a resource that's orphaned and outside of Pulumi's management, and the create will fail.
I think erroring on overlapping IDs is OK, but I wonder if introducing the error here is a breaking change for any consumers whose code is working fine somehow but relies on these overlapping IDs.

Yes both of these depend on if every instance of returning the same ID is because of idempotent creates or if there are other resources where the ID can be the same but they refer to different resource objects. The latter seems incredibly unlikely to me, and would probably be the source of a multitude of bugs (how would import work for such a resource for example).

Hypothetically, and not necessarily a blocker for this: Is there a way to make this a warning and add a no-op engine event that contains the resource type that ends up being relayed to the service when users are using the SaaS backend? Could we then query our engine events to determine how often these warnings occur and on what resource types?

Maybe, but it would take some engine re-jigging to allow one step to issue a new step.

@t0yv0
Copy link
Member

t0yv0 commented Jun 21, 2022

I'm a bit uneasy with throwing an error. Can this break working programs, or those were broken already by actually deleting the cloud resources but believing them to be intact? Is there a way to make the error more actionable to the user? Can we explain why we're recreating the resource and what the user can do to either avoid recreating or make progress otherwise?

an alternative would be to accept this and just logically delete the old resource from state but not issue the provider delete call for it

This feels right-er to me, i.e. in pseudo-code the higher level logical operation is replace, that'd be right to not call provider.Delete but accept the in-place version as the "replaced" one e.g.:

     def recreate(provider, oldId):
       newId = provider.Create(...)
       if oldId != newId:
         provider.Delete(oldId)
       return newId

I appreciate this might be a deeper surgery here though so wary of obscure corner cases it could trip up also.

but for non-idempotent creates it seems likely to result in a resource that's orphaned and outside of Pulumi's management, and the create will fail.

I'd like to unroll this to understand better as I don't see how this happens.

@@ -252,6 +252,11 @@ func (s *CreateStep) Apply(preview bool) (resource.Status, StepCompleteFunc, err
s.new.Outputs = outs
}

// If we've created a new resource as a replacement for an old one check that the ID is a new ID.
if s.replacing && s.new.Custom && !preview && s.new.ID == s.old.ID {
Copy link
Member

@t0yv0 t0yv0 Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious does it need !preview.

Also curious below s.replacing && s.pendingDelete check. Will it not try to delete unless s.old.Delete = true runs? Then maybe we should restrict the error further to s.pendingDelete as well as s.replacing?

Copy link
Member Author

@Frassle Frassle Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious does it need !preview.

Probably not we won't have an ID at preview time.

Will it not try to delete unless s.old.Delete = true runs

Correct. This ensure that the resource doesn't get deleted in the case where ID comes back the same, we error out of the engine first.

Then maybe we should restrict the error further to s.pendingDelete as well as s.replacing?

Ah yes we probably should. If this is a CreateReplecement but "old" has already been deleted than s.pendingDelete will be false.

@Frassle
Copy link
Member Author

Frassle commented Jun 21, 2022

Can this break working programs, or those were broken already by actually deleting the cloud resources but believing them to be intact?

I believe this to be the case. If anyone was hitting this before they were probably deleting things be accident.

Is there a way to make the error more actionable to the user?

We could tell them to make sure DeleteBeforeCreate is set. Not sure if there's a CLI option for that as well for --targeted-replace.

Can we explain why we're recreating the resource and what the user can do to either avoid recreating or make progress otherwise?

By the time we get here we don't know why we're doing a replace. But see above I think we can tell the user about DeleteBeforeCreate.

@Frassle
Copy link
Member Author

Frassle commented Jun 21, 2022

This feels right-er to me, i.e. in pseudo-code the higher level logical operation is replace, that'd be right to not call provider.Delete but accept the in-place version as the "replaced" one e.g.:
I appreciate this might be a deeper surgery here though so wary of obscure corner cases it could trip up also.

I think we can just use retainOnDelete here to delete the state for the old resource, but not issue an actual provider Delete command for it. But yeh I need to think if there's anything this could trip up on.

@t0yv0
Copy link
Member

t0yv0 commented Jun 21, 2022

Given we think this does not reduce the set of working programs (they were broken already), I'm OK with taking this change for now, in lieu of the more invasive approach.

// If we've created a new resource as a replacement for an old one before we've deleted the old one check
// that the ID is a new ID.
if s.replacing && s.pendingDelete && s.new.Custom && s.new.ID == s.old.ID {
return resourceStatus, nil, fmt.Errorf("provider returned existing ID (%s) for replacement resource", s.new.ID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very uncomfortable with this error. From a user's point of view, I can't read it any other way than "Pulumi got really confused about what it's doing and decided to blow up". I suspect we need to improve the workflow in some other way that would be net-beneficial for users and not just bail.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the choices are:

  1. Improve the error to suggest setting DeleteBeforeReplace.
  2. Try the idea above about accepting the create and eliding the provider delete for the "old" resource.

@Frassle
Copy link
Member Author

Frassle commented Jun 24, 2022

Talking to Mikhail we don't think this is safe either. There may be resources with non-unique IDs which uses other properties to distinguish deletes (nothing in our provider contracts seems to exclude doing this). There's also the problem that by the time we hit this error the cloud will differ from pulumi state so we'd need a refresh to fix things. So I'm going to close this off as well. We'll need to come up with a new plan to fix pulumi/pulumi-aws#2009.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants