Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup, sync and commit actions for Providers #63

Open
mongrelion opened this issue Apr 20, 2016 · 14 comments
Open

Cleanup, sync and commit actions for Providers #63

mongrelion opened this issue Apr 20, 2016 · 14 comments
Labels
enhancement New feature or request upstream-protocol Requires change of protocol specification, i.e. can't be done under the current protocol

Comments

@mongrelion
Copy link

Hi!

The Provider API could use a TeardownFunc method just like it offers a ConfigureFunc method.
Basically, some use cases need to make a final call to some API after the whole Terraform run is done, say, for example, in cases where the client used to communicate with the API does not support token-based authentication but instead user/password combination, and must make a call in the end to kill the session (some systems start to show issues if you leave too many connections open as the timeout is quite long and the developer may not have control whatsoever over that value).
There is another use case for this that I can think of at the moment and that is the Cobbler provider: by the end of creating a bunch of systems you need to make a call to their sync method, which you can call from when you create a resource from within Terraform but this has lead to odd behaviour as Cobbler is not thread safe and making a call to this sync method several times in a short period of time will make things look funky (cobbler/cobbler#1570 & hashicorp/terraform#5969)

Something like this is what I have in mind:

schema.Provider{
  Schema: ...,
  ResourcesMap: ...,
  ConfigureFunc: func (d *schema.ResourceData) ({}interface, error) {},
  TeardownFunc: func (meta {}interface) (error) {},
}

And the teardown method would receive the value previously returned by the ConfigureFunc.

What do you guys think?

@apparentlymart
Copy link
Member

This seems like a reasonable idea to me!

In the core Terraform layer we have an interface ResourceProviderCloser which a provider can implement so that Close() will be called on it once it's no longer needed, but as far as I can tell the helper/schema layer doesn't actually implement it and so the real-world providers we have today (which all use the schema helper) don't currently support it.

However, I expect we could extend schema.Provider to implement a Close method in addition to its existing Configure method -- probably with a very similar implementation -- to get what you're suggesting here. We stash a pointer to the meta value inside the provider instance so it should be easy to support the interface you suggested.

I would suggest that the attribute name be CloseFunc rather than TeardownFunc, since the Close method is already named by the layer below and it would be consistent with the existing Configure/ConfigureFunc pairing.

We might consider also exposing the *schema.ResourceData for the config to CloseFunc, in case there are some settings in there that only apply on close. I'm ambivalent about it since the provider could in principle save the relevant settings within the arbitrary meta object, but for many providers meta is just a direct pointer to some object from the underlying client library and so doesn't have a place to stash extra book-keeping information like this.

@srinarayanant
Copy link

+1
Is the tear down method available in some format ? it will be ideal to login in configure and logout in teardown

@mavogel
Copy link

mavogel commented Jan 26, 2018

@apparentlymart are there any updates or ETA on this issue?
Is it as straightforward as I maybe imagine that due to Graph it is possible to determine the last resource which is created by a provider and then call the CloseFunc after the last call of this resource depending if it was as plan/apply/destroy/show? If yes I'd be keen to contribute to get more into terraform

@apparentlymart
Copy link
Member

Oh my... this is an old issue! Sorry for the long silence here. Before I respond, I just want to be explicit that my comment above was from long before I was a HashiCorp employee, and was just participating in this project as a community contributor. Therefore when making that comment I didn't have the full context of the long- and short-term product goals.


We've actually ended up returning to this topic recently since we've been working on revamping the provider RPC protocol to work with the new capabilities we get from introducing the improved version of HCL, so while my above comment was long enough ago that I don't have a good sense of what I was thinking about at that time, I do have some context from our recent discussion this week.

A key problem we identified while discussing the possibility of teardown-type actions is how to handle failure. How best to handle failure depends on what sort of operation the provider is attempting to do:

  • In the case of terminating a stateful session, producing an error that explains to the user how they might manually terminate their session seems like a reasonable outcome, since Terraform will create a new session on the next run anyway and so the current session is effectively abandoned.
  • For the use-case of performing some sort of "commit" action things are more complex. For any operation that affects the persistent state of remote objects, we always try to retain an accurate picture in Terraform's state file of the outcome of an action, even if it fails partway, so that on the next run Terraform can attempt to retry or conclude any pending operations. If Terraform leaves some changes uncommitted without tracking that, then a subsequent Terraform run may inadvertently apply those changes even though they are not listed in the plan, which violates Terraform's primary promise.

So with that said, we'd concluded that we need to think some more about the design of a feature like this to ensure that it can be safe and explicit in its behavior. We did an initial brainstorm of an idea of making Terraform Core have an explicit representation of actions that need to be "committed" in some way, with these indicated in plan output so the user can see where in the process commit actions may occur. This builds on some other design work we've started around supporting batch requests (discussed in hashicorp/terraform#7388) but is different because our conception of batch requests was as a hidden optimization rather than as something explicitly shown to the user.

We're not able to do detailed research and prototyping on this problem at present because our current focus is on configuration language improvements, but we know there are several providers that would benefit from explicit modelling of a "deploy" or "commit" action (the Fastly provider being a key example, and hypothetical providers for various network equipment similarly) and so do intend to return to this issue at some point.

From reading the linked issues it seems like the Cobbler provider in particular has since been fixed with a workaround of manually syncing after each operation but using a mutex to avoid concurrent syncs. The Fastly provider currently works around this by managing all of the versioned API objects as a single, huge "service" resource. Neither of these approaches are ideal by any means, but hopefully similar patterns can be employed for other providers in the mean time, until we're able to devote more time to a solution in Terraform Core.

Sorry for the delay here and the long lack of response. I'll try to come back here and update this issue again when we're ready to do more in-depth work here; to help us find this again I'm going to change the summary to something more likely to come up in search when we go looking for issues on this topic.

@apparentlymart apparentlymart changed the title Provider teardown Cleanup, sync and commit actions for Providers Jan 28, 2018
@mavogel
Copy link

mavogel commented Jan 29, 2018

thank you @apparentlymart for your detailed answer. I was too optimistic thinking it could be that straightforward. For the Docker provider is also used a temporary workaround.

@shinmog
Copy link
Contributor

shinmog commented May 6, 2018

I've mentioned this during the Palo Alto Networks panos provider validation process, but we will eventually need commit type functionality for our provider as well. When you guys circle back to this issue, we would definitely like to be involved so we can explain our use case in more depth.

@jonshern
Copy link

Yeah we are using the Pan OS Provider for Palo Alto and this functionality would be very important to be able to fully automate

@phamann
Copy link

phamann commented Nov 29, 2018

The Dyn provider would also benefit from such functionality. The Dynect API is stateful and uses a session token which the provider currently creates during the ConfigureFunc by calling the Dyn go clients login method. However, the APIs logout method is never called once the provider has finished thus leaving open sessions in the Dyn API. This causes unforeseen side-effects, for example if you run multiple Terraform runs in CI environments.

@hashibot hashibot transferred this issue from hashicorp/terraform Sep 26, 2019
@hashibot hashibot added the enhancement New feature or request label Oct 2, 2019
@radeksimko radeksimko added the upstream-protocol Requires change of protocol specification, i.e. can't be done under the current protocol label Dec 8, 2019
@paultyng
Copy link
Contributor

paultyng commented Apr 2, 2020

The protocol has a Close which we plan to expose in the future. That being said, this will strictly be for cleanup, session logout, etc, not for commit type operations. Terraform's state model doesn't have any concept of a "commit" or what resources are committed, and as far as I know the plans to add this are not on the roadmap. If you need to support a commit model, there are potential resource design changes you could make to make it work, and I'm happy to discuss those on specific providers, but without a significant change to Terraform upstream, I don't think this will be added anytime soon.

@jwatte
Copy link

jwatte commented Apr 21, 2020

That is unfortunate. I, too, have the need for stashing a bunch of changes, and then committing them at the end (which may fail if the whole batch isn't overall consistent.)

The workaround of "one giant synthetic resource that sources everything else" is imperfect at best.

@AlexHunterCodes
Copy link

AlexHunterCodes commented Aug 26, 2020

I do a lot of work involving DNS records, and it's currently very easy for Terraform to temporarily break these during a large plan because there's no guarantee on how the changes will be scheduled, particularly when these records relate to resources the plan doesn't know about.

From the plan, the start and end states can be reasoned about, but the state in-between while resources are deploying currently cannot. It's also impossible to batch lots of changes together and commit them as one atomic operation, which also causes API rate limiting issues (see #7) in addition to broken infrastructure if a plan fails to apply..

@poroping
Copy link

poroping commented Sep 6, 2021

I also have a use-case for this with an IOS XE Restconf provider I'm creating. It's not as deal breaking as the PanOS needing to commit changes as the changes are actually implemented on the fly. However 'committing' the running config to nvram (copy run start) is needed after all the changes are made. Doing this after every change is not really workable as it takes a few seconds to complete. For now I will need to create a workaround resource which is not ideal.

@bflad
Copy link
Contributor

bflad commented Jun 3, 2022

Drive-by note relating to "commit" behaviors: there have been lengthy discussions on this topic between the Terraform core and providers teams. The main challenges continue to be how to potentially provide configuration for behaviors, how to handle errors during applies, and how practitioners could trigger this behavior without configurations changes.

For this use case, introducing a RPC that always gets called before providers are stopped means that core needs to return some information about any potential errors, so that providers can choose their appropriate action. Otherwise, there are risks of providers performing a "commit" against an undesirable system state. Furthermore, practitioners may wish to control this behavior. Given that provider configurations should not be specified within callable modules, there could be some friction between practitioner desires and real world usability in that scenario.

For this use case, introducing a RPC that only gets called on success means that this RPC loses its effectiveness for use cases such as closing sessions.

The solutions for all of these tends to point towards providers relying on the existing Terraform resource lifecycle model. Resource creation can perform the action, while other actions can be ignored or implemented as appropriately. More succinctly, configuration of behaviors is already well defined (e.g. resource configuration/state is tied well to the resource lifecycle and provider developers have the choice of in-place updates or replacement), error behaviors are well defined (e.g. a failing dependency will prevent the resource from performing its action), and practitioners can trigger resource behaviors manually if necessary (e.g. terraform taint/terraform apply -replace).

The biggest challenge with using a resource then becomes the configuration to properly setup the desired behavior, since every dependency must be explicitly configured. Terraform CLI 1.2 has introduced the lifecycle configuration block replace_triggered_by argument, which can be used to automatically replace a resource if any of the references change.

This means its possible to create something like the following:

resource "example_thing" "example" {}

resource "example_widget" "example" {}

resource "example_commit" "example" {
  lifecycle {
    replace_triggered_by = [
      example_thing.example,
      example_widget.example,
      # ...
    ]
  }
}

If this solution is usable at smaller scales (please let us know if it is!), then this problem space could theoretically shift to thinking about other ways to define "groupings" of resources in a configuration beyond just a module, to help reduce the amount of configuration necessary.

@alexandrubordei
Copy link

I can't speak for all commit-needing use cases but at least with the MetalSoft provider we ended up applying a similar strategy with a "deployer" resource (see https://registry.terraform.io/providers/metalsoft-io/metalcloud/latest/docs/resources/infrastructure_deployer), working within the limitations of the current model so perhaps that might help answer some of the questions.

Indeed dependencies are a weak point of the model. Unless the users specify all of the depends-on properly you'll miss resource changes which means all users of the provider need to be instructed to do that in their manifests which is of course prone to all sorts of preventable user mistakes. What makes this even more complicated is the fact that you cannot use wildcards. An option we sometime use to simplify things is to put all the dependent resources in a module and the deployer resource on some upper level but it is far from ideal.

Perhaps support for "depends on: all resources" might help or better yet some way to determine the dependencies internally (by specifying classes of resources for example). The need for a "drive" or an "instance" to deploy on the "deployer" would not change regardless of what the user wants so I wouldn't even give the user a choice.

The second issue that we've seen is that the deployer needs to run not just at the initial deploy but also during edit operations of a dependent resource. Since the deployer is not edited it will not be deployed again. We ended up having to force a property to be always "edited" which of course screws up the "plan" when nothing is actually edited but there was no other way at the time. I guess using "replace_triggered_by" instead of "depends on" might help with this issue but it wasn't available when we implemented this strategy.

The third (minor) issue is that this resource is not "natural" from a user's perspective. It feels forced (because of course, it is). All other resources are true resources reflected on the server-side whereas this resource is something that exists only at the terraform level.

So, to cut the long story short I think the answer to the question is that the solution would work since we're already doing it to some degree but if we could get some help to make it more elegant (perhaps: a way to set implicit dependencies, a way to force the deploy of the deployer even if the deployer resource itself is not edited and perhaps the concept of a singleton) it would be brilliant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request upstream-protocol Requires change of protocol specification, i.e. can't be done under the current protocol
Projects
None yet
Development

No branches or pull requests