Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preconditions and postconditions don't run during apply when their associated resource instance has no planned changes #31261

Closed
apparentlymart opened this issue Jun 17, 2022 · 4 comments · Fixed by #31491
Labels
bug confirmed a Terraform Core team member has reproduced this issue core custom-conditions Feedback about the "variable_validation" experiment explained a Terraform Core team member has described the root cause of this issue in code v1.2 Issues (primarily bugs) reported against v1.2 releases

Comments

@apparentlymart
Copy link
Member

Terraform Version

Terraform v1.3.0-dev
on linux_amd64
+ provider registry.terraform.io/hashicorp/null v3.1.1

I'm using a build from source in my local work tree here, but I've also minimally confirmed that this is reproducible with the v1.2.3 release build.

Terraform Configuration Files

The following is the final configuration that exhibits the bug, but see "Steps to Reproduce" below because this bug is only visible if we reach this configuration gradually over multiple steps:

resource "null_resource" "a" {
  triggers = {
    hello = "Hello!"
  }
}

resource "null_resource" "b" {
  lifecycle {
    precondition {
      condition     = null_resource.a.id == ""
      error_message = "The other resource should have an empty ID, for some iexplicable reason."
    }
  }
}

Debug Output

I've already root-caused this, so I'm going to skip this step and will post a follow-up comment after I open the issue explaining what's going on here.

Steps to Reproduce

Let's start with the following contrived configuration:

resource "null_resource" "a" {

}

resource "null_resource" "b" {
  lifecycle {
    precondition {
      condition     = null_resource.a.id == ""
      error_message = "The other resource should have an empty ID, for some iexplicable reason."
    }
  }
}

This is modelling the situation where a precondition of one resource depends on an attribute of another resource that can't be known until the apply step. null_resource fakes this by having id appear as unknown during planning and then filling in a timestamp during apply, and so I intentionally wrote the precondition above to fail in order to demonstrate this issue.

If we plan and apply this all at once then we can see Terraform check the precondition at the appropriate time:

$ terraform apply

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # null_resource.a will be created
  + resource "null_resource" "a" {
      + id = (known after apply)
    }

  # null_resource.b will be created
  + resource "null_resource" "b" {
      + id = (known after apply)
    }

Plan: 2 to add, 0 to change, 0 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

null_resource.a: Creating...
null_resource.a: Creation complete after 0s [id=2015836518349445544]
╷
│ Error: Resource precondition failed
│ 
│   on checks.tf line 41, in resource "null_resource" "b":
│   41:       condition     = null_resource.a.id == ""
│     ├────────────────
│     │ null_resource.a.id is "2015836518349445544"
│ 
│ The other resource should have an empty ID, for some iexplicable reason.
╵

However, things get more interesting if we arrive at this destination over multiple steps.

Let's remove the terraform.tfstate file to start fresh here and then use the following simpler configuration as the first step:

resource "null_resource" "a" {

}

resource "null_resource" "b" {
}
$ terraform apply

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # null_resource.a will be created
  + resource "null_resource" "a" {
      + id = (known after apply)
    }

  # null_resource.b will be created
  + resource "null_resource" "b" {
      + id = (known after apply)
    }

Plan: 2 to add, 0 to change, 0 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

null_resource.b: Creating...
null_resource.a: Creating...
null_resource.b: Creation complete after 0s [id=5480810909147783652]
null_resource.a: Creation complete after 0s [id=3935487483259785894]

Apply complete! Resources: 2 added, 0 changed, 0 destroyed.

So far so good! We have two useless resource instances.

Now let's return to the original configuration I opened with:

resource "null_resource" "a" {

}

resource "null_resource" "b" {
  lifecycle {
    precondition {
      condition     = null_resource.a.id == ""
      error_message = "The other resource should have an empty ID, for some iexplicable reason."
    }
  }
}
$ terraform apply
null_resource.a: Refreshing state... [id=3935487483259785894]
null_resource.b: Refreshing state... [id=5480810909147783652]
╷
│ Error: Resource precondition failed
│ 
│   on checks.tf line 41, in resource "null_resource" "b":
│   41:       condition     = null_resource.a.id == ""
│     ├────────────────
│     │ null_resource.a.id is "3935487483259785894"
│ 
│ The other resource should have an empty ID, for some iexplicable reason.
╵

This time Terraform was able to catch the problem during the planning phase, because we already know from the prior state that the id value is not the empty string. This is also expected behavior: Terraform eagerly checks the conditions as soon as it has enough information to do so, aiming to raise a problem during the plan phase whenever possible so that we can avoid bailing out partway through apply.

However, now let's see what happens if I also add triggers to null_resource.a at the same time, which simulates my having changed the configuration of that resource in a way that can only be resolved by replacing the remote object with a fresh one:

resource "null_resource" "a" {
  triggers = {
    hello = "Hello!"
  }
}

resource "null_resource" "b" {
  lifecycle {
    precondition {
      condition     = null_resource.a.id == ""
      error_message = "The other resource should have an empty ID, for some iexplicable reason."
    }
  }
}
$ terraform apply
null_resource.a: Refreshing state... [id=3935487483259785894]
null_resource.b: Refreshing state... [id=5480810909147783652]

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
-/+ destroy and then create replacement

Terraform will perform the following actions:

  # null_resource.a must be replaced
-/+ resource "null_resource" "a" {
      ~ id       = "3935487483259785894" -> (known after apply)
      + triggers = {
          + "hello" = "Hello!"
        } # forces replacement
    }

Plan: 1 to add, 0 to change, 1 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

null_resource.a: Destroying... [id=3935487483259785894]
null_resource.a: Destruction complete after 0s
null_resource.a: Creating...
null_resource.a: Creation complete after 0s [id=1106553121951240691]

Apply complete! Resources: 1 added, 0 changed, 1 destroyed.

Expected Behavior

Terraform should've checked the precondition on null_resource.b during the apply step, once the new null_resource.a.id became known, and raised an error about it not being an empty string.

Actual Behavior

Terraform didn't check the condition in either the plan phase or the apply phase. Instead, I need to re-run terraform apply to catch the problem during the next plan:

$ terraform apply
null_resource.a: Refreshing state... [id=1106553121951240691]
null_resource.b: Refreshing state... [id=5480810909147783652]
╷
│ Error: Resource precondition failed
│ 
│   on checks.tf line 43, in resource "null_resource" "b":
│   43:       condition     = null_resource.a.id == ""
│     ├────────────────
│     │ null_resource.a.id is "1106553121951240691"
│ 
│ The other resource should have an empty ID, for some iexplicable reason.
╵

If this condition were checking something real that affects the behavior of my infrastructure, I may have a problem I'm unaware of, which may confuse someone downstream trying to make another change because their plan will fail for a reason unrelated to what they modified.

@apparentlymart apparentlymart added bug core custom-conditions Feedback about the "variable_validation" experiment confirmed a Terraform Core team member has reproduced this issue v1.2 Issues (primarily bugs) reported against v1.2 releases labels Jun 17, 2022
@apparentlymart
Copy link
Member Author

I happen to know why this happens because I discovered this problem from reading the code as part of working on something else, and noticing the architectural problem before confirming that it led to this bug.

The root problem is that Terraform tries to optimize the apply step by only including graph nodes for resource instances that have actual changes (not "no-op" changes) in the plan. However, that doesn't take into account the fact that some resource behaviors ought to happen even if there isn't a pending change to a particular object, because that object must react to some changes made upstream that aren't reflected in the resource's own configuration arguments.

I think we could address this by just always putting every resource instance from the plan into the graph (even the ones marked as "no-op") and then handling the no-op-ness of the action during the evaluation of the graph node itself, skipping over the actions that would actually modify the remote object but still running all of the ancillary logic which deals with concerns like preconditions and postconditions.

However, our current apply node evaluation process wasn't designed to skip out the real action so surgically and so I expect it'll require at least a little refactoring to pull that off. I've not yet investigated exactly what that might look like.

My current work exploring some new condition-related capabilities also requires resolving this, so I may develop a possible fix as part of that but I'm currently working in a prototyping capacity in a significantly-modified Terraform Core and so it may take some work to adapt my prototype solution into something we could backport in isolation into the v1.2 series.

@apparentlymart apparentlymart added the explained a Terraform Core team member has described the root cause of this issue in code label Jun 17, 2022
@apparentlymart
Copy link
Member Author

I mentioned I would need at least a hacky solution to this bug for another thing I was working on, and that other thing is what turned out to be #31268, and so there are now two commits in that branch which seem to address this problem though at the expense of a non-trivial change to the separation of concerns for who deals with a resource instance having plans.NoOp as its action in the plan:

  • 1e75266 (core: Create apply graph nodes even for no-op "changes")
  • f8e3286 (core: Do everything except the actual action for plans.NoOp)

At the very least we'd need to add some tests to these if we want to use them as the basis for a direct solution to this bug. If we intend to backport the fix to v1.2 then we'll probably also want to look for a less invasive way to get there, since the solution I used here is probably a bit too risky for a v1.2 patch release.

@apparentlymart
Copy link
Member Author

I pulled the changes I mentioned in my previous comment, along with some new test cases, into a new PR #31491 so that we can consider it separately from the checks work, which is still in an exploratory phase.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug confirmed a Terraform Core team member has reproduced this issue core custom-conditions Feedback about the "variable_validation" experiment explained a Terraform Core team member has described the root cause of this issue in code v1.2 Issues (primarily bugs) reported against v1.2 releases
Projects
None yet
1 participant