v1.3 `apply` performance regressions with large numbers of instances #32071

bob-rohan · 2022-10-24T16:11:23Z

Terraform Version

1.3.2

Terraform Configuration Files

Judging by which parts of our estate have been impacted, I suspect this is going to require a fairly large input array and collection of resources to iterate, in order to exhaust the machine from which Terraform is invoked of most available resources. A minimal example should be capable of detailing the material differences I've noted below. If a bit of feedback could be provided on this change in behaviour between non-breaking versions it would help direct the replication for me to provide.

Debug Output

Terraform 1.3.2 DEBUG output

2022-10-24T13:09:46.304+0100 [DEBUG] no planned changes, skipping apply graph check

[0m[1m[32mNo changes.[0m[1m Your infrastructure matches the configuration.[0m

[0mTerraform has compared your real infrastructure against your configuration
and found no differences, so no changes are needed.
2022-10-24T13:09:46.856+0100 [INFO]  backend/local: apply calling Apply
2022-10-24T13:09:46.856+0100 [DEBUG] Building and walking apply graph for NormalMode plan
2022-10-24T13:09:58.977+0100 [DEBUG] ProviderTransformer: "module.product_attachment[\"somename\"].aws_servicecatalog_constraint.launch[\"NAT-103\"]" (*terraform.NodeApplyableResourceInstance) needs provider["registry.terraform.io/hashicorp/aws"]
2022-10-24T13:09:58.977+0100 [DEBUG] ProviderTransformer: "module.product_attachment[\"somename\"].aws_servicecatalog_constraint.notification_constraint[\"NAT-110\"]" (*terraform.NodeApplyableResourceInstance) needs provider["registry.terraform.io/hashicorp/aws"]
2022-10-24T13:09:58.977+0100 [DEBUG] ProviderTransformer: "module.product_attachment[\"somename\"].aws_servicecatalog_product_portfolio_association.product_attachment[\"NAT-028\"]" (*terraform.NodeApplyableResourceInstance) needs provider["registry.terraform.io/hashicorp/aws"]

Terraform 1.2.6 DEBUG output

2022-10-24T13:36:27.310+0100 [DEBUG] no planned changes, skipping apply graph check

[0m[1m[32mNo changes.[0m[1m Your infrastructure matches the configuration.[0m

[0mTerraform has compared your real infrastructure against your configuration
and found no differences, so no changes are needed.
2022-10-24T13:36:27.826+0100 [INFO]  backend/local: apply calling Apply
2022-10-24T13:36:27.826+0100 [DEBUG] Building and walking apply graph for NormalMode plan
2022-10-24T13:36:27.946+0100 [DEBUG] ProviderTransformer: "module.common_tag_associations.aws_servicecatalog_tag_option_resource_association.this (expand)" (*terraform.nodeExpandApplyableResource) needs provider["registry.terraform.io/hashicorp/aws"]
2022-10-24T13:36:27.946+0100 [DEBUG] ProviderTransformer: "module.servie_catalog_cnb_products.aws_s3_bucket_object.products (expand)" (*terraform.nodeExpandApplyableResource) needs provider["registry.terraform.io/hashicorp/aws"]
2022-10-24T13:36:27.946+0100 [DEBUG] ProviderTransformer: "module.nat_product_attachment.aws_servicecatalog_constraint.template (expand)" (*terraform.nodeExpandApplyableResource) needs provider["registry.terraform.io/hashicorp/aws"]
2022-10-24T13:36:27.946+0100 [DEBUG] ProviderTransformer: "aws_servicecatalog_tag_option.cnb (expand)" (*terraform.nodeExpandApplyableResource) needs provider["registry.terraform.io/hashicorp/aws"]
2022-10-24T13:36:27.946+0100 [DEBUG] ProviderTransformer: "module.cnb_product_attachment.aws_servicecatalog_constraint.notification_constraint (expand)" (*terraform.nodeExpandApplyableResource) needs provider["registry.terraform.io/hashicorp/aws"]
2022-10-24T13:36:27.946+0100 [DEBUG] ProviderTransformer: "module.product_attachment.aws_servicecatalog_product_portfolio_association.product_attachment (expand)" (*terraform.nodeExpandApplyableResource) needs provider["registry.terraform.io/hashicorp/aws"]
2022-10-24T13:36:27.946+0100 [DEBUG] ProviderTransformer: "module.product_attachment.aws_servicecatalog_constraint.notification_constraint (expand)" (*terraform.nodeExpandApplyableResource) needs provider["registry.terraform.io/hashicorp/aws"]
2022-10-24T13:36:27.946+0100 [DEBUG] ProviderTransformer: "module.cnb_product_attachment.aws_servicecatalog_constraint.template (expand)" (*terraform.nodeExpandApplyableResource) needs provider["registry.terraform.io/hashicorp/aws"]
2022-10-24T13:36:27.946+0100 [DEBUG] ProviderTransformer: "data.aws_region.current (expand)" (*terraform.nodeExpandApplyableResource) needs provider["registry.terraform.io/hashicorp/aws"]
2022-10-24T13:36:27.946+0100 [DEBUG] ProviderTransformer: "module.nat_product_attachment.aws_servicecatalog_constraint.notification_constraint (expand)" (*terraform.nodeExpandApplyableResource) needs provider["registry.terraform.io/hashicorp/aws"]
2022-10-24T13:36:27.946+0100 [DEBUG] ProviderTransformer: "module.nat_product_attachment.aws_servicecatalog_constraint.launch (expand)" (*terraform.nodeExpandApplyableResource) needs provider["registry.terraform.io/hashicorp/aws"]
2022-10-24T13:36:27.946+0100 [DEBUG] ProviderTransformer: "module.product_attachment.aws_servicecatalog_constraint.launch (expand)" (*terraform.nodeExpandApplyableResource) needs provider["registry.terraform.io/hashicorp/aws"]
2022-10-24T13:36:27.946+0100 [DEBUG] ProviderTransformer: "module.cnb_product_attachment.aws_servicecatalog_constraint.launch (expand)" (*terraform.nodeExpandApplyableResource) needs provider["registry.terraform.io/hashicorp/aws"]
2022-10-24T13:36:27.946+0100 [DEBUG] ProviderTransformer: "aws_s3_bucket_object.products (expand)" (*terraform.nodeExpandApplyableResource) needs provider["registry.terraform.io/hashicorp/aws"]
2022-10-24T13:36:27.946+0100 [DEBUG] ProviderTransformer: "module.product_attachment.aws_servicecatalog_constraint.template (expand)" (*terraform.nodeExpandApplyableResource) needs provider["registry.terraform.io/hashicorp/aws"]

Expected Behavior

Apply should complete in similar time frame to seen in version 1.2.6.

Actual Behavior

Apply seem to hang, TF_LOG output suggests may exhaust machine resources. DEBUG output suggest the material difference is in *terraform.NodeApplyableResourceInstance (1.3.2, see debug statement suggesting this is per array element, and fails or hangs or is so resource demanding that success is unlikely), vs *terraform.nodeExpandApplyableResource (1.2.6, see debug statement suggesting this is per resource type, and succeeds in reasonable time frame).

Steps to Reproduce

terraform apply

Additional Context

No response

References

No response

The text was updated successfully, but these errors were encountered:

jbardin · 2022-10-24T17:12:54Z

Hi @bob-rohan,

Thanks for filing the issue. There have been no changes in the area you are referring to between v1.2 and v1.3, so this is probably a side-effect of something else, probably entirely unrelated. The given types are also present during the transformation in both versions, so that should not be a factor.

What are you considering to be a fairly large input here? Can you give an idea on the number of resource instances being managed? The debug output alone isn't going to give enough information to be useful here, and setting TF_LOG=trace may have more clues. If you are killing the process to stop the execution anyway, you could use SIGQUIT which will cause it to print a stack trace as well which can pinpoint exactly what is executing at that time.

Thanks!

bob-rohan · 2022-10-25T09:27:54Z

Thanks, will pull that together. wrt relativeness of "large"

jq '.resources[].instances[].index_key' state | grep -v null | wc -l

54448

jq '.resources | length' state

677

As mentioned above, I'm refraining from passing too much judgement on the composition at least until can explain root cause. If Hashicorp have rough guidelines for recommended upper limit of distinct resources/instances, that might help me convey this point to the user when the time comes.

bob-rohan · 2022-10-25T12:24:34Z

TF_LOG=trace output

1.3.0

2022-10-25T11:43:41.047+0100 [TRACE] DiffTransformer: found NoOp change for module.product_attachment["somename"].aws_servicecatalog_constraint.notification_constraint["NAT-058"]
2022-10-25T11:43:41.047+0100 [TRACE] DiffTransformer: module.product_attachment["somename"].aws_servicecatalog_constraint.notification_constraint["NAT-058"] will be represented by module.product_attachment["somename"].aws_servicecatalog_constraint.notification_constraint["NAT-058"]
2022-10-25T11:43:41.047+0100 [TRACE] DiffTransformer complete
2022-10-25T11:43:41.262+0100 [TRACE] Completed graph transform *terraform.DiffTransformer with new graph:
  aws_s3_bucket_object.cnb (expand) - *terraform.nodeExpandApplyableResource
  aws_s3_bucket_object.cnb["BASE-AlmaLinux"] - *terraform.NodeApplyableResourceInstance

vs

1.2.9

2022-10-25T11:27:30.715+0100 [TRACE] DiffTransformer: found NoOp change for module.product_attachment["somename"].aws_servicecatalog_constraint.template["NAT-121"]
2022-10-25T11:27:30.715+0100 [TRACE] DiffTransformer: found NoOp change for module.product_attachment["somename"].aws_servicecatalog_constraint.template["NAT-101"]
2022-10-25T11:27:30.715+0100 [TRACE] DiffTransformer: found NoOp change for module.product_attachment["somename"].aws_servicecatalog_constraint.template["NAT-098"]
2022-10-25T11:27:30.715+0100 [TRACE] DiffTransformer complete
2022-10-25T11:27:30.715+0100 [TRACE] Completed graph transform *terraform.DiffTransformer (no changes)
2022-10-25T11:27:30.715+0100 [TRACE] Executing graph transform *terraform.AttachStateTransformer
2022-10-25T11:27:30.716+0100 [TRACE] Completed graph transform *terraform.AttachStateTransformer (no changes)
2022-10-25T11:27:30.716+0100 [TRACE] Executing graph transform *terraform.OrphanOutputTransformer
2022-10-25T11:27:30.716+0100 [TRACE] Completed graph transform *terraform.OrphanOutputTransformer (no changes)
2022-10-25T11:27:30.716+0100 [TRACE] Executing graph transform *terraform.AttachResourceConfigTransformer
2022-10-25T11:27:30.716+0100 [TRACE] AttachResourceConfigTransformer: attaching to "aws_servicecatalog_tag_option.common (expand)" (*terraform.nodeExpandApplyableResource) config from populate_tag_options_library.tf:1,1-50

Perhaps related to this change?

9e27703

jbardin · 2022-10-25T12:36:45Z

While that's definitely more resource instances than we generally see, the problem with configurations of that size is usually more associated with the sheer number of API calls that must be made to plan, rather than the handling of the data.

Is that the final log output from v1.3? (I would not use v1.3.0, there were a number of issues solved in the first few patch releases) That would be an odd place for the log stream to stop, unless the system became unresponsive at that exact moment due to resource exhaustion. That many resource instances is going to consume quite a bit of memory to handle, have you checked that memory isn't the constraint here?

bob-rohan · 2022-10-25T12:48:18Z

You're correct, that's not the end of the log. It's a snippet based on where the output between the two versions differs materially (in my novice view). This is basically the start of the apply phase after plan confirms no changes required, the difference is quite stark and hopefully indicative to the future processing journies these two paths would take.

While 1.3.2 and 1.3.0 have been resource heavy, the most startling difference is in the size of the output

 1409032 -rw-r--r--  1 bobrohan  staff   683M 25 Oct 11:27 129-apply.out
20545544 -rw-r--r--  1 bobrohan  staff   9.8G 25 Oct 11:47 130-apply.out

I'm tailing the logs while this processes which looks to be continuing, although there could be a buffer/lag which is misleading. This would suggest the machine isn't exhausted of resources as I originally theorised, but an output of that size is certainly indicative of a problem and in stark contrast between the two versions.

Does the PR I've linked above hold any material relevance to this change in behaviour?

jbardin · 2022-10-25T13:04:09Z

That PR is related to the number of nodes being tracked during apply. The addition of preconditions and postconditions in v1.3 requires that we track all resource instances through their entire lifecycle, so that conditions can be validated during apply. That does mean adding in placeholders for all NoOp changes, because any of those resources could have condition checks which need validation.

This is usually not a problem, because Terraform needs to be able to handle the complete graph at some point, otherwise one would not be able to destroy the resources, or restore the configuration from scratch. I think we need to figure out where the actual bottleneck is here.

How is the state being stored for this configuration? Operating on a local state would cause severe issues with this size configuration due to #32060. Resolving that may help, because we can also try to avoid some of the unnecessary overhead with handling the state for NoOp changes. I'm leaning towards the state handling being the bigger problem even without local state, because state is stored as a single blob, verifying writes could be taking considerable time with this many resources.

Since you're watching the logs (which could be slowing things down quite a bit on its own, just because of the log volume for this size config), is v1.3 getting past the graph building steps to the point of evaluating the resources? If so, I think the state handling really is the culprit, otherwise we'll have to do some more investigation in the graph handling.

bob-rohan · 2022-10-25T13:49:07Z

Backend is s3.

Yesterday (DEBUG, no tailing logs), today (TRACE, tailing logs), I agree that's likely impacted performance whilst we've attempted to gain more insight.

Last statement from DEBUG run yesterday

2022-10-24T13:14:37.939+0100 [DEBUG] ReferenceTransformer: "module.product_attachment[\"somename\"].aws_servicecatalog_constraint.notification_constraint[\"NAT-037\"]" references:

Last statement from TRACE run today (abbridged)

2022-10-25T11:47:29.777+0100 [TRACE] ReferenceTransformer: skipping &{%!s(*terraform.NodeAbstractResourceInstance=&{{{{} [product_attachment] {{} 77 aws_servicecatalog_constraint template}} 0xc0121be720 0 0xc00b2f9800 map[] map[] [] [] false {[] {aws hashicorp registry.terraform.io} } []} {{} [{product_attachment somename}]

Some grepping, may may not help grep 'Completed graph' 130-apply-part-a*

130-apply-part-aa:2022-10-25T11:43:11.245+0100 [TRACE] Completed graph transform *terraform.ResourceCountTransformer with new graph:
130-apply-part-aa:2022-10-25T11:43:11.246+0100 [TRACE] Completed graph transform *terraform.OrphanResourceInstanceCountTransformer (no changes)
130-apply-part-aa:2022-10-25T11:43:11.258+0100 [TRACE] Completed graph transform *terraform.AttachStateTransformer (no changes)
130-apply-part-aa:2022-10-25T11:43:11.258+0100 [TRACE] Completed graph transform *terraform.TargetsTransformer (no changes)
130-apply-part-aa:2022-10-25T11:43:11.265+0100 [TRACE] Completed graph transform *terraform.ReferenceTransformer (no changes)
130-apply-part-aa:2022-10-25T11:43:11.266+0100 [TRACE] Completed graph transform *terraform.RootTransformer with new graph:
130-apply-part-aa:2022-10-25T11:43:40.397+0100 [TRACE] Completed graph transform *terraform.ConfigTransformer with new graph:
130-apply-part-aa:2022-10-25T11:43:40.397+0100 [TRACE] Completed graph transform *terraform.RootVariableTransformer with new graph:
130-apply-part-aa:2022-10-25T11:43:40.398+0100 [TRACE] Completed graph transform *terraform.ModuleVariableTransformer with new graph:
130-apply-part-aa:2022-10-25T11:43:40.398+0100 [TRACE] Completed graph transform *terraform.LocalTransformer with new graph:
130-apply-part-aa:2022-10-25T11:43:40.398+0100 [TRACE] Completed graph transform *terraform.OutputTransformer with new graph:
130-apply-part-aa:2022-10-25T11:43:41.262+0100 [TRACE] Completed graph transform *terraform.DiffTransformer with new graph:
130-apply-part-aa:2022-10-25T11:43:51.383+0100 [TRACE] Completed graph transform *terraform.AttachStateTransformer (no changes)
130-apply-part-aa:2022-10-25T11:43:51.535+0100 [TRACE] Completed graph transform *terraform.OrphanOutputTransformer (no changes)
130-apply-part-aa:2022-10-25T11:43:52.358+0100 [TRACE] Completed graph transform *terraform.AttachResourceConfigTransformer (no changes)
130-apply-part-aa:2022-10-25T11:43:52.552+0100 [TRACE] (graphTransformerMulti) Completed graph transform *terraform.ProviderConfigTransformer with new graph:
130-apply-part-aa:2022-10-25T11:43:52.818+0100 [TRACE] (graphTransformerMulti) Completed graph transform *terraform.MissingProviderTransformer (no changes)
130-apply-part-aa:2022-10-25T11:43:54.385+0100 [TRACE] (graphTransformerMulti) Completed graph transform *terraform.ProviderTransformer with new graph:
130-apply-part-aa:2022-10-25T11:43:54.708+0100 [TRACE] (graphTransformerMulti) Completed graph transform *terraform.PruneProviderTransformer (no changes)
130-apply-part-aa:2022-10-25T11:43:55.073+0100 [TRACE] Completed graph transform *terraform.graphTransformerMulti with new graph:
130-apply-part-aa:2022-10-25T11:43:55.348+0100 [TRACE] Completed graph transform *terraform.RemovedModuleTransformer (no changes)
130-apply-part-aa:2022-10-25T11:43:55.880+0100 [TRACE] Completed graph transform *terraform.AttachSchemaTransformer (no changes)
130-apply-part-aa:2022-10-25T11:43:57.026+0100 [TRACE] Completed graph transform *terraform.ModuleExpansionTransformer with new graph:
``

bob-rohan · 2022-11-03T09:15:58Z

thanks @jbardin , tested with 1.3.4 and this is resolved.

github-actions · 2022-12-04T02:14:26Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

bob-rohan added bug new new issue not yet triaged labels Oct 24, 2022

jbardin added core waiting for reproduction unable to reproduce issue without further information waiting-response An issue/pull request is waiting for a response from the community and removed new new issue not yet triaged labels Oct 24, 2022

jbardin closed this as completed Oct 25, 2022

jbardin reopened this Oct 25, 2022

jbardin removed waiting-response An issue/pull request is waiting for a response from the community waiting for reproduction unable to reproduce issue without further information labels Oct 25, 2022

jbardin self-assigned this Oct 25, 2022

jbardin changed the title ~~ProviderTransformer *terraform.NodeApplyableResourceInstance vs *terraform.nodeExpandApplyableResource~~ v1.3 apply performance regressions with large numbers of instances Oct 25, 2022

This was referenced Oct 26, 2022

Performance Issue for Terraform Plan with Terraform v1.3 and Terraform Enterprise #32091

Closed

Apply optimizations for handling of condition checks #32123

Merged

jbardin closed this as completed in #32123 Nov 1, 2022

teamterraform mentioned this issue Nov 1, 2022

Backport of Apply optimizations for handling of condition checks into v1.3 #32134

Merged

rafabu mentioned this issue Nov 17, 2022

massive destroy performance degradation with 1.3.x #32234

Closed

github-actions bot locked as resolved and limited conversation to collaborators Dec 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.3 `apply` performance regressions with large numbers of instances #32071

v1.3 `apply` performance regressions with large numbers of instances #32071

bob-rohan commented Oct 24, 2022 •

edited

jbardin commented Oct 24, 2022 •

edited

bob-rohan commented Oct 25, 2022

bob-rohan commented Oct 25, 2022

jbardin commented Oct 25, 2022

bob-rohan commented Oct 25, 2022

jbardin commented Oct 25, 2022

bob-rohan commented Oct 25, 2022 •

edited

bob-rohan commented Nov 3, 2022

github-actions bot commented Dec 4, 2022

v1.3 apply performance regressions with large numbers of instances #32071

v1.3 apply performance regressions with large numbers of instances #32071

Comments

bob-rohan commented Oct 24, 2022 • edited

Terraform Version

Terraform Configuration Files

Debug Output

Expected Behavior

Actual Behavior

Steps to Reproduce

Additional Context

References

jbardin commented Oct 24, 2022 • edited

bob-rohan commented Oct 25, 2022

bob-rohan commented Oct 25, 2022

jbardin commented Oct 25, 2022

bob-rohan commented Oct 25, 2022

jbardin commented Oct 25, 2022

bob-rohan commented Oct 25, 2022 • edited

bob-rohan commented Nov 3, 2022

github-actions bot commented Dec 4, 2022

v1.3 `apply` performance regressions with large numbers of instances #32071

v1.3 `apply` performance regressions with large numbers of instances #32071

bob-rohan commented Oct 24, 2022 •

edited

jbardin commented Oct 24, 2022 •

edited

bob-rohan commented Oct 25, 2022 •

edited