Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.3 apply performance regressions with large numbers of instances #32071

Closed
bob-rohan opened this issue Oct 24, 2022 · 9 comments · Fixed by #32123
Closed

v1.3 apply performance regressions with large numbers of instances #32071

bob-rohan opened this issue Oct 24, 2022 · 9 comments · Fixed by #32123
Assignees

Comments

@bob-rohan
Copy link

bob-rohan commented Oct 24, 2022

Terraform Version

1.3.2

Terraform Configuration Files

Judging by which parts of our estate have been impacted, I suspect this is going to require a fairly large input array and collection of resources to iterate, in order to exhaust the machine from which Terraform is invoked of most available resources. A minimal example should be capable of detailing the material differences I've noted below. If a bit of feedback could be provided on this change in behaviour between non-breaking versions it would help direct the replication for me to provide.

Debug Output

Terraform 1.3.2 DEBUG output

2022-10-24T13:09:46.304+0100 [DEBUG] no planned changes, skipping apply graph check

[0m[1m[32mNo changes.[0m[1m Your infrastructure matches the configuration.[0m

[0mTerraform has compared your real infrastructure against your configuration
and found no differences, so no changes are needed.
2022-10-24T13:09:46.856+0100 [INFO]  backend/local: apply calling Apply
2022-10-24T13:09:46.856+0100 [DEBUG] Building and walking apply graph for NormalMode plan
2022-10-24T13:09:58.977+0100 [DEBUG] ProviderTransformer: "module.product_attachment[\"somename\"].aws_servicecatalog_constraint.launch[\"NAT-103\"]" (*terraform.NodeApplyableResourceInstance) needs provider["registry.terraform.io/hashicorp/aws"]
2022-10-24T13:09:58.977+0100 [DEBUG] ProviderTransformer: "module.product_attachment[\"somename\"].aws_servicecatalog_constraint.notification_constraint[\"NAT-110\"]" (*terraform.NodeApplyableResourceInstance) needs provider["registry.terraform.io/hashicorp/aws"]
2022-10-24T13:09:58.977+0100 [DEBUG] ProviderTransformer: "module.product_attachment[\"somename\"].aws_servicecatalog_product_portfolio_association.product_attachment[\"NAT-028\"]" (*terraform.NodeApplyableResourceInstance) needs provider["registry.terraform.io/hashicorp/aws"]

Terraform 1.2.6 DEBUG output

2022-10-24T13:36:27.310+0100 [DEBUG] no planned changes, skipping apply graph check

[0m[1m[32mNo changes.[0m[1m Your infrastructure matches the configuration.[0m

[0mTerraform has compared your real infrastructure against your configuration
and found no differences, so no changes are needed.
2022-10-24T13:36:27.826+0100 [INFO]  backend/local: apply calling Apply
2022-10-24T13:36:27.826+0100 [DEBUG] Building and walking apply graph for NormalMode plan
2022-10-24T13:36:27.946+0100 [DEBUG] ProviderTransformer: "module.common_tag_associations.aws_servicecatalog_tag_option_resource_association.this (expand)" (*terraform.nodeExpandApplyableResource) needs provider["registry.terraform.io/hashicorp/aws"]
2022-10-24T13:36:27.946+0100 [DEBUG] ProviderTransformer: "module.servie_catalog_cnb_products.aws_s3_bucket_object.products (expand)" (*terraform.nodeExpandApplyableResource) needs provider["registry.terraform.io/hashicorp/aws"]
2022-10-24T13:36:27.946+0100 [DEBUG] ProviderTransformer: "module.nat_product_attachment.aws_servicecatalog_constraint.template (expand)" (*terraform.nodeExpandApplyableResource) needs provider["registry.terraform.io/hashicorp/aws"]
2022-10-24T13:36:27.946+0100 [DEBUG] ProviderTransformer: "aws_servicecatalog_tag_option.cnb (expand)" (*terraform.nodeExpandApplyableResource) needs provider["registry.terraform.io/hashicorp/aws"]
2022-10-24T13:36:27.946+0100 [DEBUG] ProviderTransformer: "module.cnb_product_attachment.aws_servicecatalog_constraint.notification_constraint (expand)" (*terraform.nodeExpandApplyableResource) needs provider["registry.terraform.io/hashicorp/aws"]
2022-10-24T13:36:27.946+0100 [DEBUG] ProviderTransformer: "module.product_attachment.aws_servicecatalog_product_portfolio_association.product_attachment (expand)" (*terraform.nodeExpandApplyableResource) needs provider["registry.terraform.io/hashicorp/aws"]
2022-10-24T13:36:27.946+0100 [DEBUG] ProviderTransformer: "module.product_attachment.aws_servicecatalog_constraint.notification_constraint (expand)" (*terraform.nodeExpandApplyableResource) needs provider["registry.terraform.io/hashicorp/aws"]
2022-10-24T13:36:27.946+0100 [DEBUG] ProviderTransformer: "module.cnb_product_attachment.aws_servicecatalog_constraint.template (expand)" (*terraform.nodeExpandApplyableResource) needs provider["registry.terraform.io/hashicorp/aws"]
2022-10-24T13:36:27.946+0100 [DEBUG] ProviderTransformer: "data.aws_region.current (expand)" (*terraform.nodeExpandApplyableResource) needs provider["registry.terraform.io/hashicorp/aws"]
2022-10-24T13:36:27.946+0100 [DEBUG] ProviderTransformer: "module.nat_product_attachment.aws_servicecatalog_constraint.notification_constraint (expand)" (*terraform.nodeExpandApplyableResource) needs provider["registry.terraform.io/hashicorp/aws"]
2022-10-24T13:36:27.946+0100 [DEBUG] ProviderTransformer: "module.nat_product_attachment.aws_servicecatalog_constraint.launch (expand)" (*terraform.nodeExpandApplyableResource) needs provider["registry.terraform.io/hashicorp/aws"]
2022-10-24T13:36:27.946+0100 [DEBUG] ProviderTransformer: "module.product_attachment.aws_servicecatalog_constraint.launch (expand)" (*terraform.nodeExpandApplyableResource) needs provider["registry.terraform.io/hashicorp/aws"]
2022-10-24T13:36:27.946+0100 [DEBUG] ProviderTransformer: "module.cnb_product_attachment.aws_servicecatalog_constraint.launch (expand)" (*terraform.nodeExpandApplyableResource) needs provider["registry.terraform.io/hashicorp/aws"]
2022-10-24T13:36:27.946+0100 [DEBUG] ProviderTransformer: "aws_s3_bucket_object.products (expand)" (*terraform.nodeExpandApplyableResource) needs provider["registry.terraform.io/hashicorp/aws"]
2022-10-24T13:36:27.946+0100 [DEBUG] ProviderTransformer: "module.product_attachment.aws_servicecatalog_constraint.template (expand)" (*terraform.nodeExpandApplyableResource) needs provider["registry.terraform.io/hashicorp/aws"]

Expected Behavior

Apply should complete in similar time frame to seen in version 1.2.6.

Actual Behavior

Apply seem to hang, TF_LOG output suggests may exhaust machine resources. DEBUG output suggest the material difference is in *terraform.NodeApplyableResourceInstance (1.3.2, see debug statement suggesting this is per array element, and fails or hangs or is so resource demanding that success is unlikely), vs *terraform.nodeExpandApplyableResource (1.2.6, see debug statement suggesting this is per resource type, and succeeds in reasonable time frame).

Steps to Reproduce

terraform apply

Additional Context

No response

References

No response

@bob-rohan bob-rohan added bug new new issue not yet triaged labels Oct 24, 2022
@jbardin
Copy link
Member

jbardin commented Oct 24, 2022

Hi @bob-rohan,

Thanks for filing the issue. There have been no changes in the area you are referring to between v1.2 and v1.3, so this is probably a side-effect of something else, probably entirely unrelated. The given types are also present during the transformation in both versions, so that should not be a factor.

What are you considering to be a fairly large input here? Can you give an idea on the number of resource instances being managed? The debug output alone isn't going to give enough information to be useful here, and setting TF_LOG=trace may have more clues. If you are killing the process to stop the execution anyway, you could use SIGQUIT which will cause it to print a stack trace as well which can pinpoint exactly what is executing at that time.

Thanks!

@jbardin jbardin added core waiting for reproduction unable to reproduce issue without further information waiting-response An issue/pull request is waiting for a response from the community and removed new new issue not yet triaged labels Oct 24, 2022
@bob-rohan
Copy link
Author

Thanks, will pull that together. wrt relativeness of "large"

jq '.resources[].instances[].index_key' state | grep -v null | wc -l

54448

jq '.resources | length' state

677

As mentioned above, I'm refraining from passing too much judgement on the composition at least until can explain root cause. If Hashicorp have rough guidelines for recommended upper limit of distinct resources/instances, that might help me convey this point to the user when the time comes.

@bob-rohan
Copy link
Author

TF_LOG=trace output

1.3.0

2022-10-25T11:43:41.047+0100 [TRACE] DiffTransformer: found NoOp change for module.product_attachment["somename"].aws_servicecatalog_constraint.notification_constraint["NAT-058"]
2022-10-25T11:43:41.047+0100 [TRACE] DiffTransformer: module.product_attachment["somename"].aws_servicecatalog_constraint.notification_constraint["NAT-058"] will be represented by module.product_attachment["somename"].aws_servicecatalog_constraint.notification_constraint["NAT-058"]
2022-10-25T11:43:41.047+0100 [TRACE] DiffTransformer complete
2022-10-25T11:43:41.262+0100 [TRACE] Completed graph transform *terraform.DiffTransformer with new graph:
  aws_s3_bucket_object.cnb (expand) - *terraform.nodeExpandApplyableResource
  aws_s3_bucket_object.cnb["BASE-AlmaLinux"] - *terraform.NodeApplyableResourceInstance

vs

1.2.9

2022-10-25T11:27:30.715+0100 [TRACE] DiffTransformer: found NoOp change for module.product_attachment["somename"].aws_servicecatalog_constraint.template["NAT-121"]
2022-10-25T11:27:30.715+0100 [TRACE] DiffTransformer: found NoOp change for module.product_attachment["somename"].aws_servicecatalog_constraint.template["NAT-101"]
2022-10-25T11:27:30.715+0100 [TRACE] DiffTransformer: found NoOp change for module.product_attachment["somename"].aws_servicecatalog_constraint.template["NAT-098"]
2022-10-25T11:27:30.715+0100 [TRACE] DiffTransformer complete
2022-10-25T11:27:30.715+0100 [TRACE] Completed graph transform *terraform.DiffTransformer (no changes)
2022-10-25T11:27:30.715+0100 [TRACE] Executing graph transform *terraform.AttachStateTransformer
2022-10-25T11:27:30.716+0100 [TRACE] Completed graph transform *terraform.AttachStateTransformer (no changes)
2022-10-25T11:27:30.716+0100 [TRACE] Executing graph transform *terraform.OrphanOutputTransformer
2022-10-25T11:27:30.716+0100 [TRACE] Completed graph transform *terraform.OrphanOutputTransformer (no changes)
2022-10-25T11:27:30.716+0100 [TRACE] Executing graph transform *terraform.AttachResourceConfigTransformer
2022-10-25T11:27:30.716+0100 [TRACE] AttachResourceConfigTransformer: attaching to "aws_servicecatalog_tag_option.common (expand)" (*terraform.nodeExpandApplyableResource) config from populate_tag_options_library.tf:1,1-50

Perhaps related to this change?

9e27703

@jbardin
Copy link
Member

jbardin commented Oct 25, 2022

While that's definitely more resource instances than we generally see, the problem with configurations of that size is usually more associated with the sheer number of API calls that must be made to plan, rather than the handling of the data.

Is that the final log output from v1.3? (I would not use v1.3.0, there were a number of issues solved in the first few patch releases) That would be an odd place for the log stream to stop, unless the system became unresponsive at that exact moment due to resource exhaustion. That many resource instances is going to consume quite a bit of memory to handle, have you checked that memory isn't the constraint here?

@bob-rohan
Copy link
Author

You're correct, that's not the end of the log. It's a snippet based on where the output between the two versions differs materially (in my novice view). This is basically the start of the apply phase after plan confirms no changes required, the difference is quite stark and hopefully indicative to the future processing journies these two paths would take.

While 1.3.2 and 1.3.0 have been resource heavy, the most startling difference is in the size of the output

 1409032 -rw-r--r--  1 bobrohan  staff   683M 25 Oct 11:27 129-apply.out
20545544 -rw-r--r--  1 bobrohan  staff   9.8G 25 Oct 11:47 130-apply.out

I'm tailing the logs while this processes which looks to be continuing, although there could be a buffer/lag which is misleading. This would suggest the machine isn't exhausted of resources as I originally theorised, but an output of that size is certainly indicative of a problem and in stark contrast between the two versions.

Does the PR I've linked above hold any material relevance to this change in behaviour?

@jbardin
Copy link
Member

jbardin commented Oct 25, 2022

That PR is related to the number of nodes being tracked during apply. The addition of preconditions and postconditions in v1.3 requires that we track all resource instances through their entire lifecycle, so that conditions can be validated during apply. That does mean adding in placeholders for all NoOp changes, because any of those resources could have condition checks which need validation.

This is usually not a problem, because Terraform needs to be able to handle the complete graph at some point, otherwise one would not be able to destroy the resources, or restore the configuration from scratch. I think we need to figure out where the actual bottleneck is here.

How is the state being stored for this configuration? Operating on a local state would cause severe issues with this size configuration due to #32060. Resolving that may help, because we can also try to avoid some of the unnecessary overhead with handling the state for NoOp changes. I'm leaning towards the state handling being the bigger problem even without local state, because state is stored as a single blob, verifying writes could be taking considerable time with this many resources.

Since you're watching the logs (which could be slowing things down quite a bit on its own, just because of the log volume for this size config), is v1.3 getting past the graph building steps to the point of evaluating the resources? If so, I think the state handling really is the culprit, otherwise we'll have to do some more investigation in the graph handling.

@jbardin jbardin closed this as completed Oct 25, 2022
@jbardin jbardin reopened this Oct 25, 2022
@bob-rohan
Copy link
Author

bob-rohan commented Oct 25, 2022

Backend is s3.

Yesterday (DEBUG, no tailing logs), today (TRACE, tailing logs), I agree that's likely impacted performance whilst we've attempted to gain more insight.

Last statement from DEBUG run yesterday

2022-10-24T13:14:37.939+0100 [DEBUG] ReferenceTransformer: "module.product_attachment[\"somename\"].aws_servicecatalog_constraint.notification_constraint[\"NAT-037\"]" references:

Last statement from TRACE run today (abbridged)

2022-10-25T11:47:29.777+0100 [TRACE] ReferenceTransformer: skipping &{%!s(*terraform.NodeAbstractResourceInstance=&{{{{} [product_attachment] {{} 77 aws_servicecatalog_constraint template}} 0xc0121be720 0 0xc00b2f9800 map[] map[] [] [] false {[] {aws hashicorp registry.terraform.io} } []} {{} [{product_attachment somename}]

Some grepping, may may not help grep 'Completed graph' 130-apply-part-a*

130-apply-part-aa:2022-10-25T11:43:11.245+0100 [TRACE] Completed graph transform *terraform.ResourceCountTransformer with new graph:
130-apply-part-aa:2022-10-25T11:43:11.246+0100 [TRACE] Completed graph transform *terraform.OrphanResourceInstanceCountTransformer (no changes)
130-apply-part-aa:2022-10-25T11:43:11.258+0100 [TRACE] Completed graph transform *terraform.AttachStateTransformer (no changes)
130-apply-part-aa:2022-10-25T11:43:11.258+0100 [TRACE] Completed graph transform *terraform.TargetsTransformer (no changes)
130-apply-part-aa:2022-10-25T11:43:11.265+0100 [TRACE] Completed graph transform *terraform.ReferenceTransformer (no changes)
130-apply-part-aa:2022-10-25T11:43:11.266+0100 [TRACE] Completed graph transform *terraform.RootTransformer with new graph:
130-apply-part-aa:2022-10-25T11:43:40.397+0100 [TRACE] Completed graph transform *terraform.ConfigTransformer with new graph:
130-apply-part-aa:2022-10-25T11:43:40.397+0100 [TRACE] Completed graph transform *terraform.RootVariableTransformer with new graph:
130-apply-part-aa:2022-10-25T11:43:40.398+0100 [TRACE] Completed graph transform *terraform.ModuleVariableTransformer with new graph:
130-apply-part-aa:2022-10-25T11:43:40.398+0100 [TRACE] Completed graph transform *terraform.LocalTransformer with new graph:
130-apply-part-aa:2022-10-25T11:43:40.398+0100 [TRACE] Completed graph transform *terraform.OutputTransformer with new graph:
130-apply-part-aa:2022-10-25T11:43:41.262+0100 [TRACE] Completed graph transform *terraform.DiffTransformer with new graph:
130-apply-part-aa:2022-10-25T11:43:51.383+0100 [TRACE] Completed graph transform *terraform.AttachStateTransformer (no changes)
130-apply-part-aa:2022-10-25T11:43:51.535+0100 [TRACE] Completed graph transform *terraform.OrphanOutputTransformer (no changes)
130-apply-part-aa:2022-10-25T11:43:52.358+0100 [TRACE] Completed graph transform *terraform.AttachResourceConfigTransformer (no changes)
130-apply-part-aa:2022-10-25T11:43:52.552+0100 [TRACE] (graphTransformerMulti) Completed graph transform *terraform.ProviderConfigTransformer with new graph:
130-apply-part-aa:2022-10-25T11:43:52.818+0100 [TRACE] (graphTransformerMulti) Completed graph transform *terraform.MissingProviderTransformer (no changes)
130-apply-part-aa:2022-10-25T11:43:54.385+0100 [TRACE] (graphTransformerMulti) Completed graph transform *terraform.ProviderTransformer with new graph:
130-apply-part-aa:2022-10-25T11:43:54.708+0100 [TRACE] (graphTransformerMulti) Completed graph transform *terraform.PruneProviderTransformer (no changes)
130-apply-part-aa:2022-10-25T11:43:55.073+0100 [TRACE] Completed graph transform *terraform.graphTransformerMulti with new graph:
130-apply-part-aa:2022-10-25T11:43:55.348+0100 [TRACE] Completed graph transform *terraform.RemovedModuleTransformer (no changes)
130-apply-part-aa:2022-10-25T11:43:55.880+0100 [TRACE] Completed graph transform *terraform.AttachSchemaTransformer (no changes)
130-apply-part-aa:2022-10-25T11:43:57.026+0100 [TRACE] Completed graph transform *terraform.ModuleExpansionTransformer with new graph:
``

@jbardin jbardin removed waiting-response An issue/pull request is waiting for a response from the community waiting for reproduction unable to reproduce issue without further information labels Oct 25, 2022
@jbardin jbardin self-assigned this Oct 25, 2022
@jbardin jbardin changed the title ProviderTransformer *terraform.NodeApplyableResourceInstance vs *terraform.nodeExpandApplyableResource v1.3 apply performance regressions with large numbers of instances Oct 25, 2022
@bob-rohan
Copy link
Author

thanks @jbardin , tested with 1.3.4 and this is resolved.

@github-actions
Copy link

github-actions bot commented Dec 4, 2022

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 4, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants