New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.3 apply
performance regressions with large numbers of instances
#32071
Comments
Hi @bob-rohan, Thanks for filing the issue. There have been no changes in the area you are referring to between v1.2 and v1.3, so this is probably a side-effect of something else, probably entirely unrelated. The given types are also present during the transformation in both versions, so that should not be a factor. What are you considering to be a fairly large input here? Can you give an idea on the number of resource instances being managed? The debug output alone isn't going to give enough information to be useful here, and setting Thanks! |
Thanks, will pull that together. wrt relativeness of "large"
54448
677 As mentioned above, I'm refraining from passing too much judgement on the composition at least until can explain root cause. If Hashicorp have rough guidelines for recommended upper limit of distinct resources/instances, that might help me convey this point to the user when the time comes. |
TF_LOG=trace output 1.3.0
vs 1.2.9
Perhaps related to this change? |
While that's definitely more resource instances than we generally see, the problem with configurations of that size is usually more associated with the sheer number of API calls that must be made to plan, rather than the handling of the data. Is that the final log output from v1.3? (I would not use v1.3.0, there were a number of issues solved in the first few patch releases) That would be an odd place for the log stream to stop, unless the system became unresponsive at that exact moment due to resource exhaustion. That many resource instances is going to consume quite a bit of memory to handle, have you checked that memory isn't the constraint here? |
You're correct, that's not the end of the log. It's a snippet based on where the output between the two versions differs materially (in my novice view). This is basically the start of the apply phase after plan confirms no changes required, the difference is quite stark and hopefully indicative to the future processing journies these two paths would take. While
I'm tailing the logs while this processes which looks to be continuing, although there could be a buffer/lag which is misleading. This would suggest the machine isn't exhausted of resources as I originally theorised, but an output of that size is certainly indicative of a problem and in stark contrast between the two versions. Does the PR I've linked above hold any material relevance to this change in behaviour? |
That PR is related to the number of nodes being tracked during apply. The addition of preconditions and postconditions in v1.3 requires that we track all resource instances through their entire lifecycle, so that conditions can be validated during apply. That does mean adding in placeholders for all NoOp changes, because any of those resources could have condition checks which need validation. This is usually not a problem, because Terraform needs to be able to handle the complete graph at some point, otherwise one would not be able to destroy the resources, or restore the configuration from scratch. I think we need to figure out where the actual bottleneck is here. How is the state being stored for this configuration? Operating on a local state would cause severe issues with this size configuration due to #32060. Resolving that may help, because we can also try to avoid some of the unnecessary overhead with handling the state for NoOp changes. I'm leaning towards the state handling being the bigger problem even without local state, because state is stored as a single blob, verifying writes could be taking considerable time with this many resources. Since you're watching the logs (which could be slowing things down quite a bit on its own, just because of the log volume for this size config), is v1.3 getting past the graph building steps to the point of evaluating the resources? If so, I think the state handling really is the culprit, otherwise we'll have to do some more investigation in the graph handling. |
Backend is s3. Yesterday (DEBUG, no tailing logs), today (TRACE, tailing logs), I agree that's likely impacted performance whilst we've attempted to gain more insight. Last statement from DEBUG run yesterday
Last statement from TRACE run today (abbridged)
Some grepping, may may not help
|
apply
performance regressions with large numbers of instances
thanks @jbardin , tested with 1.3.4 and this is resolved. |
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. |
Terraform Version
Terraform Configuration Files
Judging by which parts of our estate have been impacted, I suspect this is going to require a fairly large input array and collection of resources to iterate, in order to exhaust the machine from which Terraform is invoked of most available resources. A minimal example should be capable of detailing the material differences I've noted below. If a bit of feedback could be provided on this change in behaviour between non-breaking versions it would help direct the replication for me to provide.
Debug Output
Terraform 1.3.2 DEBUG output
Terraform 1.2.6 DEBUG output
Expected Behavior
Apply should complete in similar time frame to seen in version 1.2.6.
Actual Behavior
Apply seem to hang, TF_LOG output suggests may exhaust machine resources. DEBUG output suggest the material difference is in
*terraform.NodeApplyableResourceInstance
(1.3.2, see debug statement suggesting this is per array element, and fails or hangs or is so resource demanding that success is unlikely), vs*terraform.nodeExpandApplyableResource
(1.2.6, see debug statement suggesting this is per resource type, and succeeds in reasonable time frame).Steps to Reproduce
terraform apply
Additional Context
No response
References
No response
The text was updated successfully, but these errors were encountered: