Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The provider is slow when creating/destroying resources #929

Closed
zhenrong-wang opened this issue Nov 27, 2023 · 17 comments
Closed

The provider is slow when creating/destroying resources #929

zhenrong-wang opened this issue Nov 27, 2023 · 17 comments
Assignees
Labels
question Further information is requested

Comments

@zhenrong-wang
Copy link

zhenrong-wang commented Nov 27, 2023

Hi opentofu developers,

I am switching my workload from terraform to opentofu. When I use 1.6.0-alpha5 and aliyun cloud provider 1.213.0, the provider is really slow to orchestrate the resources.

I opened the DEBUG mode, and it turns out the following message is streaming out continously. It usually takes ~100 seconds before the start of creating/destroying resources.

Is there anything wrong with the provider or openTofu ? Thanks a lot!

2023-11-27T19:09:42.665+0800 [DEBUG] provider: plugin process exited: path=.terraform/providers/registry.opentofu.org/aliyun/alicloud/1.213.0/linux_amd64/terraform-provider-alicloud_v1.213.0 pid=30472
2023-11-27T19:09:42.665+0800 [DEBUG] provider: plugin exited
2023-11-27T19:09:42.665+0800 [DEBUG] created provider logger: level=debug
2023-11-27T19:09:42.665+0800 [INFO]  provider: configuring client automatic mTLS
2023-11-27T19:09:42.675+0800 [DEBUG] provider: starting plugin: path=.terraform/providers/registry.opentofu.org/aliyun/alicloud/1.213.0/linux_amd64/terraform-provider-alicloud_v1.213.0 args=[".terraform/providers/registry.opentofu.org/aliyun/alicloud/1.213.0/linux_amd64/terraform-provider-alicloud_v1.213.0"]
2023-11-27T19:09:42.676+0800 [DEBUG] provider: plugin started: path=.terraform/providers/registry.opentofu.org/aliyun/alicloud/1.213.0/linux_amd64/terraform-provider-alicloud_v1.213.0 pid=30479
2023-11-27T19:09:42.676+0800 [DEBUG] provider: waiting for RPC address: path=.terraform/providers/registry.opentofu.org/aliyun/alicloud/1.213.0/linux_amd64/terraform-provider-alicloud_v1.213.0
2023-11-27T19:09:42.762+0800 [INFO]  provider.terraform-provider-alicloud_v1.213.0: configuring server automatic mTLS: timestamp="2023-11-27T19:09:42.761+0800"
2023-11-27T19:09:42.793+0800 [DEBUG] provider.terraform-provider-alicloud_v1.213.0: plugin address: address=/tmp/plugin3192086639 network=unix timestamp="2023-11-27T19:09:42.793+0800"
2023-11-27T19:09:42.793+0800 [DEBUG] provider: using plugin: version=5
2023-11-27T19:09:42.925+0800 [DEBUG] No provider meta schema returned
2023-11-27T19:09:43.056+0800 [DEBUG] provider.stdio: received EOF, stopping recv loop: err="rpc error: code = Unavailable desc = error reading from server: EOF"
2023-11-27T19:09:43.059+0800 [DEBUG] provider: plugin process exited: path=.terraform/providers/registry.opentofu.org/aliyun/alicloud/1.213.0/linux_amd64/terraform-provider-alicloud_v1.213.0 pid=30479
2023-11-27T19:09:43.059+0800 [DEBUG] provider: plugin exited
2023-11-27T19:09:43.059+0800 [DEBUG] created provider logger: level=debug
2023-11-27T19:09:43.060+0800 [INFO]  provider: configuring client automatic mTLS
2023-11-27T19:09:43.068+0800 [DEBUG] provider: starting plugin: path=.terraform/providers/registry.opentofu.org/aliyun/alicloud/1.213.0/linux_amd64/terraform-provider-alicloud_v1.213.0 args=[".terraform/providers/registry.opentofu.org/aliyun/alicloud/1.213.0/linux_amd64/terraform-provider-alicloud_v1.213.0"]
2023-11-27T19:09:43.069+0800 [DEBUG] provider: plugin started: path=.terraform/providers/registry.opentofu.org/aliyun/alicloud/1.213.0/linux_amd64/terraform-provider-alicloud_v1.213.0 pid=30486
2023-11-27T19:09:43.069+0800 [DEBUG] provider: waiting for RPC address: path=.terraform/providers/registry.opentofu.org/aliyun/alicloud/1.213.0/linux_amd64/terraform-provider-alicloud_v1.213.0
2023-11-27T19:09:43.158+0800 [INFO]  provider.terraform-provider-alicloud_v1.213.0: configuring server automatic mTLS: timestamp="2023-11-27T19:09:43.158+0800"
2023-11-27T19:09:43.188+0800 [DEBUG] provider: using plugin: version=5
2023-11-27T19:09:43.189+0800 [DEBUG] provider.terraform-provider-alicloud_v1.213.0: plugin address: address=/tmp/plugin1095469596 network=unix timestamp="2023-11-27T19:09:43.188+0800"
@kislerdm
Copy link
Contributor

@zhenrong-wang Hey Zhenrong! Thanks for raising the issue.

It usually takes ~100 seconds before the start of creating/destroying resources.

I guess that the refresh step is accountable for majority of that duration which is expected for large state files, because many API calls are required to be made over network to define the drift between the state file and the real infra.

Could you please share the following:

  • What was the typical/expected operation's duration for that specific infra config before? If you used terraform before, which tf version and which provider's version did you use?
  • How does the refresh time change if you rerun the plan operation?
  • Could you please share the output of the commands:
    • TF_LOG=TRACE tofu init
    • TF_LOG=TRACE tofu plan

Thanks!

@kislerdm kislerdm added the question Further information is requested label Nov 27, 2023
@zhenrong-wang
Copy link
Author

zhenrong-wang commented Nov 27, 2023

Thanks @kislerdm for your reply.

Here is my project HPC-NOW. It depends on Terraform or openTofu to orchestrate cloud resources.

1. Versions:

  • openTofu - 1.6.0-alpha5, with the provider terraform-provider-alicloud_1.213.0_linux_amd64.zip
  • Terraform - 1.6.2, with the provider terraform-provider-alicloud_1.203.0_linux_amd64.zip

2. How did I run:

Instead of running terraform/openTofu command directly, the HPC-NOW project uses a "wrapper" to run in the hpcopr CLI. Therefore, I built 2 versions of hpcopr CLI, one uses openTofu, one uses Terraform. The provider keeps the same.

I didn't wrap the plan command of tofu/terraform into the hpcopr CLI - only init, apply and destroy.

Then, I run the command hpcopr init -b and hpcopr destroy -b with the 2 versions of CLI. Both of them should create and destroy the same resource stack in the cloud (aliyun).

  • The hpcopr init -b contains both terraform/tofu init and terraform/tofu apply process
  • The hpcopr destroy -b equals to the terraform/tofu destroy process

3. The Results:

a. hpcopr with openTofu:

Creating a stack took 440 seconds.

Destroying the stack took 380 seconds.

b. hpcopr with Terraform

Creating a stack took 161 seconds.

Destroying the stack took 108 seconds

4. The DEBUG logs

The logs of creating and destroying process are saved in 1 file. Sorry that I forgot to rename the first log when finished creating the stack, so the destroying process appended logs to the same file.

hpcopr with openTofu

tofu.log

hpcopr with Terraform
terraform.log

5. Summary:

Creating a stack: openTofu - 440 seconds, Terraform - 168 seconds
Destroying a stack: openTofu - 380 seconds, Terraform - 108 seconds

My local network remained unchanged during the 2 tests. From the DEBUG log of openTofu, it seems the provider doesn't work smoothly as expected.

Looking forward to your support, thanks so much!

Zhenrong

@kislerdm
Copy link
Contributor

Hey Zhenrong! Thanks for your prompt reply!

Did I get it right, the provider version used with OpenTofu differs from the one used with terraform? If it's the case, could you please try to rerun your flow using identical provider versions, e.g. "terraform-provider-alicloud_1.203.0_linux_amd64.zip" in both cases? Thanks!

@zhenrong-wang
Copy link
Author

Hey Zhenrong! Thanks for your prompt reply!

Did I get it right, the provider version used with OpenTofu differs from the one used with terraform? If it's the case, could you please try to rerun your flow using identical provider versions, e.g. "terraform-provider-alicloud_1.203.0_linux_amd64.zip" in both cases? Thanks!

Hi @kislerdm ,

The reason why I upgraded the provider version from 1.203.0 to the latest one (1.213.0) for openTofu, is because the same scenario (low speed and wait 100 secs to start) occurs in version 1.203.0.

Sure I can rerun the test in minutes. But the situation will probably be the same. Please wait minutes.

@zhenrong-wang
Copy link
Author

zhenrong-wang commented Nov 27, 2023

Hey Zhenrong! Thanks for your prompt reply!
Did I get it right, the provider version used with OpenTofu differs from the one used with terraform? If it's the case, could you please try to rerun your flow using identical provider versions, e.g. "terraform-provider-alicloud_1.203.0_linux_amd64.zip" in both cases? Thanks!

Hi @kislerdm ,

The reason why I upgraded the provider version from 1.203.0 to the latest one (1.213.0) for openTofu, is because the same scenario (low speed and wait 100 secs to start) occurs in version 1.203.0.

Sure I can rerun the test in minutes. But the situation will probably be the same. Please wait minutes.

Hi @kislerdm
Here are the results with provider version 1.203.0:

  1. Creating a stack took 446 seconds.

  2. Destroying the stack took 338 seconds.

  3. Logs are here.

tofu-1.203.0-provider.log

Thanks a lot!

@kislerdm
Copy link
Contributor

@zhenrong-wang Hey! Thanks a lot for sharing the details!

Could you please share the logs for the verbosity level TRACE (as it was requested in the first comment) to help us with identification of root cause: TF_LOG=trace tofu apply? Thanks!

@zhenrong-wang
Copy link
Author

@zhenrong-wang Hey! Thanks a lot for sharing the details!

Could you please share the logs for the verbosity level TRACE (as it was requested in the first comment) to help us with identification of root cause: TF_LOG=trace tofu apply? Thanks!

Sure, Let me modify the wrapper to generate a new group of logs.

In order to compare, I will keep on using the provider version 1.203.0.

Please wait minutes.

@zhenrong-wang
Copy link
Author

@zhenrong-wang Hey! Thanks a lot for sharing the details!
Could you please share the logs for the verbosity level TRACE (as it was requested in the first comment) to help us with identification of root cause: TF_LOG=trace tofu apply? Thanks!

Sure, Let me modify the wrapper to generate a new group of logs.

In order to compare, I will keep on using the provider version 1.203.0.

Please wait minutes.

Hi @kislerdm

Just ran the test again. Creating the stack took 408 seconds; and destroying it took 405 seconds.

Here are the logs. I generated 2 files. Please take a look.

tofu-trace-creating-1.203.0-provider.log

tofu-trace-destroying-1.203.0-provider.log

Hope it helps. Thanks!

@kislerdm
Copy link
Contributor

Hi @zhenrong-wang! Thanks a lot for your collaboration!

May I kindly ask to confirm if your setup was identical expect for binary your executed, i.e. tofu vs. terraform? In order words, did you ran tofu/terraform apply/destroy commands on the same machine in the same VPC/availability zone/region?

Also, would it be possible to share the logs after running the TF_LOG=trace terraform apply command, so we could compare tofu vs. terraform side-by-side? Thanks!

For context, we have a couple of guesses and would like to verify them, but it'd take us extra time in order to reproduce the issue because we don't have experience with the hpcopr "wrapper" you used, neither do we have a lot of experience with ali cloud.

Thank you very much for your support! 🙏🏻

@zhenrong-wang
Copy link
Author

Hi @zhenrong-wang! Thanks a lot for your collaboration!

May I kindly ask to confirm if your setup was identical expect for binary your executed, i.e. tofu vs. terraform? In order words, did you ran tofu/terraform apply/destroy commands on the same machine in the same VPC/availability zone/region?

Also, would it be possible to share the logs after running the TF_LOG=trace terraform apply command, so we could compare tofu vs. terraform side-by-side? Thanks!

For context, we have a couple of guesses and would like to verify them, but it'd take us extra time in order to reproduce the issue because we don't have experience with the hpcopr "wrapper" you used, neither do we have a lot of experience with ali cloud.

Thank you very much for your support! 🙏🏻

Sure, I will run the terraform version and post the log here.

One thing is for sure: only the binary (executable) changed. All other elements kept unchanged.

In terms of the "wrapper", it is just another way the hpcopr CLI use to run terraform/tofu commands. Nothing different from running terraform/tofu directly.

Please wait for my log with terraform.

Thanks!

@zhenrong-wang
Copy link
Author

zhenrong-wang commented Nov 27, 2023

Hi @zhenrong-wang! Thanks a lot for your collaboration!
May I kindly ask to confirm if your setup was identical expect for binary your executed, i.e. tofu vs. terraform? In order words, did you ran tofu/terraform apply/destroy commands on the same machine in the same VPC/availability zone/region?
Also, would it be possible to share the logs after running the TF_LOG=trace terraform apply command, so we could compare tofu vs. terraform side-by-side? Thanks!
For context, we have a couple of guesses and would like to verify them, but it'd take us extra time in order to reproduce the issue because we don't have experience with the hpcopr "wrapper" you used, neither do we have a lot of experience with ali cloud.
Thank you very much for your support! 🙏🏻

Sure, I will run the terraform version and post the log here.

One thing is for sure: only the binary (executable) changed. All other elements kept unchanged.

In terms of the "wrapper", it is just another way the hpcopr CLI use to run terraform/tofu commands. Nothing different from running terraform/tofu directly.

Please wait for my log with terraform.

Thanks!

terraform-trace-creating-1.203.0-provider.log
terraform-trace-destroying-1.203.0-provider.log

@kislerdm Please check this out. For this time with terraform, destroying the stack took 240 seconds, a bit longer; while creating it took 137 seconds.

Here is how the hpcopr run terraform/tofu. I used -parallelism=1000 to guarantee the concurrency.

Screenshot_2023-11-28-01-45-17-761_com.github.android-edit.jpg

@zhenrong-wang zhenrong-wang changed the title The provider is slow to creating/destroying resources The provider is slow when creating/destroying resources Nov 27, 2023
@kislerdm
Copy link
Contributor

@zhenrong-wang Thank you very much for supporting us with additional details - we appreciate your contribution a lot! We will continue digging into the issue on our side tomorrow, and will keep you posted. Thanks!

@zhenrong-wang
Copy link
Author

@zhenrong-wang Thank you very much for supporting us with additional details - we appreciate your contribution a lot! We will continue digging into the issue on our side tomorrow, and will keep you posted. Thanks!

Hi @kislerdm

Just tested 1.6.0-beta1 with the same version of alicloud provider. The problem seems to be still there. It cost 410 seconds to create the stack.

@Yantrio
Copy link
Member

Yantrio commented Dec 4, 2023

Hi @zhenrong-wang, the proposed fix for this (#954) did not make it into beta1. Beta2 is coming really soon (hopefully within the next hour) which should resolve this issue for you.

Thanks for your patience on this one.

@zhenrong-wang
Copy link
Author

Hi @zhenrong-wang, the proposed fix for this (#954) did not make it into beta1. Beta2 is coming really soon (hopefully within the next hour) which should resolve this issue for you.

Thanks for your patience on this one.

Thanks for your reply. Looking forward to the next beta version.

@cube2222
Copy link
Collaborator

cube2222 commented Dec 4, 2023

This should be fixed @zhenrong-wang. I'll be closing this issue, but feel free to reopen if you hit any issues please.

@cube2222 cube2222 closed this as completed Dec 4, 2023
@zhenrong-wang
Copy link
Author

This should be fixed @zhenrong-wang. I'll be closing this issue, but feel free to reopen if you hit any issues please.

Thanks opentofu team! With beta2 version, the problem reported in this issue got resolved.

Creating an HPC stack in Alicloud is as smooth as terraform now.

Fantastic job! We are step closer to switch to openTofu.

@cube2222

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants