Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache terraform configurations used to fetch dependency outputs #2202

Open
JohannesRudolph opened this issue Jul 19, 2022 · 8 comments
Open
Labels
enhancement New feature or request

Comments

@JohannesRudolph
Copy link

I noticed one of my configurations being very slow to actually start running terraform apply. After inspecting terragrunt's debug output, I traced it back to a call to terraform init -get=false of a dependency configuration taking up most of the init time. I noticed on my system activity monitor that this command was pulling down quite a few MiBs from the internet (assumingly terraform backend plugins?) and this was slowing down everything.

I then looked for why terragrunt was doing this and discovered this recent discussion -
#2119 introduced new behavior to optimize fetching the state of dependency blocks by bypassing terraform alltogether and fetching directly from an S3 bucket.

However, I'm using GCS backend and that optimization isn't implemented for GCS yet. The issue also discusses the challenges of that optimization and the robust fallback behavior.

Now, based on this comment:

But this cache only works for the current process, so it is only used when you do a run-all.

Originally posted by @yorinasub17 in #2119 (comment)

I wonder if not a "naive" but more robust implementation could (persistently) cache the terraform configuration that's used to get the terraform output of the dependency configuration? This way terragrunt would still use terraform output to fetch the dependency but can avoid expensive terraform init calls on that dependency configuration.

This optimization could easily shave of 10s and up to 60s of "init" time from some of my more complex terragrunt configurations and yield a huge boost for interactive development.

@denis256 denis256 added the enhancement New feature or request label Jul 22, 2022
@parmouraly
Copy link

parmouraly commented Aug 1, 2022

We are experiencing a similar "lag" when running plans on modules with 4-5 dependency blocks, and we're using AWS s3.
More specifically, planning with mock_ouptuts and skip_outputs flag on each dependency block results in a 26s plan, while not using these leads to a 1min40secs plan.
That may not sound a lot but often you're trying to iterate on the main module (not on the dependencies) and this slowness builds up.

To mitigate this I had in mind a similar suggestion to @JohannesRudolph

More specifically, it would be great if there was an optional TG CLI flag for caching dependency outputs so they're available for subsequent plans.
In this way plan speeds could be improved significantly instead of constantly re-calculating ouputs that may never change (and even if they did we wouldn't care) while testing changes in the main module.

@fenos
Copy link

fenos commented Aug 8, 2022

I have the exact same issue, terragrunt is taking a minute to run and iterating on it is frustrating 😢

@parmouraly
Copy link

@fenos maybe add a 👍 on this issue descr

@Ido-DY
Copy link
Contributor

Ido-DY commented Aug 15, 2022

We are experiencing a similar "lag" when running plans on modules with 4-5 dependency blocks, and we're using AWS s3. More specifically, planning with mock_ouptuts and skip_outputs flag on each dependency block results in a 26s plan, while not using these leads to a 1min40secs plan. That may not sound a lot but often you're trying to iterate on the main module (not on the dependencies) and this slowness builds up.

To mitigate this I had in mind a similar suggestion to @JohannesRudolph

More specifically, it would be great if there was an optional TG CLI flag for caching dependency outputs so they're available for subsequent plans. In this way plan speeds could be improved significantly instead of constantly re-calculating ouputs that may never change (and even if they did we wouldn't care) while testing changes in the main module.

@parmouraly have you tried the new CLI flag? https://terragrunt.gruntwork.io/docs/reference/cli-options/#terragrunt-fetch-dependency-output-from-state

Since this one was added, we are using it to speed up all of our commands without any issues.
It is currently only for AWS S3 but, if this is your backend, it should work.

And, BTW, as originally discussed in the issue #2119 Terragrunt has cache for dependencies included in the same run-all, I think that caching the output locally could be risky since the dependency output could change by someone else between runs, especially when knowing that fetching the dependency output from the state directly only takes a few milliseconds.

@JohannesRudolph I think that it might be better to expand this feature to GCS instead.

@stevenpitts
Copy link
Contributor

stevenpitts commented Aug 18, 2022

@Ido-DY Could you please clarify whether you're suggesting you suspect this new terragrunt-fetch-dependency-output-from-state flag is related to OP's original issue, or if it is a new non-causal feature that you think could resolve the slow Terragrunt commands from a different angle?

@Ido-DY
Copy link
Contributor

Ido-DY commented Aug 19, 2022

@stevenpitts If I got that right, the issue is all about the slowness of Terragrunt mostly when handling dependencies. I actually investigated it myself before opening #2119, and I found out that it actually terraform init and terraform output -json that takes 99% of the time, then I compared the output of the Terraform command with the output stored within the state file and confirmed it's identical. At this point I wonder why it takes Terraform more than 10 seconds to generate the exact output that can be fetched from the state directly with a few milliseconds, and I'm still not sure.
Caching the Terragrunt configuration is constructed from a few components:

  1. The wrapping configuration (such as the backend and credentials preparations)
  2. The Terraform plugins
  3. The output of packages

The 1st component is generated super fast and there is no point to my mind to cache it.
The seconed one can be cached using TF_CACHE_DIR but it'll lead to other issues when having concurrent executions.

And the last one is cached by default only within the memory of the same run.
Caching the last layer locally without knowing if there was another execution on the same resource is not safe.

It takes a few seconds for Terragrunt to run on a single component and most of the time is spent on the Terraform commands themselves, however when resolving dependencies throughout Terraform commands this time usually multiplied.

Is this answer your question?

@parmouraly have you find this flag useful for your case?

@stevenpitts
Copy link
Contributor

@stevenpitts If I got that right, the issue is all about the slowness of Terragrunt mostly when handling dependencies. I actually investigated it myself before opening #2119, and I found out that it actually terraform init and terraform output -json that takes 99% of the time, then I compared the output of the Terraform command with the output stored within the state file and confirmed it's identical. At this point I wonder why it takes Terraform more than 10 seconds to generate the exact output that can be fetched from the state directly with a few milliseconds, and I'm still not sure. Caching the Terragrunt configuration is constructed from a few components:

  1. The wrapping configuration (such as the backend and credentials preparations)
  2. The Terraform plugins
  3. The output of packages

The 1st component is generated super fast and there is no point to my mind to cache it. The seconed one can be cached using TF_CACHE_DIR but it'll lead to other issues when having concurrent executions.

And the last one is cached by default only within the memory of the same run. Caching the last layer locally without knowing if there was another execution on the same resource is not safe.

It takes a few seconds for Terragrunt to run on a single component and most of the time is spent on the Terraform commands themselves, however when resolving dependencies throughout Terraform commands this time usually multiplied.

Is this answer your question?

I think this makes sense, thank you!
It's felt like the time it takes to "start up" Terragrunt (number 3 in your post) had increased recently, which was the reason I originally searched for this issue, but that may just be my imagination or something unrelated.

But doesn't it make sense that generating output of all dependent modules would take longer than fetching the output from state, since it involves many provider API calls, rather than a single call to the authoritative state file (local or remote)?
I get the sense that this is a different conversation, and I don't want to derail this issue, so I'll leave it as something I don't quite understand yet about the Terragrunt flow 😁

Thanks for the clarification!

@JohannesRudolph
Copy link
Author

Thanks everyone who joined in on the discussion so far. I'll summarize my understanding of the issue and will try to clarify my original enhancement suggestion.

The key problem is that when terragrunt encounters a dependency it needs to resolve the output of the dependency terraform configuration. The correct approach is to perform a terraform init -get=false (we don't need modules, only providers) followed by terraform output -json on the dependency terraform configuration.

Since this is slow, various forms of caching have been proposed here. Any approach needs to maintain correctness though. The solutions and their respective limitations are

  1. Bypassing terraform alltogether and fetching state directly from the state backend via terragrunt-fetch-dependency-output-from-state cli flag. Only supported for S3.
  2. Speed up terraform init using TF_CACHE_DIR. This is a terraform feature, but is not safe with concurrent execution and can cause issues with lockfiles
  3. Cache the initialized dependency terraform configuration inside the .terragrunt-cache dir, speeding up terraform init because the configuration is already initialized (except on the first run). This could use the same terragrunt behavior already present for "auto init".
  4. Cache the output of terraform output -json. This is already done inside a single execution of terragrunt run-all and helps speeding up executions with common shared dependency configurations.

@Ido-DY suggested extending 1) to GCS. While that's possible, my next feature request would be to extend that to Azure as well as we have some state backends there. I expect sooner or later there will be further requests to add first-party support in terragrunt for more state backends.

Another variant is extending 4) to persist the cache between different terragrunt run-all configurations and somehow maintain correctness.

My personal favorite would be 3) as this builds on terragrunt behavior that is already there i.e. maintaining the initialized terraform configuration inside a .terragrunt-cache dir and auto-init. While this will still incur the cost of running terraform output -json, it will save the terraform init step which is super expensive without TF_CACHE_DIR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants