Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add guidance/features for reducing disk space and bandwidth usage #2920

Open
brikis98 opened this issue Jan 29, 2024 · 48 comments · Fixed by #3001
Open

Add guidance/features for reducing disk space and bandwidth usage #2920

brikis98 opened this issue Jan 29, 2024 · 48 comments · Fixed by #3001
Labels
enhancement New feature or request

Comments

@brikis98
Copy link
Member

The problem

Many of our customers struggle with TG eating up a ton of disk space and bandwidth: hundreds of gigabytes in some cases! I think this comes from a few sources (note: #2919 may help provide the data we need understand this better):

  1. Providers. By default, Terraform downloads providers from scratch every time you run init. This isn't a problem for a single module, but if you do run-all in a repo that has, say, 50 terragrunt.hcl files, each one runs init on a TF module, each module downloads an average of, say, 10 providers, then that's 50 * 10 = 500 provider downloads—even if it's the exact same 10 providers across all 50 modules!
  2. Repos. When you set a source URL in TG, it downloads the whole repo into a .terragrunt-cache folder. If you have 50 terragrunt.hcl files with source URLs, and do run-all, it will download the repos 50 times—even if all 50 repos are the same!
  3. Git clone. I think TG is doing a "full" Git clone for the code in the source URL. We should consider doing a shallow clone, as that would be much faster/smaller.
  4. Modules. When you run init, Terraform downloads repos to a .modules folder. If you have 50 terragrunt.hcl files, each of which has a source URL pointing to TF code that contains, say, 10 modules, then when you do run-all, you'll end up downloading 50 * 10 = 500 repos—even if all the modules are exactly the same!
  5. Ephemeral storage. Many users run TG in places with ephemeral storage—e.g., in a K8S cluster—where the disk is totally empty/fresh on each run. So all the stuff you downloaded in previous runs isn't even available, and you have to download everything from scratch each time.

Goals

We should have the following goals:

  1. Each unique provider should only ever be downloaded once on a given computer.
  2. Each unique repo should only ever be cloned once on a given computer.
  3. Each unique module should only ever be cloned once on a given computer.
  4. git clone should be as efficient as possible: e.g., use shallow clones.

The solution

The solution will need to be some mix of:

  1. Provide guidance. One big part of this should be adding new docs to the TG docs site that provide clear, step-by-step instructions to optimize disk space and bandwidth usage. This could include:
    1. How to use TF features such as provider plugin caching and the provider_installation block to achieve the goals in the previous section.
    2. How to configure K8S and other ephemeral systems to re-use previously downloaded data.
  2. Add new features. We should consider what new features could be added to TG to help. Some initial ideas:
    1. A new command to automatically configure the current system for "optimal" provider caching as per the guidance in the previous point.
    2. Switch to using shallow clones everywhere. See Speed up git clones and catalog command #2893 for some experimentation with adding a depth=1 param to tell go-getter to do a shallow clone.
    3. Use some sort of repo/module cache to avoid re-downloading modules over and over again.

Notes

  • We can implement these improvements incrementally. We don't have to fix everything in a single giant PR.
  • The providers seem to take up the most disk space and bandwidth, so we should prioritize that first: in fact, we should do the research and write up our guidance as the very first step, and then consider if new features are merited in follow-up steps.
  • We might not be able to achieve all the goals in the goals section, as some may be dependent on internal TF implementation details. But we should strive to get as close as we can.
@levkohimins
Copy link
Contributor

levkohimins commented Feb 5, 2024

  1. Providers. By default, Terraform downloads providers from scratch every time you run init. This isn't a problem for a single module, but if you do run-all in a repo that has, say, 50 terragrunt.hcl files, each one runs init on a TF module, each module downloads an average of, say, 10 providers, then that's 50 * 10 = 500 provider downloads—even if it's the exact same 10 providers across all 50 modules!

We can offer users to use TF features such as provider plugin caching, but there is a drawback because terraform does not guarantee safe concurrency: Terraform issue #31964, and Terragrunt issue #1875, thus we cannot perform initialization in parallel for example 50 terragrunt.hcl, therefore the --terragrunt-parallelism 1 flag is mandatory. The solution may be to implement our own provider loader, by reviewing the Terraform code and integrating a similar logic into the Terragrunt code, but taking into account:

  1. First scan Terragant modules, create a list of necessary Terraform providers.
  2. Download all providers in parallel into a shared cache directory with the structure as for provider_installation HOSTNAME/NAMESPACE/TYPE/VERSION/TARGET
  3. Create directories .terraform/providers/ with symbolic links for each module or use provider_installation that does actually the same.
  1. Repos. When you set a source URL in TG, it downloads the whole repo into a .terragrunt-cache folder. If you have 50 terragrunt.hcl files with source URLs, and do run-all, it will download the repos 50 times—even if all 50 repos are the same!

The first thing that comes to mind is to create a common cache for all Terragrunt modules, but here we are faced with several issues:

  1. Since the .terragrunt-cache directories are where Terraform creates the .terraform directories, we again run into a concurrency issue.
  2. Terragrunt copies its *.tf/*.hcl files into the modules of these downloaded repositories. We need to implement a different approach in which the downloaded repositories from .terragrunt-cache are left in their original state to avoid conflicts.
  1. Git clone. I think TG is doing a "full" Git clone for the code in the source URL. We should consider doing a shallow clone, as that would be much faster/smaller.

It will be easy. I checked, a shallow clone is about 1/3-1/2 times smaller.

  1. Modules. When you run init, Terraform downloads repos to a .modules folder. If you have 50 terragrunt.hcl files, each of which has a source URL pointing to TF code that contains, say, 10 modules, then when you do run-all, you'll end up downloading 50 * 10 = 500 repos—even if all the modules are exactly the same!

As far as I know, Terraform does not have the module caching feature. But we can implement it in the same way as with providers, that is, download modules into a common cache directory, and then create symbolic links. I checked, terraform works well with module dirs that refer to another dirs. To summarize, I would suggest implementing something like terraform get, but our own implementation run-all get based on the Terraform code.

  1. Ephemeral storage. Many users run TG in places with ephemeral storage—e.g., in a K8S cluster—where the disk is totally empty/fresh on each run. So all the stuff you downloaded in previous runs isn't even available, and you have to download everything from scratch each time.

I don't know how to solve this issue. Any suggestions?

@levkohimins
Copy link
Contributor

levkohimins commented Feb 5, 2024

I just took a quick look at the Terraform code, and unfortunately the code we need is located in the internal directory, so we cannot use it as a golang package without copying it into our codebase.

Of course, the obvious disadvantage is that if they suddenly radically change the module and provider loading, we will also need to update our code. But given that this can happen, we can delivery a new caching feature to users not as a default option, but as the deliberately choice. In other words, they should explicitly run run-all init/run-all get/run-all cache/... , (I'm not sure what can be the best command name for this feature) that creates shared cache.

@brikis98
Copy link
Member Author

brikis98 commented Feb 5, 2024

  1. Providers. By default, Terraform downloads providers from scratch every time you run init. This isn't a problem for a single module, but if you do run-all in a repo that has, say, 50 terragrunt.hcl files, each one runs init on a TF module, each module downloads an average of, say, 10 providers, then that's 50 * 10 = 500 provider downloads—even if it's the exact same 10 providers across all 50 modules!

We can offer users to use TF features such as provider plugin caching, but there is a drawback because terraform does not guarantee safe concurrency: Terraform issue #31964, and Terragrunt issue #1875, thus we cannot perform initialization in parallel for example 50 terragrunt.hcl, therefore the --terragrunt-parallelism 1 flag is mandatory. The solution may be to implement our own provider loader, by reviewing the Terraform code and integrating a similar logic into the Terragrunt code, but taking into account:

  1. First scan Terragant modules, create a list of necessary Terraform providers.
  2. Download all providers in parallel into a shared cache directory with the structure as for provider_installation HOSTNAME/NAMESPACE/TYPE/VERSION/TARGET
  3. Create directories .terraform/providers/ with symbolic links for each module or use provider_installation that does actually the same.

I'm a bit worried about duplicating much of Terraform's own logic for discovering and downloading providers. Most of that is internal logic and not part of a public API they with compatibility guarantees, which may make it tough to keep up to date as Terraform and OpenTofu change.

Here's a bit of a zany idea that leverages their public API: in the provider_installation configuration, you can specify a network_mirror. What if when you run Terragrunt, it:

  1. Fires up a web server listening on localhost that implements the provider mirror network protocol (maybe there's even open source Go code that does this already).
    1. Actually, we might first ping the URL to see if a TG server is already running there, and if so, just use that one. This handles the case where you have multiple instances of TG running concurrently.
    2. The server should only listen on localhost (not 0.0.0.0) so it's accessible only from the local computer
    3. We should also configure randomly-generated credentials for the server so random websites you open on your computer can't make requests to localhost.
  2. Configures Terraform to use that localhost URL as its network_mirror.
  3. When Terraform queries the localhost server for a provider, the server, in memory, can maintain locks on a per-provider basis:
    1. If no one has the lock for this provider already, Terragrunt checks the local disk (in a predictable file path) to see if the provider is already there. If it is, it gives Terraform a local file path. If it's not already on disk, it tells Terraform to download the provider from whatever the original URL was to the predictable file path on disk (I think this is doable with the provider mirror network protocol, but not 100% sure).
    2. If someone already has the lock, then Terragrunt has Terraform wait, and then looks up the provider from disk.

We'd probably make this feature opt-in, at least initially. Once you turn it on, you get provider caching automatically, in a way that should be concurrency safe.

The first thing that comes to mind is to create a common cache for all Terragrunt modules, but here we are faced with several issues:

  1. Since the .terragrunt-cache directories are where Terraform creates the .terraform directories, we again run into a concurrency issue.
  2. Terragrunt copies its *.tf/*.hcl files into the modules of these downloaded repositories. We need to implement a different approach in which the downloaded repositories from .terragrunt-cache are left in their original state to avoid conflicts.

Yea, both of these are valid issues. Any ideas on solutions? Do the suggestions in #2923, especially a content-addressable store similar to pnpm with symlinks offer a potential solution?

  1. Git clone. I think TG is doing a "full" Git clone for the code in the source URL. We should consider doing a shallow clone, as that would be much faster/smaller.

It will be easy. I checked, a shallow clone is about 1/3-1/2 times smaller.

It's easy for most things, but as I found out in #2893, one issue we hit with shallow clones is that the catalog command uses the Git repo to look up tags, which you can't do with a shallow clone. So we may need some conditional logic where we use shallow clones by default, but if something needs to do a look up in Git history, we swap to a full clone.

  1. Modules. When you run init, Terraform downloads repos to a .modules folder. If you have 50 terragrunt.hcl files, each of which has a source URL pointing to TF code that contains, say, 10 modules, then when you do run-all, you'll end up downloading 50 * 10 = 500 repos—even if all the modules are exactly the same!

As far as I know, Terraform does not have the module caching feature. But we can implement it in the same way as with providers, that is, download modules into a common cache directory, and then create symbolic links. I checked, terraform works well with module dirs that refer to another dirs. To summarize, I would suggest implementing something like terraform get, but our own implementation run-all get based on the Terraform code.

I'm a bit worried about duplicating much of Terraform's own logic for discovering and downloading modules. Most of that is internal logic and not part of a public API they with compatibility guarantees, which may make it tough to keep up to date as Terraform and OpenTofu change.

Are there any hooks for downloading modules? For example, how does Terraform work in an air gapped environment?

  1. Ephemeral storage. Many users run TG in places with ephemeral storage—e.g., in a K8S cluster—where the disk is totally empty/fresh on each run. So all the stuff you downloaded in previous runs isn't even available, and you have to download everything from scratch each time.

I don't know how to solve this issue. Any suggestions?

This would mostly be about documenting how to persist data, such as a provider cache, in a K8S cluster: e.g., with persistent volumes.

@levkohimins
Copy link
Contributor

levkohimins commented Feb 6, 2024

  1. Providers. By default, Terraform downloads providers from scratch every time you run init. This isn't a problem for a single module, but if you do run-all in a repo that has, say, 50 terragrunt.hcl files, each one runs init on a TF module, each module downloads an average of, say, 10 providers, then that's 50 * 10 = 500 provider downloads—even if it's the exact same 10 providers across all 50 modules!

I'm a bit worried about duplicating much of Terraform's own logic for discovering and downloading providers. Most of that is internal logic and not part of a public API they with compatibility guarantees, which may make it tough to keep up to date as Terraform and OpenTofu change.

Agree.

Here's a bit of a zany idea that leverages their public API: in the provider_installation configuration, you can specify a network_mirror. What if when you run Terragrunt, it:

  1. Fires up a web server listening on localhost that implements the provider mirror network protocol (maybe there's even open source Go code that does this already).

    1. Actually, we might first ping the URL to see if a TG server is already running there, and if so, just use that one. This handles the case where you have multiple instances of TG running concurrently.
    2. The server should only listen on localhost (not 0.0.0.0) so it's accessible only from the local computer
    3. We should also configure randomly-generated credentials for the server so random websites you open on your computer can't make requests to localhost.
  2. Configures Terraform to use that localhost URL as its network_mirror.

  3. When Terraform queries the localhost server for a provider, the server, in memory, can maintain locks on a per-provider basis:

    1. If no one has the lock for this provider already, Terragrunt checks the local disk (in a predictable file path) to see if the provider is already there. If it is, it gives Terraform a local file path. If it's not already on disk, it tells Terraform to download the provider from whatever the original URL was to the predictable file path on disk (I think this is doable with the provider mirror network protocol, but not 100% sure).
    2. If someone already has the lock, then Terragrunt has Terraform wait, and then looks up the provider from disk.

We'd probably make this feature opt-in, at least initially. Once you turn it on, you get provider caching automatically, in a way that should be concurrency safe.

Great idea. Yes, there are, one of them https://github.com/terralist/terralist supports Module and Provider registries.
But I am not sure that this will solve all our issues. Yes, we can reduce the amount of traffic by reading the already downloaded plugin from disk, but each terraform process will stores the received plugin from the mirror network in its .terraform directory. So the disk usage will be the same + one more copy for (proxy) private register.

The first thing that comes to mind is to create a common cache for all Terragrunt modules, but here we are faced with several issues:

  1. Since the .terragrunt-cache directories are where Terraform creates the .terraform directories, we again run into a concurrency issue.
  2. Terragrunt copies its *.tf/*.hcl files into the modules of these downloaded repositories. We need to implement a different approach in which the downloaded repositories from .terragrunt-cache are left in their original state to avoid conflicts.

Yea, both of these are valid issues. Any ideas on solutions?

  1. We can store .terraform data separately from repositories by changing the path with TF_DATA_DIR
  2. I cannot say now, but I'm sure there is a solution.

Do the suggestions in #2923, especially a content-addressable store similar to pnpm with symlinks offer a potential solution?

In the case of npm, this is justified, since npm itself understands where to get which files from, in our case, terraform needs to be provided a regular file structure, and for this we would have to create hundreds or thousands of symlinks, and the creation of such a database itself is not trivial task, considering that the modules themselves are not so big as plugins, to spend so much time and resources on it, not sure.

  1. Git clone. I think TG is doing a "full" Git clone for the code in the source URL. We should consider doing a shallow clone, as that would be much faster/smaller.

It will be easy. I checked, a shallow clone is about 1/3-1/2 times smaller.

It's easy for most things, but as I found out in #2893, one issue we hit with shallow clones is that the catalog command uses the Git repo to look up tags, which you can't do with a shallow clone. So we may need some conditional logic where we use shallow clones by default, but if something needs to do a look up in Git history, we swap to a full clone.

Ah, will keep this in mind, thanks.

  1. Modules. When you run init, Terraform downloads repos to a .modules folder. If you have 50 terragrunt.hcl files, each of which has a source URL pointing to TF code that contains, say, 10 modules, then when you do run-all, you'll end up downloading 50 * 10 = 500 repos—even if all the modules are exactly the same!

Are there any hooks for downloading modules? For example, how does Terraform work in an air gapped environment?

We can run a private registry locally, but then we need to change the module links to point to this private register. Perhaps we could do this automatically, after cloning repos into the cache directory, but this does not solve the disk usage issue, although compared to plugins, this may not be so critical.

  1. Ephemeral storage. Many users run TG in places with ephemeral storage—e.g., in a K8S cluster—where the disk is totally empty/fresh on each run. So all the stuff you downloaded in previous runs isn't even available, and you have to download everything from scratch each time.

I don't know how to solve this issue. Any suggestions?

This would mostly be about documenting how to persist data, such as a provider cache, in a K8S cluster: e.g., with persistent volumes.

Ah, understood.

@brikis98
Copy link
Member Author

brikis98 commented Feb 6, 2024

Here's a bit of a zany idea that leverages their public API: in the provider_installation configuration, you can specify a network_mirror. What if when you run Terragrunt, it:

  1. Fires up a web server listening on localhost that implements the provider mirror network protocol (maybe there's even open source Go code that does this already).

    1. Actually, we might first ping the URL to see if a TG server is already running there, and if so, just use that one. This handles the case where you have multiple instances of TG running concurrently.
    2. The server should only listen on localhost (not 0.0.0.0) so it's accessible only from the local computer
    3. We should also configure randomly-generated credentials for the server so random websites you open on your computer can't make requests to localhost.
  2. Configures Terraform to use that localhost URL as its network_mirror.

  3. When Terraform queries the localhost server for a provider, the server, in memory, can maintain locks on a per-provider basis:

    1. If no one has the lock for this provider already, Terragrunt checks the local disk (in a predictable file path) to see if the provider is already there. If it is, it gives Terraform a local file path. If it's not already on disk, it tells Terraform to download the provider from whatever the original URL was to the predictable file path on disk (I think this is doable with the provider mirror network protocol, but not 100% sure).
    2. If someone already has the lock, then Terragrunt has Terraform wait, and then looks up the provider from disk.

We'd probably make this feature opt-in, at least initially. Once you turn it on, you get provider caching automatically, in a way that should be concurrency safe.

Great idea. Yes, there are, one of them https://github.com/terralist/terralist supports Module and Provider registries. But I am not sure that this will solve all our issues. Yes, we can reduce the amount of traffic by reading the already downloaded plugin from disk, but each terraform process will stores the received plugin from the mirror network in its .terraform directory. So the disk usage will be the same + one more copy for (proxy) private register.

If we also enable the plugin cache (which TG could enable via env var automatically when executing terraform), I think TF will use a symlink to the cache, rather than copying the whole thing again. Can you think of a quick & dirty way to test out these hypotheses and see if this is a viable path forward?

The first thing that comes to mind is to create a common cache for all Terragrunt modules, but here we are faced with several issues:

  1. Since the .terragrunt-cache directories are where Terraform creates the .terraform directories, we again run into a concurrency issue.
  2. Terragrunt copies its *.tf/*.hcl files into the modules of these downloaded repositories. We need to implement a different approach in which the downloaded repositories from .terragrunt-cache are left in their original state to avoid conflicts.

Yea, both of these are valid issues. Any ideas on solutions?

  1. We can store .terraform data separately from repositories by changing the path with TF_DATA_DIR
  2. I cannot say now, but I'm sure there is a solution.

Alright, keep thinking about it in the background to see if you can come up with something. One thing I stumbled across recently that may be of use: hashicorp/terraform#28309

Do the suggestions in #2923, especially a content-addressable store similar to pnpm with symlinks offer a potential solution?

In the case of npm, this is justified, since npm itself understands where to get which files from, in our case, terraform needs to be provided a regular file structure, and for this we would have to create hundreds or thousands of symlinks, and the creation of such a database itself is not trivial task, considering that the modules themselves are not so big as plugins, to spend so much time and resources on it, not sure.

Fair enough.

  1. Modules. When you run init, Terraform downloads repos to a .modules folder. If you have 50 terragrunt.hcl files, each of which has a source URL pointing to TF code that contains, say, 10 modules, then when you do run-all, you'll end up downloading 50 * 10 = 500 repos—even if all the modules are exactly the same!

Are there any hooks for downloading modules? For example, how does Terraform work in an air gapped environment?

We can run a private registry locally, but then we need to change the module links to point to this private register. Perhaps we could do this automatically, after cloning repos into the cache directory, but this does not solve the disk usage issue, although compared to plugins, this may not be so critical.

I suspect disk space isn't as big of a concern with modules, as those are mostly text (whereas providers are binaries in the tens of MBs). The time spent re-downloading (re-cloning) things is probably the bigger concern there.

@lorengordon
Copy link
Contributor

lorengordon commented Feb 6, 2024

I'm not sure how useful this is as far as addressing the issue from within terragrunt, but figured I'd share in the sense that there are approaches users could take to address the issues within their own pipelines/workflows... It is complicated though, and also maybe abuses some implementation details of terraform. Definitely welcome the conversation and would appreciate any features within terragrunt that address these issues more directly!

One thing I started doing is maintaining a single terraform config of all the providers and modules that are in use across the whole project. I call it the vendor config. This vendor config provides no inputs to any module, and is used only for running terraform init. For example:

terraform {
  required_version = "1.6.5" # terraform-version

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "5.35.0"
    }
  }
}

module "foo" {
  source = "git::https://url/to/foo/module?ref=1.0.0"
}

module "bar" {
  source = "git::https://url/to/bar/module?ref=1.0.0"
}

Then in the "real" terragrunt and terraform configs, the module source points to the relative path to the .terraform directory that was initialized. E.g.

source = "../../vendor/.terraform/modules/bar"

source = "../../vendor/.terraform/modules/foo"

Before running any terragrunt commands, we run terraform -chdir vendor init -backend=false -lock=false to populate the provider and module cache. Combined with TF_PLUGIN_CACHE_DIR, this setup ensures the providers and modules are only downloaded the one time over the network. We also use this setup to manage all module versions in a single place.

(The provider versions in that vendor config also update the resulting .terraform.lock.hcl, which we use across all stacks in the project. I think its a nice and clean way to manage the lock file, but I don't think that is quite as relevant to this particular issue.)

@levkohimins
Copy link
Contributor

levkohimins commented Feb 6, 2024

Providers. By default, Terraform downloads providers from scratch every time you run init. This isn't a problem for a single module, but if you do run-all in a repo that has, say, 50 terragrunt.hcl files, each one runs init on a TF module, each module downloads an average of, say, 10 providers, then that's 50 * 10 = 500 provider downloads—even if it's the exact same 10 providers across all 50 modules!

If we also enable the plugin cache (which TG could enable via env var automatically when executing terraform), I think TF will use a symlink to the cache, rather than copying the whole thing again.

I took a quick look at the terraform code https://github.com/hashicorp/terraform/blob/main/internal/providercache/installer.go

Briefly step by step how terraform pluging cache works:

	// Step 1: Which providers might we need to fetch a new version of?
	// This produces the subset of requirements we need to ask the provider
	// source about. If we're in the normal (non-upgrade) mode then we'll
	// just ask the source to confirm the continued existence of what
	// was locked, or otherwise we'll find the newest version matching the
	// configured version constraint.

	// Step 2: Query the provider source for each of the providers we selected
	// in the first step and select the latest available version that is
	// in the set of acceptable versions.
	//
	// This produces a set of packages to install to our cache in the next step.

	// Step 3: For each provider version we've decided we need to install,
	// install its package into our target cache (possibly via the global cache).

So the idea with the lockers might work, since it first queries which versions exist in the registry, and then checks which exists in the cache. But there may be issues keeping the connection, Terraform processes must wait for the private registry, until a plugin is downloaded, so a timeout may occur, in the case when the user’s Internet speed is low and the plugin is large.

@lorengordon suggested the interesting idea. Thanks @lorengordon! On the one hand, we don’t need any private registers, which eliminates a huge number of issues that we don’t yet know about, but on the other hand, we will have to implement such logic that will generate on fly such a config and replace the source in other configs. But there is a drawback too, this will not work correctly with modules, since, unlike providers, only a specific version of the module is stored in .terraform/modules, and in the case of 50 terragrout.hcl there may be such a case, when the module is the same, but versions are different. A solution could be: instead of replacing the source in configurations, we can create symlinks for each module in .terraform/providers, and thus the plugins will be shared, but the modules will be downloaded individually.

By the way, I don't know if the symlink approach is workable on Windows OS at all. Should I check it or does someone already know the answer? :)

Can you think of a quick & dirty way to test out these hypotheses and see if this is a viable path forward?

Of course I can, but can you please confirm that this request is still relevant?

Repos. When you set a source URL in TG, it downloads the whole repo into a .terragrunt-cache folder. If you have 50 terragrunt.hcl files with source URLs, and do run-all, it will download the repos 50 times—even if all 50 repos are the same!

  1. We can store .terraform data separately from repositories by changing the path with TF_DATA_DIR
  2. I cannot say now, but I'm sure there is a solution.

Alright, keep thinking about it in the background to see if you can come up with something.

Sure, will do.

One thing I stumbled across recently that may be of use: hashicorp/terraform#28309

Ah, very interested, I don’t know yet whether this will be useful to us, but will keep it mind. Thanks.

Modules. When you run init, Terraform downloads repos to a .modules folder. If you have 50 terragrunt.hcl files, each of which has a source URL pointing to TF code that contains, say, 10 modules, then when you do run-all, you'll end up downloading 50 * 10 = 500 repos—even if all the modules are exactly the same!
We can run a private registry locally, but then we need to change the module links to point to this private register. Perhaps we could do this automatically, after cloning repos into the cache directory, but this does not solve the disk usage issue, although compared to plugins, this may not be so critical.

I suspect disk space isn't as big of a concern with modules, as those are mostly text (whereas providers are binaries in the tens of MBs). The time spent re-downloading (re-cloning) things is probably the bigger concern there.

Agree. What will be the decision?

  1. If we use a private registry for providers, should we also use a private registry for modules to reduce traffic
  2. Just do nothing

@lorengordon
Copy link
Contributor

But there is a drawback too, this will not work correctly with modules, since, unlike providers, only a specific version of the module is stored in .terraform/modules, and in the case of 50 terragrout.hcl there may be such a case, when the module is the same, but versions are different.

While we see supporting a single version of a module as a bonus, if I had to support multiple versions, I would do it by changing the module label. That label is what maps to the path in the .terraform/modules directory. For example:

module "foo_1.0.0" {
  source = "git::https://url/to/foo/module?ref=1.0.0"
}

module "foo_1.0.1" {
  source = "git::https://url/to/foo/module?ref=1.0.1"
}

and then referencing paths like so:

source = "../../vendor/.terraform/modules/foo_1.0.0"

source = "../../vendor/.terraform/modules/foo_1.0.1"

@lorengordon
Copy link
Contributor

lorengordon commented Feb 6, 2024

One place I know of that my approach does fall over for modules though, is nested modules. If a vendor module is itself referencing another remote module, there is no way I've yet figured out within terraform to capture and overwrite that nested remote source.

@levkohimins
Copy link
Contributor

levkohimins commented Feb 6, 2024

While we see supporting a single version of a module as a bonus, if I had to support multiple versions, I would do it by changing the module label. That label is what maps to the path in the .terraform/modules directory.

Ah, right, this might work :) We need to weigh whether it’s worth parsing all the configs to change the name of the modules and its source, or accepting duplication of modules as a compromise.

One place I know of that my approach does fall over for modules though, is nested modules. If a vendor module is itself referencing another remote module, there is no way I know of within terraform to capture and overwrite that nested remote source.

Oh really, this idea won't work with nested terraform modules, since each terraform module creates its own .terraform folder in its root.

@lorengordon
Copy link
Contributor

lorengordon commented Feb 6, 2024

For example, how does Terraform work in an air gapped environment?

I also support air-gapped environments. We only use modules that use source = git:https://..., and then we mirror modules to an internally accessible host, and use the git url "insteadOf" option to rewrite git urls in our shell configs.

For providers, we host an accessible provider mirror and use the network_mirror option in the .terraformrc file.

@lorengordon
Copy link
Contributor

lorengordon commented Feb 7, 2024

Oh really, this idea won't work with nested terraform modules, since each terraform module creates its own .terraform folder in its root.

Modules are tricky overall anyway, since the terraform network mirror only supports providers, not modules. You'd have to use something like the host option of .terraformrc as suggested earlier, and a localhost implementation of the module registry. That's probably the easiest. Otherwise, you're parsing through the .terraform directory for module blocks, swapping out remote source for a filesystem location, running init, and recursively repeating that until everything is resolved to a local on-filesystem location.

@levkohimins
Copy link
Contributor

levkohimins commented Feb 7, 2024

@lorengordon, yeah, you are right! I think we shouldn’t bother so much with modules, since they usually take up several megabytes. I wouldn't touch them.

@brikis98, suggestion on how to resolve the issue of duplication of providers, based on @lorengordon suggestion.

  1. Parse all tf configs to create a single config of all providers.
  2. Use provider plugin caching.
  3. Run terraform init for this config.
  4. Continue run-all ... .

This way, one terraform process download all providers at the same time, eliminating concurrency issue and in the end we have a cache with all the necessary providers.

@lorengordon
Copy link
Contributor

lorengordon commented Feb 7, 2024

  1. Parse all tf configs to create a single config of all providers.

One sticking point with that step, even for providers, is any config that uses a module with a remote source. Remote modules may have provider requirements also. "Parsing all tf configs" to figure out all the providers in use and their version constraints, necessarily involves retrieving all remote modules. And so we're now reinventing a lot of the plumbing around terraform init.

There may be an optimization available though, if the .terragrunt.lock.hcl files are checked-in/available locally. Parse all of those for the provider requirements....

@levkohimins
Copy link
Contributor

levkohimins commented Feb 7, 2024

One sticking point with that step, even for providers, is any config that uses a module with a remote source. Remote modules may have provider requirements also. "Parsing all tf configs" to figure out all the providers in use and their version constraints, necessarily involves retrieving all remote modules. And so we're now reinventing a lot of the plumbing around terraform init.

Ok, but if we also include the modules in the single config, it will also download providers of these modules, right? After that we can just remove these modules as garbage.

@brikis98
Copy link
Member Author

brikis98 commented Feb 9, 2024

Providers. By default, Terraform downloads providers from scratch every time you run init. This isn't a problem for a single module, but if you do run-all in a repo that has, say, 50 terragrunt.hcl files, each one runs init on a TF module, each module downloads an average of, say, 10 providers, then that's 50 * 10 = 500 provider downloads—even if it's the exact same 10 providers across all 50 modules!

If we also enable the plugin cache (which TG could enable via env var automatically when executing terraform), I think TF will use a symlink to the cache, rather than copying the whole thing again.

I took a quick look at the terraform code https://github.com/hashicorp/terraform/blob/main/internal/providercache/installer.go

Briefly step by step how terraform pluging cache works:

	// Step 1: Which providers might we need to fetch a new version of?
	// This produces the subset of requirements we need to ask the provider
	// source about. If we're in the normal (non-upgrade) mode then we'll
	// just ask the source to confirm the continued existence of what
	// was locked, or otherwise we'll find the newest version matching the
	// configured version constraint.

	// Step 2: Query the provider source for each of the providers we selected
	// in the first step and select the latest available version that is
	// in the set of acceptable versions.
	//
	// This produces a set of packages to install to our cache in the next step.

	// Step 3: For each provider version we've decided we need to install,
	// install its package into our target cache (possibly via the global cache).

So the idea with the lockers might work, since it first queries which versions exist in the registry, and then checks which exists in the cache. But there may be issues keeping the connection, Terraform processes must wait for the private registry, until a plugin is downloaded, so a timeout may occur, in the case when the user’s Internet speed is low and the plugin is large.

Did you actually test this out and see a timeout issue? Or are you just guessing that it might be an issue?

By the way, I don't know if the symlink approach is workable on Windows OS at all. Should I check it or does someone already know the answer? :)

AFAIK, symlinks work more or less as you'd expect on Win 10/11.

Can you think of a quick & dirty way to test out these hypotheses and see if this is a viable path forward?

Of course I can, but can you please confirm that this request is still relevant?

I'll create a separate comment shortly to summarize the options on the table and address this there.

Modules. When you run init, Terraform downloads repos to a .modules folder. If you have 50 terragrunt.hcl files, each of which has a source URL pointing to TF code that contains, say, 10 modules, then when you do run-all, you'll end up downloading 50 * 10 = 500 repos—even if all the modules are exactly the same!
We can run a private registry locally, but then we need to change the module links to point to this private register. Perhaps we could do this automatically, after cloning repos into the cache directory, but this does not solve the disk usage issue, although compared to plugins, this may not be so critical.

I suspect disk space isn't as big of a concern with modules, as those are mostly text (whereas providers are binaries in the tens of MBs). The time spent re-downloading (re-cloning) things is probably the bigger concern there.

Agree. What will be the decision?

  1. If we use a private registry for providers, should we also use a private registry for modules to reduce traffic
  2. Just do nothing

For now, let's gather all ideas, and then decide which ones to test out, and in which order. Reducing provider downloads is definitely a higher priority than the module stuff, so that should be the first thing to focus on.

@brikis98
Copy link
Member Author

brikis98 commented Feb 9, 2024

I'm not sure how useful this is as far as addressing the issue from within terragrunt, but figured I'd share in the sense that there are approaches users could take to address the issues within their own pipelines/workflows... It is complicated though, and also maybe abuses some implementation details of terraform. Definitely welcome the conversation and would appreciate any features within terragrunt that address these issues more directly!

One thing I started doing is maintaining a single terraform config of all the providers and modules that are in use across the whole project. I call it the vendor config. This vendor config provides no inputs to any module, and is used only for running terraform init. For example:

terraform {
  required_version = "1.6.5" # terraform-version

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "5.35.0"
    }
  }
}

module "foo" {
  source = "git::https://url/to/foo/module?ref=1.0.0"
}

module "bar" {
  source = "git::https://url/to/bar/module?ref=1.0.0"
}

Then in the "real" terragrunt and terraform configs, the module source points to the relative path to the .terraform directory that was initialized. E.g.

source = "../../vendor/.terraform/modules/bar"

source = "../../vendor/.terraform/modules/foo"

Before running any terragrunt commands, we run terraform -chdir vendor init -backend=false -lock=false to populate the provider and module cache. Combined with TF_PLUGIN_CACHE_DIR, this setup ensures the providers and modules are only downloaded the one time over the network. We also use this setup to manage all module versions in a single place.

(The provider versions in that vendor config also update the resulting .terraform.lock.hcl, which we use across all stacks in the project. I think its a nice and clean way to manage the lock file, but I don't think that is quite as relevant to this particular issue.)

Thanks for sharing this approach! Definitely a cool idea.

As pointed out in subsequent comments, this approach doesn't quite seem to handle nested modules... And we have a lot of those. So it feels promising, but not quite complete.

@levkohimins
Copy link
Contributor

levkohimins commented Feb 9, 2024

So the idea with the lockers might work, since it first queries which versions exist in the registry, and then checks which exists in the cache. But there may be issues keeping the connection, Terraform processes must wait for the private registry, until a plugin is downloaded, so a timeout may occur, in the case when the user’s Internet speed is low and the plugin is large.

Did you actually test this out and see a timeout issue? Or are you just guessing that it might be an issue?

So far, only guesses.

@levkohimins
Copy link
Contributor

levkohimins commented Feb 9, 2024

Thanks for sharing this approach! Definitely a cool idea.

As pointed out in subsequent comments, this approach doesn't quite seem to handle nested modules... And we have a lot of those. So it feels promising, but not quite complete.

I could be wrong, but doesn't terraform init ensure that all providers are downloaded for nested or nested nested modules? comment

@brikis98
Copy link
Member Author

brikis98 commented Feb 9, 2024

OK, let me summarize the ideas on the table so far:

Problem 1: Providers

Reducing bandwidth and disk space usage with providers is the highest priority and should be the thing we focus on first.

Idea 1: network mirror running on localhost

As described here:

  1. TG runs a server on localhost.
  2. TG configures that server as a network_mirror for downloading providers.
  3. This server does in-memory locking to ensure there are no issues with downloading providers concurrently.
  4. TG also enables plugin caching. This ensures each plugin is only ever downloaded once.

There may be an issue here with timeouts related to step (4), so we'll have to test and see if this is workable.

Idea 2: simple pre-process

Loosely based on @lorengordon's approach, as @levkohimins wrote up here:

  1. Parse all tf configs to create a single config of all providers.
  2. Use provider plugin caching.
  3. Run terraform init for this config.
  4. Continue run-all ... .

This is promising, but I've crossed it out because this approach doesn't handle nested modules. That is, the parsing in step (1) would only find the top-level modules, but after running init on those, they may contain nested modules, which define further providers and other nested modules.

Idea 3: more complicated pre-process

This is a slight tweak on idea 2:

  1. Parse all TF configs to extract all required_providers and module blocks.
  2. Copy all the required_providers blocks into a single main.tf.
  3. Copy all module blocks into a single main.tf, but (a) only copy the source and version parameters from within the body of a module block, ignoring all other parameters so we don't have to deal with variables, resources, etc and (b) give each module block a unique ID, to ensure they don't clash.
  4. Run terraform get. This will just download all the modules including nested modules into the .terraform folder.
  5. Next, walk the tree from top-level modules in main.tf to the underlying code in .terraform, and then for each module in .terraform, repeat the process recursively to find all nested modules. As you do this walk, parse out all required_providers blocks, and copy it into a totally new main.tf.
  6. Run init on that totally new main.tf.
  7. Continue run-all...

This seemed like an approach that would allow us to fix the weaknesses of idea 2, but as I wrote it out, I realized this approach also has problems:

  1. You don't have to specify a provider in required_providers. For providers in the TF registry, it's enough to include a provider block or any resource or data source, and Terraform will automatically figure out which provider you want, and download it. So just extracting required_providers and module blocks is not enough!
  2. The approach above doesn't take into account lock files. Each module may have a different lock file, and we'd need to respect it.

When I realized problem 1, I thought it might be solvable by pulling all resources, data sources, provider blocks, etc into our mega main.tf, but once I saw problem 2, I more or less gave up. This approach feels like a dead end. We'd be recreating so much TF logic, that we're almost certain to have weird bugs and difficulty maintaining this code.

Recommendation: small prototype of idea 1

Unless anyone has other ideas to consider, I recommend that we build the smallest prototype of idea 1 that we can. In fact, perhaps we should build a tiny web server that just hangs indefinitely (doesn't actually download providers or do any locking or anything else) solely to see if the timeout thing is going to be a real problem. If it is, we'll need new ideas. If not, we can proceed with having the prototype actually do some work.

Problem 2: TG source URLs

The next priority is the source URLs in TG, which are downloaded multiple times, and for which we do a full clone. We should only invest time in this after making improvements to problem 1 above.

I'd recommend:

  1. Switch to a shallow clone.
  2. Zany idea to consider (somewhat similar to Proposal: Create Content Adressable Store for Terragrunt #2923): maintain a system-wide TG module cache, perhaps in ~/.terragrunt/cache. In this cache, we would only ever download a single unique source URL just once. Then, when TG is running, what it puts into ~/.terragrunt-cache is a bunch of symlinks pointing to ~/.terragrunt/cache, plus any files it copies from the current working dir and any generated files. Generating a symlink for every file in the source URL is probably a bit tedious, but having to download each repo only once saves time and bandwidth, and I'm guessing the symlinks will save some disk space over newly downloaded copies.

Problem 3: TF module downloads

This is the next priority: Terraform re-downloading the same modules into .terraform folders over and over again. We should only invest time in this after making improvements to problems 1 and 2 above.

I haven't heard any working ideas for how to improve on this yet, so please toss out ideas.

Problem 4: Ephemeral caches

The final priority is explaining best practices for using TG in a place with ephemeral storage, such as K8S. We should only invest time in this after making improvements to problems 1, 2, and 3 above.

I think this is mainly documenting the need to use persistent disk stores.

@lorengordon
Copy link
Contributor

doesn't terraform init ensure that all providers are downloaded for nested or nested nested modules?

Yes, it does. The "single config" option, using what I called the vendor config, does retrieve all providers. And it will generate a lock file that contains all the provider constraints.

It also retrieves all modules, including nested ones. The problem with nested modules are those specifically with remote sources. If the source is local within the nested module, no problem, the local relative path is fine. But a remote source will re-download the remote module when init is executed in the "real" config.

However, one thing that just occurred to me to address that, would be to pre-populate the .terraform/modules directory of the "real" config, using the content previously retrieved using the vendor config. Basically copy the modules from one directory to another, or maybe symlink if possible. Then terraform init in the "real" config would see that all the remote modules are already present. Unfortunately, lining up the module label names and the directory names would take quite a bit more parsing....

@yhakbar
Copy link
Collaborator

yhakbar commented Feb 12, 2024

Regarding pre-fetching provider binaries, what if we just have an opt-in configuration (like an environment variable named PRE_FETCH_BINARIES) to support pre-fetching the provider binaries with a naive assumption that the providercache will never change, then create an RFC requesting that OpenTofu move the providercache package out of internal?

If OpenTofu accepts the RFC, we can switch to an opt-out configuration, and rely on the public package to handle the logic for pre-fetching provider binaries in OpenTofu and use our naive custom logic for Terraform.

Would this handle your concerns regarding the potential changing logic in the internal providercache package, @levkohimins ?

@brikis98
Would it be a simpler first pass to only support concurrent pre-fetching with directories that contain a .terraform.lock.hcl file? I've seen the file lock providers for nested modules in addition to the top most module. Those can be safely downloaded concurrently, as the providers in the .terraform.lock.hcl file can be merged, then deduplicated to provide a list of plugins to pre-fetch. For timeouts that may occur due to bandwidth issues, we can make the max concurrent plugin downloads and download retries be configurable for low bandwidth environments.

For directories without .terraform.lock.hcl files, they can be init in series, which should be fast on repeat runs (and runs after the initial init downloading the latest of a given provider plugin) if TF_PLUGIN_CACHE_DIR is populated and the directory is persisted between runs.

@levkohimins
Copy link
Contributor

levkohimins commented Feb 12, 2024

Would this handle your concerns regarding the potential changing logic in the internal providercache package, @levkohimins ?

@yhakbar, The concern is not only that the code we are interested in is located the internal/ directory, but that if the provider loading logic changes, we will have to rewrite the code in Terragrunt as well in any case. Although I think that such changes are unlikely to completely break the work of loading providers , since developers always try to maintain compatibility with older versions, but either way it puts some extra workload on us than just running the terraform init command.

@levkohimins
Copy link
Contributor

levkohimins commented Feb 12, 2024

Idea 1: network mirror running on localhost

As described here:

  1. TG runs a server on localhost.
  2. TG configures that server as a network_mirror for downloading providers.
  3. This server does in-memory locking to ensure there are no issues with downloading providers concurrently.
  4. TG also enables plugin caching. This ensures each plugin is only ever downloaded once.

There may be an issue here with timeouts related to step (4), so we'll have to test and see if this is workable.

@brikis98,
I found out that providers must be present in .terraform.lock.hcl, otherwise Terraform re-downloads providers, even when they are already present in the cache. this means that one way or another, Terraform functionality must be partially implemented inside Terragrunt in order to generate this file, otherwise it simply will not work. What it looks like:

provider "registry.terraform.io/hashicorp/aws" {
  version     = "5.36.0"
  constraints = "5.36.0"
  hashes = [
    "h1:54QgAU2vY65WZsiZ9FligQfIf7hQUvwse4ezMwVMwgg=",
    "zh:0da8409db879b2c400a7d9ed1311ba6d9eb1374ea08779eaf0c5ad0af00ac558",
    "zh:1b7521567e1602bfff029f88ccd2a182cdf97861c9671478660866472c3333fa",
    "zh:1cab4e6f3a1d008d01df44a52132a90141389e77dbb4ec4f6ac1119333242ecf",
    "zh:1df9f73595594ce8293fb21287bcacf5583ae82b9f3a8e5d704109b8cf691646",
    "zh:2b5909268db44b6be95ff6f9dc80d5f87ca8f63ba530fe66723c5fdeb17695fc",
    "zh:37dd731eeb0bc1b20e3ec3a0cb5eb7a730edab425058ff40f2243438acc82830",
    "zh:3e94c76a2b607a1174d10f5712aed16cb32216ac1c91bd6f21749d61a14045ac",
    "zh:40e6ba3184d2d3bf283a07feed8b79c1bbc537a91215cac7b3521b9ccb3e503e",
    "zh:67e52353fea47eb97825f6eb6fddd1935e0ff3b53a8861d23a70c2babf83ae51",
    "zh:6d2e2f390e0c7b2cd2344b1d5d6eec8a1c11cf35d19f1d6f341286f2449e9e10",
    "zh:7005483c43926800fad5bb18e27be883dac4339edb83a8f18ccdc7edf86fafc2",
    "zh:7073fa7ccaa9b07c2cf7b24550a90e11f4880afd5c53afd51278eff0154692a0",
    "zh:9b12af85486a96aedd8d7984b0ff811a4b42e3d88dad1a3fb4c0b580d04fa425",
    "zh:a6d48620e526c766faec9aeb20c40a98c1810c69b6699168d725f721dfe44846",
    "zh:e29b651b5f39324656f466cd24a54861795cc423a1b58372f4e1d2d2112d10a0",
  ]
}

About connection timeout concern, Terraform terminates connections if the registry does not respond after 10-15 seconds, and exits with the error Error: Failed to install provider. So it turns out that the idea with locks is not feasible. Anyway, the idea with a private registry was not very promising, since issues could occur with firewalls, we would also have to check that the port on which the registry listens to connections is not busy, etc.

A workaround could be to (don't use private registry at all) run terraform init non-parallel/sequentially for all terragrunt.hcl before the target command, which can then run in parallel, but we still have to generate .terraform.lock.hcl files.

🤷‍♂️ Honestly, I don’t see any other way than to implement the functionality of fetching providers and modules by Terragrunt itself to ensure maximum performance and predictable behavior.

@brikis98
Copy link
Member Author

Idea 1: network mirror running on localhost

As described here:

  1. TG runs a server on localhost.
  2. TG configures that server as a network_mirror for downloading providers.
  3. This server does in-memory locking to ensure there are no issues with downloading providers concurrently.
  4. TG also enables plugin caching. This ensures each plugin is only ever downloaded once.

There may be an issue here with timeouts related to step (4), so we'll have to test and see if this is workable.

@brikis98, I found out that providers must be present in .terraform.lock.hcl, otherwise Terraform re-downloads providers, even when they are already present in the cache. this means that one way or another, Terraform functionality must be partially implemented inside Terragrunt in order to generate this file, otherwise it simply will not work. What it looks like:

provider "registry.terraform.io/hashicorp/aws" {
  version     = "5.36.0"
  constraints = "5.36.0"
  hashes = [
    "h1:54QgAU2vY65WZsiZ9FligQfIf7hQUvwse4ezMwVMwgg=",
    "zh:0da8409db879b2c400a7d9ed1311ba6d9eb1374ea08779eaf0c5ad0af00ac558",
    "zh:1b7521567e1602bfff029f88ccd2a182cdf97861c9671478660866472c3333fa",
    "zh:1cab4e6f3a1d008d01df44a52132a90141389e77dbb4ec4f6ac1119333242ecf",
    "zh:1df9f73595594ce8293fb21287bcacf5583ae82b9f3a8e5d704109b8cf691646",
    "zh:2b5909268db44b6be95ff6f9dc80d5f87ca8f63ba530fe66723c5fdeb17695fc",
    "zh:37dd731eeb0bc1b20e3ec3a0cb5eb7a730edab425058ff40f2243438acc82830",
    "zh:3e94c76a2b607a1174d10f5712aed16cb32216ac1c91bd6f21749d61a14045ac",
    "zh:40e6ba3184d2d3bf283a07feed8b79c1bbc537a91215cac7b3521b9ccb3e503e",
    "zh:67e52353fea47eb97825f6eb6fddd1935e0ff3b53a8861d23a70c2babf83ae51",
    "zh:6d2e2f390e0c7b2cd2344b1d5d6eec8a1c11cf35d19f1d6f341286f2449e9e10",
    "zh:7005483c43926800fad5bb18e27be883dac4339edb83a8f18ccdc7edf86fafc2",
    "zh:7073fa7ccaa9b07c2cf7b24550a90e11f4880afd5c53afd51278eff0154692a0",
    "zh:9b12af85486a96aedd8d7984b0ff811a4b42e3d88dad1a3fb4c0b580d04fa425",
    "zh:a6d48620e526c766faec9aeb20c40a98c1810c69b6699168d725f721dfe44846",
    "zh:e29b651b5f39324656f466cd24a54861795cc423a1b58372f4e1d2d2112d10a0",
  ]
}

About connection timeout concern, Terraform terminates connections if the registry does not respond after 10-15 seconds, and exits with the error Error: Failed to install provider. So it turns out that the idea with locks is not feasible. Anyway, the idea with a private registry was not very promising, since issues could occur with firewalls, we would also have to check that the port on which the registry listens to connections is not busy, etc.

A workaround could be to (don't use private registry at all) run terraform init non-parallel/sequentially for all terragrunt.hcl before the target command, which can then run in parallel, but we still have to generate .terraform.lock.hcl files.

🤷‍♂️ Honestly, I don’t see any other way than to implement the functionality of fetching providers and modules by Terragrunt itself to ensure maximum performance and predictable behavior.

Thanks for looking into this.

Here's one more silly idea to try:

  • TG runs a server on localhost and configures it as a network_mirror, as before.
  • As that mirror gets requests, it forwards them to the real underlying registry and proxies through that registry's response. So responses are just as fast as normal, so we don't hit timeout issues.
  • When Terraform tries to download the actual providers from our localhost network_mirror, we do not proxy the files, we just return a 4xx or 5xx. So Terraform will fail. That's OK, we can hide this failure message from the user.
  • However, the localhost server has recorded all the providers that Terraform tried to download... So now we have the full list of all requested providers. We de-dupe the list, get the whole thing downloaded concurrently and added to the cache.
  • Now we can run the run-all commands as necessary and everything should run from the cache.

In short, we're letting Terraform figure out what providers it needs, and the network_mirror is just there to get that information from Terraform. We can then use that to efficiently fetch all the providers we need, and then let Terraform run off the cache.

I'm skipping over a bunch of details, but at a high level, WDTY?

@levkohimins
Copy link
Contributor

levkohimins commented Feb 16, 2024

Thanks for looking into this.

Here's one more silly idea to try:

  • TG runs a server on localhost and configures it as a network_mirror, as before.
  • As that mirror gets requests, it forwards them to the real underlying registry and proxies through that registry's response. So responses are just as fast as normal, so we don't hit timeout issues.
  • When Terraform tries to download the actual providers from our localhost network_mirror, we do not proxy the files, we just return a 4xx or 5xx. So Terraform will fail. That's OK, we can hide this failure message from the user.
  • However, the localhost server has recorded all the providers that Terraform tried to download... So now we have the full list of all requested providers. We de-dupe the list, get the whole thing downloaded concurrently and added to the cache.
  • Now we can run the run-all commands as necessary and everything should run from the cache.

In short, we're letting Terraform figure out what providers it needs, and the network_mirror is just there to get that information from Terraform. We can then use that to efficiently fetch all the providers we need, and then let Terraform run off the cache.

I'm skipping over a bunch of details, but at a high level, WDTY?

Interesting idea! But we still have to generate the .terraform.lock.hcl file before running the terraform command, otherwise if our terragrunt.hcl files have multiple identical providers, we will run into the following issues:

  1. It will not take into account already existing providers in the cache and will download them again and again.
  2. One instance overwrites an existing file in the cache, but other instances may already be using it, and then an error like this occurs:
╷
│ Error: Failed to install provider from shared cache
│ 
│ Error while importing hashicorp/google v5.9.0 from the shared cache
│ directory: the provider cache at .terraform/providers has a copy of
│ registry.terraform.io/hashicorp/google 5.9.0 that doesn't match any of the
│ checksums recorded in the dependency lock file.
╵

This is why we must specify the --terragrunt-parallelism 1 flag when using the terraform cache, at least for now.

To generate this .terraform.lock.hcl file, I think, we need to move some terraform logic, not sure how much, but I can figure out it, WDYT?

@brikis98
Copy link
Member Author

Thanks for looking into this.
Here's one more silly idea to try:

  • TG runs a server on localhost and configures it as a network_mirror, as before.
  • As that mirror gets requests, it forwards them to the real underlying registry and proxies through that registry's response. So responses are just as fast as normal, so we don't hit timeout issues.
  • When Terraform tries to download the actual providers from our localhost network_mirror, we do not proxy the files, we just return a 4xx or 5xx. So Terraform will fail. That's OK, we can hide this failure message from the user.
  • However, the localhost server has recorded all the providers that Terraform tried to download... So now we have the full list of all requested providers. We de-dupe the list, get the whole thing downloaded concurrently and added to the cache.
  • Now we can run the run-all commands as necessary and everything should run from the cache.

In short, we're letting Terraform figure out what providers it needs, and the network_mirror is just there to get that information from Terraform. We can then use that to efficiently fetch all the providers we need, and then let Terraform run off the cache.
I'm skipping over a bunch of details, but at a high level, WDTY?

Interesting idea! But we still have to generate the .terraform.lock.hcl file before running the terraform command, otherwise if our terragrunt.hcl files have multiple identical providers, we will run into the following issues:

  1. It will not take into account already existing providers in the cache and will download them again and again.
  2. One instance overwrites an existing file in the cache, and other instances may already be using it, and then an error like this occurs:
╷
│ Error: Failed to install provider from shared cache
│ 
│ Error while importing hashicorp/google v5.9.0 from the shared cache
│ directory: the provider cache at .terraform/providers has a copy of
│ registry.terraform.io/hashicorp/google 5.9.0 that doesn't match any of the
│ checksums recorded in the dependency lock file.
╵

This is why we must specify the --terragrunt-parallelism 1 flag when using the terraform cache, at least for now.

To generate this .terraform.lock.hcl file, I think, we need to move some terraform logic, not sure how much, but I can figure out it, WDTY?

Let's assume for now that for any module without a lock file, we run init sequentially to generate it.

If the network_mirror approach works at all, then perhaps we can generate the lock file as part of that same process.

@levkohimins
Copy link
Contributor

Resolved in v0.56.4 release. Make sure to read Provider Caching.

@brikis98
Copy link
Member Author

@levkohimins I think that only resolved the provider thing. There are many other tasks in this bug, so going to reopen.

@brikis98 brikis98 reopened this Apr 11, 2024
@amontalban
Copy link

Joining to the party because we are behind a solution for the explained problem.

Today I have tested latest Terragrunt (v0.57.2) running in an Atlantis setup and I'm having some mixed results, which I think are due to the cache server being spin up for each thread and potentially causing a race condition.

Have you considered offering the cache server as a standalone service that I can spin up on instance boot and share among all processes?

Thank you for working on this!

@levkohimins
Copy link
Contributor

levkohimins commented Apr 19, 2024

@amontalban, That's true. Each Terragrunt instance runs its own cache server. We use file locking to prevent conflicts when multiple Terragrunt instances try to cache the same provider. What do you mean by

I'm having some mixed results

Thinking out loud, for the standalone server, we will need connections (like gRPC) between the Terragrunt instances and the Terragrunt Cache Server itself to receive notifications from the cache server when the cache is ready.

@brikis98, Interesting what you think about this.

@amontalban
Copy link

What do you mean by

I'm having some mixed results

Hi @levkohimins!

Some of the plans work and some don't on the same Atlantis PR, and I think it is because all threads (We have parallel configuration in Atlantis running up to 10 at the same time) are trying to lock/download providers at the same time. For example a working one:

time=2024-04-18T22:43:57Z level=info msg=Terragrunt Cache server is listening on 127.0.0.1:36425
time=2024-04-18T22:43:57Z level=info msg=Start Terragrunt Cache server
time=2024-04-18T22:43:59Z level=info msg=Downloading Terraform configurations from git::ssh://git@github.com/terraform-aws-modules/terraform-aws-iam.git?ref=v5.30.0 into /home/atlantis/.cache/terragrunt/modules/4eoLS_PnCDG--fz0b0bUcb6_sjY/Z_nexO2qqCg5RPmJa_gkAX4ynAY
time=2024-04-18T22:44:13Z level=info msg=Provider "registry.terraform.io/hashicorp/aws/5.45.0" is cached

Initializing the backend...

Successfully configured the backend "s3"! Terraform will automatically
use this backend unless the backend configuration changes.

Initializing provider plugins...
- Finding hashicorp/aws versions matching ">= 5.43.0"...

A non working one:

time=2024-04-18T23:16:12Z level=info msg=Terragrunt Cache server is listening on 127.0.0.1:39541
time=2024-04-18T23:16:12Z level=info msg=Start Terragrunt Cache server
time=2024-04-18T23:16:16Z level=info msg=Downloading Terraform configurations from git::ssh://git@github.com/terraform-aws-modules/terraform-aws-iam.git?ref=v5.37.1 into /home/atlantis/.cache/terragrunt/modules/CZJOkESJyj2a2n17Ph1lBtPy7p8/Z_nexO2qqCg5RPmJa_gkAX4ynAY prefix=[/home/atlantis/.atlantis/repos/ACME/terraform/838/provider_aws_dev__global_iam_roles_sre-role/provider/aws/security/_global/iam/policies/sre-assume-role] 
time=2024-04-18T23:16:16Z level=info msg=Downloading Terraform configurations from git::ssh://git@github.com/ACME/tf-aws-iam-saml-provider.git?ref=v1.0.1 into /home/atlantis/.cache/terragrunt/modules/1-d3ZTASqfksjn_orsKPDUGD7ks/lPrDQ0wT1dtsjgNOnzZoq0oCtyE prefix=[/home/atlantis/.atlantis/repos/ACME/terraform/838/provider_aws_dev__global_iam_roles_sre-role/provider/aws/security/_global/iam/identity_providers/X] 
╷
│ Error: Failed to query available provider packages
│ 
│ Could not retrieve the list of available versions for provider
│ hashicorp/aws: host registry.terraform.io rejected the given authentication
│ credentials

Another error:

time=2024-04-18T22:43:57Z level=info msg=Terragrunt Cache server is listening on 127.0.0.1:38393
time=2024-04-18T22:43:57Z level=info msg=Start Terragrunt Cache server
time=2024-04-18T22:43:58Z level=info msg=Downloading Terraform configurations from git::ssh://git@github.com/terraform-aws-modules/terraform-aws-iam.git?ref=v5.30.0 into /home/atlantis/.cache/terragrunt/modules/BBCsdcAtPBWyKxcKxrWGSmPnyMI/Z_nexO2qqCg5RPmJa_gkAX4ynAY

Error: Could not retrieve providers for locking

Terraform failed to fetch the requested providers for cache_provider in order
to calculate their checksums: some providers could not be installed:
- registry.terraform.io/hashicorp/aws: host registry.terraform.io rejected
the given authentication credentials.

And we have the following settings:

TERRAGRUNT_DOWNLOAD="$HOME/.cache/terragrunt/modules"
TERRAGRUNT_FETCH_DEPENDENCY_OUTPUT_FROM_STATE="true"
TERRAGRUNT_PROVIDER_CACHE=1
TERRAGRUNT_NON_INTERACTIVE="true"
TERRAGRUNT_INCLUDE_EXTERNAL_DEPENDENCIES="true"

Let me know if you want me to open an issue for this.

Thanks!

@levkohimins
Copy link
Contributor

levkohimins commented Apr 19, 2024

Hi @amontalban, thanks for the detailed explanation.
Terragrunt Provider Cache is concurrency safe. Based on your log, I see an authentication issue.

rejected the given authentication credentials

Please create a new issue and indicate there the terraform version, your CLI Configuration, and also check if you are using any credentials. Thanks.

@amontalban
Copy link

Hi @amontalban, thanks for the detailed explanation. Terragrunt Provider Cache is concurrency safe. Based on your log, I see an authentication issue.

rejected the given authentication credentials

Please create a new issue and indicate there the terraform version, your CLI Configuration, and also check if you are using any credentials. Thanks.

Thanks I will open an issue then.

Regarding Terragrunt Provider Cache being concurrency safe I understand it is if it used in a single terragrunt process like a terragrunt run-all plan/apply or terragrunt plan/apply but what happens if I have multiple terragrunt processes using the same directory at the same time (This is what Atlantis does in the background)?

Thanks!

@levkohimins
Copy link
Contributor

Thanks I will open an issue then.

Thanks!

Regarding Terragrunt Provider Cache being concurrency safe I understand it is if it used in a single terragrunt process like a terragrunt run-all plan/apply or terragrunt plan/apply but what happens if I have multiple terragrunt processes using the same directory at the same time (This is what Atlantis does in the background)?

By safe concurrency I meant multiple Terragrunt processes running at the same time.

@tuananh
Copy link

tuananh commented May 9, 2024

@levkohimins is it possible to mount a volume and share cache between multiple Kubernetes pods?

@levkohimins
Copy link
Contributor

@levkohimins is it possible to mount a volume and share cache between multiple Kubernetes pods?

You can specify the different cache directory --terragrunt-provider-cache-dir

@tuananh
Copy link

tuananh commented May 9, 2024

@levkohimins is it possible to mount a volume and share cache between multiple Kubernetes pods?

You can specify the different cache directory --terragrunt-provider-cache-dir

does it mean if i do that, i will have problem ? each job should have their own cache?

@levkohimins
Copy link
Contributor

@levkohimins is it possible to mount a volume and share cache between multiple Kubernetes pods?

You can specify the different cache directory --terragrunt-provider-cache-dir

does it mean if i do that, i will have problem ? each job should have their own cache?

The Terragrunt Provider Cache is concurrency safe, so you can run multiple Terragrunt processes with one shared cache directory. The only requirement is that the file system must support File locking.

@tuananh
Copy link

tuananh commented May 10, 2024

if anyone like me looking to use this with aws EFS, it should work since EFS supports flock

@RaagithaGummadi
Copy link

hi @brikis98 @levkohimins
From TG 0.55.20, 0.55.19, to the latest version we are having troubles in our terragrunt execution environment, while trying to download the terraform source URLs

while downloading the terraform source URLs, https:// is getting replaced by file:/// and the workflow is failing, failing to download the module zips.

It was working fine till TG 0.55.13.. Because of this issue, we are not able to use any of the recently delivered features.. Can you please look into this as a priority

@tomaaron
Copy link

Hey there! I have a question regarding how to handle multi platform with lock files in order to reduce disk & bandwidth usage? It seems to me that all the caching functionality only works for your own platform.

@levkohimins
Copy link
Contributor

levkohimins commented May 24, 2024

hi @brikis98 @levkohimins From TG 0.55.20, 0.55.19, to the latest version we are having troubles in our terragrunt execution environment, while trying to download the terraform source URLs

while downloading the terraform source URLs, https:// is getting replaced by file:/// and the workflow is failing, failing to download the module zips.

It was working fine till TG 0.55.13.. Because of this issue, we are not able to use any of the recently delivered features.. Can you please look into this as a priority

Hi @RaagithaGummadi, this issue is not related to this subject, if the issue still exists please let me know there #3141

@levkohimins
Copy link
Contributor

levkohimins commented May 24, 2024

Hey there! I have a question regarding how to handle multi platform with lock files in order to reduce disk & bandwidth usage? It seems to me that all the caching functionality only works for your own platform.

Hi @tomaaron, Could you please describe in detail how you create lock files for multiple platforms in your workflow when you do not use Terragrunt Provider Cache feature?

@tomaaron
Copy link

Hey there! I have a question regarding how to handle multi platform with lock files in order to reduce disk & bandwidth usage? It seems to me that all the caching functionality only works for your own platform.

Hi @tomaaron, Could you please describe in detail how you create lock files for multiple platforms in your workflow when you do not use Terragrunt Provider Cache feature?

That's actually what I'm trying to figure out. So far I have unsuccessfully tried the following:

terragrunt run-all providers lock -platform=linux_amd64 -platform=darwin_arm64 --terragrunt-provider-cache

But this seems to download the providers over and over again.

@levkohimins
Copy link
Contributor

Hi @tomaaron, Could you please describe in detail how you create lock files for multiple platforms in your workflow when you do not use Terragrunt Provider Cache feature?

That's actually what I'm trying to figure out. So far I have unsuccessfully tried the following:

terragrunt run-all providers lock -platform=linux_amd64 -platform=darwin_arm64 --terragrunt-provider-cache

But this seems to download the providers over and over again.

Yeah it won't work. I'll look into what we can do to make this work through the Terragrunt Provider Cache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants