Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terragrunt apply fails (could not find aws credentails ) #2730

Open
skkc2 opened this issue Sep 22, 2023 · 15 comments
Open

Terragrunt apply fails (could not find aws credentails ) #2730

skkc2 opened this issue Sep 22, 2023 · 15 comments
Assignees
Labels
bug Something isn't working

Comments

@skkc2
Copy link

skkc2 commented Sep 22, 2023

Hi All,

We use Terraform and Terragrunt to manage AWS infrastructure. when I run the terragrunt locally it seems fine and no issues in deploying infrastructure but it errors out while deploying through Jenkins as no AWS creds were found and it only happens to some of the folders rest all other services in other folders deploy successfully. it was working fine till a week ago but all of a sudden there is an issue. Not sure what went wrong any suggestions pls?

Previously we used to save .terraform.lock.hcl in SCM along with terragrunt.hcl but we’ve removed in some folders and there is inconsistenyc so we've reinitailised and saved .terraform.lock.hcl in folders. is it causing issues?

Exact Errors

time=2023-09-22T11:41:56Z level=error msg=Module /home/ec2-user/workspace/CI-CD Infrastructure/nft/service-discovery-services has finished with an error: Error finding AWS credentials (did you set the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables?): NoCredentialProviders: no valid providers in chain. Deprecated.
	For verbose messaging see aws.Config.CredentialsChainVerboseErrors prefix=[/home/ec2-user/workspace/CI-CD Infrastructure/nft/service-discovery-services] 
time=2023-09-22T11:41:59Z level=error msg=Module /home/ec2-user/workspace/CI-CD Infrastructure/nft/rds-config-null-resource has finished with an error: Error finding AWS credentials (did you set the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables?): NoCredentialProviders: no valid providers in chain. Deprecated.
	For verbose messaging see aws.Config.CredentialsChainVerboseErrors prefix=[/home/ec2-user/workspace/CI-CD Infrastructure/nft/rds-config-null-resource] 
time=2023-09-22T11:42:03Z level=error msg=Module /home/ec2-user/workspace/CI-CD Infrastructure/nft/rds-config-null-resource has finished with an error: Error finding AWS credentials (did you set the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables?): NoCredentialProviders: no valid providers in chain. Deprecated.

locals {
account_vars      = read_terragrunt_config(find_in_parent_folders("account.hcl"))
region_vars       = read_terragrunt_config(find_in_parent_folders("region.hcl"))
environment_vars  = read_terragrunt_config(find_in_parent_folders("environment.hcl"))
account_name      = local.account_vars.locals.account_name
account_name_abbr = local.account_vars.locals.account_name_abbr
account_id        = local.account_vars.locals.aws_account_id
aws_region        = local.region_vars.locals.aws_region
environment_name  = local.environment_vars.locals.environment
default_tags = {
  Name        = local.environment_name
  Environment = local.environment_name
  Terraform   = true
}
}

# Generate an AWS provider block
generate "provider" {
path      = "provider.tf"
if_exists = "overwrite_terragrunt"
contents  = <<EOF
provider "aws" {
region = "${local.aws_region}"
# version             = "= 3.30.0"
# Only these AWS Account IDs may be operated on by this template
allowed_account_ids = ["${local.account_id}"]

# default_tags {
#   tags = {
#     Name        = "${local.environment_name}"
#     Environment = "${local.environment_name}"
#     Terraform   = true  
#   }
# }
}
EOF
}

# Configure Terragrunt to automatically store tfstate files in an S3 bucket
remote_state {
backend = "s3"
config = {
  encrypt = true
  bucket  = "tfstate-apps-${local.account_id}-${local.aws_region}"
  key     = "${local.environment_name}/${path_relative_to_include()}/terraform.tfstate"
  region  = local.aws_region
  # dynamodb_table = "terraform-locks"
}
generate = {
  path      = "backend.tf"
  if_exists = "overwrite_terragrunt"
}
}

inputs = merge(
local.account_vars.locals,
local.region_vars.locals,
local.environment_vars.locals,
)

Versions

  • Terragrunt version: v0.38.7
  • Terraform version:
  • Environment details (Ubuntu 20.04, Windows 10, etc.):

Any suggestions please?

@skkc2 skkc2 added the bug Something isn't working label Sep 22, 2023
@denis256
Copy link
Member

Hello,
I wanted to confirm if was updated Terragrunt version? or it is the same as before?
I suspect that AWS credentials were removed from the env variables used in Jenkins job

@skkc2 skkc2 closed this as completed Sep 26, 2023
@skkc2 skkc2 reopened this Sep 26, 2023
@skkc2
Copy link
Author

skkc2 commented Sep 26, 2023

Hi denis256,

Terragrunt and terraform remained same version in local machines and jenkins

I don't think AWS credentails removed if it was it shouldn't execute eny modules but some modules are being executed.

@mimadrone
Copy link

I've also been encountering this. Jenkins job does run-all init, validate, plan on many directories in parallel, and some of them (not the same ones, not necessarily at the same points in the process) error out saying there are no credentials. I suspect AWS' behavior has changed (rate-limiting maybe?) because the Terragrunt version hasn't. Trying to see if auto-retry for this error helps now.

@mimadrone
Copy link

Update: auto-retry tuning is dicy. I got it to sometimes work by also setting the number of retries to 5, but occasionally that wasn't enough, so I also increased the delay, and then it started failing the job after only one error. So I haven't been able to come up with a consistent method to avoid this.

@skkc2
Copy link
Author

skkc2 commented Dec 1, 2023

Update: auto-retry tuning is dicy. I got it to sometimes work by also setting the number of retries to 5, but occasionally that wasn't enough, so I also increased the delay, and then it started failing the job after only one error. So I haven't been able to come up with a consistent method to avoid this.

What version of terraform and terragrunt are you using?
Recent version of terraform seems to notice this issue they've rolled out update. I tried with latest version aswell still same.
The only way i could reduce the amount of aws creds issue is by executing the shared directory (10 services i.e,) first and then applications directory (which has 10 + folders init with multiple services).
image (1)

@skkc2 skkc2 closed this as completed Dec 1, 2023
@skkc2 skkc2 reopened this Dec 1, 2023
@mimadrone
Copy link

mimadrone commented Dec 1, 2023

Changing auto-retry doesn't seem to work, which probably is because the error Terragrunt surfaces is its own and not caught from elsewhere? I have:

retry_sleep_interval_sec = 10
retryable_errors = [ 
  # Default list
  "(?s).*Failed to load state.*tcp.*timeout.*",
  "(?s).*Failed to load backend.*TLS handshake timeout.*",
  "(?s).*Creating metric alarm failed.*request to update this alarm is in progress.*",
  "(?s).*Error installing provider.*TLS handshake timeout.*",
  "(?s).*Error configuring the backend.*TLS handshake timeout.*",
  "(?s).*Error installing provider.*tcp.*timeout.*",
  "(?s).*Error installing provider.*tcp.*connection reset by peer.*",
  "NoSuchBucket: The specified bucket does not exist",
  "(?s).*Error creating SSM parameter: TooManyUpdates:.*",
  "(?s).*app.terraform.io.*: 429 Too Many Requests.*",
  "(?s).*ssh_exchange_identification.*Connection closed by remote host.*",
  "(?s).*Client\\.Timeout exceeded while awaiting headers.*",
  "(?s).*Could not download module.*The requested URL returned error: 429.*",
  # Tests hit erroneous NoCredentialProviders errors because of some kind of rate limiting AWS-side
  "(?s).*NoCredentialProviders: no valid providers in chain.*",
]

but it doesn't retry at all.

@mimadrone
Copy link

mimadrone commented Dec 4, 2023

Contacted AWS support, who told me that they don't publish the throttling/rate limiting numbers because "they're internal" (so, they don't publish the numbers because they don't publish the numbers?) and that Terragrunt should implement a retry with exponential backoff.

The AWS support person indicated that the limit might change at any point, which I suspect means they did recently change it. Experimentally: we've got about 150 modules and we hit a few denials each time; setting TERRAGRUNT_PARALLELISM to 100 seems to prevent the failures, though I haven't got many runs to prove it. UPDATE: no, we see it at 100. I think the limit must be under 70.

@gitsstewart
Copy link

@denis256 given the latest information, is there anything that should be looked at from this point?

@denis256
Copy link
Member

I will do more tests, but so far I have been thinking about:

  • retries when AWS api errors happens
  • automatically adjust TERRAGRUNT_PARALLELISM (if not configured) based on number of modules

@denis256 denis256 self-assigned this Dec 18, 2023
@skkc2 skkc2 closed this as completed Dec 19, 2023
@skkc2 skkc2 reopened this Dec 19, 2023
@skkc2
Copy link
Author

skkc2 commented Dec 19, 2023

I will do more tests, but so far I have been thinking about:

* retries when AWS api errors happens

* automatically adjust `TERRAGRUNT_PARALLELISM` (if not configured) based on number of modules

It would be great help denis, we're facing this issue for a while

@denis256
Copy link
Member

Hi,
I wanted to check if the issue still appears after upgrade to https://github.com/gruntwork-io/terragrunt/releases/tag/v0.54.13

@skkc2
Copy link
Author

skkc2 commented Jan 16, 2024

Hi @denis256
unfortunately no, issue still remain.
image

@denis256
Copy link
Member

It is still complicated on my side to reproduce this issue, I tried to setup something in https://github.com/denis256/terragrunt-tests/tree/master/aws-rate-limit but still not getting the same error as reported.

Will be helpful to share an example repository where this error happens.

@skkc2
Copy link
Author

skkc2 commented Jan 18, 2024

Hi denis,
Sorry, I don't have any samples to share due to restrictions. I've seen your sample repo I think having multiple modules of rate1 and each has a similar/somewhat different main.tf would generate this issue. I've ~145 modules.
Thanks

@skkc2
Copy link
Author

skkc2 commented Feb 6, 2024

@denis256

I got this working my issue was resolved by updating the terragurnt and also increasing the RAM. it would be nice that if it highlighted the memory error and also limited the terrragrunt/terraform memory usage.

articles/blogs from online about limiting RAM usage, shows there are good few of them experience this issue because of the modules and providers sizes. the problem is not with the module api calls but with aws provider processes that are heavy because they support lot of aws services at once. In our case for environments like NFT and others which have a lot of resources to deploy which require a lot of provider versions and doing all that stuff at once would require a good bit of RAM. So 8GB would crash terraform.

Any plans in the future to throttle it without breaking it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: To do
Development

No branches or pull requests

4 participants