ECS container fails to retrieve secrets from AWS Secret Manager due to hop limit #610

stuart-storypark · 2023-04-27T01:41:52Z

I am running an R script in a Docker container on an ECS instance, and I am using the paws.security.identity package to retrieve secrets from AWS Secret Manager. However, the container is failing to retrieve secrets from Secret Manager, and the connection seems to hang indefinitely.

After investigating the issue, I discovered that the hop limit setting on the ECS instance was causing requests to the metadata endpoint to fail. Specifically, the hop limit was set to 1, which was preventing the R script from successfully retrieving secrets from Secret Manager.

To resolve the issue, I increased the hop limit from 1 to 2, and the R script was able to successfully retrieve secrets from Secret Manager. It appears that the issue was related to network connectivity and routing, and increasing the hop limit resolved the issue.

I am submitting this issue (which can be closed immediately) in the hope the solution helps someone else.

Steps to Reproduce:

Run an R script in a Docker container on an ECS instance.
Attempt to retrieve secrets from AWS Secret Manager using the paws.security.identity package.
Observe that the connection appears to hang indefinitely, and the container fails to retrieve secrets from Secret Manager.

Expected Result:
The R script should be able to successfully retrieve secrets from AWS Secret Manager using the paws.security.identity package.

Actual Result:
The R script fails to retrieve secrets from Secret Manager, and the connection appears to hang indefinitely.

Solution:
Increase the hop limit setting on the ECS instance to 2. This will allow requests to the metadata endpoint to succeed, and should resolve the issue with retrieving secrets from AWS Secret Manager in a Docker container running on an ECS instance.

I am using terraform to provision these resources, I made the following change to my terraform manifest:

resource "aws_launch_template" "asg_launch_template" {
<snip>
metadata_options {
    http_endpoint               = "enabled"
    http_tokens                 = "required"
    http_put_response_hop_limit = 2
    instance_metadata_tags      = "enabled"
  }
}

DyfanJones · 2023-04-27T17:57:44Z

Not familiar with hop limit, I will have to do some reading up. Do you believe there is some issue with the IP protocol when in ECS tasks?

DyfanJones · 2023-04-27T18:04:24Z

Possible similar issue with older version of botocore: boto/botocore#1892

DyfanJones · 2023-04-27T18:11:24Z

Related issue: boto/botocore#1897

stuart-storypark · 2023-04-27T20:38:40Z

I believe it was a result of moving to IMDSv2 here: #552

You can read a bit about the hop limit here: https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/instance-metadata-v2-how-it-works.html. Specifically the last paragraph on the page.

I don't think any changes need to be made to paws, or any action taken. I was raising this issue in the hope that someone else taking the same journey I did would find this issue and get a quicker resolution. I went down a red herring of thinking I needed to set up a secrets manager endpoint. :)

joakibo · 2023-06-26T16:25:57Z

We have also encountered this issue with paws, and also curl under the hood: the request would just hang and no response and not respecting the set timeout. Used a lot of time to dig in to understand what was wrong.

For curl you could get around it by setting --max-time, but without it, even setting timeout with --connect-timeout would not be respected. I guess the issue is that it's not a timeout, the request gets an intermediate response, it's just waiting for the final response due to this put response hop limit.

As mentioned above this is related to IMDSv2. paws uses it as first way of trying to get EC2 metadata credentials and then falls back to IMDSv1 if it doesn't work. However, the issue here is that paws gets stuck on IMDSv2 due to the above and is not able to fallback. The set timeout of 1 second by paws is not resepected. AWS CLI does not have the same issues, it cuts the connection if no response in x seconds.

So a possible improvement to paws here would be to extend the calls to the EC2 metadata endpoints with some --max-time or some other property when calling these such that it will correctly fallback to IMDSv1. Checked httr package, no immediate candidate for such property even though curl supports --max-time.

@stuart-storypark's solution anyhow works, setting the metadata properties for EC2 instances.

DyfanJones · 2023-06-26T16:38:28Z

@joakibo thanks for the extra insight. This might be something we can fix. I don't have an environment to fully test this out yet. If anyone is happy to work with me to test this i would be more than grateful. Or even provide their dockerfile so i can test it in the paws aws account.

Correct me if I am wrong but paws it stuck at this part when it attempts to instance metadata?

https://github.com/paws-r/paws/blob/main/paws.common/R/config.R#L173-L193

joakibo · 2023-06-26T16:44:51Z

@DyfanJones Correct, and as mentioned using curl in the terminal yielded same results:

curl -X PUT --connect-timeout 1 http://169.254.169.254/latest/api/token and it would just hang.
curl -X PUT --connect-timeout 1 --max-time 5 http://169.254.169.254/latest/api/token and it would quit after five seconds as unsuccessful.

Seems like paws gets stuck into the first scenario there. The issue is also explained at https://stackoverflow.com/questions/62324816/ecs-container-hangs-when-calling-ssm-api-endpoint/62326320#62326320 with same resolution as above. For paws then when I tested this interactively I hit ctrl+c and then it continued, fell back to IMDSv1 and it eventually worked. But in a context where no-one could ctrl+c then it would hang indefinitely.

AWS CLI worked fine in the same context.

So, just having some way of getting it to respect timeout when running issue https://github.com/paws-r/paws/blob/main/paws.common/R/config.R#L188C7-L188C36

ECS container on EC2 is possibly sufficient for reproducing the issue, f.ex. with rocker:r-ver or something like that.

DyfanJones · 2023-06-26T17:09:15Z

After a little more digging it looks like there are many different curl timeout options

library(httr)
resp1 = VERB("GET", "http://httpbin.org", config(timeout_ms = 1000))
#> Error in curl::curl_fetch_memory(url, handle = handle): Timeout was reached: [httpbin.org] Operation timed out after 1002 milliseconds with 0 bytes received
resp2 = VERB("GET", "http://httpbin.org", config(timeout = 1))
#> Error in curl::curl_fetch_memory(url, handle = handle): Timeout was reached: [httpbin.org] Operation timed out after 1001 milliseconds with 0 bytes received
resp3 = VERB("GET", "http://httpbin.org", config(connecttimeout = 1))

^{Created on 2023-06-26 with reprex v2.0.2}

We are currently using connect-timeout similar to above. I wonder if we could resolve this by adding the stand timeout curl option. When calling IMDSv2.

joakibo · 2023-06-26T17:40:56Z

Thanks for some code samples, I'll do a test on my side since I have access to the problematic setup and come back to you if any of those works.

DyfanJones · 2023-06-26T17:55:42Z

@joakibo created a branch to try and address this. It allows connect_timeout and timeout to be used. Are you able to test this? I would super grateful if you can.

remotes::install_github("DyfanJones/paws", ref="timeout")

joakibo · 2023-06-26T17:56:59Z

Results. Of those three only timeout_ms actually resulted in it stopping the request after one second, I had to ctrl+c the others.

DyfanJones · 2023-06-26T17:58:25Z

ah thanks for this. I can switch from timeout to timeout_ms. I will update the branch accordingly.

joakibo · 2023-06-26T17:59:45Z

Hold on a second, let's see, I may have been too quick there. Seems like both timeout and timeout_ms works (which makes sense, they should be the same just on different scales), while connecttimeout certainly does not work.

I put timeout = 1000 there, so makes sense that it didn't kill it. Setting it to 1 and it works.

Blinking cursor of death.

joakibo · 2023-06-26T18:02:58Z

So given this, connecttimeout doesn't kick in since it is able to establish connection but it is supposed to "hop" on further for the put request, and hence just stands there awaiting response.

timeout seems like more general timeouts on receiving the actual end response, same as --max-time for curl.

DyfanJones · 2023-06-26T18:04:06Z

ah I will revert the last change :P hahaha my bad for being eager :P

DyfanJones · 2023-06-26T18:05:36Z

Yes you are right, from looking at the libcurl documentation timeout is this: https://curl.se/libcurl/c/CURLOPT_TIMEOUT.html

CURLOPT_TIMEOUT - maximum time the transfer is allowed to complete

joakibo · 2023-06-26T18:09:02Z

Good stuff 👍

DyfanJones · 2023-06-26T18:09:30Z

@joakibo if possible are you able to test the branch?

remotes::install_github("DyfanJones/paws", ref="timeout")

I would be super grateful if you are able to :)

DyfanJones · 2023-06-26T18:19:10Z

Mini test script:

remotes::install_github("DyfanJones/paws", ref="timeout")

s3 <- paws::s3()
s3$list_buckets()

Assuming ecs has s3 list buckets permissions :) If this hangs please let me know :D

joakibo · 2023-06-26T18:21:08Z

Tried now but got this

> remotes::install_github("DyfanJones/paws", ref="timeout")
Error: Failed to install 'unknown package' from GitHub:
  HTTP error 404.
  Not Found

  Did you spell the repo owner (`DyfanJones`) and repo name (`paws`) correctly?
  - If spelling is correct, check that you have the required permissions to access the repo.

It's not public possibly?

EDIT: Tried to setup PAT but that doesn't seem to work. Problem is that this runs in a context where I don't have setup SSH etc. But access is anyway the issue. Will see if I can find a way.

DyfanJones · 2023-06-26T18:31:28Z

Ah my bad i didn't add the package directory. Try this instead 😛

remotes::install_github("DyfanJones/paws/paws.common", ref="timeout")

joakibo · 2023-06-26T18:40:27Z

@DyfanJones Seems like that worked. Tested this first but hanged:
httr::VERB("PUT", "http://169.254.169.254/latest/api/token", config(connecttimeout = 1))

Then after having done remotes::install I tested this paws.security.identity::sts()$get_caller_identity() which worked. Here run with options(paws.log_level = 3

> paws.security.identity::sts()$get_caller_identity()
INFO [2023-06-26 18:37:23.572]: Unable to locate credentials file
INFO [2023-06-26 18:37:23.580]: Unable to get credentials from config file.
INFO [2023-06-26 18:37:23.584]: Unable to obtain access_key_id, secret_access_key or session_token
INFO [2023-06-26 18:37:23.593]: -> PUT /latest/api/token HTTP/1.1
-> Host: 169.254.169.254
-> User-Agent: libcurl/7.68.0 r-curl/5.0.1 httr/1.4.6
-> Accept-Encoding: deflate, gzip, br
-> Accept: application/json, text/xml, application/xml, */*
-> X-aws-ec2-metadata-token-ttl-seconds: 21600
-> Content-Length: 0
->
INFO [2023-06-26 18:37:24.597]: -> GET /latest/meta-data/iam/security-credentials HTTP/1.1
-> Host: 169.254.169.254
-> User-Agent: libcurl/7.68.0 r-curl/5.0.1 httr/1.4.6
-> Accept-Encoding: deflate, gzip, br
-> Accept: application/json, text/xml, application/xml, */*
->
INFO [2023-06-26 18:37:24.598]: <- HTTP/1.1 200 OK
INFO [2023-06-26 18:37:24.598]: <- Content-Type: text/plain
INFO [2023-06-26 18:37:24.598]: <- Accept-Ranges: none
INFO [2023-06-26 18:37:24.598]: <- Last-Modified: Mon, 26 Jun 2023 17:50:37 GMT
INFO [2023-06-26 18:37:24.598]: <- Content-Length: 64
INFO [2023-06-26 18:37:24.598]: <- Date: Mon, 26 Jun 2023 18:37:24 GMT
INFO [2023-06-26 18:37:24.599]: <- Server: EC2ws
INFO [2023-06-26 18:37:24.599]: <- Connection: close
INFO [2023-06-26 18:37:24.599]: <-
INFO [2023-06-26 18:37:24.613]: -> PUT /latest/api/token HTTP/1.1
-> Host: 169.254.169.254
-> User-Agent: libcurl/7.68.0 r-curl/5.0.1 httr/1.4.6
-> Accept-Encoding: deflate, gzip, br
-> Accept: application/json, text/xml, application/xml, */*
-> X-aws-ec2-metadata-token-ttl-seconds: 21600
-> Content-Length: 0
->
INFO [2023-06-26 18:37:25.618]: -> GET /latest/meta-data/iam/security-credentials/[masked] HTTP/1.1
-> Host: 169.254.169.254
-> User-Agent: libcurl/7.68.0 r-curl/5.0.1 httr/1.4.6
-> Accept-Encoding: deflate, gzip, br
-> Accept: application/json, text/xml, application/xml, */*
->
INFO [2023-06-26 18:37:25.619]: <- HTTP/1.1 200 OK
INFO [2023-06-26 18:37:25.619]: <- Content-Type: text/plain
INFO [2023-06-26 18:37:25.619]: <- Accept-Ranges: none
INFO [2023-06-26 18:37:25.620]: <- Last-Modified: Mon, 26 Jun 2023 17:50:37 GMT
INFO [2023-06-26 18:37:25.620]: <- Content-Length: 1582
INFO [2023-06-26 18:37:25.620]: <- Date: Mon, 26 Jun 2023 18:37:25 GMT
INFO [2023-06-26 18:37:25.620]: <- Server: EC2ws
INFO [2023-06-26 18:37:25.620]: <- Connection: close
INFO [2023-06-26 18:37:25.620]: <-
INFO [2023-06-26 18:37:25.873]: -> POST / HTTP/1.1
.
.
.

DyfanJones · 2023-06-26T18:59:43Z

This is great news it is working :) I will merge the PR for the next paws.commom release

DyfanJones · 2023-06-26T19:01:08Z

Thanks for all the testing and the extra insight, this issue wouldn't of been resolved without your help

joakibo · 2023-06-26T19:09:28Z

Sure, no problems, happy to assist 👍

DyfanJones added a commit to DyfanJones/paws that referenced this issue Jun 26, 2023

split timeout and connect_timeout (paws-r#610)

ab5275d

DyfanJones mentioned this issue Jun 26, 2023

split timeout and connect_timeout (#610) #637

Merged

DyfanJones linked a pull request Jun 26, 2023 that will close this issue

split timeout and connect_timeout (#610) #637

Merged

DyfanJones closed this as completed in #637 Jun 26, 2023

DyfanJones added a commit that referenced this issue Jun 26, 2023

split timeout and connect_timeout (#610)

8033cbb

DyfanJones added the bug 🐞 Something isn't working label Mar 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ECS container fails to retrieve secrets from AWS Secret Manager due to hop limit #610

ECS container fails to retrieve secrets from AWS Secret Manager due to hop limit #610

stuart-storypark commented Apr 27, 2023

DyfanJones commented Apr 27, 2023

DyfanJones commented Apr 27, 2023

DyfanJones commented Apr 27, 2023

stuart-storypark commented Apr 27, 2023 •

edited

Loading

joakibo commented Jun 26, 2023 •

edited

Loading

DyfanJones commented Jun 26, 2023 •

edited

Loading

joakibo commented Jun 26, 2023 •

edited

Loading

DyfanJones commented Jun 26, 2023

joakibo commented Jun 26, 2023

DyfanJones commented Jun 26, 2023

joakibo commented Jun 26, 2023 •

edited

Loading

DyfanJones commented Jun 26, 2023

joakibo commented Jun 26, 2023 •

edited

Loading

joakibo commented Jun 26, 2023 •

edited

Loading

DyfanJones commented Jun 26, 2023

DyfanJones commented Jun 26, 2023

joakibo commented Jun 26, 2023

DyfanJones commented Jun 26, 2023

DyfanJones commented Jun 26, 2023 •

edited

Loading

joakibo commented Jun 26, 2023 •

edited

Loading

DyfanJones commented Jun 26, 2023

joakibo commented Jun 26, 2023 •

edited

Loading

DyfanJones commented Jun 26, 2023

DyfanJones commented Jun 26, 2023

joakibo commented Jun 26, 2023

ECS container fails to retrieve secrets from AWS Secret Manager due to hop limit #610

ECS container fails to retrieve secrets from AWS Secret Manager due to hop limit #610

Comments

stuart-storypark commented Apr 27, 2023

DyfanJones commented Apr 27, 2023

DyfanJones commented Apr 27, 2023

DyfanJones commented Apr 27, 2023

stuart-storypark commented Apr 27, 2023 • edited Loading

joakibo commented Jun 26, 2023 • edited Loading

DyfanJones commented Jun 26, 2023 • edited Loading

joakibo commented Jun 26, 2023 • edited Loading

DyfanJones commented Jun 26, 2023

joakibo commented Jun 26, 2023

DyfanJones commented Jun 26, 2023

joakibo commented Jun 26, 2023 • edited Loading

DyfanJones commented Jun 26, 2023

joakibo commented Jun 26, 2023 • edited Loading

joakibo commented Jun 26, 2023 • edited Loading

DyfanJones commented Jun 26, 2023

DyfanJones commented Jun 26, 2023

joakibo commented Jun 26, 2023

DyfanJones commented Jun 26, 2023

DyfanJones commented Jun 26, 2023 • edited Loading

joakibo commented Jun 26, 2023 • edited Loading

DyfanJones commented Jun 26, 2023

joakibo commented Jun 26, 2023 • edited Loading

DyfanJones commented Jun 26, 2023

DyfanJones commented Jun 26, 2023

joakibo commented Jun 26, 2023

stuart-storypark commented Apr 27, 2023 •

edited

Loading

joakibo commented Jun 26, 2023 •

edited

Loading

DyfanJones commented Jun 26, 2023 •

edited

Loading

joakibo commented Jun 26, 2023 •

edited

Loading

joakibo commented Jun 26, 2023 •

edited

Loading

joakibo commented Jun 26, 2023 •

edited

Loading

joakibo commented Jun 26, 2023 •

edited

Loading

DyfanJones commented Jun 26, 2023 •

edited

Loading

joakibo commented Jun 26, 2023 •

edited

Loading

joakibo commented Jun 26, 2023 •

edited

Loading