-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ECS container fails to retrieve secrets from AWS Secret Manager due to hop limit #610
Comments
Not familiar with hop limit, I will have to do some reading up. Do you believe there is some issue with the IP protocol when in ECS tasks? |
Possible similar issue with older version of botocore: boto/botocore#1892 |
Related issue: boto/botocore#1897 |
I believe it was a result of moving to IMDSv2 here: #552 You can read a bit about the hop limit here: https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/instance-metadata-v2-how-it-works.html. Specifically the last paragraph on the page. I don't think any changes need to be made to paws, or any action taken. I was raising this issue in the hope that someone else taking the same journey I did would find this issue and get a quicker resolution. I went down a red herring of thinking I needed to set up a secrets manager endpoint. :) |
We have also encountered this issue with paws, and also curl under the hood: the request would just hang and no response and not respecting the set timeout. Used a lot of time to dig in to understand what was wrong. For curl you could get around it by setting --max-time, but without it, even setting timeout with --connect-timeout would not be respected. I guess the issue is that it's not a timeout, the request gets an intermediate response, it's just waiting for the final response due to this put response hop limit. As mentioned above this is related to IMDSv2. paws uses it as first way of trying to get EC2 metadata credentials and then falls back to IMDSv1 if it doesn't work. However, the issue here is that paws gets stuck on IMDSv2 due to the above and is not able to fallback. The set timeout of 1 second by paws is not resepected. AWS CLI does not have the same issues, it cuts the connection if no response in x seconds. So a possible improvement to paws here would be to extend the calls to the EC2 metadata endpoints with some --max-time or some other property when calling these such that it will correctly fallback to IMDSv1. Checked httr package, no immediate candidate for such property even though curl supports @stuart-storypark's solution anyhow works, setting the metadata properties for EC2 instances. |
@joakibo thanks for the extra insight. This might be something we can fix. I don't have an environment to fully test this out yet. If anyone is happy to work with me to test this i would be more than grateful. Or even provide their dockerfile so i can test it in the paws aws account. Correct me if I am wrong but paws it stuck at this part when it attempts to instance metadata? https://github.com/paws-r/paws/blob/main/paws.common/R/config.R#L173-L193 |
@DyfanJones Correct, and as mentioned using curl in the terminal yielded same results:
Seems like paws gets stuck into the first scenario there. The issue is also explained at https://stackoverflow.com/questions/62324816/ecs-container-hangs-when-calling-ssm-api-endpoint/62326320#62326320 with same resolution as above. For paws then when I tested this interactively I hit ctrl+c and then it continued, fell back to IMDSv1 and it eventually worked. But in a context where no-one could ctrl+c then it would hang indefinitely. AWS CLI worked fine in the same context. So, just having some way of getting it to respect ECS container on EC2 is possibly sufficient for reproducing the issue, f.ex. with rocker:r-ver or something like that. |
After a little more digging it looks like there are many different curl timeout options library(httr)
resp1 = VERB("GET", "http://httpbin.org", config(timeout_ms = 1000))
#> Error in curl::curl_fetch_memory(url, handle = handle): Timeout was reached: [httpbin.org] Operation timed out after 1002 milliseconds with 0 bytes received
resp2 = VERB("GET", "http://httpbin.org", config(timeout = 1))
#> Error in curl::curl_fetch_memory(url, handle = handle): Timeout was reached: [httpbin.org] Operation timed out after 1001 milliseconds with 0 bytes received
resp3 = VERB("GET", "http://httpbin.org", config(connecttimeout = 1)) Created on 2023-06-26 with reprex v2.0.2 We are currently using connect-timeout similar to above. I wonder if we could resolve this by adding the stand timeout curl option. When calling IMDSv2. |
Thanks for some code samples, I'll do a test on my side since I have access to the problematic setup and come back to you if any of those works. |
@joakibo created a branch to try and address this. It allows connect_timeout and timeout to be used. Are you able to test this? I would super grateful if you can. remotes::install_github("DyfanJones/paws", ref="timeout") |
ah thanks for this. I can switch from timeout to timeout_ms. I will update the branch accordingly. |
So given this,
|
ah I will revert the last change :P hahaha my bad for being eager :P |
Yes you are right, from looking at the libcurl documentation timeout is this: https://curl.se/libcurl/c/CURLOPT_TIMEOUT.html
|
Good stuff 👍 |
@joakibo if possible are you able to test the branch? remotes::install_github("DyfanJones/paws", ref="timeout") I would be super grateful if you are able to :) |
Mini test script: remotes::install_github("DyfanJones/paws", ref="timeout") s3 <- paws::s3()
s3$list_buckets() Assuming ecs has s3 list buckets permissions :) If this hangs please let me know :D |
Tried now but got this
It's not public possibly? EDIT: Tried to setup PAT but that doesn't seem to work. Problem is that this runs in a context where I don't have setup SSH etc. But access is anyway the issue. Will see if I can find a way. |
Ah my bad i didn't add the package directory. Try this instead 😛 remotes::install_github("DyfanJones/paws/paws.common", ref="timeout") |
@DyfanJones Seems like that worked. Tested this first but hanged: Then after having done
|
This is great news it is working :) I will merge the PR for the next paws.commom release |
Thanks for all the testing and the extra insight, this issue wouldn't of been resolved without your help |
Sure, no problems, happy to assist 👍 |
I am running an R script in a Docker container on an ECS instance, and I am using the paws.security.identity package to retrieve secrets from AWS Secret Manager. However, the container is failing to retrieve secrets from Secret Manager, and the connection seems to hang indefinitely.
After investigating the issue, I discovered that the hop limit setting on the ECS instance was causing requests to the metadata endpoint to fail. Specifically, the hop limit was set to 1, which was preventing the R script from successfully retrieving secrets from Secret Manager.
To resolve the issue, I increased the hop limit from 1 to 2, and the R script was able to successfully retrieve secrets from Secret Manager. It appears that the issue was related to network connectivity and routing, and increasing the hop limit resolved the issue.
I am submitting this issue (which can be closed immediately) in the hope the solution helps someone else.
Steps to Reproduce:
Run an R script in a Docker container on an ECS instance.
Attempt to retrieve secrets from AWS Secret Manager using the paws.security.identity package.
Observe that the connection appears to hang indefinitely, and the container fails to retrieve secrets from Secret Manager.
Expected Result:
The R script should be able to successfully retrieve secrets from AWS Secret Manager using the paws.security.identity package.
Actual Result:
The R script fails to retrieve secrets from Secret Manager, and the connection appears to hang indefinitely.
Solution:
Increase the hop limit setting on the ECS instance to 2. This will allow requests to the metadata endpoint to succeed, and should resolve the issue with retrieving secrets from AWS Secret Manager in a Docker container running on an ECS instance.
I am using terraform to provision these resources, I made the following change to my terraform manifest:
The text was updated successfully, but these errors were encountered: