-
Notifications
You must be signed in to change notification settings - Fork 49
fix: increase hop limit to access AWS instance metadata endpoint #1382
Conversation
Allow containers to access certain AWS instance metadata endpoints. Fix #1226. Signed-off-by: Simão Reis <sreis@opstrace.com>
Signed-off-by: Simão Reis <sreis@opstrace.com>
Need to check how we can apply the fix during an upgrade. |
This is beautiful @sreis. Thank you for this great debugging effort. This topic/issue has it all, that's so nice.
Based on this it is indeed pretty much definite that aws/aws-sdk-go#2972 is about the issue at hand. But I was a bit puzzled... why do session credential requests sometimes work and sometimes not? When they work, why would they take a lot of time? This is puzzling, in view of what So, a word on what The official documentation says
(https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_InstanceMetadataOptionsRequest.html) Based on my understanding of networks and HTTP this didn't really make sense. HTTP does not know hops, and lower-level network does not know "requests". Turns out this sentence in the docs is bogus. I found https://aws.amazon.com/blogs/security/defense-in-depth-open-firewalls-reverse-proxies-ssrf-vulnerabilities-ec2-instance-metadata-service/. It says:
That is, the system knows exactly which IP packet(s) (it's likely just one packet) corresponding to the HTTP response contain the sensitive information, and these IP packets have a special TTL hop limit configured.
That is, if a network path involves 2 hops then the data does not arrive. If a network path involves 1 hop then the data is expected to arrive immediately. Macroscopically, we observed
The people in aws/aws-sdk-go#2972 observed their requests to succeed, just get "slow". This non-deterministic nature of things makes me think that the following conditions are all met:
That allows for the following scenarios to happen:
|
Thanks for adding this. I was also super curious about which change in our system resulted in this issue to surface. Especially given the fact that this new ec2 metadata firewall thingy was introduced in 2019. About this hypothesis:
https://github.com/aws/aws-sdk-go/blob/v1.38.68/aws/ec2metadata/service.go#L82 func NewClient(cfg aws.Config, handlers request.Handlers, endpoint, signingRegion string, opts ...func(*client.Client)) *EC2Metadata {
if !aws.BoolValue(cfg.EC2MetadataDisableTimeoutOverride) && httpClientZero(cfg.HTTPClient) {
// If the http client is unmodified and this feature is not disabled
// set custom timeouts for EC2Metadata requests.
cfg.HTTPClient = &http.Client{
// use a shorter timeout than default because the metadata
// service is local if it is running, and to fail faster
// if not running on an ec2 instance.
Timeout: 1 * time.Second,
}
// max number of retries on the client operation
cfg.MaxRetries = aws.Int(2)
} Super tight timeout criterion: makes sense. So, how can the long delays happen? I suppose, if an outer caller retries the higher-level operation many more times (i.e. a retry loop around the inner two-attempt-two-seconds retry loop that the code above shows). Well. Seen enough. This is pure joy, and we don't need to understand all detail if after all the symptoms we saw go away. |
@jgehrcke Something I found on this quest is that Loki overrides the AWS Go HTTP client. Meaning it overrides the client you linked to earlier. I still don't know why it stalls for 120s though. |
OK, now I understand where the time is spent when things are 'slow' or after all time out.
but
The go aws sdk can (and does) implement a fallback, from IMDSv2 to IMDSv1 if it failed to obtain the secret token required for IMDSv2. In aws/aws-sdk-go#2972 (comment) we can see how long this can take.
At 20 seconds lost against the firewall. 🔥 And of course in practice this is a distribution around 20 seconds. Enough rooom for race conditions and after all a non-deterministic fallout as we had seen it. The change in this pull request will make it so that the IMDSv2 way of doing things will succeed immediately, getting rid of the spurious and dubious 20 second delays. |
I hope that this only overrides the HTTP client used for S3 interaction, and it looks like that:
The HTTP client used for interacting with the metadata service is probably still the one in the go sdk. |
Of course a super interesting question. If changing the launch configuration for the auto-scaling group on EC2 requires spawning new EC2 instances then this is of course not so cool (given our current ambitious upgrade strategy, haha :D). It looks like there is at least one approach that is promising for 'fixing' existing EC2 instances, based on this comment:
https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_ModifyInstanceMetadataOptions.html |
Back in #1226 (comment), we saw that the S3 bucket list request was stalling for 2+ minutes. It turns out it was one of the requests, an
ec2metadata/GetToken
that makes aPUT /latest/api/token HTTP/1.1
, that was hanging. This led me to find aws/aws-sdk-go#2972, which mentions that containers running on EC2 instances have issues accessing the AWS instance metadata endpoint due to a network knob not being configured properly.I've ran a build in the draft PR #1351 with this fix applied and the stalling AWS instance metada requests seems to be fixed.
A few notes from that build:
default
tenant would time out because thestartupProbe
we added could only check one tenant.It's most likely we started seeing an increased error rate in #1226 due to a recent loki bump that includes grafana/loki#4188, which bumped the aws-go-sdk from 1.38.60 to 1.38.68. This particular PR https://github.com/aws/aws-sdk-go/pull/3962/files might have led to this time out.
Fix #1226