fix: increase hop limit to access AWS instance metadata endpoint #1382

sreis · 2021-09-09T12:42:27Z

Back in #1226 (comment), we saw that the S3 bucket list request was stalling for 2+ minutes. It turns out it was one of the requests, an ec2metadata/GetToken that makes a PUT /latest/api/token HTTP/1.1, that was hanging. This led me to find aws/aws-sdk-go#2972, which mentions that containers running on EC2 instances have issues accessing the AWS instance metadata endpoint due to a network knob not being configured properly.

I've ran a build in the draft PR #1351 with this fix applied and the stalling AWS instance metada requests seems to be fixed.

A few notes from that build:

The installer no longer times out checking the Loki and Cortex endoints when the instance is created, indicating the querier initialized properly. Previously, requests to the default tenant would time out because the startupProbe we added could only check one tenant.

[2021-09-09T12:00:54Z] 2021-09-09T12:00:54.668Z info: waiting for expected HTTP responses at these URLs: {
[2021-09-09T12:00:54Z]   "https://cortex.default.prs-bk-5774-06a-a.opstrace.io/api/v1/labels": "default",
[2021-09-09T12:00:54Z]   "https://loki.default.prs-bk-5774-06a-a.opstrace.io/loki/api/v1/labels": "default",
[2021-09-09T12:00:54Z]   "https://cortex.system.prs-bk-5774-06a-a.opstrace.io/api/v1/labels": "system",
[2021-09-09T12:00:54Z]   "https://loki.system.prs-bk-5774-06a-a.opstrace.io/loki/api/v1/labels": "system"
[2021-09-09T12:00:54Z] }
[2021-09-09T12:00:54Z] 2021-09-09T12:00:54.810Z info: https://cortex.default.prs-bk-5774-06a-a.opstrace.io/api/v1/labels: got expected HTTP response
[2021-09-09T12:00:54Z] 2021-09-09T12:00:54.895Z info: https://loki.default.prs-bk-5774-06a-a.opstrace.io/loki/api/v1/labels: got expected HTTP response
[2021-09-09T12:00:54Z] 2021-09-09T12:00:54.932Z info: https://loki.system.prs-bk-5774-06a-a.opstrace.io/loki/api/v1/labels: got expected HTTP response
[2021-09-09T12:00:54Z] 2021-09-09T12:00:54.935Z info: https://cortex.system.prs-bk-5774-06a-a.opstrace.io/api/v1/labels: got expected HTTP response
[2021-09-09T12:00:54Z] 2021-09-09T12:00:54.936Z info: wait for data API endpoints: all probe URLs returned expected HTTP responses, continue
[2021-09-09T12:00:54Z] 2021-09-09T12:00:54.936Z info: waiting for expected HTTP responses at these URLs: {
[2021-09-09T12:00:54Z]   "https://dd.default.prs-bk-5774-06a-a.opstrace.io/api/v1/series": "default"
[2021-09-09T12:00:54Z] }
[2021-09-09T12:00:54Z] 2021-09-09T12:00:54.985Z info: wait for DD API endpoints: all probe URLs returned expected HTTP responses, continue
[2021-09-09T12:00:54Z] 2021-09-09T12:00:54.985Z info: waiting for expected HTTP responses at these URLs: {
[2021-09-09T12:00:54Z]   "https://default.prs-bk-5774-06a-a.opstrace.io/": "default",
[2021-09-09T12:00:54Z]   "https://system.prs-bk-5774-06a-a.opstrace.io/": "system"
[2021-09-09T12:00:54Z] }
[2021-09-09T12:00:55Z] 2021-09-09T12:00:55.267Z info: wait for grafana endpoints: all probe URLs returned expected HTTP responses, continue

More important, the ec2metadata/GetToken request returned instantly

2021-09-09 11:56:50	
stdout
2021/09/09 11:56:50 DEBUG: Request ec2metadata/GetToken Details:
2021-09-09 11:56:50	
stdout
---[ REQUEST POST-SIGN ]-----------------------------
PUT /latest/api/token HTTP/1.1
Host: 169.254.169.254
User-Agent: aws-sdk-go/1.38.68 (go1.16.2; linux; amd64)
Content-Length: 0
X-Aws-Ec2-Metadata-Token-Ttl-Seconds: 21600
Accept-Encoding: gzip


-----------------------------------------------------
2021/09/09 11:56:50 DEBUG: Response ec2metadata/GetToken Details:
---[ RESPONSE ]--------------------------------------
HTTP/1.1 200 OK
Connection: close
Content-Length: 56
Content-Type: text/plain
Date: Thu, 09 Sep 2021 11:56:50 GMT
Server: EC2ws
X-Aws-Ec2-Metadata-Token-Ttl-Seconds: 21600


-----------------------------------------------------
2021/09/09 11:56:50 DEBUG: Request ec2metadata/GetMetadata Details:
---[ REQUEST POST-SIGN ]-----------------------------
GET /latest/meta-data/iam/security-credentials/ HTTP/1.1
Host: 169.254.169.254
User-Agent: aws-sdk-go/1.38.68 (go1.16.2; linux; amd64)
X-Aws-Ec2-Metadata-Token: AQAEABMbM1-ukdJYiBzyjIgAGJDjb43gVMld-FSnGehziQ-T1GFXLA==
Accept-Encoding: gzip


-----------------------------------------------------
2021/09/09 11:56:50 DEBUG: Response ec2metadata/GetMetadata Details:
---[ RESPONSE ]--------------------------------------
HTTP/1.1 200 OK
Connection: close
Content-Length: 27
Accept-Ranges: none
Content-Type: text/plain
Date: Thu, 09 Sep 2021 11:56:50 GMT
Last-Modified: Thu, 09 Sep 2021 11:55:43 GMT
Server: EC2ws
X-Aws-Ec2-Metadata-Token-Ttl-Seconds: 21600


-----------------------------------------------------
2021/09/09 11:56:50 DEBUG: Request ec2metadata/GetMetadata Details:
---[ REQUEST POST-SIGN ]-----------------------------
GET /latest/meta-data/iam/security-credentials/prs-bk-5774-06a-a-eks-nodes HTTP/1.1
Host: 169.254.169.254
User-Agent: aws-sdk-go/1.38.68 (go1.16.2; linux; amd64)
X-Aws-Ec2-Metadata-Token: AQAEABMbM1-ukdJYiBzyjIgAGJDjb43gVMld-FSnGehziQ-T1GFXLA==
Accept-Encoding: gzip


-----------------------------------------------------
2021/09/09 11:56:50 DEBUG: Response ec2metadata/GetMetadata Details:
---[ RESPONSE ]--------------------------------------
HTTP/1.1 200 OK
Connection: close
Content-Length: 1310
Accept-Ranges: none
Content-Type: text/plain
Date: Thu, 09 Sep 2021 11:56:50 GMT
Last-Modified: Thu, 09 Sep 2021 11:55:43 GMT
Server: EC2ws
X-Aws-Ec2-Metadata-Token-Ttl-Seconds: 21600


-----------------------------------------------------
2021/09/09 11:56:50 DEBUG: Request s3/ListObjectsV2 Details:
---[ REQUEST POST-SIGN ]-----------------------------
GET /?delimiter=%2F&list-type=2&prefix=index%2Fprs-bk-5774-06a-a_loki_index_18879%2F HTTP/1.1
Host: prs-bk-5774-06a-a-loki.s3.us-west-2.amazonaws.com
User-Agent: aws-sdk-go/1.38.68 (go1.16.2; linux; amd64)
Authorization: AWS4-HMAC-SHA256 Credential=ASIA56XCAEKWIX44AW54/20210909/us-west-2/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date;x-amz-security-token, Signature=64305ff35bb804a4902ae78ce11677ce153ac6dac11b2cddc9114c78d6afbda8
X-Amz-Content-Sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
X-Amz-Date: 20210909T115650Z
X-Amz-Security-Token: XXX
Accept-Encoding: gzip


-----------------------------------------------------
2021/09/09 11:56:50 DEBUG: Response s3/ListObjectsV2 Details:
---[ RESPONSE ]--------------------------------------
HTTP/1.1 200 OK
Transfer-Encoding: chunked
Content-Type: application/xml
Date: Thu, 09 Sep 2021 11:56:51 GMT
Server: AmazonS3
X-Amz-Bucket-Region: us-west-2
X-Amz-Id-2: erE+/vaDOksG6K/pTPS3aAb9ADyfT1m8/oSIZXovpi8z1iEkdi0Emwia2BIclFd3FFwvrD2CMZA=
X-Amz-Request-Id: ZVCJJPS5JHQNC9KH


-----------------------------------------------------
ts=2021-09-09T11:56:50.702201018Z caller=spanlogger.go:87 traceID=587f6ee85d3aed3f method=S3ObjectClient.List level=debug msg="ListObjectsV2WithContext returned" bucketName=prs-bk-5774-06a-a-loki length=0

It's most likely we started seeing an increased error rate in #1226 due to a recent loki bump that includes grafana/loki#4188, which bumped the aws-go-sdk from 1.38.60 to 1.38.68. This particular PR https://github.com/aws/aws-sdk-go/pull/3962/files might have led to this time out.

Fix #1226

Removes Loki querier workaround introduced in loki: define querier startup probe #918 since we identified the root cause.

Allow containers to access certain AWS instance metadata endpoints. Fix #1226. Signed-off-by: Simão Reis <sreis@opstrace.com>

Signed-off-by: Simão Reis <sreis@opstrace.com>

sreis · 2021-09-09T12:45:25Z

Need to check how we can apply the fix during an upgrade.

jgehrcke · 2021-09-09T13:29:46Z

This is beautiful @sreis. Thank you for this great debugging effort. This topic/issue has it all, that's so nice.

It turns out it was one of the requests, an ec2metadata/GetToken that makes a PUT /latest/api/token HTTP/1.1, that was hanging

Based on this it is indeed pretty much definite that aws/aws-sdk-go#2972 is about the issue at hand.

But I was a bit puzzled... why do session credential requests sometimes work and sometimes not? When they work, why would they take a lot of time? This is puzzling, in view of what HttpPutResponseHopLimit is really doing.

So, a word on what HttpPutResponseHopLimit is really doing.

The official documentation says

The desired HTTP PUT response hop limit for instance metadata requests. The larger the number, the further instance metadata requests can travel.

(https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_InstanceMetadataOptionsRequest.html)

Based on my understanding of networks and HTTP this didn't really make sense. HTTP does not know hops, and lower-level network does not know "requests".

Turns out this sentence in the docs is bogus. I found https://aws.amazon.com/blogs/security/defense-in-depth-open-firewalls-reverse-proxies-ssrf-vulnerabilities-ec2-instance-metadata-service/.

It says:

With IMDSv2, the PUT response containing the secret token will, by default, not be able to travel outside the instance. This is accomplished by having the default Time To Live (TTL) on the low-level IP packets containing the secret token set to “1,

That is, the system knows exactly which IP packet(s) (it's likely just one packet) corresponding to the HTTP response contain the sensitive information, and these IP packets have a special TTL hop limit configured.

With IMDSv2, setting the TTL value to “1” means that requests from the EC2 instance itself will work because they’re returned to the caller (on the instance) before the subtraction occurs. But if the EC2 instance has been misconfigured as an open router, layer 3 firewall, VPN, tunnel, or NAT device, the response containing the token will have its TTL reduced to zero before leaving the instance, and the packet containing the response will be discarded on its way out of the instance, preventing transport to the attacker. The information simply won’t make it further than the EC2 instance itself, which means that an attacker won’t get the response back with the token, and with it the ability to access instance metadata, even if they’ve been successful at getting past all other defenses.

That is, if a network path involves 2 hops then the data does not arrive.

If a network path involves 1 hop then the data is expected to arrive immediately.

Macroscopically, we observed

this to sometimes work, probably even "fast"
this to sometimes not work, probably as of "timeout"

The people in aws/aws-sdk-go#2972 observed their requests to succeed, just get "slow".

This non-deterministic nature of things makes me think that the following conditions are all met:

there is more than one network path that a request/response cycle can use.
the network paths differ in the number of involved hops. Some have 1, some have 2 or more.
the HTTP client doing the request performs retrying upon response timeout, and that timeout is 'tight'

That allows for the following scenarios to happen:

quick response, if first request uses network path with 1 hop
slow response, if first N requests use network path with 2 hops, and one of the retry attempts uses the path with 1 hop
no response within global timeout, if all request attempts use a path with 2 hops

jgehrcke · 2021-09-09T13:37:37Z

It's most likely we started seeing an increased error rate in #1226 due to a recent loki bump that includes grafana/loki#4188, which bumped the aws-go-sdk from 1.38.60 to 1.38.68. This particular PR https://github.com/aws/aws-sdk-go/pull/3962/files might have led to this time out.

Thanks for adding this. I was also super curious about which change in our system resulted in this issue to surface. Especially given the fact that this new ec2 metadata firewall thingy was introduced in 2019.

About this hypothesis:

the HTTP client doing the request performs retrying upon response timeout, and that timeout is 'tight'

https://github.com/aws/aws-sdk-go/blob/v1.38.68/aws/ec2metadata/service.go#L82

func NewClient(cfg aws.Config, handlers request.Handlers, endpoint, signingRegion string, opts ...func(*client.Client)) *EC2Metadata {
	if !aws.BoolValue(cfg.EC2MetadataDisableTimeoutOverride) && httpClientZero(cfg.HTTPClient) {
		// If the http client is unmodified and this feature is not disabled
		// set custom timeouts for EC2Metadata requests.
		cfg.HTTPClient = &http.Client{
			// use a shorter timeout than default because the metadata
			// service is local if it is running, and to fail faster
			// if not running on an ec2 instance.
			Timeout: 1 * time.Second,
		}
		// max number of retries on the client operation
		cfg.MaxRetries = aws.Int(2)
	}

Super tight timeout criterion: makes sense.
Only two attempts: also makes sense, this is after all local network traffic, i.e. not subject to the transient issues that happen when crossing machine boundaries.

So, how can the long delays happen? I suppose, if an outer caller retries the higher-level operation many more times (i.e. a retry loop around the inner two-attempt-two-seconds retry loop that the code above shows).

Well. Seen enough. This is pure joy, and we don't need to understand all detail if after all the symptoms we saw go away.

sreis · 2021-09-09T13:47:47Z

@jgehrcke Something I found on this quest is that Loki overrides the AWS Go HTTP client. Meaning it overrides the client you linked to earlier.

I still don't know why it stalls for 120s though.

jgehrcke · 2021-09-09T13:50:47Z

OK, now I understand where the time is spent when things are 'slow' or after all time out.

From https://aws.amazon.com/blogs/security/defense-in-depth-open-firewalls-reverse-proxies-ssrf-vulnerabilities-ec2-instance-metadata-service/:

IMDSv2’s combination of beginning a session with a PUT request, and then requiring the secret session token in other requests, is always strictly more effective than requiring only a static header.

but

Both IMDSv1 and IMDSv2 will be available and enabled by default, and customers can choose which they will use.

The go aws sdk can (and does) implement a fallback, from IMDSv2 to IMDSv1 if it failed to obtain the secret token required for IMDSv2.

In aws/aws-sdk-go#2972 (comment) we can see how long this can take.

16:00:54 till 16:01:15 was spent in local retrying against the firewall with no response ever being received for the ec2metadata/GetToken request.

At 16:01:15 the sdk then fell back to IMDSv1, requesting /latest/meta-data/iam/security-credentials/ instead.

20 seconds lost against the firewall. 🔥 And of course in practice this is a distribution around 20 seconds. Enough rooom for race conditions and after all a non-deterministic fallout as we had seen it.

The change in this pull request will make it so that the IMDSv2 way of doing things will succeed immediately, getting rid of the spurious and dubious 20 second delays.

jgehrcke · 2021-09-09T13:52:04Z

Loki overrides the AWS Go HTTP client. Meaning it overrides the client you linked to earlier.

I hope that this only overrides the HTTP client used for S3 interaction, and it looks like that:


	s3Config = s3Config.WithHTTPClient(&http.Client{
		Transport: transport,
	})

The HTTP client used for interacting with the metadata service is probably still the one in the go sdk.

jgehrcke · 2021-09-09T13:54:32Z

Need to check how we can apply the fix during an upgrade.

Of course a super interesting question. If changing the launch configuration for the auto-scaling group on EC2 requires spawning new EC2 instances then this is of course not so cool (given our current ambitious upgrade strategy, haha :D).

It looks like there is at least one approach that is promising for 'fixing' existing EC2 instances, based on this comment:

aws ec2 modify-instance-metadata-options --instance-id i-34215432543254235 --http-endpoint enabled --http-put-response-hop-limit 2

https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_ModifyInstanceMetadataOptions.html

MatApple · 2021-09-09T19:22:29Z

@sreis @jgehrcke excellent work! 🎉 Really cool to uncover (and share so clearly) this pesky issue and gain even deeper insight into the inner workings of AWS 💪 Thank you

sreis added 2 commits September 9, 2021 12:21

installer(aws): increase instance metadata network hop

191e685

Allow containers to access certain AWS instance metadata endpoints. Fix #1226. Signed-off-by: Simão Reis <sreis@opstrace.com>

chore: remove loki querier startup probe workaround

28c1331

Signed-off-by: Simão Reis <sreis@opstrace.com>

sreis merged commit 704bb82 into main Sep 9, 2021

sreis deleted the sreis/fix-1226 branch September 9, 2021 14:06

This was referenced Sep 9, 2021

draft: debug issue 1226 #1351

Closed

chore: increase hop limit to access AWS instance metadata service during upgrade #1383

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: increase hop limit to access AWS instance metadata endpoint #1382

fix: increase hop limit to access AWS instance metadata endpoint #1382

sreis commented Sep 9, 2021

sreis commented Sep 9, 2021

jgehrcke commented Sep 9, 2021

jgehrcke commented Sep 9, 2021 •

edited

sreis commented Sep 9, 2021

jgehrcke commented Sep 9, 2021 •

edited

jgehrcke commented Sep 9, 2021

jgehrcke commented Sep 9, 2021 •

edited

MatApple commented Sep 9, 2021

fix: increase hop limit to access AWS instance metadata endpoint #1382

fix: increase hop limit to access AWS instance metadata endpoint #1382

Conversation

sreis commented Sep 9, 2021

sreis commented Sep 9, 2021

jgehrcke commented Sep 9, 2021

jgehrcke commented Sep 9, 2021 • edited

sreis commented Sep 9, 2021

jgehrcke commented Sep 9, 2021 • edited

jgehrcke commented Sep 9, 2021

jgehrcke commented Sep 9, 2021 • edited

MatApple commented Sep 9, 2021

jgehrcke commented Sep 9, 2021 •

edited

jgehrcke commented Sep 9, 2021 •

edited

jgehrcke commented Sep 9, 2021 •

edited