Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster-autoscaler does not work with VPC-Endpoint (AWS/EKS) #2829

Closed
maust opened this issue Feb 13, 2020 · 17 comments
Closed

Cluster-autoscaler does not work with VPC-Endpoint (AWS/EKS) #2829

maust opened this issue Feb 13, 2020 · 17 comments
Assignees

Comments

@maust
Copy link
Contributor

maust commented Feb 13, 2020

We do have a quite locked-down network in AWS (no internet-connectivity at all). Access to AWS services only via VPC endpoints and on-premise systems via DirectConnect. At the same time we would like to use IAM roles.

Kubernetes: 1.14
Cluster-Autoscaler: 1.14.7

When using cluster-autoscaler it cannot fetch credentials via STS using the IAM role. To my understanding the issue is caused by cluster-autoscaler not using the regional STS endpoint (https://sts.eu-central-1.amazonaws.com) but instead the global (https://sts.amazonaws.com). With VPC-Endpoints it is not possible to replace the global enpoint.

With github.com/aws/aws-sdk-go v1.25.18
(see https://github.com/aws/aws-sdk-go/blob/master/CHANGELOG.md) configuration of regional STS endpoints was introduced by setting env AWS_STS_REGIONAL_ENDPOINTS=regional

I tried setting the region as env AWS_REGION and AWS_STS_REGIONAL_ENDPOINTS but still the global endpoint is used.

After looking at the 1.14 branch for cluster-autoscaler, it looks like v1.23.22 is used (see
https://github.com/kubernetes/autoscaler/blob/cluster-autoscaler-release-1.14/cluster-autoscaler/vendor/github.com/aws/aws-sdk-go/CHANGELOG.md)

I also checked the other cluster-autoscaler branches:

  • 1.14 github.com/aws/aws-sdk-go v1.23.22
  • 1.15 github.com/aws/aws-sdk-go v1.16.26
  • 1.16 github.com/aws/aws-sdk-go v1.23.12
  • 1.17 github.com/aws/aws-sdk-go v1.23.18
  • master github.com/aws/aws-sdk-go v1.28.2

So I would assume that supporting such a use case would be possible by upgrading the aws-sdk-go version to >= v1.25.18 - let me know if I can be of help.

logs:
E0213 16:05:54.390164 1 aws_manager.go:259] Failed to regenerate ASG cache: cannot autodiscover ASGs: WebIdentityErr: failed to retrieve credentials
caused by: RequestError: send request failed
caused by: Post https://sts.amazonaws.com/: dial tcp 54.239.29.25:443: i/o timeout
F0213 16:05:54.390200 1 aws_cloud_provider.go:330] Failed to create AWS Manager: cannot autodiscover ASGs: WebIdentityErr: failed to retrieve credentials
caused by: RequestError: send request failed
caused by: Post https://sts.amazonaws.com/: dial tcp 54.239.29.25:443: i/o timeout

Attached you can find the kubernetes deployment yaml.
deployment.txt

@ajohnstone
Copy link

ajohnstone commented Mar 10, 2020

This has been updated in master 3 months ago, but no release. Any idea when a new release would be cut?

af6f325

@Jeffwan
Copy link
Contributor

Jeffwan commented Mar 12, 2020

/assign

@Jeffwan
Copy link
Contributor

Jeffwan commented Mar 12, 2020

Thanks. I can help resolve this issue and request newer release. I think what we can help is bump the SDK version and then user can mount env AWS_STS_REGIONAL_ENDPOINTS=regional. SDK client will pick up env and resolve right endpoint. Is that correct?

@ajohnstone
Copy link

ajohnstone commented Mar 13, 2020

@Jeffwan That is correct, It's resolved in the version that is already in master.

-	github.com/aws/aws-sdk-go v1.23.18
+	github.com/aws/aws-sdk-go v1.28.2

See go.mod in master here.
af6f325#diff-5b1211f36242f6afe85bdb0062369dc3R16

The upstream fix was here https://github.com/aws/aws-sdk-go/pull/2779/files
Resolved in aws-sdk-go "Release v1.25.18 (2019-10-23)"
https://github.com/aws/aws-sdk-go/blob/master/CHANGELOG.md#sdk-enhancements-13

@ajohnstone
Copy link

Fixes #2532

@Jeffwan
Copy link
Contributor

Jeffwan commented Mar 13, 2020

@ajohnstone Thanks. I plan to have a few cherry-pick recently, I will make the change and include this in the new release.

@Jeffwan
Copy link
Contributor

Jeffwan commented Mar 24, 2020

Just make changes on 1.15. I will make the changes for rest of the version

@Jeffwan
Copy link
Contributor

Jeffwan commented Mar 31, 2020

em.. Sorry we only have following versions to support this case.

https://github.com/kubernetes/autoscaler/releases/tag/cluster-autoscaler-1.15.6
https://github.com/kubernetes/autoscaler/releases/tag/cluster-autoscaler-1.18.1

1.14, 1.16 and 1.17 changes is not included in this release. Change will be merged and you can build one image for short term. If you need any help, let me know

@maust
Copy link
Contributor Author

maust commented Apr 14, 2020

We are just migrating to 1.15, thanks a lot. If I can find time I will extend the AWS documentation.

@maust
Copy link
Contributor Author

maust commented Apr 14, 2020

Successful tested with 1.15.6, thanks again.

I opened #3052

From my side this could be closed, not sure if you want to keep it open until the other versions support it.

@Jeffwan
Copy link
Contributor

Jeffwan commented Apr 14, 2020

I will leave it open to track changes in other branches. @maust Thanks for the contribution. I will review the doc change

@haofeif
Copy link

haofeif commented Jun 4, 2020

Hi @Jeffwan any plan to fix them in other version (1.16.x -> 1.17.x)?
thanks.

@Jeffwan
Copy link
Contributor

Jeffwan commented Jun 10, 2020

1.17.x change has been merged. I get some feedbacks on 1.16.x and I will fix it before next release. @haofeif

I also update PR and address feedbacks for 1.16 change. Once it get merged, next release will pick it up. #3003

@ramdesh
Copy link

ramdesh commented Jul 22, 2020

Hello, is there any info on when this will be released in a 1.16.x or 1.17.x release? Thanks

@Jeffwan
Copy link
Contributor

Jeffwan commented Aug 5, 2020

1.16.6, 1.17.3 have been released. Please download latest version. I will close the issue. Thanks everyone for all your feedbacks

@Jeffwan
Copy link
Contributor

Jeffwan commented Aug 5, 2020

/close

@k8s-ci-robot
Copy link
Contributor

@Jeffwan: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants