Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS Cluster Autoscaler Permissions #113

Closed
pluttrell opened this issue Jun 9, 2017 · 17 comments
Closed

AWS Cluster Autoscaler Permissions #113

pluttrell opened this issue Jun 9, 2017 · 17 comments
Labels
area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider

Comments

@pluttrell
Copy link

Using v0.5.4 of the aws-cluster-autoscaler, we're getting this error:

E0609 23:20:59.162974       1 static_autoscaler.go:108] Failed to update node registry: Unable to get first autoscaling.Group for node-us-west-2a.dev.clusters.mydomain.io

It sure looks like a permission problem... But per the instructions, I have the following policy on my instance role named nodes.dev.clusters.mydomain.io:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "autoscaling:DescribeAutoScalingGroups",
                "autoscaling:DescribeAutoScalingInstances",
                "autoscaling:SetDesiredCapacity",
                "autoscaling:TerminateInstanceInAutoScalingGroup"
            ],
            "Resource": "*"
        }
    ]
}

Without this addition, I get a different error:

E0609 23:05:48.475214       1 static_autoscaler.go:108] Failed to update node registry: AccessDenied: User: arn:aws:sts::11111111111:assumed-role/nodes.dev.clusters.mydomain.io/i-0472257b3f8d4ec43 is not authorized to perform: autoscaling:DescribeAutoScalingGroups
	status code: 403, request id: 2cf17af0-4d68-11e7-825c-73c99354b20d

So we're thinking that we have the necessary permissions.

For reference here's our execution config:

./cluster-autoscaler
--cloud-provider=aws
--nodes=1:10:node-us-west-2a.dev.clusters.mydomain.io
--nodes=1:10:node-us-west-2b.dev.clusters.mydomain.io
--nodes=1:10:node-us-west-2c.dev.clusters.mydomain.io
--scale-down-delay=10m
--skip-nodes-with-local-storage=false
--skip-nodes-with-system-pods=true
--v=4

Any ideas on what to do?
Is there any strategy for debugging this?

@zaa
Copy link

zaa commented Jun 12, 2017

Judging by the code from https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/aws_manager.go#L114 it looks like you've passed an incorrect group name.

@mwielgus
Copy link
Contributor

@pluttrell Was it a problem with the group name?

@pluttrell
Copy link
Author

Nope, the group names were identical to what was in AWS.

We do however have the aws-cluster-autoscaler working perfectly with just using the kubernetes resource files directly without helm, so we've gone with that option for now.

@mwielgus
Copy link
Contributor

Great :). Closing the bug.

@7chenko
Copy link

7chenko commented Sep 8, 2017

Getting a similar error, with kops 1.7.0, kubernetes 1.7.5, cluster-autoscaler 0.6.1, but only when trying to scale from 0 nodes. According to this, as of CA 0.6.1 I should be able to scale to/from 0. I'm getting errors like this:

E0908 03:18:13.511590       1 static_autoscaler.go:118] Failed to update node registry: RequestError: send request failed
caused by: Post https://autoscaling.us-west-2.amazonaws.com/: dial tcp: i/o timeout

Using a deployment similar to this one, and it works as long as there is at least 1 node up:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
  labels:
    app: cluster-autoscaler
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    metadata:
      labels:
        app: cluster-autoscaler
    spec:
      containers:
        - image: gcr.io/google_containers/cluster-autoscaler:v0.6.1
          name: cluster-autoscaler
          resources:
            limits:
              cpu: 100m
              memory: 300Mi
            requests:
              cpu: 100m
              memory: 300Mi
          command:
            - ./cluster-autoscaler
            - --v=4
            - --stderrthreshold=info
            - --cloud-provider=aws
            - --skip-nodes-with-local-storage=false
            - --nodes=0:10:nodes.uswest2.metamoto.net
          env:
            - name: AWS_REGION
              value: us-west-2
          volumeMounts:
            - name: ssl-certs
              mountPath: /etc/ssl/certs/ca-certificates.crt
              readOnly: true
          imagePullPolicy: "Always"
      volumes:
        - name: ssl-certs
          hostPath:
            path: "/etc/ssl/certs/ca-certificates.crt"
      tolerations:
        - key: "node-role.kubernetes.io/master"
          effect: NoSchedule

@7chenko
Copy link

7chenko commented Sep 8, 2017

Figured this out, it was the fact that a kube-dns pod was not running on the master node. To run it, had to add the master toleration to the kube-dns deployment (same as with cluster-autoscaler deployment above). Once kube-dns was running on the master, autoscaler was able to use it to get ASG info from AWS and scale up from 0 nodes.

@MrHohn
Copy link
Member

MrHohn commented Nov 1, 2017

Curious does cluster-autoscaler depend on in-cluster DNS service? Probably not?

Instead of putting kube-dns on master, what about setting dnsPolicy: Default for cluster-autoscaler so that the name resolution does not go through kube-dns?

Using dnsPolicy: ClusterFirst on pods that run on master node might not work unless kube-proxy pod also runs on master (for Service VIP -> backend Pods routing), which isn't always true (e.g. in GCE kube-up it doesn't).

@shiv9012
Copy link

shiv9012 commented Mar 14, 2018

@MrHohn @7chenko @StevenACoffman i have tried

  1. running both cluster-autoscaler & kube-dns on master
  2. using dnsPolicy: Default for cluster-autoscaler

Still im getting this error

Failed to update node registry: RequestError: send request failed
caused by: Post https://autoscaling.us-east-1.amazonaws.com/: dial tcp ...*:443: i/o timeout

Please suggest

@MrHohn
Copy link
Member

MrHohn commented Apr 9, 2018

Failed to update node registry: RequestError: send request failed
caused by: Post https://autoscaling.us-east-1.amazonaws.com/: dial tcp ...*:443: i/o timeout

This looks like a routing or firewall issue instead..

@srossross-tableau
Copy link

srossross-tableau commented May 9, 2018

I'm getting the original error posted Failed to update node registry: Unable to get first autoscaling.Group nodes.public-prod.k8s.local

What steps can I take to debug and fix this?

@srossross-tableau
Copy link

I think that I have the correct AWS permissions to describe the autoscaling groups

If I exec into the cluster-autoscaler pod and install the aws cli. I can run:

aws --region us-west-2 autoscaling describe-auto-scaling-groups | grep nodes
            "AutoScalingGroupARN": "arn:aws:autoscaling:us-west-2:***:autoScalingGroup:****:autoScalingGroupName/nodes.public-prod.k8s.local", 

@aleksandra-malinowska aleksandra-malinowska added area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider labels May 10, 2018
@aleksandra-malinowska
Copy link
Contributor

Briefly looking at the code, it seems that AWS returns no groups with this name. Based on the error message, method is called with correct group name.

I'm unable to replicate or debug it, but I guess if you get different results for requests made by Go library and command line tool, maintainers of those tools may be better able to help.

@christopherhein
Copy link
Member

christopherhein commented May 16, 2018

@srossross-tableau can you confirm that the original request is including the region like you have in the aws call from in the container?

You might need to make sure your env is set correctly.

env:
- name: AWS_REGION
  value: us-west-2

@srossross-tableau
Copy link

Thanks @christopherhein that was the issue.

@dthomason
Copy link

Curious does cluster-autoscaler depend on in-cluster DNS service? Probably not?

Instead of putting kube-dns on master, what about setting dnsPolicy: Default for cluster-autoscaler so that the name resolution does not go through kube-dns?

Using dnsPolicy: ClusterFirst on pods that run on master node might not work unless kube-proxy pod also runs on master (for Service VIP -> backend Pods routing), which isn't always true (e.g. in GCE kube-up it doesn't).

I tested this and feel this is the best approach. It keeps you from having to modify the kube-dns deployment while keeping your masters clean. Thanks!!

ingvagabund pushed a commit to ingvagabund/autoscaler that referenced this issue Aug 26, 2019
…-differences

UPSTREAM: <carry>: openshift: add custom nodeset comparator
@gazal-k
Copy link

gazal-k commented Oct 21, 2019

Curious does cluster-autoscaler depend on in-cluster DNS service? Probably not?
Instead of putting kube-dns on master, what about setting dnsPolicy: Default for cluster-autoscaler so that the name resolution does not go through kube-dns?
Using dnsPolicy: ClusterFirst on pods that run on master node might not work unless kube-proxy pod also runs on master (for Service VIP -> backend Pods routing), which isn't always true (e.g. in GCE kube-up it doesn't).

I tested this and feel this is the best approach. It keeps you from having to modify the kube-dns deployment while keeping your masters clean. Thanks!!

Setting dnsPolicy: Default worked for me too on EKS 1.13

@waterdrops
Copy link

Curious does cluster-autoscaler depend on in-cluster DNS service? Probably not?
Instead of putting kube-dns on master, what about setting dnsPolicy: Default for cluster-autoscaler so that the name resolution does not go through kube-dns?
Using dnsPolicy: ClusterFirst on pods that run on master node might not work unless kube-proxy pod also runs on master (for Service VIP -> backend Pods routing), which isn't always true (e.g. in GCE kube-up it doesn't).

I tested this and feel this is the best approach. It keeps you from having to modify the kube-dns deployment while keeping your masters clean. Thanks!!

Setting dnsPolicy: Default worked for me too on EKS 1.13

I met the same error on EKS 1.13, you helped me a lot, Thank you very much @gazal-k

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider
Projects
None yet
Development

No branches or pull requests