Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem Deploying Autoscaler with v1.5.1 #1796

Closed
pluttrell opened this issue Feb 6, 2017 · 27 comments
Closed

Problem Deploying Autoscaler with v1.5.1 #1796

pluttrell opened this issue Feb 6, 2017 · 27 comments
Labels
Milestone

Comments

@pluttrell
Copy link

Using Kops v1.5.0-beta2, if I deploy the Cluster Autoscaler as described here on AWS, it appears to fail. Here's exactly what I ran:

CLOUD_PROVIDER=aws
IMAGE=gcr.io/google_containers/cluster-autoscaler:v0.4.0
MIN_NODES=3
MAX_NODES=24
AWS_REGION=us-east-1
GROUP_NAME="k8s-worker"
SSL_CERT_PATH="/etc/ssl/certs/ca-certificates.crt" # (/etc/ssl/certs for gce)

addon=cluster-autoscaler.yml
wget -O ${addon} https://raw.githubusercontent.com/kubernetes/kops/master/addons/cluster-autoscaler/v1.4.0.yaml

sed -i -e "s@{{CLOUD_PROVIDER}}@${CLOUD_PROVIDER}@g" "${addon}"
sed -i -e "s@{{IMAGE}}@${IMAGE}@g" "${addon}"
sed -i -e "s@{{MIN_NODES}}@${MIN_NODES}@g" "${addon}"
sed -i -e "s@{{MAX_NODES}}@${MAX_NODES}@g" "${addon}"
sed -i -e "s@{{GROUP_NAME}}@${GROUP_NAME}@g" "${addon}"
sed -i -e "s@{{AWS_REGION}}@${AWS_REGION}@g" "${addon}"
sed -i -e "s@{{SSL_CERT_PATH}}@${SSL_CERT_PATH}@g" "${addon}"

kubectl apply -f ${addon}

Here is the log from the pod itself:

2017-02-06T21:53:09.516651243Z I0206 21:53:09.516516       1 cluster_autoscaler.go:353] Cluster Autoscaler 0.4.0
2017-02-06T21:53:09.833039609Z E0206 21:53:09.832856       1 event.go:257] Could not construct reference to: '&api.Endpoints{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"cluster-autoscaler", GenerateName:"", Namespace:"kube-system", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil)}, Subsets:[]api.EndpointSubset(nil)}' due to: 'selfLink was empty, can't make reference'. Will not report event: 'Normal' '%v became leader' 'cluster-autoscaler-362589257-qvjps'
2017-02-06T21:53:09.833083521Z I0206 21:53:09.832940       1 leaderelection.go:215] sucessfully acquired lease kube-system/cluster-autoscaler
2017-02-06T21:55:10.236022480Z E0206 21:55:10.235891       1 aws_manager.go:81] Error while regenerating Asg cache: RequestError: send request failed
2017-02-06T21:55:10.236074713Z caused by: Post https://autoscaling.us-east-1.amazonaws.com/: dial tcp: i/o timeout
2017-02-06T21:57:10.654286671Z W0206 21:57:10.654164       1 cluster_autoscaler.go:202] Cluster is not ready for autoscaling: Error while looking for ASG for instance {Name:i-04dab8e1abd2eeadf}, error: RequestError: send request failed
2017-02-06T21:57:10.654433665Z caused by: Post https://autoscaling.us-east-1.amazonaws.com/: dial tcp: i/o timeout
2017-02-06T21:59:20.945923291Z W0206 21:59:20.945820       1 cluster_autoscaler.go:202] Cluster is not ready for autoscaling: Error while looking for ASG for instance {Name:i-07cb35848b7303a8c}, error: RequestError: send request failed
2017-02-06T21:59:20.945959787Z caused by: Post https://autoscaling.us-east-1.amazonaws.com/: dial tcp: i/o timeout
2017-02-06T22:01:31.312714356Z W0206 22:01:31.312578       1 cluster_autoscaler.go:202] Cluster is not ready for autoscaling: Error while looking for ASG for instance {Name:i-045dfa57161b3f669}, error: RequestError: send request failed
2017-02-06T22:01:31.312762096Z caused by: Post https://autoscaling.us-east-1.amazonaws.com/: dial tcp: i/o timeout
2017-02-06T22:03:41.648172881Z W0206 22:03:41.648044       1 cluster_autoscaler.go:202] Cluster is not ready for autoscaling: Error while looking for ASG for instance {Name:i-0922d3fd192fa3708}, error: RequestError: send request failed
2017-02-06T22:03:41.648207274Z caused by: Post https://autoscaling.us-east-1.amazonaws.com/: dial tcp: i/o timeout
2017-02-06T22:05:51.955455693Z W0206 22:05:51.955355       1 cluster_autoscaler.go:202] Cluster is not ready for autoscaling: Error while looking for ASG for instance {Name:i-0922d3fd192fa3708}, error: RequestError: send request failed
2017-02-06T22:05:51.955490454Z caused by: Post https://autoscaling.us-east-1.amazonaws.com/: dial tcp: i/o timeout
2017-02-06T22:08:02.268974568Z W0206 22:08:02.268861       1 cluster_autoscaler.go:202] Cluster is not ready for autoscaling: Error while looking for ASG for instance {Name:i-0922d3fd192fa3708}, error: RequestError: send request failed
2017-02-06T22:08:02.269019618Z caused by: Post https://autoscaling.us-east-1.amazonaws.com/: dial tcp: lookup autoscaling.us-east-1.amazonaws.com on 100.64.0.10:53: dial udp 100.64.0.10:53: i/o timeout
2017-02-06T22:10:12.709737594Z W0206 22:10:12.709623       1 cluster_autoscaler.go:202] Cluster is not ready for autoscaling: Error while looking for ASG for instance {Name:i-07cb35848b7303a8c}, error: RequestError: send request failed
2017-02-06T22:10:12.709795552Z caused by: Post https://autoscaling.us-east-1.amazonaws.com/: dial tcp: i/o timeout
2017-02-06T22:12:22.947293286Z W0206 22:12:22.947191       1 cluster_autoscaler.go:202] Cluster is not ready for autoscaling: Error while looking for ASG for instance {Name:i-04dab8e1abd2eeadf}, error: RequestError: send request failed
2017-02-06T22:12:22.947324194Z caused by: Post https://autoscaling.us-east-1.amazonaws.com/: dial tcp: i/o timeout
2017-02-06T22:14:33.216621196Z W0206 22:14:33.216505       1 cluster_autoscaler.go:202] Cluster is not ready for autoscaling: Error while looking for ASG for instance {Name:i-07cb35848b7303a8c}, error: RequestError: send request failed
2017-02-06T22:14:33.216661881Z caused by: Post https://autoscaling.us-east-1.amazonaws.com/: dial tcp: i/o timeout
@pluttrell
Copy link
Author

@ese Do I have the config correct?

@yissachar
Copy link
Contributor

@pluttrell Is the group name correct? I believe it has to be the name of your node ASG, which likely is not k8s-worker.

@pluttrell
Copy link
Author

pluttrell commented Feb 6, 2017

@yissachar Thanks for the suggestion. I deleted the old deployment and recreated it, but this time using the name of my Nodes ASG, as follows:

GROUP_NAME="nodes.${NAME}"

But still see the same problem:

2017-02-06T23:14:01.681430960Z I0206 23:14:01.681296       1 cluster_autoscaler.go:353] Cluster Autoscaler 0.4.0
2017-02-06T23:14:01.979278933Z I0206 23:14:01.979177       1 leaderelection.go:295] lock is held by cluster-autoscaler-362589257-en53e and has not yet expired
2017-02-06T23:14:05.432683960Z I0206 23:14:05.432589       1 leaderelection.go:295] lock is held by cluster-autoscaler-362589257-en53e and has not yet expired
2017-02-06T23:14:09.776690773Z I0206 23:14:09.776600       1 leaderelection.go:295] lock is held by cluster-autoscaler-362589257-en53e and has not yet expired
2017-02-06T23:14:13.373972813Z I0206 23:14:13.373852       1 leaderelection.go:295] lock is held by cluster-autoscaler-362589257-en53e and has not yet expired
2017-02-06T23:14:16.476140474Z I0206 23:14:16.476027       1 leaderelection.go:295] lock is held by cluster-autoscaler-362589257-en53e and has not yet expired
2017-02-06T23:14:19.620142570Z E0206 23:14:19.619954       1 event.go:257] Could not construct reference to: '&api.Endpoints{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"cluster-autoscaler", GenerateName:"", Namespace:"kube-system", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil)}, Subsets:[]api.EndpointSubset(nil)}' due to: 'selfLink was empty, can't make reference'. Will not report event: 'Normal' '%v became leader' 'cluster-autoscaler-3428038825-s49yu'
2017-02-06T23:14:19.620201936Z I0206 23:14:19.620108       1 leaderelection.go:215] sucessfully acquired lease kube-system/cluster-autoscaler
2017-02-06T23:16:20.435300340Z E0206 23:16:20.435030       1 aws_manager.go:81] Error while regenerating Asg cache: RequestError: send request failed
2017-02-06T23:16:20.435344746Z caused by: Post https://autoscaling.us-east-1.amazonaws.com/: dial tcp: i/o timeout
2017-02-06T23:18:20.772820200Z W0206 23:18:20.772702       1 cluster_autoscaler.go:202] Cluster is not ready for autoscaling: Error while looking for ASG for instance {Name:i-06f447adf256f4e82}, error: RequestError: send request failed
2017-02-06T23:18:20.772863488Z caused by: Post https://autoscaling.us-east-1.amazonaws.com/: dial tcp: i/o timeout

@ese
Copy link
Contributor

ese commented Feb 6, 2017

@pluttrell At a quick review it seems a problem with your dns

@pluttrell
Copy link
Author

@ese I'm using Route53. I also have deployed the Dashboard addon, which is fully accessable at: https://api.${NAME}/ui. As is all of the kubectl commands.

@dmcnaught
Copy link
Contributor

Was that a typo @pluttrell - GROUP_NAME="nodes.${NAME}" - you have $ after { in your comment.
It's working for me with 1.5.0-beta2

@pluttrell
Copy link
Author

@dmcnaught Thanks for pointing out my typo in the comment above, which I used instead of redacting it. I just corrected it. What I tried earlier had the full ASG name, which I cut&pasted, so I'm sure it was correct.

@ese
Copy link
Contributor

ese commented Feb 7, 2017

@pluttrell I can't reproduce it. Works fine for me with 1.5.1 kops release

@pluttrell
Copy link
Author

After upgrading to 1.5.1, I worked on reproducing the problem and found that using the exact same steps to create 12 clusters, only 2 of them experienced this problem. The other 10 worked fine.

@pluttrell pluttrell changed the title Problem Deploying Autoscaler with v1.5.0-beta2 Problem Deploying Autoscaler with v1.5.1 Feb 8, 2017
@pluttrell
Copy link
Author

The problem might also come and go. Or not be triggered until there's a scaling event.

In a cluster that had previously not reported any errors, I intentional deployed an exorbitant number of replicas to trigger a scaling event, but it failed while trying to scale up.

I0208 00:37:08.293007       1 scale_down.go:163] No candidates for scale down
I0208 00:37:18.540502       1 scale_down.go:163] No candidates for scale down
I0208 00:37:28.882205       1 scale_down.go:163] No candidates for scale down
I0208 00:37:39.139642       1 scale_down.go:163] No candidates for scale down
I0208 00:37:49.430254       1 scale_down.go:163] No candidates for scale down
I0208 00:37:59.891498       1 scale_down.go:163] No candidates for scale down
W0208 00:40:10.266085       1 cluster_autoscaler.go:202] Cluster is not ready for autoscaling: Error while looking for ASG for instance {Name:i-091092f5959263783}, error: RequestError: send request failed
caused by: Post https://autoscaling.us-east-1.amazonaws.com/: dial tcp: i/o timeout

@justinsb justinsb added the P1 label Feb 8, 2017
@justinsb justinsb added this to the 1.5.2 milestone Feb 8, 2017
@mark-grimes
Copy link

mark-grimes commented Mar 9, 2017

Was this ever resolved? I have exactly the same problem. Oddly I get the same error if I put a nonsense name for the ASG and/or remove the IAM autoscaling permissions. Could someone confirm what error is displayed if these are incorrect (even though I'm confident they are correct)?

kubernetes v1.5.3 server and client
kops git-723f3bc (built from source from the release branch)

Also, I can confirm the Error while looking for ASG for instance error is only triggered on a scaling event.

@mark-grimes
Copy link

Finally figured it out. Following the template at https://github.com/kubernetes/contrib/blob/master/cluster-autoscaler/cloudprovider/aws/README.md#1-asg-setup-min-1-max-10-asg-name-k8s-worker-asg-1 works, but the template at https://github.com/kubernetes/kops/tree/master/addons/cluster-autoscaler doesn't.

Correcting for indentation, the diff is (working version is '<')

7c7
<     app: cluster-autoscaler
---
>     k8s-app: cluster-autoscaler
12c12
<       app: cluster-autoscaler
---
>       k8s-app: cluster-autoscaler
16c16,18
<         app: cluster-autoscaler
---
>         k8s-app: cluster-autoscaler
>       annotations:
>         scheduler.alpha.kubernetes.io/tolerations: '[{"key":"dedicated", "value":"master"}]'
19,20c21,22
<         - image: gcr.io/google_containers/cluster-autoscaler:v0.4.0
<           name: cluster-autoscaler
---
>         - name: cluster-autoscaler
>           image: gcr.io/google_containers/cluster-autoscaler:v0.4.0
30d31
<             - --v=4
32d32
<             - --skip-nodes-with-local-storage=false
41d40
<           imagePullPolicy: "Always"
45c44,46
<             path: "/etc/ssl/certs/ca-certificates.crt"
---
>             path: /etc/ssl/certs/ca-certificates.crt
>       nodeSelector:
>         kubernetes.io/role: master

Which of those is the crucial difference I don't know.

@Tibingeo
Copy link

Tibingeo commented Jun 28, 2017

I get this error in my autoscaler log

2017-06-28T17:15:18.034832321Z I0628 17:15:18.034555       1 leaderelection.go:204] succesfully renewed lease kube-system/cluster-autoscaler
2017-06-28T17:15:20.110544607Z I0628 17:15:20.110304       1 leaderelection.go:204] succesfully renewed lease kube-system/cluster-autoscaler
2017-06-28T17:15:22.116750413Z I0628 17:15:22.116519       1 leaderelection.go:204] succesfully renewed lease kube-system/cluster-autoscaler
2017-06-28T17:15:24.123511893Z I0628 17:15:24.123260       1 leaderelection.go:204] succesfully renewed lease kube-system/cluster-autoscaler
2017-06-28T17:15:26.210659327Z I0628 17:15:26.210364       1 leaderelection.go:204] succesfully renewed lease kube-system/cluster-autoscaler
2017-06-28T17:15:26.714239374Z E0628 17:15:26.713979       1 static_autoscaler.go:108] Failed to update node registry: RequestError: send request failed
2017-06-28T17:15:26.714267753Z caused by: Post https://autoscaling.ap-southeast-1a.amazonaws.com/: dial tcp: lookup autoscaling.ap-southeast-1a.amazonaws.com on 100.64.0.10:53: no such host

what does error mean ? It's trying to connect to some unknown host 100.64.0.10:53

@7chenko
Copy link

7chenko commented Sep 8, 2017

Getting a similar error, with kops 1.7.0, kubernetes 1.7.5, cluster-autoscaler 0.6.1, but only when trying to scale from 0 nodes. According to this, as of CA 0.6.1 I should be able to scale to/from 0. I'm getting errors like this:

E0908 03:18:13.511590       1 static_autoscaler.go:118] Failed to update node registry: RequestError: send request failed
caused by: Post https://autoscaling.us-west-2.amazonaws.com/: dial tcp: i/o timeout

Using a deployment similar to this one, and it works as long as there is at least 1 node up:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
  labels:
    app: cluster-autoscaler
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    metadata:
      labels:
        app: cluster-autoscaler
    spec:
      containers:
        - image: gcr.io/google_containers/cluster-autoscaler:v0.6.1
          name: cluster-autoscaler
          resources:
            limits:
              cpu: 100m
              memory: 300Mi
            requests:
              cpu: 100m
              memory: 300Mi
          command:
            - ./cluster-autoscaler
            - --v=4
            - --stderrthreshold=info
            - --cloud-provider=aws
            - --skip-nodes-with-local-storage=false
            - --nodes=0:10:nodes.uswest2.metamoto.net
          env:
            - name: AWS_REGION
              value: us-west-2
          volumeMounts:
            - name: ssl-certs
              mountPath: /etc/ssl/certs/ca-certificates.crt
              readOnly: true
          imagePullPolicy: "Always"
      volumes:
        - name: ssl-certs
          hostPath:
            path: "/etc/ssl/certs/ca-certificates.crt"
      tolerations:
        - key: "node-role.kubernetes.io/master"
          effect: NoSchedule

@7chenko
Copy link

7chenko commented Sep 8, 2017

Figured this out, it was the fact that a kube-dns pod was not running on the master node. To run it, had to add the master toleration to the kube-dns deployment (same as with cluster-autoscaler deployment above). Once kube-dns was running on the master, autoscaler was able to use it to get ASG info from AWS and scale up from 0 nodes.

@chrislovecnm
Copy link
Contributor

chrislovecnm commented Sep 10, 2017

@andrewsykim we need kube-dns on the master? See above comment.

@andrewsykim
Copy link
Member

Makes more sense to set dnsPolicy: Default for cluster-autoscaler as it shouldn't depend on kube-dns anyways.

@andrewsykim
Copy link
Member

^ @7chenko any chance you can open a PR for this?

@kyleu
Copy link

kyleu commented Sep 10, 2017

I'm seeing the same thing. What's the advice, kube-dns on master or dnsPolicy? How do I accomplish either?

@chrislovecnm
Copy link
Contributor

@kyleu we need the manifest for autoscaler to be changed. It should not be utilizing cluster dns. Also kube-dns should not probably live on the master.

@andrewsykim
Copy link
Member

@kyleu
Copy link

kyleu commented Sep 11, 2017

Mine turned out to be specifying the AZ (us-east-1a), and not the region (us-east-1). The URL showed my error, but I overlooked it.

@KernCheh
Copy link

KernCheh commented Sep 11, 2017

I faced the same issue and changed dnsPolicy from ClusterFirst to Default on cluster-autoscaler deployment as per @andrewsykim suggestion and it seemed to work. The cluster started to scale down

@7chenko
Copy link

7chenko commented Sep 11, 2017

I run kube-dns on the master because I also find that when it runs on the nodes it prevents scale-down (along with kube-dns-autoscaler pod). What's the right way to avoid that? I do have --skip-nodes-with-system-pods=false.

@pluttrell
Copy link
Author

For 1.7.x clusters, should we set dnsPolicy: Default or not?

@andrewsykim
Copy link
Member

From the start, cluster-autoscaler should run with dnsPolicy: Default. If someone can open a PR to address that, it would be great, otherwise, I will push a patch sometime next week when I have some bandwidth, thanks!

@lgg42
Copy link

lgg42 commented Mar 16, 2020

Mine turned out to be specifying the AZ (us-east-1a), and not the region (us-east-1). The URL showed my error, but I overlooked it.

For anyone reading this issue take this into account, it also happened to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests