Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vertical-pod-autoscaler 0.3.0 on AWS EKS - admission controller doesn't kick in #1547

Closed
piontec opened this issue Jan 2, 2019 · 13 comments
Closed
Labels
area/vertical-pod-autoscaler lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@piontec
Copy link
Contributor

piontec commented Jan 2, 2019

HI!
I'm running VPA on EKS cluster in AWS. It supports mutating webhooks, as claimed by AWS. Now, I have the following configuration (to test "hamster" deployment in "Initial" mode):

$ ksdev get deploy,pod | grep vpa
deployment.extensions/vpa-admission-controller             1         1         1            1           1d
deployment.extensions/vpa-recommender                      1         1         1            1           1d
deployment.extensions/vpa-updater                          1         1         1            1           1d

pod/vpa-admission-controller-58977d995f-knwrr             1/1       Running     0          29m
pod/vpa-recommender-6bf9f87f85-6zz86                      1/1       Running     0          29m
pod/vpa-updater-6df84c89dd-pfb29                          1/1       Running     0          28m

Webhook is registered and seems in place:

$ ksdev get mutatingwebhookconfiguration.v1beta1.admissionregistration.k8s.io -o yaml       
apiVersion: v1
items:
- apiVersion: admissionregistration.k8s.io/v1beta1
  kind: MutatingWebhookConfiguration
  metadata:
    creationTimestamp: 2019-01-02T09:39:29Z
    generation: 1
    name: vpa-webhook-config
    namespace: ""
    resourceVersion: "39148745"
    selfLink: /apis/admissionregistration.k8s.io/v1beta1/mutatingwebhookconfigurations/vpa-webhook-config
    uid: 4cec7d97-0e72-11e9-889f-127fc02963b2
  webhooks:
  - clientConfig:
      caBundle: [CUT]
      service:
        name: vpa-webhook
        namespace: kube-system
    failurePolicy: Ignore
    name: vpa.k8s.io
    namespaceSelector: {}
    rules:
    - apiGroups:
      - ""
      apiVersions:
      - v1
      operations:
      - CREATE
      resources:
      - pods
    - apiGroups:
      - autoscaling.k8s.io
      apiVersions:
      - v1beta1
      operations:
      - CREATE
      - UPDATE
      resources:
      - verticalpodautoscalers
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Hamster pods are running, VPA is created and successfully updated by Recommender:

$ kdev get verticalpodautoscalers.autoscaling.k8s.io -o yaml
apiVersion: v1
items:
- apiVersion: autoscaling.k8s.io/v1beta1
  kind: VerticalPodAutoscaler
  metadata:   
    clusterName: ""
    creationTimestamp: 2019-01-02T09:50:32Z
    generation: 1
    name: hamster-vpa
    namespace: default
    resourceVersion: "39155158"
    selfLink: /apis/autoscaling.k8s.io/v1beta1/namespaces/default/verticalpodautoscalers/hamster-vpa
    uid: d8733e00-0e73-11e9-9516-0a01d9a5380e
  spec:
    selector:
      matchLabels:
        app: hamster
    updatePolicy:
      updateMode: Initial
  status:
    conditions:
    - lastTransitionTime: 2019-01-02T09:51:05Z
      status: "True"
      type: RecommendationProvided
    recommendation:
      containerRecommendations:
      - containerName: hamster
        lowerBound:
          cpu: 560m
          memory: 262144k
        target:
          cpu: 587m
          memory: 262144k
        uncappedTarget:
          cpu: 587m
          memory: 262144k
        upperBound:
          cpu: 15428m
          memory: "282975409"

But actual admission controller seems to do nothing: the only logs I get are (repeated over and over):

I0102 10:05:21.020578       1 reflector.go:357] k8s.io/autoscaler/vertical-pod-autoscaler/pkg/utils/vpa/api.go:89: Watch close - *v1beta1.VerticalPodAutoscaler total 8 items received
I0102 10:05:21.020906       1 round_trippers.go:383] GET https://172.20.0.1:443/apis/autoscaling.k8s.io/v1beta1/verticalpodautoscalers?resourceVersion=39153926&timeoutSeconds=431&watch=true
I0102 10:05:21.021195       1 round_trippers.go:390] Request Headers:
I0102 10:05:21.021284       1 round_trippers.go:393]     Accept: application/json, */*
I0102 10:05:21.021460       1 round_trippers.go:393]     User-Agent: admission-controller/v0.0.0 (linux/amd64) kubernetes/$Format
I0102 10:05:21.021566       1 round_trippers.go:393]     Authorization: Bearer [XXX]
I0102 10:05:21.029808       1 round_trippers.go:408] Response Status: 200 OK in 8 milliseconds
I0102 10:05:21.029968       1 round_trippers.go:411] Response Headers:
I0102 10:05:21.030170       1 round_trippers.go:414]     Audit-Id: a812bbcd-1c26-49a2-9f7e-da07a60b7d51
I0102 10:05:21.030291       1 round_trippers.go:414]     Content-Type: application/json
I0102 10:05:21.030468       1 round_trippers.go:414]     Date: Wed, 02 Jan 2019 10:05:21 GMT

When new pods matching the selector are created, their default resurces are not changes nor there's anything showing up in logs. HOw can I investigate this problem?

@bskiba
Copy link
Member

bskiba commented Jan 2, 2019

Which Kubernetes version?
You can try curling the VPA admission webhook service from within the cluster and see if any requests appear in the admission-controller logs.
If you have access to the master, you can also take a look at the API server logs - it should note if there are any errors on calling the webhook. It think the log to look at is kube-controller-manager.log

@piontec
Copy link
Contributor Author

piontec commented Jan 2, 2019

Kubernetes version is "v1.10.11-eks". I did just that in the meantime. I'm pretty sure it's wrong EKS config - in-cluster service URL works fine, but I get no calls from apiserver when pods are created (checked with tcpdump - nothing, so it's not just lack of log entries). I'm in contact with AWS support, and I will update this case once I learn more. Currently, in EKS there's no way you can get control plane logs :|

@bskiba
Copy link
Member

bskiba commented Jan 2, 2019

I see, that's a bummer :( One thing you can also try to do in the meantime is change the failurePolicy of VPA webhook to Fail (instead of Ignore). This should cause a pod creation failure if API server fails to call the webhook and might get some cause for that failure (though I wouldn't expect anything too verbose).

@d-nishi
Copy link

d-nishi commented Jan 7, 2019

/sig aws

@safanaj
Copy link
Contributor

safanaj commented Jan 30, 2019

Apparently looks like that on AWS EKS the admission controller pod have to listen onto 443 (no matter if the service is ok to forward to any other port), looks like they are using a weird way to resolve endpoint (maybe they are not using this https://github.com/kubernetes/kubernetes/blob/release-1.11/staging/src/k8s.io/apiserver/pkg/admission/plugin/webhook/config/serviceresolver.go)

Applying this https://github.com/kubernetes/autoscaler/pull/1613/files#diff-741c9c09f72b481cf3cb277a6a2ee929 and passing --port=443 it works

@bskiba
Copy link
Member

bskiba commented Jan 30, 2019

@safanaj Thanks for the update! I'll take a look at your PR today hopefully.

@brycecarman
Copy link

Verify the rules on the security groups you use for the cluster control plane and for the worker nodes. In particular, verify that the control plane security group allows egress to the worker node security group on port 8000 and the worker nodes allow ingress on 8000 from the control plane.

The default node group template allows port 8000 by default.

@piontec
Copy link
Contributor Author

piontec commented Feb 28, 2019

Yes, we have checked our security groups, it seems you have to use port 443, as @safanaj mentioned above.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 29, 2019
@bskiba
Copy link
Member

bskiba commented May 29, 2019

I think this is fixed already, since the change by @safanaj has been released.
/close

@k8s-ci-robot
Copy link
Contributor

@bskiba: Closing this issue.

In response to this:

I think this is fixed already, since the change by @safanaj has been released.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@5cat
Copy link

5cat commented Mar 12, 2023

I faced this issue as well and was getting in the kube-apiserver logs the following:

failed calling webhook "vpa.k8s.io": failed to call webhook: Post "https://vpa-webhook.kube-system.svc:443/?timeout=30s": context deadline exceeded

Changing the 8000 port to 10250 fixed the issue in my EKS cluster.
https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-0.13.0/vertical-pod-autoscaler/deploy/admission-controller-deployment.yaml#L58,L42

this requires the addition of --port=10250 since the default port is 8000 in the admission-controller container.

@5cat
Copy link

5cat commented Mar 12, 2023

Verify the rules on the security groups you use for the cluster control plane and for the worker nodes. In particular, verify that the control plane security group allows egress to the worker node security group on port 8000 and the worker nodes allow ingress on 8000 from the control plane.

The default node group template allows port 8000 by default.

@brycecarman Actually you were right, instead of using port 10250 I can use the default 8000 but I needed to add a security group rule to allow the traffic, which wasnt on by default. I used the eks terraform module, and those were the default security group rules, 8000 isnt one of them.

I couldint use the port --port=443 for some reason it told me it couldint bind to that port.

yaroslava-serdiuk pushed a commit to yaroslava-serdiuk/autoscaler that referenced this issue Feb 22, 2024
…after the workload is evicted (kubernetes#1547)

Signed-off-by: tenzen-y <yuki.iwai.tz@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/vertical-pod-autoscaler lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

8 participants