VPA recommends 100T for memory and 100G for CPU #5569

yogeek · 2023-03-06T14:10:18Z

Which component are you using?:

vertical-pod-autoscaler

What version of the component are you using?:

k8s.gcr.io/autoscaling/vpa-recommender:0.11.0

What k8s version are you using (kubectl version)?:

kubectl version Output

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.15", GitCommit:"1d79bc3bcccfba7466c44cc2055d6e7442e140ea", GitTreeState:"clean", BuildDate:"2022-09-21T12:18:10Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.15", GitCommit:"1d79bc3bcccfba7466c44cc2055d6e7442e140ea", GitTreeState:"clean", BuildDate:"2022-09-21T12:12:26Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?:

VPA deployed as a subchart of goldilocks in a K8S clusters in AWS (not an EKS, deployed with kubeadm on EC2 instances)

The issue seems to appear if prometheus is used for recommender history via the extraArgs field (cf. values.yaml below)

What did you expect to happen?:

I would expect coherent values in recommendations

What happened instead?:

Incoherent values are shown in recommendations like 100T for memory limite or 100G for CPU limit

status:
  conditions:
  - lastTransitionTime: "2023-03-06T13:24:55Z"
    status: "True"
    type: RecommendationProvided
  recommendation:
    containerRecommendations:
    - containerName: goldilocks
      lowerBound:
        cpu: 15m
        memory: "104857600"
      target:
        cpu: 23m
        memory: "163378051"
      uncappedTarget:
        cpu: 23m
        memory: "163378051"
      upperBound:
        cpu: 100G
        memory: 100T

How to reproduce it (as minimally and precisely as possible):

Deploy goldilocks with VPA with the commands below :

cat <<EOF >> kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: goldilocks

helmCharts:
- name: goldilocks
  namespace: goldilocks
  includeCRDs: true
  releaseName: goldilocks
  repo: https://charts.fairwinds.com/stable
  valuesInline:
    # see all values in https://github.com/FairwindsOps/charts/blob/master/stable/goldilocks/values.yaml
    image:
      repository: us-docker.pkg.dev/fairwinds-ops/oss/goldilocks
      tag: v4.7.0
    vpa:
      # vpa.enabled -- If true, the vpa will be installed as a sub-chart
      enabled: true
      # see all values in https://github.com/FairwindsOps/charts/blob/master/stable/vpa/values.yaml
      recommender:
        image:
          repository: k8s.gcr.io/autoscaling/vpa-recommender
        # In order to utilize prometheus for recommender history, we need to pass some extra flags to the recommender
        extraArgs:
          prometheus-address: |
            http://prometheus-k8s.monitoring.svc.cluster.local:9090
          storage: prometheus
      updater:
        enabled: false
EOF

kubectl create ns goldilocks
kustomize build --enable-helm . | kubectl apply -n goldilocks -f -

# enable goldilocks on its own namespace to see some recommendations coming from the VPS generated by goldilocks
kubectl label ns goldilocks goldilocks.fairwinds.com/enabled=true

Additionnal information

I found a similar issue here but no more informatin : https://bugzilla.redhat.com/show_bug.cgi?id=1935794

The text was updated successfully, but these errors were encountered:

voelzmo · 2023-03-06T16:09:19Z

Hey @yogeek, thanks for raising this issue!

To me, this looks like a UX problem with goldilocks, not so much an issue with VPA itself. Let me try to elaborate why:

The VPA recommendation is what's presented in the status.recommendation.containerRecommendations.target field. If you would be using VPA in updateMode: auto, this is what your container's resources would be adjusted to: ~163.4MB memory and 23m CPU.

The containerRecommendations also has other fields, for example upperBound and lowerBound. They are used in the updater component of the VPA to decide if a new recommendation should be applied by evicting a Pod. When the currently configured resources in a Pod are *greater than upperBound, the Pod is evicted to be scaled down. When the currently configured resources in a Pod a *smaller than lowerBound, the Pod is evicted to be scaled up.

The statistical model used in the VPA recommender needs 8 days of historic data to produce recommendations and upper/lower boundaries with maximum accuracy. In the time between starting a workload for the first time and these 8 days, the boundaries will become more and more accurate. The lowerBound converges much quicker to maximum accuracy than the upperBound: the idea is that upscaling can be done much more liberally than downscaling. In your example data above, the upperBound is incredibly high: it is the maximum possible value for the VPA recommender's statistical model, so my assumption is that this is a situation that happened very shortly after starting your workload?

What goldilocks shows you here are resources configurations for two possible QoS classes: guaranteed and burstable. In the former case, requests need to equal limits, therefore goldilocks correctly shows to set CPU limits and requests to the above mentioned 23m and memory requests and limits to 164MB. In the latter case, limits can be greater than requests, so goldilock chooses to show you the upperBound values for "maximum possible setting for the limits– technically this could be any value that's greater than therequests` configuration. So this seems indeed to be very confusing (the second issue you're linking is a good indication that you're not alone with this!), I think this is a question better addressed to the goldilock devs, even though it seems to be documented.

PS: The goldilocks-controller is deployed with limits == requests, so if you were to enable VPA's updateMode: auto, it would keep limits and requests at the same value all the time. So what you would get in your container would be the left side:

Does this make sense?

/remove-kind bug
/kind support

sudermanjr · 2023-03-06T19:07:38Z

Thanks for the detailed info here. I actually learned a thing or two about the VPC which is rather helpful.

Just wanted to mention that this PR changes the behavior of the equals sign in Goldilocks. We also have some planned changes to improve usability.

To me it seems the original issue here is related to your statement:

The statistical model used in the VPA recommender needs 8 days of historic data to produce recommendations and upper/lower boundaries with maximum accuracy.

I have definitely seen behavior in the VPA that the recommended values are extremely large on initial recommendation when first setting up a demo. I don't think that's a bug, just something to be aware of about initial recommendations. I will have to keep in mind the 8-days recommendation when talking about this in the future.

TL;DR - Thank you for the details, and I 100% agree with your statements :-D

yogeek · 2023-03-06T19:15:05Z

@voelzmo thanks for the details explanation 👍

Indeed, your assumption is correct : this situation happened very shortly after deploying my workload

I have notified goldilocks team about your feedback in the slack channel

I guess that this issue can be closed and that maybe a warning could be added in the goldilocks documentation for users to be aware of this behavior about initial recommendations

Slack discussion : https://fairwindscommunity.slack.com/archives/CV0AU2CTS/p1678110018503029 Github related issue in VPA : kubernetes/autoscaler#5569

voelzmo · 2023-03-07T13:00:01Z

Hey @sudermanjr and @yogeek,

Great to see this has been useful for you! One more aside, if you want to go into the nitty-gritty details of the VPA algorithm and how the boundaries evolve with more data, we can take a look at the code documentation for that part:

the status.recommendation.containerRecommendations[*].upperBound gets to a reasonable value after 12 hours (confidence multiplier is 3 by then) – your mileage for "reasonable" may vary, though ;)
the base for upperBound is the 95th percentile, so only 5% of all seen usage sample have been higher than this upperBound

A caveat: when I mentioned the 8 days above, I was referring to the memory histogram's DefaultMemoryAggregationIntervalCount, which just means that for each container the VPA will keep the peak memory usage in a window of 8 days. As you can see from the numbers above, even having a few hours of usage data will vastly improve the numbers.

…U) (#593) Slack discussion : https://fairwindscommunity.slack.com/archives/CV0AU2CTS/p1678110018503029 Github related issue in VPA : kubernetes/autoscaler#5569

yogeek added the kind/bug Categorizes issue or PR as related to a bug. label Mar 6, 2023

k8s-ci-robot added kind/support Categorizes issue or PR as a support question. and removed kind/bug Categorizes issue or PR as related to a bug. labels Mar 6, 2023

yogeek closed this as completed Mar 6, 2023

yogeek mentioned this issue Mar 6, 2023

Add FAQ entry about incoherent recommendations (100T memory / 100G CPU) FairwindsOps/goldilocks#593

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VPA recommends 100T for memory and 100G for CPU #5569

VPA recommends 100T for memory and 100G for CPU #5569

yogeek commented Mar 6, 2023 •

edited

Loading

voelzmo commented Mar 6, 2023 •

edited

Loading

sudermanjr commented Mar 6, 2023 •

edited

Loading

yogeek commented Mar 6, 2023

voelzmo commented Mar 7, 2023

VPA recommends 100T for memory and 100G for CPU #5569

VPA recommends 100T for memory and 100G for CPU #5569

Comments

yogeek commented Mar 6, 2023 • edited Loading

Additionnal information

voelzmo commented Mar 6, 2023 • edited Loading

sudermanjr commented Mar 6, 2023 • edited Loading

yogeek commented Mar 6, 2023

voelzmo commented Mar 7, 2023

yogeek commented Mar 6, 2023 •

edited

Loading

voelzmo commented Mar 6, 2023 •

edited

Loading

sudermanjr commented Mar 6, 2023 •

edited

Loading