Add autoscaling #2

bradenmacdonald · 2022-12-06T18:56:29Z

No description provided.

adzuci · 2023-01-10T17:55:39Z

Hey @bradenmacdonald, in case it's helpful, this is an example of how 2U defines an HPA in a Django IDA chart:


{{ if and .Values.app.enabled .Values.app.autoscaling.enabled}}
---
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  labels:
    app.kubernetes.io/instance: {{ .Values.app.name }}
    app.kubernetes.io/name: {{ .Values.app.name }}
  name: {{ .Values.app.name }}
spec:
  minReplicas: {{ .Values.app.autoscaling.minReplicas }}
  maxReplicas: {{ .Values.app.autoscaling.maxReplicas }}
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: {{ .Values.app.name }}
  targetCPUUtilizationPercentage: {{ .Values.app.autoscaling.targetCPUUtilizationPercentage }}

{{ end }}

antoviaque · 2023-01-11T14:04:21Z

@jfavellar90 Could you post a status update on this task here? This way we could follow & discuss here async, ahead of the next meeting.

antoviaque · 2023-01-16T12:59:11Z

@jfavellar90 Are you still interested in working on this task?

jfavellar90 · 2023-01-24T15:55:37Z

@antoviaque I'm sorry for the late answer, I was a bit busy, however, I'm still interested in working on this one.

It's important to distinguish between the two main mechanisms to implement autoscaling in Kubernetes:

Pod-based scaling: here we can mention methods like Horizontal Pod autoscaler (HPA) and Vertical pod autoscaler (VPA). The first consist of spinning up new pods according to the value of a metric (CPU, memory), and the second one aims to stabilize the consumption and resources of every pod, so it is maintained between limits and requests that were specified in the initial pod configuration. Both of them are meant to be applied over a Deployment. In this order of ideas, these are resources namespace-specific, considering there's an installation per namespace, do you agree?

Regarding the Tutor support for enabling these mechanisms, there was an effort to include the feature in the Tutor core. However, the approach was changed to use plugins, being the grove plugin a good example with many configuration settings for HPA. In our plugin Drydock we implement a similar logic.
I checked as well the HPA implementation in @lpm0073 cookie-cutter repo via terraform which defines de behavior of the scaling to prevent abrupt changes in the number of pods.

It would be interesting to run some load testing to fine-tune consumption limits for an HPA

Node-based scaling: This mechanism allows the addition of new NODES to the K8S cluster so new workloads can be scheduled. The tools explored in the community are:
1. Cluster-autoscaler (CA): this tool makes sure there are enough resources to satisfy the current cluster needs (like scheduling new pods). In the case of AWS EKS, it modifies the number of instances in a node group (ASG) to have the proper capacity to schedule the current workloads. It offers support for different cloud providers. It requires the creation of extra resources in the cloud provider to work properly. It is used by Grove as a node autoscaler.
2. Karpenter(https://karpenter.sh/): This tool is only supported fro AWS for now. It's more flexible than cluster autoscaler in the sense that It allows having diverse compute requirements and it is more tolerant to really high demands.

Questions:

Are HPA and VPA in the scope of this plugin since these resources are tied to deployments in every installation? Should we define a community plugin that only handles HPA and VPA (something like this repo by Aulasneo plus the VPA feature)
Node-based solutions make sense to me in the global context we are trying to cover. how to handle the external cloud provider-specific resources required to make these tools work properly?

antoviaque · 2023-01-25T16:04:27Z

@jfavellar90 Thank you for the update! What would be your timeline for this work?

@bradenmacdonald How would you go at answering Jhony's questions? Who should be involved in this discussion?

jfavellar90 · 2023-02-07T14:12:27Z

@antoviaque @bradenmacdonald here are some notes to move forward on this task:

As I mentioned earlier, I consider the best approach for POD autoscaling is the existence of a plugin adjusting HPA and VPA per environment. I created this repo: https://github.com/eduNEXT/tutor-contrib-pod-autoscaling in order to condense this logic. This will be inspired on the existing repo https://github.com/aulasneo/tutor-contrib-hpa plus the VPA logic. The problem I see here is that VPA requires a Helm Chart to be installed globally, but this one can be managed in our tutor-contrib-multi chart
I already created a PR which is still WIP, to add a couple of chart dependencies required to enable POD autoscaling. I was wondering if I could install such dependencies in different namespaces but it seems it's not possible. Would it be OK to install all these charts in the same namespace, even when they're installed for different purposes (ingress, autoscaling, etc)?
I'll start the implementation of cluster-autoscaler as NODE autoscaler solution. From there, we can consider the support for other solutions like Karpenter

antoviaque · 2023-02-08T09:37:11Z

Copying the relevant notes from Keith from the meeting:

Decided to step forward, there are two ways to scale. Pods and Nodes.
PR was created in the repo, although couple of dependencies are required.
Still a work in progress though.
Hopes to add a couple more details in the next few days.
Will add HPA, but not BPA based on existing plugin.
Although, we can merge both in one plugin, it will need to be configured for each instance deployed with tutor.
Node autoscaling hasn’t started yet, but hope to work on cluster autoscaling solution.
From that point, we can then implement Karpenter.
But will start with tool that works across cloud providers.
Next step: PR needs to be moved from Draft to Review and then prroceed with the cluster autoscaling implementation.

bradenmacdonald · 2023-02-08T20:45:09Z

Here's my understanding:

Scaling of nodes is usually configured at the cluster level, with your cloud provider, or managed with something like Karpenter. It may depend on having metrics-server installed on the cluster. In any case, since it happens outside of the cluster and changes the cluster, it feels out of scope for this helm chart for now.
We definitely need HPA to scale each individual Open edX instance dynamically. e.g. if lots of students are taking an exam on instance A, the instance A LMS pods should have more replicas scaled up.
It is impossible as far as I know to set up auto-scaling rules on the cluster as a whole (e.g. say "all pods tagged with the label lms should be scaled between 1 and 15 replicas with cpu usage between 20-70%"). Instead, we need to install a HorizontalPodAutoscaler on the cluster per pod that is auto-scaled, which means that we need to implement HPA as a Tutor plugin. It could be a separate Tutor plugin or built in to the Tutor plugin that's included with this repo. Either way, we want to have a good HPA experience out of the box.
Vertical Pod Auto-scaling is a different story, and I'm not as familiar with it. It makes sense to use if you have some stateful deployment type where the resource usage can vary wildly from deployment to deployment. My impression of Open edX is that it tends to be somewhat stable in how much memory it uses per LMS worker after the worker has warmed up, so I'm not sure if it would be very helpful for LMS pods. It may be helpful for something stateful like Redis or MySQL. Do people here have experience with which Open edX things it is well suited to?
- My instinct is for now to focus on setting "good" requests/limits for the Open edX pods by default rather than VPA, but I'm open to other approaches if that's not feasible.

bradenmacdonald · 2023-02-08T20:45:32Z

Oh and thanks for the nice start on this @jfavellar90 :)

jfavellar90 · 2023-02-21T16:50:14Z

hi @bradenmacdonald . I added changes to the repo https://github.com/eduNEXT/tutor-contrib-pod-autoscaling. This repo supports HPA and VPA (VPA with automatic suggestions). I need to complement a bit the docs in the repo by adding every configuration variable. It was tested on Tutor 15 and works fine. Things I think we can improve:

Support HPA behaviors (this allows controlling how the deployment will scale )
Better VPA support.
Tune the values of the HPA based on lad tests to suggest configurations according to an instance size

felipemontoya · 2023-04-18T17:12:37Z

@bradenmacdonald is reviewing this and its very close to be merged.

bradenmacdonald assigned jfavellar90 Dec 6, 2022

antoviaque mentioned this issue Feb 8, 2023

Monitoring with Prometheus (and metrics-server) #3

Closed

bradenmacdonald mentioned this issue Feb 8, 2023

feat: adding metrics-server and vpa as dependencies of the main chart #17

Merged

MoisesGSalas mentioned this issue Mar 27, 2023

Identify next steps until we can run this #26

Open

bradenmacdonald closed this as completed Apr 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add autoscaling #2

Add autoscaling #2

bradenmacdonald commented Dec 6, 2022

adzuci commented Jan 10, 2023

antoviaque commented Jan 11, 2023

antoviaque commented Jan 16, 2023

jfavellar90 commented Jan 24, 2023 •

edited

antoviaque commented Jan 25, 2023

jfavellar90 commented Feb 7, 2023

antoviaque commented Feb 8, 2023

bradenmacdonald commented Feb 8, 2023

bradenmacdonald commented Feb 8, 2023

jfavellar90 commented Feb 21, 2023

felipemontoya commented Apr 18, 2023

Add autoscaling #2

Add autoscaling #2

Comments

bradenmacdonald commented Dec 6, 2022

adzuci commented Jan 10, 2023

antoviaque commented Jan 11, 2023

antoviaque commented Jan 16, 2023

jfavellar90 commented Jan 24, 2023 • edited

antoviaque commented Jan 25, 2023

jfavellar90 commented Feb 7, 2023

antoviaque commented Feb 8, 2023

bradenmacdonald commented Feb 8, 2023

bradenmacdonald commented Feb 8, 2023

jfavellar90 commented Feb 21, 2023

felipemontoya commented Apr 18, 2023

jfavellar90 commented Jan 24, 2023 •

edited