Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add autoscaling #2

Closed
bradenmacdonald opened this issue Dec 6, 2022 · 11 comments
Closed

Add autoscaling #2

bradenmacdonald opened this issue Dec 6, 2022 · 11 comments
Assignees

Comments

@bradenmacdonald
Copy link
Contributor

No description provided.

@adzuci
Copy link

adzuci commented Jan 10, 2023

Hey @bradenmacdonald, in case it's helpful, this is an example of how 2U defines an HPA in a Django IDA chart:


{{ if and .Values.app.enabled .Values.app.autoscaling.enabled}}
---
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  labels:
    app.kubernetes.io/instance: {{ .Values.app.name }}
    app.kubernetes.io/name: {{ .Values.app.name }}
  name: {{ .Values.app.name }}
spec:
  minReplicas: {{ .Values.app.autoscaling.minReplicas }}
  maxReplicas: {{ .Values.app.autoscaling.maxReplicas }}
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: {{ .Values.app.name }}
  targetCPUUtilizationPercentage: {{ .Values.app.autoscaling.targetCPUUtilizationPercentage }}

{{ end }}

@antoviaque
Copy link

@jfavellar90 Could you post a status update on this task here? This way we could follow & discuss here async, ahead of the next meeting.

@antoviaque
Copy link

@jfavellar90 Are you still interested in working on this task?

@jfavellar90
Copy link
Contributor

jfavellar90 commented Jan 24, 2023

@antoviaque I'm sorry for the late answer, I was a bit busy, however, I'm still interested in working on this one.

It's important to distinguish between the two main mechanisms to implement autoscaling in Kubernetes:

  • Pod-based scaling: here we can mention methods like Horizontal Pod autoscaler (HPA) and Vertical pod autoscaler (VPA). The first consist of spinning up new pods according to the value of a metric (CPU, memory), and the second one aims to stabilize the consumption and resources of every pod, so it is maintained between limits and requests that were specified in the initial pod configuration. Both of them are meant to be applied over a Deployment. In this order of ideas, these are resources namespace-specific, considering there's an installation per namespace, do you agree?

Regarding the Tutor support for enabling these mechanisms, there was an effort to include the feature in the Tutor core. However, the approach was changed to use plugins, being the grove plugin a good example with many configuration settings for HPA. In our plugin Drydock we implement a similar logic.
I checked as well the HPA implementation in @lpm0073 cookie-cutter repo via terraform which defines de behavior of the scaling to prevent abrupt changes in the number of pods.

It would be interesting to run some load testing to fine-tune consumption limits for an HPA

  • Node-based scaling: This mechanism allows the addition of new NODES to the K8S cluster so new workloads can be scheduled. The tools explored in the community are:
    1. Cluster-autoscaler (CA): this tool makes sure there are enough resources to satisfy the current cluster needs (like scheduling new pods). In the case of AWS EKS, it modifies the number of instances in a node group (ASG) to have the proper capacity to schedule the current workloads. It offers support for different cloud providers. It requires the creation of extra resources in the cloud provider to work properly. It is used by Grove as a node autoscaler.
    2. Karpenter(https://karpenter.sh/): This tool is only supported fro AWS for now. It's more flexible than cluster autoscaler in the sense that It allows having diverse compute requirements and it is more tolerant to really high demands.

Questions:

  • Are HPA and VPA in the scope of this plugin since these resources are tied to deployments in every installation? Should we define a community plugin that only handles HPA and VPA (something like this repo by Aulasneo plus the VPA feature)
  • Node-based solutions make sense to me in the global context we are trying to cover. how to handle the external cloud provider-specific resources required to make these tools work properly?

@antoviaque
Copy link

@jfavellar90 Thank you for the update! What would be your timeline for this work?

@bradenmacdonald How would you go at answering Jhony's questions? Who should be involved in this discussion?

@jfavellar90
Copy link
Contributor

@antoviaque @bradenmacdonald here are some notes to move forward on this task:

  • As I mentioned earlier, I consider the best approach for POD autoscaling is the existence of a plugin adjusting HPA and VPA per environment. I created this repo: https://github.com/eduNEXT/tutor-contrib-pod-autoscaling in order to condense this logic. This will be inspired on the existing repo https://github.com/aulasneo/tutor-contrib-hpa plus the VPA logic. The problem I see here is that VPA requires a Helm Chart to be installed globally, but this one can be managed in our tutor-contrib-multi chart
  • I already created a PR which is still WIP, to add a couple of chart dependencies required to enable POD autoscaling. I was wondering if I could install such dependencies in different namespaces but it seems it's not possible. Would it be OK to install all these charts in the same namespace, even when they're installed for different purposes (ingress, autoscaling, etc)?
  • I'll start the implementation of cluster-autoscaler as NODE autoscaler solution. From there, we can consider the support for other solutions like Karpenter

@antoviaque
Copy link

Copying the relevant notes from Keith from the meeting:

  • Decided to step forward, there are two ways to scale. Pods and Nodes.
  • PR was created in the repo, although couple of dependencies are required.
  • Still a work in progress though.
  • Hopes to add a couple more details in the next few days.
  • Will add HPA, but not BPA based on existing plugin.
  • Although, we can merge both in one plugin, it will need to be configured for each instance deployed with tutor.
  • Node autoscaling hasn’t started yet, but hope to work on cluster autoscaling solution.
  • From that point, we can then implement Karpenter.
  • But will start with tool that works across cloud providers.
  • Next step: PR needs to be moved from Draft to Review and then prroceed with the cluster autoscaling implementation.

@bradenmacdonald
Copy link
Contributor Author

Here's my understanding:

  • Scaling of nodes is usually configured at the cluster level, with your cloud provider, or managed with something like Karpenter. It may depend on having metrics-server installed on the cluster. In any case, since it happens outside of the cluster and changes the cluster, it feels out of scope for this helm chart for now.
  • We definitely need HPA to scale each individual Open edX instance dynamically. e.g. if lots of students are taking an exam on instance A, the instance A LMS pods should have more replicas scaled up.
  • It is impossible as far as I know to set up auto-scaling rules on the cluster as a whole (e.g. say "all pods tagged with the label lms should be scaled between 1 and 15 replicas with cpu usage between 20-70%"). Instead, we need to install a HorizontalPodAutoscaler on the cluster per pod that is auto-scaled, which means that we need to implement HPA as a Tutor plugin. It could be a separate Tutor plugin or built in to the Tutor plugin that's included with this repo. Either way, we want to have a good HPA experience out of the box.
  • Vertical Pod Auto-scaling is a different story, and I'm not as familiar with it. It makes sense to use if you have some stateful deployment type where the resource usage can vary wildly from deployment to deployment. My impression of Open edX is that it tends to be somewhat stable in how much memory it uses per LMS worker after the worker has warmed up, so I'm not sure if it would be very helpful for LMS pods. It may be helpful for something stateful like Redis or MySQL. Do people here have experience with which Open edX things it is well suited to?
    • My instinct is for now to focus on setting "good" requests/limits for the Open edX pods by default rather than VPA, but I'm open to other approaches if that's not feasible.

@bradenmacdonald
Copy link
Contributor Author

Oh and thanks for the nice start on this @jfavellar90 :)

@jfavellar90
Copy link
Contributor

hi @bradenmacdonald . I added changes to the repo https://github.com/eduNEXT/tutor-contrib-pod-autoscaling. This repo supports HPA and VPA (VPA with automatic suggestions). I need to complement a bit the docs in the repo by adding every configuration variable. It was tested on Tutor 15 and works fine. Things I think we can improve:

  • Support HPA behaviors (this allows controlling how the deployment will scale )
  • Better VPA support.
  • Tune the values of the HPA based on lad tests to suggest configurations according to an instance size

@felipemontoya
Copy link
Member

@bradenmacdonald is reviewing this and its very close to be merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

5 participants