Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document autoscaling of preemptible GPU resources #131

Open
sotte opened this Issue Jun 20, 2018 · 14 comments

Comments

Projects
None yet
6 participants
@sotte
Copy link
Contributor

sotte commented Jun 20, 2018

I would like to be able to define a pool of preemptible GPUs that are automatically used whenever I run experiments or do a hyper parameter search. The resources should be freed after the training.

I think this would be a great feature and it would save a lot of money! :)

@mouradmourafiq

This comment has been minimized.

Copy link
Member

mouradmourafiq commented Jun 20, 2018

@sotte I think that's the current behavior, pods used for training are removed once the experiment is done. Or did you mean something else?

@sotte

This comment has been minimized.

Copy link
Contributor Author

sotte commented Jun 20, 2018

Oh, I was going through the docs and I did not have the impression that that is the current behaviour. If that is the case I think it's really great! (But the docs could/should be clearer.)

Actually, I think this feature deserves a separate tutorial because it's so amazing! I renamed the ticket to better reflect that.

@sotte sotte changed the title Use and autoscale preemptible GPUs Document autoscaling of preemptible GPU resources Jun 20, 2018

@jorgemf

This comment has been minimized.

Copy link

jorgemf commented Jun 20, 2018

@sotte, @mouradmourafiq I think you don't understand each other. I think what polyaxon does is to run in a cluster you have already predefined, launching jobs in pods as required. What @sotte wants is to add nodes to the cluster when launching the jobs, so he can add preemptible instances when needed as they are cheaper. You start with a one node cluster and when a new experiment requires new pods you add new nodes to your cluster so polyaxon can run the required jobs for the experiment.

@sotte

This comment has been minimized.

Copy link
Contributor Author

sotte commented Jun 20, 2018

Yes, @jorgemf describes what I want: a minimal cluster that that scales up, with preemtible instances and GPUs, as the workload increases.

@mouradmourafiq

This comment has been minimized.

Copy link
Member

mouradmourafiq commented Jun 21, 2018

@sotte @jorgemf This could be a nice feature. I think there are already tutorials describing such behavior at least for AWS.

@sotte I just realized that you will be at Pydata, Friday 6th, if you are going to be around the whole day let me know and we can catch up after my presentation.

@sotte

This comment has been minimized.

Copy link
Contributor Author

sotte commented Jun 22, 2018

@mouradmourafiq yes I'm at PyData and I'm really looking forward to your talk! I'll be there for the whole conference, so let's catch up.

@jorgemf

This comment has been minimized.

Copy link

jorgemf commented Jul 9, 2018

@sotte

This comment has been minimized.

Copy link
Contributor Author

sotte commented Jul 9, 2018

That looks interesting! Sadly we're using google cloud, but doing something similar should work with GKE as well. I'll try to do some research on the weekend.

Also: https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler

@wbuchwalter

This comment has been minimized.

Copy link
Contributor

wbuchwalter commented Jul 13, 2018

https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler is what is used under the scene by GKE, and it also works for AWS and Azure (for 1.10+ I believe).

@rogerzklotz

This comment has been minimized.

Copy link

rogerzklotz commented Jul 28, 2018

From google docs it looks like we can use the built in node scheduling to accomplish something like this. I haven't had the chance to try this out, does anyone see anything obvious that would make this a bad idea?

@vfdev-5

This comment has been minimized.

Copy link
Contributor

vfdev-5 commented Oct 26, 2018

My 2 cents on the topic, I'm trying to acomplish a similar stuff. What I've done and this works in more or less intended way mostly following the Polyaxon tutorial on GKE.

I create a cluster with several node-pools for example:

  • default-pool with option --node-labels=polyaxon=core
  • gpu-preempt for GPU experiments with options:
--node-labels=polyaxon=exp-gpu
--preemptible
--enable-autoscaling
--num-nodes "0"
--min-nodes "0"
--max-nodes "2"

this pool has automatically a taint nvidia.com/gpu:present

  • cpu-preempt without GPU for builds and jobs with options:
--node-labels=polyaxon=build_job
--preemptible
--enable-autoscaling
--num-nodes "0"
--min-nodes "0"
--max-nodes "2"

and I use the following polyaxon-config.yml:

nodeSelectors:
  core:
    polyaxon: core
  experiments:
    polyaxon: exp-gpu
    polyaxon: exp-small-gpu
  builds:
    polyaxon: build_job
  jobs:
    polyaxon: build_job

tolerations:
  experiments:
    - key: nvidia.com/gpu
      operator: Equal
      value: present

When I start an experiment then the docker image building is done with cpu-preempt node and the experiment is run with gpu-preempt node. All nodes are destroyed when build/experiments are ended and unneeded after ~15 minutes.

HTH

@mouradmourafiq

This comment has been minimized.

Copy link
Member

mouradmourafiq commented Oct 29, 2018

@vfdev-5 This is a nice resource, maybe we should add it as a guide in the docs or a blog post for future references.

@vfdev-5

This comment has been minimized.

Copy link
Contributor

vfdev-5 commented Oct 29, 2018

@mouradmourafiq thanks! If you want I can add some of this info to "Kubernetes cluster On GKE"

@mouradmourafiq

This comment has been minimized.

Copy link
Member

mouradmourafiq commented Oct 29, 2018

@vfdev-5 sure, that would be a nice thing to add there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.