-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better GPU support #6
Comments
Hey! |
Hi So I think it would be the responsibility of the "ops" person who deploys the TfJob operator to specify the location of the drivers on the host machine and appropriate mount points in the cluster. This assumes that all GPU nodes in the cluster use the same driver version and have the drivers installed in the same location. Supporting more multiple driver versions really depends on whether K8s eventually supports this. For the TfJob operator the goal is really to just cut down on some of the boilerplate when specifying GPU jobs. So with the current operator you can write a TfJob spec to use GPUs like so
Since the mount paths would be the same for all TfJobs there's no reason to make the user specify it when creating individual jobs. The user could just specify the following.
The TfJob operator would be instantiated with the information it needs to add to the actual JobController specs to use GPUs. This would include adding the volume mounts showed above and scheduling constraints so it gets scheduled on GPU nodes. |
PR #9 is out for review. Its pretty close to what I suggested above. Main difference is we look at container resources and limits to determine if GPUs are required rather than introducing new fields to indicate when GPUs are desired. |
PR #9 is merged. |
feat(status): updateStatus-> update
We should make it easier to use GPUs.
Right now to use GPUs the user would have to add appropriate volume mounts to the PodSpec in the TfJob to mount the GPU devices from the host and set other specs like environment variables if needed.
I think we should have a higher level API. For example
The TfJob controller could then be instantiated with the necessary information to add the appropriate volume mounts and scheduling information to the pods.
The text was updated successfully, but these errors were encountered: