New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is sharing GPU to multiple containers feasible? #52757

Open
tianshapjq opened this Issue Sep 20, 2017 · 49 comments

Comments

Projects
None yet
@tianshapjq
Copy link
Contributor

tianshapjq commented Sep 20, 2017

Is this a BUG REPORT or FEATURE REQUEST?: feature request
/kind feature

What happened:
As far, we do not support sharing GPU to multiple containers, one GPU can only be assigned to one container at a time. But we do have some requirements on achieving this, is it feasible that we manage GPU just like CPU or memory?

What you expected to happen:
sharing GPU to multiple containers just like CPU and memory.

@tianshapjq

This comment has been minimized.

Copy link
Contributor Author

tianshapjq commented Sep 20, 2017

@vishh @cmluciano is it workable?

@tbchj

This comment has been minimized.

Copy link

tbchj commented Sep 20, 2017

+1

@jianzhangbjz

This comment has been minimized.

Copy link
Contributor

jianzhangbjz commented Sep 20, 2017

/cc

1 similar comment
@huzhengchuan

This comment has been minimized.

Copy link
Contributor

huzhengchuan commented Sep 20, 2017

/cc

@RenaudWasTaken

This comment has been minimized.

Copy link
Member

RenaudWasTaken commented Sep 20, 2017

/sig node
until we have a wg-resource-management label

From @flx42:

By default, kernels from different processes can't run on one GPU simultaneously (concurrency but not parallelism), they are time sliced. The Pascal architecture brings instruction-level preemption instead of block-level preemption, but context switches are not free.

Also, there is no way of partitioning GPU resources (SMs, memory), or even assignin priorities when sharing a card.

You also have MPS which is another problem :D

But I suppose you only mean sharing NVIDIA devices between multiple containers ?

Currently we are focusing on making sure GPU enablement through Device Plugin is done right in 1.8 but it could be a goal for 1.9.

@tianshapjq

This comment has been minimized.

Copy link
Contributor Author

tianshapjq commented Sep 20, 2017

@RenaudWasTaken thanks! But another question, where to place the GPU enablement code if we separate the GPU from kubelet? Seems it's not appropriate to place the GPU code in the vendor pkg anymore, do we have to create a new repo related to kubernetes?

@RenaudWasTaken

This comment has been minimized.

Copy link
Member

RenaudWasTaken commented Sep 20, 2017

@tianshapjq see the device plugin design document for 1.8 which is how we plan to support GPUs in the future: kubernetes/community#695

@linyouchong

This comment has been minimized.

Copy link
Member

linyouchong commented Sep 20, 2017

@RenaudWasTaken Do you mean that sharing NVIDIA devices between multiple containers could be a goal for 1.9 ?

@dixudx

This comment has been minimized.

Copy link
Member

dixudx commented Sep 27, 2017

/cc

@vishh

This comment has been minimized.

Copy link
Member

vishh commented Oct 5, 2017

Sharing GPUs is out of scope for the foreseeable future (at-least until v1.11). Our current focus is to get gpus per container working in production.

@vishh vishh self-assigned this Oct 5, 2017

@reverson

This comment has been minimized.

Copy link

reverson commented Oct 20, 2017

/cc

@ScorpioCPH

This comment has been minimized.

Copy link
Member

ScorpioCPH commented Dec 6, 2017

/cc

@flx42

This comment has been minimized.

Copy link

flx42 commented Feb 14, 2018

FWIW, I'm seeing more and more users/customers asking for a way to share a single GPU across a pod.

@mindprince

This comment has been minimized.

Copy link
Member

mindprince commented Feb 28, 2018

@flx42 Did you mean sharing a single GPU between different containers belonging to the same pod? What isolation do your users/customers expect in such scenarios?

@tianshapjq

This comment has been minimized.

Copy link
Contributor Author

tianshapjq commented Feb 28, 2018

@flx42 yes, seems the isolation is the blocker at this moment. GPU doesn't support secure isolation in production-grade, which would cause fatal damage if we simply assign one gpu to multiple containers IMO. If any news about gpu isolation please let me know :)

@WIZARD-CXY

This comment has been minimized.

Copy link
Contributor

WIZARD-CXY commented Feb 28, 2018

FWIW, I'm seeing more and more users/customers asking for a way to share a single GPU across a pod.

I'm one of the many users.

@tianshapjq

This comment has been minimized.

Copy link
Contributor Author

tianshapjq commented Feb 28, 2018

@flx42 @mindprince BTW, as device plugin has been counted into extended resource, is that sharing would not be acceptable at present?

@flx42

This comment has been minimized.

Copy link

flx42 commented Feb 28, 2018

What isolation do your users/customers expect in such scenarios?

Isolation doesn't matter in this case.

@allxone

This comment has been minimized.

Copy link

allxone commented Mar 1, 2018

+1

2 similar comments
@brucechou1983

This comment has been minimized.

Copy link

brucechou1983 commented Mar 26, 2018

+1

@YuxiJin-tobeyjin

This comment has been minimized.

Copy link
Contributor

YuxiJin-tobeyjin commented Apr 13, 2018

+1

@ericjee

This comment has been minimized.

Copy link

ericjee commented Jul 3, 2018

@vishh thx for reply:)

| NVIDIA-SMI 375.39                 Driver Version: 375.39                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  On   | 0000:00:08.0     Off |                    0 |
| N/A   34C    P0    30W / 250W |  13691MiB / 16276MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     11165    C   uwsgi                                         2021MiB |
|    0     12210    C   /usr/bin/python3                               643MiB |
|    0     13131    C   /usr/local/bin/uwsgi                          4493MiB |
|    0     15411    C   uwsgi                                         2035MiB |
|    0     16215    C   /usr/local/bin/uwsgi                          4493MiB

Currently we just settle memory usage via tf.GPUOptions

gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.2)

we have bunch services handle lower frequency requests, which deployed for internal user(VC, test user, .etc), if there is a way to allocate gpu memory usage also enable multiple pods requests like other system resources in k8s, my users can build their app more efficient. and yes, over provision is ok..

@vishh

This comment has been minimized.

Copy link
Member

vishh commented Jul 3, 2018

@ericjee So your workload is tensorflow. Is it training or inference? Do you care about just memory or GPU core isolation as well? I assume you are aware that TF support for unified memory is still experimental. Also, Nvidia does not support memory isolation yet in their GPU stack. Also, it is not clear yet if MPS can and should be used for sharing.
When it comes to inference, TF Serving can already handle bin packing models so k8s level support is a nice to have where k8s can avoid needing two level scheduling.
For sharing a GPU, it needs to be split into cores and memory bytes. The same GPU can be consumed either as cores, or memory or a combination of both. As of now in k8s we only supported independent resources. This use case calls for supporting dependent resources which will not be trivial.
If a workload only needs a small fraction of a GPU, is it a candidate to run just using CPUs? Have you put some thought into other alternatives to supporting GPU sharing natively in k8s?

@samedguener

This comment has been minimized.

Copy link

samedguener commented Jul 5, 2018

Hey guys,

we have the use-case that we want to run multiple inference models (with low request rate) on a single GPU. While inferences with large request income should be assigned to dedicated GPUs.

We have investigated: While it is possible to share one GPU among multiple containers (using nvidia-docker2), it is not possible to do this on K8s (using Nvidia Device Plugin, nvidia-docker2 etc. pre-delivered by https://github.com/NVIDIA/kubernetes/, deployed on GCP). K8s returns the error that 1 Insufficient nvidia.com/gpu or something like overcommitting is not allowed [..]. This happens for multiple containers in one pod and multiple pods trying to use the same GPU.

For our full understanding:
Where are the exact limitations? (K8s? Device Plugin?)
Is there any trick we could still achieve that? (KubeCon 2016 , Page 21, Rudi Chiarito was talking about a weird trick: https://schd.ws/hosted_files/cnkc16/84/StateOfTheGPUnion.pdf)
Is a priority-based GPU scheduling an option? (This would all happen on K8s side. Pods would get a priority, depending on that devices would be mounted on need. K8s would be responsible for this. This would be not a real sharing, but still solve the problem of assigning multiple GPU to multiple low-request pods.)

I appreciate any help in this regard.

Best,
Samed

@vishh

This comment has been minimized.

Copy link
Member

vishh commented Jul 5, 2018

@RenaudWasTaken

This comment has been minimized.

Copy link
Member

RenaudWasTaken commented Jul 5, 2018

Is there any trick we could still achieve that?

If you don't have any security or isolation "requirements" (i.e: The nvidia-docker way of sharing of GPUs) then there is a trick.

You could fork the nvidia device plugin to advertise more GPUs to Kubernetes. You would then maintain a map of "fake GPUs" that maps to "real GPUs". When a pod lands on Kubelet you'd inject the real GPU.

e.g:
On your machine you have GPU-AAAA and GPU-FFFF, your device plugin advertises to kubelet GPU-1, GPU-2, GPU-3, GPU-4.
GPU-1, GPU-2 maps to GPU-AAAA and GPU-3, GPU-4 maps GPU-FFFF.

When a request for GPU-1 or GPU-2 comes in your forked device plugin will inject NVIDIA_VISIBLE_DEVICES=GPU-AAAA in the environment.

This has a few downsides as it's very likely you'll have to manually manage your infrastructure to avoid running out of memory on the GPU or context switching a lot.
Additionally this means you'll have to maintain your own device plugin.

In any case I hope this helps you :)

@thomasjungblut

This comment has been minimized.

Copy link

thomasjungblut commented Jul 5, 2018

You could fork the nvidia device plugin to advertise more GPUs to Kubernetes. You would then maintain a map of "fake GPUs" that maps to "real GPUs". When a pod lands on Kubelet you'd inject the real GPU.

That's a smart idea

@straydata

This comment has been minimized.

Copy link

straydata commented Aug 4, 2018

@samedguener

solve the problem of assigning multiple GPU to multiple low-request pods

If this is your issue, have you considered removing the nvidia.com/gpu limit from those containers? https://github.com/NVIDIA/k8s-device-plugin#nvidia-device-plugin-for-kubernetes - Even with the plugin (and by default without) any containers that are not explicitly set with nvidia.com/gpu resource limits will have access to all gpu devices on the node they are scheduled to and they will share that gpu time with any containers or processes that are also running there.
Note that for this GPU sharing to work, you should still be using nvidia-docker2 (specifically nvidia-runc) in your k8s cluster.

@ddysher

This comment has been minimized.

Copy link
Contributor

ddysher commented Aug 8, 2018

@samedguener we have exactly the same use case to share GPUs across low qps inferences services. Our current workaround is to run a single service to orchestrate the services; it works but not ideal. (we've summarized this here, at section Blocking Issues).

@straydata we recently find out this as well (somewhat unexpectedly). Are u already leveraging this to share GPU? From kubernetes resources mgmt's perspective, this is totally out of control and could potentially harm the system.

@vishh

This comment has been minimized.

Copy link
Member

vishh commented Aug 8, 2018

@ddysher is the model you are expecting one where you intend to deploy each model as a microservice and let k8s handle bin-packing even within a single GPU boundary?
If you instead bin pack at the application level (TF serving for example), you can already achieve this today, except that the bin packing is manual.

@thomasjungblut

This comment has been minimized.

Copy link

thomasjungblut commented Aug 9, 2018

Since @samedguener is on vacation, let me pitch in on what we've done so far.

@straydata we had a team that was using this approach internally, before device plugins this needed privileged permissions to work- thus we abadoned it quickly.
As @ddysher already figured, this also causes some issues since the GPU capacity is not managed by K8s anymore- so in shared workload there might be a batch job coming in and take the whole GPU. So you'll end up doing specific taints and node selectors to maintain pools of these machines. In the end this proved too cumbersome and too much manual effort.

We resorted to a technique we call "model stuffing", which basically stuffs as many models as possible into one docker container and then do the sharing in the application logic. Very similar to TensorflowServing (TFS) in that regard, but they are not using Tensorflow. We have a huge amount of other frameworks that are using GPUs for inference and thus the general problem of GPU sharing can't be solved with TFS solely. We obviously want to avoid writing a TFS per deep learning framework :-)

Couple of things we lose by using TFS or model stuffing is cgroup limits for cpu and memory per model and the ability to do request rate limiting per model (separate ingresses etc).

We then followed @RenaudWasTaken's idea and report more GPUs to Kubernetes than physically available (we call them VGPUs, 'v' for virtual) through the device plugin. We had huge success with a very simple version of it that was just exposing a static number of VGPUs to Kubernetes, say five per physical GPU. In one of our benchmarks we figured out that we could run an InceptionV3 model with just 228mb of VRAM, that's 50 theoretical inception models we could fit into one K80. In reality we were limited by CPU on the p2.xlarge instance we had in AWS- so we could only bring 30 in, but that saving is already amazing.

We then obviously also ran into issues of memory over-commit because we can't really tell up front how much VRAM a given model will use, ultimately leading to the bin packing issue that @vishh just highlighted. We then resorted to do some dynamic offering of the VGPUs based on free memory (NVML is really helpful here and has some amazing metrics), but that doesn't really fix the underlying issue, since the over-commit scenarios can still happen in various conditions and workloads. We have a couple assumptions that we can make on the workload, especially their memory limits which makes it work for us, but it's a long shot from a general solution to this problem. Let's call it what it is: a gross hack.

To make this more relevant for this particular forum, a couple of things I'd like to see from the device plugin API going forward:

  • per device sub-resources (VRAM and duty cycles), akin to CPU and memory and therefore also support bin packing through the k8s scheduler
  • on top of that, the same pattern of resource request and limits for each of these
  • the protocol should also tell me what devices are currently in use on a given node, we had to do some really nasty hacking for device plugin restarts in order to figure out what pods were running with which VGPUs

I'd like to see a cgroup-like enforcement of the above device sub-resources from Nvidia. From our test we could find that there is some fair scheduling of duty cycles underneath, so there must be a way to enable this from the driver's side. We also have multi-tenancy security issues with this approach, since we could potentially host models of different customers on the same GPU and there is no real VRAM isolation by process AFAIK. I realize that this is a more invasive change and also cuts deep into the hardware architecture- would be great to have it though!

If the demand here is big enough and you can accept the limitations, we can go through SAP's open sourcing process and make it available for you as a fork to the official Nvidia device plugin. I'd also be looking for contributors for that project, so please let me know if you would be interested. We'll be contributing patches for NVML and Prometheus metrics for the official device plugin nevertheless.

@ddysher

This comment has been minimized.

Copy link
Contributor

ddysher commented Aug 9, 2018

@vishh No, I dont expect k8s to handle this for me. By service, what i mean is an application logic that handles GPU sharing (all models are in the same container), exactly the same as the "model stuffing" technique from @thomasjungblut. Sorry for using an overloaded term...

@ddysher is the model you are expecting one where you intend to deploy each model as a microservice and let k8s handle bin-packing even within a single GPU boundary?

@jiayingz

This comment has been minimized.

Copy link
Member

jiayingz commented Aug 9, 2018

@thomasjungblut you may want to take a look at these two KEP proposals:
"Compute Device Assignment": kubernetes/community#2454
"New Resource API proposal": kubernetes/community#2265
They may help address some of the pain points you mentioned in #52757 (comment)

Certain requirements, like resource isolation guarantees, still need to be driven by vendors. I feel many devices we have today may not provide as strong isolation guarantees as cpu and memory do. That is why we are wondering whether IsolationGuarantee can be surfaced by device plugins as a special resource attribute, and workloads can specify their isolation requirements through the ResourceClass API. This way, even though certain device resources may not provide strong isolation guarantees today, pods not requiring multi-tenancy isolation can still share a device through bin packing. It may still take a long way to provide native support for device sharing in k8s, and the complexity to support per device sub-resource scheduling is perhaps too much to be added to the default scheduler. If we can surface duty cycles, GPU memory etc as resource attributes through ComputeResource interface, perhaps you can consider to build a custom scheduler to perform the special scheduling logic?

@thomasjungblut

This comment has been minimized.

Copy link

thomasjungblut commented Aug 10, 2018

Thanks for the overview @jiayingz. Indeed such a similar project already exists by @sanjeevm0 https://github.com/Microsoft/KubeGPU

@cvaldit

This comment has been minimized.

Copy link

cvaldit commented Aug 17, 2018

Thanks @RenaudWasTaken #52757 (comment)
I forked and tested success "share GPUs" with your idea (https://hub.docker.com/r/cvaldit/nvidia-k8s-device-plugin/). It is working but I hope the official release coming soon.

@jiaxuanzhou

This comment has been minimized.

Copy link
Contributor

jiaxuanzhou commented Aug 20, 2018

mark

@warmchang

This comment has been minimized.

Copy link
Contributor

warmchang commented Sep 30, 2018

#52757 (comment)
Smart trick. 👍

@lcwxz1989

This comment has been minimized.

Copy link

lcwxz1989 commented Oct 15, 2018

@RenaudWasTaken
hello , I have another problem. I need to get the pod use which gpu device . And this info can not be found in the k8s api or in the etcd.So is it possible to get this info or not? I think you are well know k8s,so thank you.

@RenaudWasTaken

This comment has been minimized.

Copy link
Member

RenaudWasTaken commented Oct 15, 2018

Hello @lcwxz1989 !

This is currently not exposed but there is a design that is currently being discussed that would allow that.
Currently if you want to do that you would need to have your GPU containers expose the GPU they are consuming (maybe expose a /gpu HTTP endpoint).

@dimm0

This comment has been minimized.

Copy link

dimm0 commented Oct 15, 2018

@lcwxz1989 I'm running "printenv NVIDIA_VISIBLE_DEVICES" inside containers for now..
Would love to see this capability too.

@thomasjungblut

This comment has been minimized.

Copy link

thomasjungblut commented Oct 16, 2018

@lcwxz1989 if you're into more nasty hacks, you can run the device plugin as privileged and read /usr/proc/<pid>/environ, which contains a variable called HOSTNAME. That is the pod id, I did not find the namespace though, so be careful! You can get the PIDs that are used by the GPU from nvmlDevice.Status().

@angao

This comment has been minimized.

Copy link
Contributor

angao commented Nov 23, 2018

/cc

@rafmonteiro

This comment has been minimized.

Copy link

rafmonteiro commented Jan 8, 2019

@cvaldit I was trying your solution and wasn't working. Then I realized that I got it wrong at 1st so I need to do the device fork by myself, the question is: How can I do that? could you share some documentation, please?

@cheyang

This comment has been minimized.

Copy link
Contributor

cheyang commented Feb 18, 2019

We have open sourced a solution to do the GPU sharing in Kubernetes. Welcome to try and comment. @RenaudWasTaken

The advantage of this GPU shared scheduling extension is the implementation of k8s native, which fully utilizes the extension mechanism of k8s scheduler and device plugin. You don’t need to rebuild scheduler and Kubelet.

It also provides the possibility to control GPU usage in the application layer. The end user has flexibilities to decide to to use per_process_gpu_memory_fraction in TensorFlow or Nvidia MPS.

GPU Sharing Scheduler Extender: https://github.com/AliyunContainerService/gpushare-scheduler-extender
GPU Sharing Device Plugin: https://github.com/AliyunContainerService/gpushare-device-plugin

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment