Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom GPU inference image cannot auto scale across multi GPUs. #924

Closed
rtrobin opened this issue Jul 7, 2020 · 20 comments
Closed

Custom GPU inference image cannot auto scale across multi GPUs. #924

rtrobin opened this issue Jul 7, 2020 · 20 comments

Comments

@rtrobin
Copy link

rtrobin commented Jul 7, 2020

/kind bug

What steps did you take and what happened:
I have a custom image which offers a prediction API. When I try auto-scaling sample, all pods are attempting to allocate on the same GPU. Also most pods fail to launch and trigger restart.

What did you expect to happen:
Custom inference pods can auto scale across multi GPUs, like tf/pytorch pre-built predictor.

Environment:

  • Istio Version:
  • Knative Version:
  • KFServing Version: 0.3
  • Kubeflow version:
  • Kfdef:[k8s_istio/istio_dex/gcp_basic_auth/gcp_iap/aws/aws_cognito/ibm]
  • Minikube version:
  • Kubernetes version: (use kubectl version):
  • OS (e.g. from /etc/os-release):
@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
area/inference 0.53

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

1 similar comment
@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
area/inference 0.53

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@yuzisun
Copy link
Member

yuzisun commented Jul 7, 2020

@rtrobin Can you help elaborate how all pods can possibly allocate on the same GPU ? Can you paste the custom inference service yaml and what errors you saw?

@rtrobin
Copy link
Author

rtrobin commented Jul 7, 2020

@yuzisun Hi, I don't know the exact code implementation. The docker is shipped by other team, which provides a web service. The DL model is compiled from Tensorflow into dynamic library, called by main program.

There are two errors in this case. First, not all pods are allocating on gpu worker nodes. Some pods applies on cpu worker node or even cpu master node. These pods fail to launch because no gpu card found, failed to load cudnn and other nvidia library. Second, I guess, the code loads TF model into default gpu device, which is No.0 card. I'm not sure how kfserving apply each pod on device. Maybe it needs to specify environment variable CUDA_VISIBLE_DEVICES to restrict the devices that the pod could see.

My yaml file is shown below.

apiVersion: serving.kubeflow.org/v1alpha2
kind: InferenceService
metadata:
  name: custom-service
spec:
  default:
    predictor:
      custom:
        container:
          image: custom-image
        resources:
          limits:
            nvidia.com/gpu: "1"
          requests:
            nvidia.com/gpu: "1"

@salanki
Copy link
Contributor

salanki commented Jul 7, 2020

Your resources is on the wrong indentation level. They should be under container. I am surprised KFServing allowed this tbh. That is why it's not getting scheduled in the right places and can see all GPUs.

@yuzisun
Copy link
Member

yuzisun commented Jul 8, 2020

@salanki @rtrobin in that case those fields are treated as unknown fields, what we really need here is pruning unknown fields which is only available in kubernetes 1.15+.

@rtrobin
Copy link
Author

rtrobin commented Jul 8, 2020

@salanki Thanks! Stupid typo...

Btw, do we have a full docs of what character field can be used and how to set them? Now only samples are provided. I have to go through samples to find if any one could fit my need. For example, I want to define an inference service with limited gpu memory. I can't tell this feature is possible or not according to samples. @yuzisun

@salanki
Copy link
Contributor

salanki commented Jul 8, 2020

You mean select a specific GPU type? That you have to do with:

metadata:
  annotations:
    serving.kubeflow.org/gke-accelerator: Tesla_V100

You need to have your nodes tagged with the gke-accelerator label. You are welcome to hit us up on #kfserving on the Kubeflow slack as well for some more interactive discussions.

For GPU inference, I have some additional examples that might be helpful in my own repo: https://github.com/coreweave/kubernetes-cloud/tree/master/online-inference.

@salanki
Copy link
Contributor

salanki commented Jul 8, 2020

@yuzisun: The UX of being able to set fields that take no effect is pretty bad :/

@rtrobin
Copy link
Author

rtrobin commented Jul 8, 2020

I just find, GPU device is occupied exclusively by individual node. However, with my former wrong configuration, the device can be shared by nodes, which could lead to a better performance in some scenario. Also cluster could deploy more models than current strategy. I was wondering to restrict GPU memory in yaml file, so that nodes could select device which have enough memory left in "share mode". But it seems device-share-mode is not supported yet?

Your repo seems helpful. I would check it later. Appreciate! @salanki

@salanki
Copy link
Contributor

salanki commented Jul 8, 2020

NVIDIA explicitly doesn’t not support sharing the same GPU over multiple containers in the NVIDIA device plugin. NVIDIA/k8s-device-plugin#134 (comment)

@yuzisun
Copy link
Member

yuzisun commented Jul 8, 2020

@yuzisun: The UX of being able to set fields that take no effect is pretty bad :/

That's unfortunately pretty common issue for kubernetes crd, once kubeflow bumps the kubernetes min requirement we should be able to use v1 crd to prune unknown fields.
kubernetes-sigs/kubebuilder#1174
GoogleCloudPlatform/flink-on-k8s-operator#85

@yuzisun
Copy link
Member

yuzisun commented Jul 8, 2020

@salanki Thanks! Stupid typo...

Btw, do we have a full docs of what character field can be used and how to set them? Now only samples are provided. I have to go through samples to find if any one could fit my need. For example, I want to define an inference service with limited gpu memory. I can't tell this feature is possible or not according to samples. @yuzisun

@rtrobin The full api doc is here https://github.com/kubeflow/kfserving/blob/master/docs/apis/README.md, for custom it is the kubernetes container spec

@rtrobin
Copy link
Author

rtrobin commented Jul 8, 2020

For my case, in inference stage, most models won't use more than half of the GPU memory. It would be very useful to deploy multiple models in one device. I find few GPU sharing device plugins, such as gpushare-device-plugin and shared-gpu-nvidia-k8s-device-plugin. Did you, by any chance, try this kind of plugins before? @salanki

@rtrobin
Copy link
Author

rtrobin commented Jul 8, 2020

@rtrobin The full api doc is here https://github.com/kubeflow/kfserving/blob/master/docs/apis/README.md, for custom it is the kubernetes container spec

Thanks!

@yuzisun
Copy link
Member

yuzisun commented Jul 8, 2020

@rtrobin instead of using low level device plugin, KFServing is working on a solution for cohosting multiple models onto the same container, you can checkout detailed proposal linked on this issue.

@rtrobin
Copy link
Author

rtrobin commented Jul 9, 2020

@rtrobin instead of using low level device plugin, KFServing is working on a solution for cohosting multiple models onto the same container, you can checkout detailed proposal linked on this issue.

I briefly worked through the discussion in the issue. It is a good idea to host multiple models in the same container, especially for the sklearn and xgboost framework. As someone mentioned in the post, triton has a similar feature to host multiple models. Also triton uses multi cuda stream to benefit the gpu usage. For TF, TensorRT or Onnx framework, maybe triton is already a good option.

Besides, for my specific use case, this solution can't solve my problem. The inference service shipped to me is more likely to be a custom service, not a model to integrate. For me, as service provider, I can't ask algorithms developer to use the same version framework. For most case, I don't even know how the service implements, or the framework it uses. Also, besides framework, there are lots of other packages involved. Maybe one service uses OpenCV 4, but other uses OpenCV 3. It is hard to put two service in the same container.

@yuzisun
Copy link
Member

yuzisun commented Jul 10, 2020

@ukclivecox
Copy link
Contributor

We are building the successor for the current python kfserver in https://github.com/SeldonIO/mlserver
It will support the new V2 Dataplane and be a plugin replacement for the current server for sklearn, xgoost and custom models in kfserving.
Would be happy to discuss how we can ensure this is part of the roadmap.

@rtrobin
Copy link
Author

rtrobin commented Aug 10, 2020

@rtrobin you can checkout https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_gpu_sharing.md, @k82cn is the expert on this.

Thanks for the info. The original issue is caused by config file type. I'm closing this issue now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants