Enable GPU Memory as resource requirement for InferenceService #947

Svendegroote91 · 2020-07-14T08:01:06Z

/kind feature

Describe the solution you'd like
[A clear and concise description of what you want to happen.]
Would it be possible to add the GPU memory as a resource requirement, similarly to the GPU count?
For example:

apiVersion: "serving.kubeflow.org/v1alpha2"
kind: "InferenceService"
metadata:
  name: "flowers-sample-gpu"
spec:
  default:
    predictor:
      tensorflow:
        storageUri: "gs://kfserving-samples/models/tensorflow/flowers"
        runtimeVersion: "1.13.0-gpu"
        resources:
          limits:
            aliyun.com/gpu-mem: 3

Is it technically already possible to try this if you have the
GPUshare scheduler extender installed on your cluster?

I noticed that they added something related in the Arena repo - kubeflow/arena#211

The text was updated successfully, but these errors were encountered:

issue-label-bot · 2020-07-14T08:01:13Z

Issue-Label Bot is automatically applying the labels:

Label	Probability
area/inference	0.74

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

issue-label-bot · 2020-07-14T08:01:14Z

Issue Label Bot is not confident enough to auto-label this issue.
See dashboard for more details.

yuzisun · 2020-07-14T09:16:34Z

@Svendegroote91 technically inference service spec already allows aliyun.com/gpu-mem: 3 since resources limits is a map. Would you like to try out and let us know if that works out of the box?

Svendegroote91 · 2020-07-14T09:35:33Z

@yuzisun Ok I can give it a try and let you know.

If the resources limits is a map, I see that from a scheduling perspective it will work.
However, I wonder if the memory constraint will be correctly translated into a command argument --per_process_gpu_memory_fraction for Tensorflow serving (see similar PR on kubeflow/arena#211) . If not, I would be happy to help on this. Any thoughts on this?

yuzisun · 2020-07-14T09:46:55Z

@Svendegroote91 looks like we will need that argument, feel free to send a PR for that.

Svendegroote91 · 2020-07-20T06:45:28Z

@yuzisun I tested the setup but it doesn't work out of the box. Probably the following steps will need to be implemented:

aliyun.com/memory resource request requires tf serving gpu image
aliyun.com/memory resource request requires env var NVIDIA_VISIBLE_DEVICES set to all - I am still verifying if it can be solved in another way (创建pod报错nvidia-container-cli: device error: unknown device id: no-gpu-has-10MiB-to-run AliyunContainerService/gpushare-scheduler-extender#80)
aliyun.com/memory resource request requires command line arg --per_process_gpu_memory_fraction to isolate the GPU workload

I'll see if I can file a PR for this somewhere soon.

yuzisun · 2020-08-01T01:55:19Z

@Svendegroote91 sorry for the late reply, KFServing do support gpu image for tfserving and you can specify runtimeVersion: 1.14-gpu for example. Our v1beta1 API should actually allow you specify container fields on prepackaged model server like tfserving, it is worth note that currently you can achieve all above with KFServing custom.

631068264 · 2023-02-07T08:47:49Z

I use kubeflow1.6.1 + k8s-gpushare-schd-extender:1.11-d170d8a newest version

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "firesmoke"
spec:
  predictor:
    containers:
      - name: kserve-container
        image: harbor.xxx.cn/library/model/firesmoke:v1
        env:
          - name: MODEL_NAME
            value: firesmoke
        resources:
          limits:
            aliyun.com/gpu-mem: 1

error

Status:
  Components:
    Predictor:
      Latest Created Revision:  firesmoke-predictor-default-00001
  Conditions:
    Last Transition Time:  2023-02-07T16:30:17Z
    Message:               Revision "firesmoke-predictor-default-00001" failed with message: binding rejected: failed bind with extender at URL http://127.0.0.1:32766/gpushare-scheduler/bind, code 500.
    Reason:                RevisionFailed
    Severity:              Info
    Status:                False
    Type:                  PredictorConfigurationReady
    Last Transition Time:  2023-02-07T16:30:17Z
    Message:               Configuration "firesmoke-predictor-default" does not have any ready Revision.
    Reason:                RevisionMissing
    Status:                False
    Type:                  PredictorReady
    Last Transition Time:  2023-02-07T16:30:17Z
    Message:               Configuration "firesmoke-predictor-default" does not have any ready Revision.
    Reason:                RevisionMissing
    Severity:              Info
    Status:                False
    Type:                  PredictorRouteReady
    Last Transition Time:  2023-02-07T16:30:17Z
    Message:               Configuration "firesmoke-predictor-default" does not have any ready Revision.
    Reason:                RevisionMissing
    Status:                False
    Type:                  Ready
Events:                    <none>

Any other gpushare choice for kubeflow @yuzisun

k8s-ci-robot added the kind/feature label Jul 14, 2020

issue-label-bot bot added the area/inference label Jul 14, 2020

yuzisun added the kserve/control-plane label Nov 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable GPU Memory as resource requirement for InferenceService #947

Enable GPU Memory as resource requirement for InferenceService #947

Svendegroote91 commented Jul 14, 2020 •

edited

issue-label-bot bot commented Jul 14, 2020

issue-label-bot bot commented Jul 14, 2020

yuzisun commented Jul 14, 2020

Svendegroote91 commented Jul 14, 2020

yuzisun commented Jul 14, 2020

Svendegroote91 commented Jul 20, 2020 •

edited

yuzisun commented Aug 1, 2020

631068264 commented Feb 7, 2023 •

edited

Enable GPU Memory as resource requirement for InferenceService #947

Enable GPU Memory as resource requirement for InferenceService #947

Comments

Svendegroote91 commented Jul 14, 2020 • edited

issue-label-bot bot commented Jul 14, 2020

issue-label-bot bot commented Jul 14, 2020

yuzisun commented Jul 14, 2020

Svendegroote91 commented Jul 14, 2020

yuzisun commented Jul 14, 2020

Svendegroote91 commented Jul 20, 2020 • edited

yuzisun commented Aug 1, 2020

631068264 commented Feb 7, 2023 • edited

Svendegroote91 commented Jul 14, 2020 •

edited

Svendegroote91 commented Jul 20, 2020 •

edited

631068264 commented Feb 7, 2023 •

edited