Assign device to input tensors in huggingface server with huggingface backend #3657

saileshd1402 · 2024-04-30T08:25:10Z

What this PR does / why we need it:
Currently, when we run huggingface server with a GPU, the input torch tensors are being stored in CPU instead of GPU, which is causing the following error:

Defaulted container "kserve-container" out of: kserve-container, queue-proxy
INFO:root:Copying contents of /mnt/models to local
INFO:kserve:Loading generative model for task 'text_generation'
Gemma's activation function should be approximate GeLU and not exact GeLU.
Changing the activation function to `gelu_pytorch_tanh`.if you want to use the legacy `gelu`, edit the `model.config` to set `hidden_activation=gelu`   instead of `hidden_act`. See https://github.com/huggingface/transformers/pull/29402 for more details.
INFO:kserve:Successfully loaded tokenizer
INFO:accelerate.utils.modeling:We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
Loading checkpoint shards: 100%|██████████| 2/2 [00:07<00:00,  3.58s/it]
INFO:kserve:Successfully loaded huggingface model from path /mnt/models
INFO:kserve:Registering model: gemmagpu
INFO:kserve:Setting max asyncio worker threads as 16
INFO:kserve:Starting uvicorn with 1 workers
2024-04-29 20:13:40.683 uvicorn.error INFO:     Started server process [1]
2024-04-29 20:13:40.683 uvicorn.error INFO:     Waiting for application startup.
2024-04-29 20:13:40.686 1 kserve INFO [start():63] Starting gRPC server on [::]:8081
2024-04-29 20:13:40.687 uvicorn.error INFO:     Application startup complete.
2024-04-29 20:13:40.687 uvicorn.error INFO:     Uvicorn running on http://0.0.0.0:8080/ (Press CTRL+C to quit)
/prod_venv/lib/python3.10/site-packages/transformers/generation/utils.py:1460: UserWarning: You are calling .generate() with the `input_ids` being on a device type different than your model's device. `input_ids` is on cpu, whereas the model is on cuda. You may experience unexpected behaviors or slower generation. Please make sure that you have put `input_ids` to the correct device by calling for example input_ids = input_ids.to('cuda') before running `.generate()`.
  warnings.warn(
Exception in thread Thread-2 (_process_requests):
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/prod_venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/huggingfaceserver/huggingfaceserver/generative_model.py", line 308, in _process_requests
    self._handle_request(req)
  File "/huggingfaceserver/huggingfaceserver/generative_model.py", line 289, in _handle_request
    outputs = self._model.generate(**kwargs)
  File "/prod_venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/prod_venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1527, in generate
    result = self._greedy_search(
  File "/prod_venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2411, in _greedy_search
    outputs = self(
  File "/prod_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/prod_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/prod_venv/lib/python3.10/site-packages/transformers/models/gemma/modeling_gemma.py", line 1105, in forward
    outputs = self.model(
  File "/prod_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/prod_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/prod_venv/lib/python3.10/site-packages/transformers/models/gemma/modeling_gemma.py", line 875, in forward
    inputs_embeds = self.embed_tokens(input_ids)
  File "/prod_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/prod_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/prod_venv/lib/python3.10/site-packages/torch/nn/modules/sparse.py", line 162, in forward
    return F.embedding(
  File "/prod_venv/lib/python3.10/site-packages/torch/nn/functional.py", line 2233, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)

To fix this, we can use the ".to()" function in the torch library to assign the input tensors to the correct device

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Type of changes
Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)

Feature/Issue validation/testing:

Please describe the tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

Locally ran the huggingface server as mentioned here with the '--backend=huggingface' flag and with the meta-llama/Llama-2-7b-chat-hf model with NVIDIA A100 40GB GPU.
Tested with the following requests, which all responded successfully:

curl -H "content-type:application/json" -v localhost:8080/openai/v1/completions -d '{"model":"llama", "prompt":"Hello give me a hello world python program"}'
--------------------------------------------------------------------------------------------
curl -H "content-type:application/json" -v localhost:8080/openai/v1/chat/completions -d '{"model": "llama", "messages": [{"role": "user", "content": "Hello, how are you?"}]}'
--------------------------------------------------------------------------------------------
curl -H "content-type:application/json" -v localhost:8080/openai/v1/completions -d '{"model":"llama", "prompt":[1,2,3,4,5,6]}'

Then I created an image with my branch and tested it with a kserve deployment after updating the image field of the inference service manifest

Logs

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

Checklist:

Have you added unit/e2e tests that prove your fix is effective or that this feature works?
Has code been commented, particularly in hard-to-understand areas?
Have you made corresponding changes to the documentation?

Release note:

Re-running failed tests

/rerun-all - rerun all failed workflows.
/rerun-workflow <workflow name> - rerun a specific failed workflow. Only one workflow name can be specified. Multiple /rerun-workflow commands are allowed per comment.

Signed-off-by: sailgpu <sailesh.duddupudi@nutanix.com>

sivanantha321 · 2024-04-30T12:42:00Z

/lgtm

Signed-off-by: sailgpu <sailesh.duddupudi@nutanix.com>

cmaddalozzo · 2024-04-30T15:37:55Z

/lgtm

terrytangyuan

/lgtm

yuzisun · 2024-04-30T16:29:14Z

/approve
thanks @saileshd1402 !!

oss-prow-bot · 2024-04-30T16:29:19Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cmaddalozzo, saileshd1402, sivanantha321, terrytangyuan, yuzisun

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [yuzisun]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

… backend (kserve#3657) * Assign device of input tensors Signed-off-by: sailgpu <sailesh.duddupudi@nutanix.com> * lint fix Signed-off-by: sailgpu <sailesh.duddupudi@nutanix.com> --------- Signed-off-by: sailgpu <sailesh.duddupudi@nutanix.com> Signed-off-by: asd981256 <asd981256@gmail.com>

oss-prow-bot bot requested review from cmaddalozzo and sivanantha321 April 30, 2024 08:25

Assign device of input tensors

52bf0d2

Signed-off-by: sailgpu <sailesh.duddupudi@nutanix.com>

saileshd1402 force-pushed the fix-huggingface-input-tensors branch from 0254104 to 52bf0d2 Compare April 30, 2024 08:34

oss-prow-bot bot assigned sivanantha321 Apr 30, 2024

oss-prow-bot bot added the lgtm label Apr 30, 2024

sivanantha321 approved these changes Apr 30, 2024

View reviewed changes

lint fix

c8c2ff3

Signed-off-by: sailgpu <sailesh.duddupudi@nutanix.com>

oss-prow-bot bot removed the lgtm label Apr 30, 2024

oss-prow-bot bot assigned cmaddalozzo Apr 30, 2024

oss-prow-bot bot added the lgtm label Apr 30, 2024

cmaddalozzo approved these changes Apr 30, 2024

View reviewed changes

terrytangyuan approved these changes Apr 30, 2024

View reviewed changes

oss-prow-bot bot assigned terrytangyuan Apr 30, 2024

oss-prow-bot bot added the approved label Apr 30, 2024

yuzisun merged commit 0fe5d3f into kserve:master Apr 30, 2024
57 of 58 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assign device to input tensors in huggingface server with huggingface backend #3657

Assign device to input tensors in huggingface server with huggingface backend #3657

saileshd1402 commented Apr 30, 2024 •

edited

sivanantha321 commented Apr 30, 2024

cmaddalozzo commented Apr 30, 2024

terrytangyuan left a comment

yuzisun commented Apr 30, 2024

oss-prow-bot bot commented Apr 30, 2024

Assign device to input tensors in huggingface server with huggingface backend #3657

Assign device to input tensors in huggingface server with huggingface backend #3657

Conversation

saileshd1402 commented Apr 30, 2024 • edited

sivanantha321 commented Apr 30, 2024

cmaddalozzo commented Apr 30, 2024

terrytangyuan left a comment

Choose a reason for hiding this comment

yuzisun commented Apr 30, 2024

oss-prow-bot bot commented Apr 30, 2024

saileshd1402 commented Apr 30, 2024 •

edited