Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assign device to input tensors in huggingface server with huggingface backend #3657

Merged
merged 2 commits into from
Apr 30, 2024

Conversation

saileshd1402
Copy link
Contributor

@saileshd1402 saileshd1402 commented Apr 30, 2024

What this PR does / why we need it:
Currently, when we run huggingface server with a GPU, the input torch tensors are being stored in CPU instead of GPU, which is causing the following error:

Defaulted container "kserve-container" out of: kserve-container, queue-proxy
INFO:root:Copying contents of /mnt/models to local
INFO:kserve:Loading generative model for task 'text_generation'
Gemma's activation function should be approximate GeLU and not exact GeLU.
Changing the activation function to `gelu_pytorch_tanh`.if you want to use the legacy `gelu`, edit the `model.config` to set `hidden_activation=gelu`   instead of `hidden_act`. See https://github.com/huggingface/transformers/pull/29402 for more details.
INFO:kserve:Successfully loaded tokenizer
INFO:accelerate.utils.modeling:We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
Loading checkpoint shards: 100%|██████████| 2/2 [00:07<00:00,  3.58s/it]
INFO:kserve:Successfully loaded huggingface model from path /mnt/models
INFO:kserve:Registering model: gemmagpu
INFO:kserve:Setting max asyncio worker threads as 16
INFO:kserve:Starting uvicorn with 1 workers
2024-04-29 20:13:40.683 uvicorn.error INFO:     Started server process [1]
2024-04-29 20:13:40.683 uvicorn.error INFO:     Waiting for application startup.
2024-04-29 20:13:40.686 1 kserve INFO [start():63] Starting gRPC server on [::]:8081
2024-04-29 20:13:40.687 uvicorn.error INFO:     Application startup complete.
2024-04-29 20:13:40.687 uvicorn.error INFO:     Uvicorn running on http://0.0.0.0:8080/ (Press CTRL+C to quit)
/prod_venv/lib/python3.10/site-packages/transformers/generation/utils.py:1460: UserWarning: You are calling .generate() with the `input_ids` being on a device type different than your model's device. `input_ids` is on cpu, whereas the model is on cuda. You may experience unexpected behaviors or slower generation. Please make sure that you have put `input_ids` to the correct device by calling for example input_ids = input_ids.to('cuda') before running `.generate()`.
  warnings.warn(
Exception in thread Thread-2 (_process_requests):
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/prod_venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/huggingfaceserver/huggingfaceserver/generative_model.py", line 308, in _process_requests
    self._handle_request(req)
  File "/huggingfaceserver/huggingfaceserver/generative_model.py", line 289, in _handle_request
    outputs = self._model.generate(**kwargs)
  File "/prod_venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/prod_venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1527, in generate
    result = self._greedy_search(
  File "/prod_venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2411, in _greedy_search
    outputs = self(
  File "/prod_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/prod_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/prod_venv/lib/python3.10/site-packages/transformers/models/gemma/modeling_gemma.py", line 1105, in forward
    outputs = self.model(
  File "/prod_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/prod_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/prod_venv/lib/python3.10/site-packages/transformers/models/gemma/modeling_gemma.py", line 875, in forward
    inputs_embeds = self.embed_tokens(input_ids)
  File "/prod_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/prod_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/prod_venv/lib/python3.10/site-packages/torch/nn/modules/sparse.py", line 162, in forward
    return F.embedding(
  File "/prod_venv/lib/python3.10/site-packages/torch/nn/functional.py", line 2233, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)

To fix this, we can use the ".to()" function in the torch library to assign the input tensors to the correct device

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Type of changes
Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)

Feature/Issue validation/testing:

Please describe the tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

  1. Locally ran the huggingface server as mentioned here with the '--backend=huggingface' flag and with the meta-llama/Llama-2-7b-chat-hf model with NVIDIA A100 40GB GPU.
  2. Tested with the following requests, which all responded successfully:
curl -H "content-type:application/json" -v localhost:8080/openai/v1/completions -d '{"model":"llama", "prompt":"Hello give me a hello world python program"}'
--------------------------------------------------------------------------------------------
curl -H "content-type:application/json" -v localhost:8080/openai/v1/chat/completions -d '{"model": "llama", "messages": [{"role": "user", "content": "Hello, how are you?"}]}'
--------------------------------------------------------------------------------------------
curl -H "content-type:application/json" -v localhost:8080/openai/v1/completions -d '{"model":"llama", "prompt":[1,2,3,4,5,6]}'
  1. Then I created an image with my branch and tested it with a kserve deployment after updating the image field of the inference service manifest
  • Logs

Special notes for your reviewer:

  1. Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

Checklist:

  • Have you added unit/e2e tests that prove your fix is effective or that this feature works?
  • Has code been commented, particularly in hard-to-understand areas?
  • Have you made corresponding changes to the documentation?

Release note:


Re-running failed tests

  • /rerun-all - rerun all failed workflows.
  • /rerun-workflow <workflow name> - rerun a specific failed workflow. Only one workflow name can be specified. Multiple /rerun-workflow commands are allowed per comment.

Signed-off-by: sailgpu <sailesh.duddupudi@nutanix.com>
@sivanantha321
Copy link
Member

/lgtm

Signed-off-by: sailgpu <sailesh.duddupudi@nutanix.com>
@oss-prow-bot oss-prow-bot bot removed the lgtm label Apr 30, 2024
@cmaddalozzo
Copy link
Contributor

/lgtm

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@yuzisun
Copy link
Member

yuzisun commented Apr 30, 2024

/approve
thanks @saileshd1402 !!

Copy link

oss-prow-bot bot commented Apr 30, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cmaddalozzo, saileshd1402, sivanantha321, terrytangyuan, yuzisun

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@yuzisun yuzisun merged commit 0fe5d3f into kserve:master Apr 30, 2024
57 of 58 checks passed
asd981256 pushed a commit to asd981256/kserve that referenced this pull request May 14, 2024
… backend (kserve#3657)

* Assign device of input tensors

Signed-off-by: sailgpu <sailesh.duddupudi@nutanix.com>

* lint fix

Signed-off-by: sailgpu <sailesh.duddupudi@nutanix.com>

---------

Signed-off-by: sailgpu <sailesh.duddupudi@nutanix.com>
Signed-off-by: asd981256 <asd981256@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants