Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable dtype support for huggingface server #3613

Merged
merged 14 commits into from
May 9, 2024
Merged

Conversation

Datta0
Copy link
Contributor

@Datta0 Datta0 commented Apr 19, 2024

What this PR does / why we need it: Currently, HF runtime doesn't have support to switch dtypes while VLLM does. And VLLM seems to load 16bit by default and HF seems to be running at 32bit causing inconsistencies.

Type of changes

  • New feature (non-breaking change which adds functionality)

Feature/Issue validation/testing:

Please describe the tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

  • Enable --dtype=float16 (or bfloat16 or float32) argument while running huggingfaceserver via python -m huggingfaceserver --model_id=meta-llama/Llama-2-7b-chat-hf --model_name=llamachat --backend=huggingface. Check for dtype of weights and memory usage.

@@ -98,6 +98,7 @@ def __init__(
self.vllm_engine_args = engine_args
self.use_vllm = not kwargs.get("disable_vllm", False) if _vllm else False
self.ready = False
self.dtype = kwargs.get("dtype","bfloat16")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically BF16 is for training, for inference FP16 is more acceptable

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the argument defined from vllm ?

Copy link
Contributor Author

@Datta0 Datta0 Apr 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is defined here. I verified the dtype of the model after engine creation and it matches with the input.
Memory usage matches up too (before initialising cache and blocking memory for blocks of paged attention)
image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment that this argument is used by both HF and vllm?

Copy link
Member

@yuzisun yuzisun Apr 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also the default should be auto ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto defaults to FP16 even if the model is FP32 in case of vLLM
but for HF, it doesn't. It sticks to FP32. And HF config doesn't have the info about what type the model weights are of. So for FP32 models (which are rare but do exist. GPTJ, OLMo etc for eg), this would again cause inconsistency.

@yuzisun
Copy link
Member

yuzisun commented Apr 20, 2024

@Datta0 Can you help sign off your commit?

Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>
Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>
@Datta0
Copy link
Contributor Author

Datta0 commented Apr 20, 2024

@yuzisun Done signed off the commits :)

Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>
Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>
@Datta0
Copy link
Contributor Author

Datta0 commented Apr 22, 2024

@yuzisun can you please check this :)

@yuzisun
Copy link
Member

yuzisun commented May 1, 2024

@yuzisun can you please check this :)

@Datta0 can you help rebase the PR to resolve the conflicts?

Datta0 added 2 commits May 2, 2024 05:56
Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>
Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>
@datta-nimmaturi
Copy link

@yuzisun Done :)

@yuzisun
Copy link
Member

yuzisun commented May 3, 2024

@Datta0 the e2e test failed here

  =====================================  Logs for Predictor Pod: hf-t5-small-predictor-00001-deployment-c54b8fc6-tbcmp  =========================================
  INFO:kserve:Loading generative model for task 'text2text_generation'
  ERROR:kserve:Failed to start model server: HuggingfaceGenerativeModel.__init__() got an unexpected keyword argument 'dtype'
  ================================================================================================================

@@ -0,0 +1,419 @@
# Copyright 2024 The KServe Authors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the merge conflict is not resolved correctly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file should no longer exist

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooopsie...
Updated it now

Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>
Comment on lines 113 to 116
if (
"dtype" in args and args.dtype == "auto"
): # Replacing auto with float16 to make it consistent with HF (VLLM defaults to BF16 for BF16 models but HF doesn't)
args.dtype = "float16"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic seems like can be folded into later on line 147

dtype = kwargs.get("dtype", "auto")
hf_dtype_map = {
             "float32": torch.float32,
             "float16": torch.float16,
             "bfloat16": torch.bfloat16,
             "half": torch.float16,
             "auto": torch.float16
             "float": torch.float32,
         }

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto defaults to FP16 even if the model is FP32 in case of vLLM. So if we do the mapping only for HF, for an FP32 model, HF loads in 32 and vLLM would load in FP16

Copy link
Member

@yuzisun yuzisun May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But here auto maps to FP16 if dtype is not provided, did I miss anything?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah so if the base model itself is BF16, vLLM would load in BF16. HF (with this mapping) would load in FP16. Memory wise both are pretty much the same. But it still is an inconsistency right?

Copy link
Member

@yuzisun yuzisun May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why vLLM defaults to BF16 which only makes sense for training for convergence, but for inference we should use FP16 which performs more calculations per unit time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I think I get your point now, you are trying to set default to FP16 for vLLM

Copy link
Member

@yuzisun yuzisun May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on HF doc https://huggingface.co/docs/transformers/en/main_classes/model, user might actually expect loading the model using the dtype from config.json.

"auto" - A torch_dtype entry in the config.json file of the model will be attempted to be used. 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes...
But auto is BF16 is only for BF16 models aka weight tensors are themselves stored in BF16 like llama3
Yeah there's compelling reason to default to FP16 (at least for inference to support older architectures)
IDK why vLLM doesn't do that
and the conversion is straight forward anyway

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we then allow only FP16, BF16 and FP32 as allowed dtypes for both HF and vLLM? That will avoid the confusion.

Copy link
Member

@yuzisun yuzisun May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still want auto, it is fine as long as we clearly document the behavior.

Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>
@@ -185,6 +204,7 @@ def load_model():
do_lower_case=not kwargs.get("disable_lower_case", False),
add_special_tokens=not kwargs.get("disable_special_tokens", False),
max_length=kwargs["max_length"],
dtype=dtype,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For encoder models we probably do not want to default to FP16, see

2024-05-03 18:13:23.626 1 kserve ERROR [inference_error_handler():104] Exception:
  Traceback (most recent call last):
    File "/huggingfaceserver/huggingfaceserver/encoder_model.py", line 242, in predict
      outputs = self._model(**input_batch).logits
    File "/prod_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
      return self._call_impl(*args, **kwargs)
    File "/prod_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
      return forward_call(*args, **kwargs)
    File "/prod_venv/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py", line 1360, in forward
      outputs = self.bert(
    File "/prod_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
      return self._call_impl(*args, **kwargs)
    File "/prod_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
      return forward_call(*args, **kwargs)
    File "/prod_venv/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py", line 1006, in forward
      embedding_output = self.embeddings(
    File "/prod_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
      return self._call_impl(*args, **kwargs)
    File "/prod_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
      return forward_call(*args, **kwargs)
    File "/prod_venv/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py", line 239, in forward
      embeddings = self.LayerNorm(embeddings)
    File "/prod_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
      return self._call_impl(*args, **kwargs)
    File "/prod_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
      return forward_call(*args, **kwargs)
    File "/prod_venv/lib/python3.10/site-packages/torch/nn/modules/normalization.py", line 196, in forward
      return F.layer_norm(
    File "/prod_venv/lib/python3.10/site-packages/torch/nn/functional.py", line 2543, in layer_norm
      return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
  RuntimeError: "LayerNormKernelImpl" not implemented for 'Half'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually a lot of CPU-based operations in Pytorch are not implemented to support FP16, so we probably need to check need GPU availability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats a fair point. But people would probably want to run models at lower precisions on CPU given that there's not much to lose. We should at least support that right? And for those who want to run, its for themselves to verify compatibility/support?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so should we default to float32/float16 based on GPU availability?
something like

default_dtype = 'auto' if torch.cuda.is_available() else 'float32'
dtype = kwargs.get("dtype", default_dtype)

Edit: Ok I just checked vLLM CPU support document. For CPU vLLM casts FP16 to BF16...
https://arc.net/l/quote/nujjbyig
image

Datta0 added 2 commits May 4, 2024 05:10
Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>
Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>
Copy link
Contributor

@spolti spolti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>
@oss-prow-bot oss-prow-bot bot removed the lgtm label May 7, 2024
Datta0 added 3 commits May 8, 2024 16:48
Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>
Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>
Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>
@yuzisun
Copy link
Member

yuzisun commented May 9, 2024

/lgtm
/approve

Copy link

oss-prow-bot bot commented May 9, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Datta0, spolti, yuzisun

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@oss-prow-bot oss-prow-bot bot added the approved label May 9, 2024
@yuzisun yuzisun merged commit a30d402 into kserve:master May 9, 2024
57 of 58 checks passed
asd981256 pushed a commit to asd981256/kserve that referenced this pull request May 14, 2024
* Enable dtype for huggingface server

Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>

* Set float16 as default. Fixup linter

Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>

* Add small comment to make the changes understandable

Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>

* Fixup linter

Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>

* Adapt to new huggingfacemodel

Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>

* Fixup merge :)

Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>

* Explicitly mention the behaviour of dtype flag on auto.

Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>

* Default to FP32 for encoder models

Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>

* Selectively add --dtype to parser. Use FP16 for GPU and FP32 for CPU

Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>

* Fixup linter

Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>

* Update poetry

Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>

* Use torch.float32 forr tests explicitly

Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>

---------

Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>
Signed-off-by: asd981256 <asd981256@gmail.com>
oss-prow-bot bot pushed a commit that referenced this pull request May 18, 2024
* upgrade vllm/transformers version (#3671)

upgrade vllm version

Signed-off-by: Johnu George <johnugeorge109@gmail.com>

* Add openai models endpoint (#3666)

Signed-off-by: Curtis Maddalozzo <cmaddalozzo@bloomberg.net>

* feat: Support customizable deployment strategy for RawDeployment mode. Fixes #3452 (#3603)

* feat: Support customizable deployment strategy for RawDeployment mode

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

* regen

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

* lint

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

* Correctly apply rollingupdate

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

* address comments

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

* Add validation

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

---------

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

* Enable dtype support for huggingface server (#3613)

* Enable dtype for huggingface server

Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>

* Set float16 as default. Fixup linter

Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>

* Add small comment to make the changes understandable

Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>

* Fixup linter

Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>

* Adapt to new huggingfacemodel

Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>

* Fixup merge :)

Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>

* Explicitly mention the behaviour of dtype flag on auto.

Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>

* Default to FP32 for encoder models

Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>

* Selectively add --dtype to parser. Use FP16 for GPU and FP32 for CPU

Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>

* Fixup linter

Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>

* Update poetry

Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>

* Use torch.float32 forr tests explicitly

Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>

---------

Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>

* Add method for checking model health/readiness (#3673)

Signed-off-by: Curtis Maddalozzo <cmaddalozzo@bloomberg.net>

* fix for extract zip from gcs (#3510)

* fix for extract zip from gcs

Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com>

* initial commit for gcs model download unittests

Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com>

* unittests for model download from gcs

Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com>

* black format fix

Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com>

* code verification

Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com>

---------

Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com>

* Update Dockerfile and Readme (#3676)

Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>

* Update huggingface readme (#3678)

* update wording for huggingface README

small update to make readme easier to understand

Signed-off-by: Alexa Griffith  <agriffith50@bloomberg.net>

* Update README.md

Signed-off-by: Alexa Griffith agriffith50@bloomberg.net

* Update python/huggingfaceserver/README.md

Co-authored-by: Filippe Spolti <filippespolti@gmail.com>
Signed-off-by: Alexa Griffith  <agriffith50@bloomberg.net>

* update vllm

Signed-off-by: alexagriffith <agriffith50@bloomberg.net>

* Update README.md

---------

Signed-off-by: Alexa Griffith  <agriffith50@bloomberg.net>
Signed-off-by: Alexa Griffith agriffith50@bloomberg.net
Signed-off-by: alexagriffith <agriffith50@bloomberg.net>
Signed-off-by: Dan Sun <dsun20@bloomberg.net>
Co-authored-by: Filippe Spolti <filippespolti@gmail.com>
Co-authored-by: Dan Sun <dsun20@bloomberg.net>

* fix: HPA equality check should include annotations (#3650)

* fix: HPA equality check should include annotations

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

* Only watch related autoscalerclass annotation

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

* simplify

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

* Add missing delete action

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

* fix logic

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
---------

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

* Fix:  huggingface runtime in helm chart (#3679)

fix huggingface runtime in chart

Signed-off-by: Dan Sun <dsun20@bloomberg.net>

* Fix: model id and model dir check order (#3680)

* fix huggingface runtime in chart

Signed-off-by: Dan Sun <dsun20@bloomberg.net>

* Allow model_dir to be specified on template

Signed-off-by: Dan Sun <dsun20@bloomberg.net>

* Default model_dir to /mnt/models for HF

Signed-off-by: Dan Sun <dsun20@bloomberg.net>

* Lint format

Signed-off-by: Dan Sun <dsun20@bloomberg.net>

---------

Signed-off-by: Dan Sun <dsun20@bloomberg.net>

* Fix:vLLM Model Supported check throwing circular dependency (#3688)

* Fix:vLLM Model Supported check throwing circular dependency

Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>

* remove unwanted comments

Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>

* remove unwanted comments

Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>

* fix return case

Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>

* fix to check all arch in model config forr vllm support

Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>

* fixlint

Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>

---------

Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>

* Fix: Allow null in Finish reason streaming response in vLLM (#3684)

Fix: allow null in Finish reason

Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>

---------

Signed-off-by: Johnu George <johnugeorge109@gmail.com>
Signed-off-by: Curtis Maddalozzo <cmaddalozzo@bloomberg.net>
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>
Signed-off-by: Andrews Arokiam <andrews.arokiam@ideas2it.com>
Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
Signed-off-by: Alexa Griffith  <agriffith50@bloomberg.net>
Signed-off-by: Alexa Griffith agriffith50@bloomberg.net
Signed-off-by: alexagriffith <agriffith50@bloomberg.net>
Signed-off-by: Dan Sun <dsun20@bloomberg.net>
Co-authored-by: Curtis Maddalozzo <cmaddalozzo@users.noreply.github.com>
Co-authored-by: Yuan Tang <terrytangyuan@gmail.com>
Co-authored-by: Datta Nimmaturi <39181234+Datta0@users.noreply.github.com>
Co-authored-by: Andrews Arokiam <87992092+andyi2it@users.noreply.github.com>
Co-authored-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
Co-authored-by: Alexa Griffith <agriffith50@bloomberg.net>
Co-authored-by: Filippe Spolti <filippespolti@gmail.com>
Co-authored-by: Dan Sun <dsun20@bloomberg.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants