Add OpenAI API support to Huggingfaceserver #3582

cmaddalozzo · 2024-04-08T18:00:15Z

What this PR does / why we need it:
This PR adds support for the OpenAI completion and chat completion endpoints to the HuggingfaceServer.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #3419, #3580

Type of changes
Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

Checklist:

Have you added unit/e2e tests that prove your fix is effective or that this feature works?
Has code been commented, particularly in hard-to-understand areas?
Have you made corresponding changes to the documentation?

Release note:

Support OpenAi completion and chat completion endpoints in huggingfaceserver.

terrytangyuan

The commits seem a bit messy since it's based on #3477

Would you like to clean it up?

python/huggingfaceserver/huggingfaceserver/__main__.py

python/huggingfaceserver/huggingfaceserver/model.py

python/kserve/kserve/model.py

python/huggingfaceserver/huggingfaceserver/__main__.py

gavrishp · 2024-04-17T10:07:42Z

python/huggingfaceserver/huggingfaceserver/__main__.py

-        kwargs=vars(args),
-    )
+        engine_args = build_vllm_engine_args(args)
+        model = VLLMModel(args.model_name, engine_args)


I'm just curious if the film load fails here, should we fall back to HF?

spolti · 2024-04-23T14:07:00Z

For now, you might need to rebase, @yuzisun has merged a PR yesterday, I guess, to pin ray version to 2.10 to avoid this issue for now.

johnugeorge · 2024-04-24T18:37:21Z

python/huggingfaceserver/huggingfaceserver/encoder_model.py

+        self._tokenizer = AutoTokenizer.from_pretrained(
+            str(model_id_or_path),
+            revision=self.tokenizer_revision,
+            do_lower_case=self.do_lower_case,


Can you once verify the tokenizer args? Tokenizer also has device_map setting.

I am having a hard time finding any reference to device_map in the HF transformers code. There's also no mention of tokenizers supporting device_map in the docs. This comment suggests it's not needed/supported: huggingface/transformers#16359 (comment)

Let me try this again tomorrow. Btw, model used was gemma-2b

I also can not find anything from gemma-2b

Signed-off-by: Tessa Pham <hpham111@bloomberg.net> more components for OpenAI endpoints Signed-off-by: Tessa Pham <hpham111@bloomberg.net> add OpenAI endpoints to router Signed-off-by: Tessa Pham <hpham111@bloomberg.net> modify generate() in data plane Signed-off-by: Tessa Pham <hpham111@bloomberg.net> class OpenAIModel Signed-off-by: Tessa Pham <hpham111@bloomberg.net> delete and rename files Signed-off-by: Tessa Pham <hpham111@bloomberg.net> add create_chat_completion() to OpenAIModel Signed-off-by: Tessa Pham <hpham111@bloomberg.net> update routers and lint Signed-off-by: Tessa Pham <hpham111@bloomberg.net>

Signed-off-by: Curtis Maddalozzo <cmaddalozzo@bloomberg.net>

Fix tests. Signed-off-by: Curtis Maddalozzo <cmaddalozzo@bloomberg.net>

Pass loop as argument to the background request handler. Signed-off-by: Curtis Maddalozzo <cmaddalozzo@bloomberg.net>

python/huggingfaceserver/huggingfaceserver/__main__.py

yuzisun · 2024-04-24T22:35:22Z

python/kserve/kserve/protocol/dataplane.py

@@ -351,7 +350,7 @@ async def generate(

        Args:
            model_name (str): Model name.
-            request (bytes|GenerateRequest): Generate Request body data.
+            request (bytes|GenerateRequest): Generate Request / ChatCompletion Request body data.


Is this generate function still used, I think it uses the openai data plane now right?

Signed-off-by: Curtis Maddalozzo <cmaddalozzo@bloomberg.net>

Don't try to load table question answering models as they are not supported. Signed-off-by: Curtis Maddalozzo <cmaddalozzo@bloomberg.net>

Signed-off-by: Curtis Maddalozzo <cmaddalozzo@bloomberg.net>

yuzisun · 2024-04-25T20:55:10Z

/lgtm
/approve

oss-prow-bot · 2024-04-25T20:55:15Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cmaddalozzo, yuzisun

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [yuzisun]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

* master: Add OpenAI API support to Huggingfaceserver (kserve#3582) Allow rerunning failed workflows by comment (kserve#3550) Fix CVE-2023-45288 for qpext (kserve#3618) chore: v0.12.1 install files (kserve#3619) build: Fix CRD copying in generate-install.sh (kserve#3620) Fix Pydantic 2 warnings (kserve#3622) Fix make deploy-dev-storage-initializer not working (kserve#3617)

oss-prow-bot bot requested review from israel-hdez and sivanantha321 April 8, 2024 18:00

cmaddalozzo force-pushed the huggingfaceserver-openai branch 6 times, most recently from 68ecc53 to b8a7ed8 Compare April 8, 2024 19:03

terrytangyuan reviewed Apr 9, 2024

View reviewed changes