-
Notifications
You must be signed in to change notification settings - Fork 985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VLLM support for OpenAI Completions in HF server #3589
Conversation
538fe1f
to
72920dd
Compare
python/huggingfaceserver/huggingfaceserver/completions_utils.py
Outdated
Show resolved
Hide resolved
python/huggingfaceserver/huggingfaceserver/completions_utils.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just add the license to this file so the check passes?
return self.create_error_response("suffix is not currently supported") | ||
|
||
model_name = request.model | ||
request_id = f"cmpl-{random_uuid()}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Request id may need to be fetched from the header x-request-id
if user sets from the client side, need to think about how we can pass the headers like KServe model does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will need to pass through here, right?
https://github.com/kserve/kserve/blob/master/python/kserve/kserve/protocol/rest/openai/openai_model.py#L62
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since CompletionCreateParams
is auto-generated thus can't easily have properties added to it maybe we should introduce a new class to wrap it.
i.e.
class CompletionRequest(BaseModel):
request_id: str
params: CompletionCreateParams
context: Dict # headers can go in here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this looks good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated with the changes
prompt_token_ids=input_ids, | ||
) | ||
) | ||
except ValueError as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can consider registering openai error response handler like v1/v2 endpoints
InvalidInput: invalid_input_handler, |
def apply_chat_template(self, messages: Iterable[ChatCompletionMessageParam]): | ||
return self.tokenizer.apply_chat_template(conversation=messages, tokenize=False) | ||
|
||
def create_error_response( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can move the openai error responses to openai_errors.py
?
kserve/python/kserve/kserve/errors.py
Line 98 in 9fd262f
async def invalid_input_handler(_, exc): |
response = self.request_output_to_completion_response( | ||
final_res_batch, request, request_id, created_time, model_name | ||
) | ||
except ValueError as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here we can reuse KServe's error handler
|
||
return response | ||
|
||
async def completion_stream_generator( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a test for this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In current state, I will need to install vllm as the Class needs it. Else I need to remove it as class method. What do you suggest?
return self.create_error_response( | ||
message=f"The model `{request.model}` does not exist.", | ||
err_type="NotFoundError", | ||
status_code=HTTPStatus.NOT_FOUND, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can raise the ModelNotFound
exception
@@ -137,6 +145,9 @@ def load(self) -> bool: | |||
self.vllm_engine = AsyncLLMEngine.from_engine_args( | |||
self.vllm_engine_args | |||
) | |||
self.openai_serving_completion = OpenAIServingCompletion( | |||
self.vllm_engine, self.model_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.vllm_engine, self.model_name | |
self.vllm_engine, self.name |
Quick from openai import OpenAI
client = OpenAI(api_key="nokey")
client.base_url = "https://llama2-predictor.domain.net/openai/v1"
chat_completion = client.chat.completions.create(
messages=[
{
"role": "user",
"content": "Hello world!"
},
],
model="llama2",
) 2024-04-15 22:15:29.670 1 kserve ERROR [generic_exception_handler():111] Exception:
Traceback (most recent call last):
File "/prod_venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
await self.app(scope, receive, _send)
File "/prod_venv/lib/python3.10/site-packages/timing_asgi/middleware.py", line 70, in __call__
await self.app(scope, receive, send_wrapper)
File "/prod_venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/prod_venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/prod_venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/prod_venv/lib/python3.10/site-packages/starlette/routing.py", line 758, in __call__
await self.middleware_stack(scope, receive, send)
File "/prod_venv/lib/python3.10/site-packages/starlette/routing.py", line 778, in app
await route.handle(scope, receive, send)
File "/prod_venv/lib/python3.10/site-packages/starlette/routing.py", line 299, in handle
await self.app(scope, receive, send)
File "/prod_venv/lib/python3.10/site-packages/starlette/routing.py", line 79, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/prod_venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/prod_venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/prod_venv/lib/python3.10/site-packages/starlette/routing.py", line 74, in app
response = await func(request)
File "/prod_venv/lib/python3.10/site-packages/fastapi/routing.py", line 299, in app
raise e
File "/prod_venv/lib/python3.10/site-packages/fastapi/routing.py", line 294, in app
raw_response = await run_endpoint_function(
File "/prod_venv/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
return await dependant.call(**values)
File "/kserve/kserve/protocol/rest/openai/endpoints.py", line 116, in create_chat_completion
completion = await self.dataplane.create_chat_completion(
File "/kserve/kserve/protocol/rest/openai/dataplane.py", line 77, in create_chat_completion
return await model.create_chat_completion(request)
File "/kserve/kserve/protocol/rest/openai/openai_model.py", line 249, in create_chat_completion
chat_prompt = self.apply_chat_template(params.messages)
File "/huggingfaceserver/huggingfaceserver/model.py", line 339, in apply_chat_template
prompt=self.openai_serving_completion.apply_chat_template(messages)
File "/huggingfaceserver/huggingfaceserver/vllm_completions.py", line 339, in apply_chat_template
return self.tokenizer.apply_chat_template(conversation=messages, tokenize=False)
File "/prod_venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1783, in apply_chat_template
rendered = compiled_template.render(
File "/prod_venv/lib/python3.10/site-packages/jinja2/environment.py", line 1301, in render
self.environment.handle_exception()
File "/prod_venv/lib/python3.10/site-packages/jinja2/environment.py", line 936, in handle_exception
raise rewrite_traceback_stack(source=source)
File "<template>", line 1, in top-level template code
File "/prod_venv/lib/python3.10/site-packages/jinja2/sandbox.py", line 393, in call
return __context.call(__obj, *args, **kwargs)
File "/prod_venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1828, in raise_exception
raise TemplateError(message)
jinja2.exceptions.TemplateError: Conversation roles must alternate user/assistant/user/assistant/...
2024-04-15 22:15:29.670 uvicorn.access INFO: 10.0.0.85:0 1 - "POST /openai/v1/chat/completions HTTP/1.1" 500 Internal Server Error |
Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
f82e18c
to
ec65ff5
Compare
Noticed this. This is due to the Union types having |
Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
/lgtm |
@gavrishp just built with fresh code, it works, thank you! |
|
||
async def openai_error_handler(_, exc): | ||
logger.error("Exception:", exc_info=exc) | ||
return JSONResponse( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to return the openai defined ErrorResponse type, otherwise it will not be compatible with openai client?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made changes to send contents of type OpenAI error object
For error scenarios the body looks like this
try:
completion = openai.completions.create(model="llama2", prompt="")
if stream:
for chunk in completion:
print(chunk.choices[0].text, datetime.datetime.now())
else:
print(completion.choices[0].text)
except openai.APIStatusError as e:
print("Another non-200-range status code was received")
print(e.body)
{'code': '500', 'message': 'Either prompt or prompt_ids should be provided.', 'param': '', 'type': 'OpenAIError'}
OpenAI ErrorObject - https://github.com/openai/openai-python/blob/15e7ac157dcc80d952107f596bb9c38cfa06685d/src/openai/types/shared/error_object.py#L10
@@ -42,9 +32,6 @@ def get_open_ai_models(repository: ModelRepository) -> List[Model]: | |||
|
|||
|
|||
def maybe_register_openai_endpoints(app: FastAPI, model_registry: ModelRepository): | |||
# Check if the openai package is available before continuing so we don't run into any import errors | |||
if not openai_is_available(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might still need this check otherwise openai endpoints get registered to other model serving runtimes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we can not use the approach checking if openai dependency exists or not now
Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
@@ -42,9 +32,6 @@ def get_open_ai_models(repository: ModelRepository) -> List[Model]: | |||
|
|||
|
|||
def maybe_register_openai_endpoints(app: FastAPI, model_registry: ModelRepository): | |||
# Check if the openai package is available before continuing so we don't run into any import errors | |||
if not openai_is_available(): | |||
return | |||
open_ai_models = get_open_ai_models(model_registry) | |||
# If no OpenAI models then no need to add the endpoints | |||
if len(open_ai_models) == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yuzisun we only register the OpenAI endpoints if one of more of the registered models extends OpenAIModel
Thanks @gavrishp ! |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cmaddalozzo, gavrishp, yuzisun The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it:
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #3419
Type of changes
Please delete options that are not relevant.
Checklist: