Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VLLM support for OpenAI Completions in HF server #3589

Merged
merged 24 commits into from
Apr 19, 2024

Conversation

gavrishp
Copy link
Contributor

@gavrishp gavrishp commented Apr 9, 2024

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #3419

Type of changes
Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Checklist:

  • Have you added unit/e2e tests that prove your fix is effective or that this feature works?
  • Has code been commented, particularly in hard-to-understand areas?
  • Have you made corresponding changes to the documentation?
 Support OpenAi completion endpoint in vllm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just add the license to this file so the check passes?

return self.create_error_response("suffix is not currently supported")

model_name = request.model
request_id = f"cmpl-{random_uuid()}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Request id may need to be fetched from the header x-request-id if user sets from the client side, need to think about how we can pass the headers like KServe model does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since CompletionCreateParams is auto-generated thus can't easily have properties added to it maybe we should introduce a new class to wrap it.
i.e.

class CompletionRequest(BaseModel):
    request_id: str
    params: CompletionCreateParams
    context: Dict  # headers can go in here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this looks good

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated with the changes

prompt_token_ids=input_ids,
)
)
except ValueError as e:
Copy link
Member

@yuzisun yuzisun Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can consider registering openai error response handler like v1/v2 endpoints

InvalidInput: invalid_input_handler,

def apply_chat_template(self, messages: Iterable[ChatCompletionMessageParam]):
return self.tokenizer.apply_chat_template(conversation=messages, tokenize=False)

def create_error_response(
Copy link
Member

@yuzisun yuzisun Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can move the openai error responses to openai_errors.py?

async def invalid_input_handler(_, exc):

response = self.request_output_to_completion_response(
final_res_batch, request, request_id, created_time, model_name
)
except ValueError as e:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here we can reuse KServe's error handler


return response

async def completion_stream_generator(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test for this function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In current state, I will need to install vllm as the Class needs it. Else I need to remove it as class method. What do you suggest?

Comment on lines 382 to 385
return self.create_error_response(
message=f"The model `{request.model}` does not exist.",
err_type="NotFoundError",
status_code=HTTPStatus.NOT_FOUND,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can raise the ModelNotFound exception

@@ -137,6 +145,9 @@ def load(self) -> bool:
self.vllm_engine = AsyncLLMEngine.from_engine_args(
self.vllm_engine_args
)
self.openai_serving_completion = OpenAIServingCompletion(
self.vllm_engine, self.model_name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.vllm_engine, self.model_name
self.vllm_engine, self.name

@pbasov
Copy link

pbasov commented Apr 15, 2024

Quick /openai/v1/chat/completions test with https://huggingface.co/TheBloke/Llama-2-7B-Chat-AWQ (regular /v2/generate works fine):

from openai import OpenAI

client = OpenAI(api_key="nokey")
client.base_url = "https://llama2-predictor.domain.net/openai/v1"
chat_completion = client.chat.completions.create(
    messages=[
        {   
            "role": "user",
            "content": "Hello world!"
        },
    ],
    model="llama2",
)
2024-04-15 22:15:29.670 1 kserve ERROR [generic_exception_handler():111] Exception:
Traceback (most recent call last):
  File "/prod_venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/prod_venv/lib/python3.10/site-packages/timing_asgi/middleware.py", line 70, in __call__
    await self.app(scope, receive, send_wrapper)
  File "/prod_venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/prod_venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/prod_venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/prod_venv/lib/python3.10/site-packages/starlette/routing.py", line 758, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/prod_venv/lib/python3.10/site-packages/starlette/routing.py", line 778, in app
    await route.handle(scope, receive, send)
  File "/prod_venv/lib/python3.10/site-packages/starlette/routing.py", line 299, in handle
    await self.app(scope, receive, send)
  File "/prod_venv/lib/python3.10/site-packages/starlette/routing.py", line 79, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/prod_venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/prod_venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/prod_venv/lib/python3.10/site-packages/starlette/routing.py", line 74, in app
    response = await func(request)
  File "/prod_venv/lib/python3.10/site-packages/fastapi/routing.py", line 299, in app
    raise e
  File "/prod_venv/lib/python3.10/site-packages/fastapi/routing.py", line 294, in app
    raw_response = await run_endpoint_function(
  File "/prod_venv/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/kserve/kserve/protocol/rest/openai/endpoints.py", line 116, in create_chat_completion
    completion = await self.dataplane.create_chat_completion(
  File "/kserve/kserve/protocol/rest/openai/dataplane.py", line 77, in create_chat_completion
    return await model.create_chat_completion(request)
  File "/kserve/kserve/protocol/rest/openai/openai_model.py", line 249, in create_chat_completion
    chat_prompt = self.apply_chat_template(params.messages)
  File "/huggingfaceserver/huggingfaceserver/model.py", line 339, in apply_chat_template
    prompt=self.openai_serving_completion.apply_chat_template(messages)
  File "/huggingfaceserver/huggingfaceserver/vllm_completions.py", line 339, in apply_chat_template
    return self.tokenizer.apply_chat_template(conversation=messages, tokenize=False)
  File "/prod_venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1783, in apply_chat_template
    rendered = compiled_template.render(
  File "/prod_venv/lib/python3.10/site-packages/jinja2/environment.py", line 1301, in render
    self.environment.handle_exception()
  File "/prod_venv/lib/python3.10/site-packages/jinja2/environment.py", line 936, in handle_exception
    raise rewrite_traceback_stack(source=source)
  File "<template>", line 1, in top-level template code
  File "/prod_venv/lib/python3.10/site-packages/jinja2/sandbox.py", line 393, in call
    return __context.call(__obj, *args, **kwargs)
  File "/prod_venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1828, in raise_exception
    raise TemplateError(message)
jinja2.exceptions.TemplateError: Conversation roles must alternate user/assistant/user/assistant/...
2024-04-15 22:15:29.670 uvicorn.access INFO:     10.0.0.85:0 1 - "POST /openai/v1/chat/completions HTTP/1.1" 500 Internal Server Error

Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
@gavrishp
Copy link
Contributor Author

Quick /openai/v1/chat/completions test with https://huggingface.co/TheBloke/Llama-2-7B-Chat-AWQ (regular /v2/generate works fine):

from openai import OpenAI

client = OpenAI(api_key="nokey")
client.base_url = "https://llama2-predictor.domain.net/openai/v1"
chat_completion = client.chat.completions.create(
    messages=[
        {   
            "role": "user",
            "content": "Hello world!"
        },
    ],
    model="llama2",
)
2024-04-15 22:15:29.670 1 kserve ERROR [generic_exception_handler():111] Exception:
Traceback (most recent call last):
  File "/prod_venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/prod_venv/lib/python3.10/site-packages/timing_asgi/middleware.py", line 70, in __call__
    await self.app(scope, receive, send_wrapper)
  File "/prod_venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/prod_venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/prod_venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/prod_venv/lib/python3.10/site-packages/starlette/routing.py", line 758, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/prod_venv/lib/python3.10/site-packages/starlette/routing.py", line 778, in app
    await route.handle(scope, receive, send)
  File "/prod_venv/lib/python3.10/site-packages/starlette/routing.py", line 299, in handle
    await self.app(scope, receive, send)
  File "/prod_venv/lib/python3.10/site-packages/starlette/routing.py", line 79, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/prod_venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/prod_venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/prod_venv/lib/python3.10/site-packages/starlette/routing.py", line 74, in app
    response = await func(request)
  File "/prod_venv/lib/python3.10/site-packages/fastapi/routing.py", line 299, in app
    raise e
  File "/prod_venv/lib/python3.10/site-packages/fastapi/routing.py", line 294, in app
    raw_response = await run_endpoint_function(
  File "/prod_venv/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/kserve/kserve/protocol/rest/openai/endpoints.py", line 116, in create_chat_completion
    completion = await self.dataplane.create_chat_completion(
  File "/kserve/kserve/protocol/rest/openai/dataplane.py", line 77, in create_chat_completion
    return await model.create_chat_completion(request)
  File "/kserve/kserve/protocol/rest/openai/openai_model.py", line 249, in create_chat_completion
    chat_prompt = self.apply_chat_template(params.messages)
  File "/huggingfaceserver/huggingfaceserver/model.py", line 339, in apply_chat_template
    prompt=self.openai_serving_completion.apply_chat_template(messages)
  File "/huggingfaceserver/huggingfaceserver/vllm_completions.py", line 339, in apply_chat_template
    return self.tokenizer.apply_chat_template(conversation=messages, tokenize=False)
  File "/prod_venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1783, in apply_chat_template
    rendered = compiled_template.render(
  File "/prod_venv/lib/python3.10/site-packages/jinja2/environment.py", line 1301, in render
    self.environment.handle_exception()
  File "/prod_venv/lib/python3.10/site-packages/jinja2/environment.py", line 936, in handle_exception
    raise rewrite_traceback_stack(source=source)
  File "<template>", line 1, in top-level template code
  File "/prod_venv/lib/python3.10/site-packages/jinja2/sandbox.py", line 393, in call
    return __context.call(__obj, *args, **kwargs)
  File "/prod_venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1828, in raise_exception
    raise TemplateError(message)
jinja2.exceptions.TemplateError: Conversation roles must alternate user/assistant/user/assistant/...
2024-04-15 22:15:29.670 uvicorn.access INFO:     10.0.0.85:0 1 - "POST /openai/v1/chat/completions HTTP/1.1" 500 Internal Server Error

Noticed this. This is due to the Union types having root as the parent parameter. Got this updated in the openai generated types file. It should be fixed now. Let me know, if there's anything else.

Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
@cmaddalozzo
Copy link
Contributor

/lgtm

@pbasov
Copy link

pbasov commented Apr 16, 2024

@gavrishp just built with fresh code, it works, thank you!


async def openai_error_handler(_, exc):
logger.error("Exception:", exc_info=exc)
return JSONResponse(
Copy link
Member

@yuzisun yuzisun Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to return the openai defined ErrorResponse type, otherwise it will not be compatible with openai client?

Copy link
Contributor Author

@gavrishp gavrishp Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made changes to send contents of type OpenAI error object

For error scenarios the body looks like this

try:
    completion = openai.completions.create(model="llama2", prompt="")

    if stream:
        for chunk in completion:
            print(chunk.choices[0].text, datetime.datetime.now())
    else:   
        print(completion.choices[0].text)

except openai.APIStatusError as e:
    print("Another non-200-range status code was received")
    print(e.body)

{'code': '500', 'message': 'Either prompt or prompt_ids should be provided.', 'param': '', 'type': 'OpenAIError'}

OpenAI ErrorObject - https://github.com/openai/openai-python/blob/15e7ac157dcc80d952107f596bb9c38cfa06685d/src/openai/types/shared/error_object.py#L10

@@ -42,9 +32,6 @@ def get_open_ai_models(repository: ModelRepository) -> List[Model]:


def maybe_register_openai_endpoints(app: FastAPI, model_registry: ModelRepository):
# Check if the openai package is available before continuing so we don't run into any import errors
if not openai_is_available():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might still need this check otherwise openai endpoints get registered to other model serving runtimes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we can not use the approach checking if openai dependency exists or not now

Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>
@oss-prow-bot oss-prow-bot bot removed the lgtm label Apr 17, 2024
@@ -42,9 +32,6 @@ def get_open_ai_models(repository: ModelRepository) -> List[Model]:


def maybe_register_openai_endpoints(app: FastAPI, model_registry: ModelRepository):
# Check if the openai package is available before continuing so we don't run into any import errors
if not openai_is_available():
return
open_ai_models = get_open_ai_models(model_registry)
# If no OpenAI models then no need to add the endpoints
if len(open_ai_models) == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuzisun we only register the OpenAI endpoints if one of more of the registered models extends OpenAIModel

@yuzisun
Copy link
Member

yuzisun commented Apr 19, 2024

Thanks @gavrishp !
/lgtm
/approve

Copy link

oss-prow-bot bot commented Apr 19, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cmaddalozzo, gavrishp, yuzisun

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@oss-prow-bot oss-prow-bot bot merged commit 9cc4eed into kserve:master Apr 19, 2024
56 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support OpenAI Schema for KServe LLM runtime
5 participants