-
Notifications
You must be signed in to change notification settings - Fork 720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tracking] WebLLM: OpenAI-Compatible APIs in ChatModule #276
Labels
status: tracking
Tracking work in progress
Comments
Closed
@CharlieFRuan Thanks for creating the tracking issue. Just wanted to let you know that @shreygupta2809 and I are currently working on supporting the function calling |
This was referenced Feb 8, 2024
CharlieFRuan
added a commit
that referenced
this issue
Feb 21, 2024
This PR adds OpenAI-like API to `ChatModule`, specifically the `chatCompletion` API. See `examples/openai-api` for example usage. See [OpenAI reference](https://platform.openai.com/docs/api-reference/chat) on its original usage. Changes include: - Implement `chatCompletion()` in `ChatModule` - Expose conversation manipulation methods in `llm_chat.ts` so that `request.messages` can override existing chat history or system prompt in `chat_module.ts` - Implement `src/openai_api_protocols` that represents all OpenAI-related data structure; largely referred to [openai-node](https://github.com/openai/openai-node/blob/master/src/resources/chat/completions.ts) - Add `examples/openai-api` that demonstrates `chatCompletion()` for both streaming and non-streaming usages, without web worker - Support both streaming and non-streaming `chatCompletion()` in `web_worker.ts` with example usage added to `examples/web-worker` - For streaming with web worker, users have access to an async generator whose `next()` sends/receives messages with the worker, which has an underlying async generator that does the actual decodings Existing gaps from full-OpenAI compatibility are listed in #276, some may be unavoidable (e.g. selecting `model` in request) while some are WIP. Benchmarked performance via `{WebWorker, SingleThread} X {OAI-Stream, OAI-NonStream, Generate}`, virtually no degradation, with +-1 tok/s variation. Specifically, on M3 Max Llama 2 7B q4f32_1, decoding 128 tokens with a 12-token prompt yield: - Prefill: 182 tok/s - Decode: 48.3 tok/s - End-to-end: 38.5 tok/s - Where end-to-end is from the time we create the request to finish everything; the time recorded on the highest level.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Overview
The goal of this task is to implement APIs that are OpenAI API compatible. Existing APIs like
generate()
will still be kept. Essentially we want JSON-in and JSON-out, resulting in a UI like:If streaming:
Action items
chat_completion()
(both streaming and non-streaming), support configs/features that we currently do not have insidellm_chat.ts
tools
)Existing gaps
There are some fields/features that are not yet supported in WebLLM compared to OpenAI's
openai-node
.Fields in
ChatCompletionRequest
model
: in WebLLM, we need to callreload(model)
instead of making it an argument inChatCompletionRequest
response_format
(json-formatting)tool_choice
tools
Fields in
ChatCompletion
respondsystem_fingerprint
: not applicable in our case (OpenAI needs it because they perform request remotely on servers)Others
n > 1
when streaming, sincellm_chat.ts
does not support maintaining multiple sequences. We have to finish one sequence and then start generating another, conflicting with the goal of streaming in chunks.Future Items
The text was updated successfully, but these errors were encountered: