Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tracking] WebLLM: OpenAI-Compatible APIs in ChatModule #276

Closed
1 of 3 tasks
CharlieFRuan opened this issue Jan 28, 2024 · 1 comment
Closed
1 of 3 tasks

[Tracking] WebLLM: OpenAI-Compatible APIs in ChatModule #276

CharlieFRuan opened this issue Jan 28, 2024 · 1 comment
Labels
status: tracking Tracking work in progress

Comments

@CharlieFRuan
Copy link
Contributor

CharlieFRuan commented Jan 28, 2024

Overview

The goal of this task is to implement APIs that are OpenAI API compatible. Existing APIs like generate() will still be kept. Essentially we want JSON-in and JSON-out, resulting in a UI like:

import * as webllm from "@mlc-ai/web-llm";

async function main() {
  const chat = new webllm.ChatModule();
	await chat.reload("Llama-2-7b-chat-hf-q4f32_1");

  const completion = await chat.chat_completion({
    messages: [
      { "role": "system", "content": "You are a helpful assistant." },
      { "role": "user", "content": "Hello!" }
    ],
    // optional generative configs here
  });

  console.log(completion.choices[0]);
}

main();

If streaming:

  const completion = await chat.chat_completion({
    messages: [
      { "role": "system", "content": "You are a helpful assistant." },
      { "role": "user", "content": "Hello!" }
    ],
    stream = true,
    // optional generative configs here
  });

  for await (const chunk of completion) {
    console.log(chunk.choices[0].delta.content);
  }

Action items

Existing gaps

There are some fields/features that are not yet supported in WebLLM compared to OpenAI's openai-node.

Fields in ChatCompletionRequest

  • model: in WebLLM, we need to call reload(model) instead of making it an argument in ChatCompletionRequest
  • response_format (json-formatting)
  • function calling related:
    • tool_choice
    • tools

Fields in ChatCompletion respond

  • system_fingerprint: not applicable in our case (OpenAI needs it because they perform request remotely on servers)

Others

  • We do not support n > 1 when streaming, since llm_chat.ts does not support maintaining multiple sequences. We have to finish one sequence and then start generating another, conflicting with the goal of streaming in chunks.

Future Items

@CharlieFRuan CharlieFRuan added the status: tracking Tracking work in progress label Jan 28, 2024
@Kartik14
Copy link

Kartik14 commented Feb 2, 2024

@CharlieFRuan Thanks for creating the tracking issue. Just wanted to let you know that @shreygupta2809 and I are currently working on supporting the function calling

@CharlieFRuan CharlieFRuan changed the title [Tracking] OpenAI-API Compatible APIs in ChatModule [Tracking] WebLLM: OpenAI-Compatible APIs in ChatModule Feb 15, 2024
CharlieFRuan added a commit that referenced this issue Feb 21, 2024
This PR adds OpenAI-like API to `ChatModule`, specifically the
`chatCompletion` API. See `examples/openai-api` for example usage. See
[OpenAI reference](https://platform.openai.com/docs/api-reference/chat)
on its original usage.

Changes include:
- Implement `chatCompletion()` in `ChatModule`
- Expose conversation manipulation methods in `llm_chat.ts` so that
`request.messages` can override existing chat history or system prompt
in `chat_module.ts`
- Implement `src/openai_api_protocols` that represents all
OpenAI-related data structure; largely referred to
[openai-node](https://github.com/openai/openai-node/blob/master/src/resources/chat/completions.ts)
- Add `examples/openai-api` that demonstrates `chatCompletion()` for
both streaming and non-streaming usages, without web worker
- Support both streaming and non-streaming `chatCompletion()` in
`web_worker.ts` with example usage added to `examples/web-worker`
- For streaming with web worker, users have access to an async generator
whose `next()` sends/receives messages with the worker, which has an
underlying async generator that does the actual decodings

Existing gaps from full-OpenAI compatibility are listed in
#276, some may be unavoidable
(e.g. selecting `model` in request) while some are WIP.

Benchmarked performance via `{WebWorker, SingleThread} X {OAI-Stream,
OAI-NonStream, Generate}`, virtually no degradation, with +-1 tok/s
variation. Specifically, on M3 Max Llama 2 7B q4f32_1, decoding 128
tokens with a 12-token prompt yield:
- Prefill: 182 tok/s
- Decode: 48.3 tok/s
- End-to-end: 38.5 tok/s
- Where end-to-end is from the time we create the request to finish
everything; the time recorded on the highest level.
@tqchen tqchen closed this as completed May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: tracking Tracking work in progress
Projects
Status: Done
Development

No branches or pull requests

3 participants