[Tracking] WebLLM: OpenAI-Compatible APIs in ChatModule #276

CharlieFRuan · 2024-01-28T22:48:22Z

Overview

The goal of this task is to implement APIs that are OpenAI API compatible. Existing APIs like generate() will still be kept. Essentially we want JSON-in and JSON-out, resulting in a UI like:

import * as webllm from "@mlc-ai/web-llm";

async function main() {
  const chat = new webllm.ChatModule();
	await chat.reload("Llama-2-7b-chat-hf-q4f32_1");

  const completion = await chat.chat_completion({
    messages: [
      { "role": "system", "content": "You are a helpful assistant." },
      { "role": "user", "content": "Hello!" }
    ],
    // optional generative configs here
  });

  console.log(completion.choices[0]);
}

main();

If streaming:

  const completion = await chat.chat_completion({
    messages: [
      { "role": "system", "content": "You are a helpful assistant." },
      { "role": "user", "content": "Hello!" }
    ],
    stream = true,
    // optional generative configs here
  });

  for await (const chunk of completion) {
    console.log(chunk.choices[0].delta.content);
  }

Action items

O1: Implement the basic chat_completion() (both streaming and non-streaming), support configs/features that we currently do not have inside llm_chat.ts
- [ChatModule] Add GenerationConfig and set up unit tests #298
- [API] Support OpenAI-like API chatCompletion for ChatModule #300
O2: Support function calling (tools)
O3: Documentation and tests for the WebLLM repo

Existing gaps

There are some fields/features that are not yet supported in WebLLM compared to OpenAI's openai-node.

Fields in `ChatCompletionRequest`

model: in WebLLM, we need to call reload(model) instead of making it an argument in ChatCompletionRequest
response_format (json-formatting)
function calling related:
- tool_choice
- tools

Fields in `ChatCompletion` respond

system_fingerprint: not applicable in our case (OpenAI needs it because they perform request remotely on servers)

Others

We do not support n > 1 when streaming, since llm_chat.ts does not support maintaining multiple sequences. We have to finish one sequence and then start generating another, conflicting with the goal of streaming in chunks.

Future Items

Support chat completion with image inputs (e.g. LLaVA), with Gradio frontend
Add support for low-level APIs for post-forward logit processing
- Supported here: Implement LogitProcessor, Expose ForwardTokensAndSample, and support customRequest in web worker #277
Support embedding models
More modalities such as Audio

The text was updated successfully, but these errors were encountered:

Kartik14 · 2024-02-02T21:53:31Z

@CharlieFRuan Thanks for creating the tracking issue. Just wanted to let you know that @shreygupta2809 and I are currently working on supporting the function calling

This PR adds OpenAI-like API to `ChatModule`, specifically the `chatCompletion` API. See `examples/openai-api` for example usage. See [OpenAI reference](https://platform.openai.com/docs/api-reference/chat) on its original usage. Changes include: - Implement `chatCompletion()` in `ChatModule` - Expose conversation manipulation methods in `llm_chat.ts` so that `request.messages` can override existing chat history or system prompt in `chat_module.ts` - Implement `src/openai_api_protocols` that represents all OpenAI-related data structure; largely referred to [openai-node](https://github.com/openai/openai-node/blob/master/src/resources/chat/completions.ts) - Add `examples/openai-api` that demonstrates `chatCompletion()` for both streaming and non-streaming usages, without web worker - Support both streaming and non-streaming `chatCompletion()` in `web_worker.ts` with example usage added to `examples/web-worker` - For streaming with web worker, users have access to an async generator whose `next()` sends/receives messages with the worker, which has an underlying async generator that does the actual decodings Existing gaps from full-OpenAI compatibility are listed in #276, some may be unavoidable (e.g. selecting `model` in request) while some are WIP. Benchmarked performance via `{WebWorker, SingleThread} X {OAI-Stream, OAI-NonStream, Generate}`, virtually no degradation, with +-1 tok/s variation. Specifically, on M3 Max Llama 2 7B q4f32_1, decoding 128 tokens with a 12-token prompt yield: - Prefill: 182 tok/s - Decode: 48.3 tok/s - End-to-end: 38.5 tok/s - Where end-to-end is from the time we create the request to finish everything; the time recorded on the highest level.

CharlieFRuan added the status: tracking Tracking work in progress label Jan 28, 2024

CharlieFRuan mentioned this issue Jan 31, 2024

Ideas for a Refactoring #279

Closed

This was referenced Feb 8, 2024

Support a vision model on webcam live stream #291

Closed

♥️ Function Calling-only model #297

Open

CharlieFRuan changed the title ~~[Tracking] OpenAI-API Compatible APIs in ChatModule~~ [Tracking] WebLLM: OpenAI-Compatible APIs in ChatModule Feb 15, 2024

CharlieFRuan mentioned this issue Feb 20, 2024

[API] Support OpenAI-like API chatCompletion for ChatModule #300

Merged

nahidalam mentioned this issue Apr 26, 2024

Do you plan to support LLaVA or video-LLaVA? #384

Open

tqchen closed this as completed May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tracking] WebLLM: OpenAI-Compatible APIs in ChatModule #276

[Tracking] WebLLM: OpenAI-Compatible APIs in ChatModule #276

CharlieFRuan commented Jan 28, 2024 •

edited

Loading

Kartik14 commented Feb 2, 2024

[Tracking] WebLLM: OpenAI-Compatible APIs in ChatModule #276

[Tracking] WebLLM: OpenAI-Compatible APIs in ChatModule #276

Comments

CharlieFRuan commented Jan 28, 2024 • edited Loading

Overview

Action items

Existing gaps

Fields in ChatCompletionRequest

Fields in ChatCompletion respond

Others

Future Items

Kartik14 commented Feb 2, 2024

CharlieFRuan commented Jan 28, 2024 •

edited

Loading

Fields in `ChatCompletionRequest`

Fields in `ChatCompletion` respond