Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OpenAI] Support stateful chat completion #330

Merged
merged 1 commit into from
Mar 12, 2024

Conversation

CharlieFRuan
Copy link
Contributor

This PR adds the stateful option to ChatCompletionRequest. When set to true, we preserve previous chat history, allowing multi-round chat, essentially behaving like generate(). Note that a stateful chat can only have n=1.

In addition, we expose getMessage() to ChatInterface. This allows streaming chat completion requests to extract the final response more easily (rather than manually concatenating the deltas).

@CharlieFRuan CharlieFRuan merged commit 212ae18 into mlc-ai:main Mar 12, 2024
CharlieFRuan added a commit that referenced this pull request Mar 14, 2024
Changes in WebLLM:
- Stateful chat completion: #330
- OpenAI's `logit_bias`: #331
- OpenAI's `logprobs` and `top_logprobs`:
#333

Changes in TVMjs:
- apache/tvm#16650
- Fix param download issues (already reflected in 0.2.26, but at the
time this PR was not merged yet)
  - Expose `sampleTopPFromProb` to support `logprobs` (new in 0.2.27)
CharlieFRuan added a commit that referenced this pull request Apr 3, 2024
…tCompletion (#359)

We introduced the field `stateful` in `chatCompletion()` earlier to
allow easier multi-round chatting in
#330.

However, this is not ideal since we would prefer APIs that are
functional in behavior, giving us various benefits (e.g. better fault
tolerance for future use cases).

Therefore, in this PR:
- We disable `chatCompletionRequest.stateful`, and ask users to maintain
the chat history explicitly
- Instead, we introduce implicit KVCache reuse for multi-round chatting
- When we detect users are doing multi-round chatting, we will not reset
the KV cache, so only the new message will be prefilled
- To detect multi-round chatting, we instantiate a `Conversation`
instance for each request, and compare it with the current internal
`Conversation`. If they match, it means that we can safely not reset the
internal state, and only prefill the new input.

To see the behavior, check out `mainMultiroundChat()` in
`examples/openai-api/src/openai_api.ts`.

Implementation details:
- Instantiate `Conversation` object in `ChatModule.prefill()`, since
this is the place where various workflows meet (streaming,
non-streaming, n > 1, etc.)
- The object's state is determined by system prompt, message history,
and function calling usages
- Inside `prefill()`, we then compare the two objects with
`compareConversationObject()`, reset all internal states if false
- Another detail is that, instead of overriding
`conversation.config.system_message`, we add a field
`conversation.override_system_message`, making `conversation.config`
protected
- We further remove all methods in `ChatModule` that overrides
`this.getPipeline().conversation` by changing
`updateConversationWithChatCompletionMessages()` to
`getConversationFromChatCompletionRequest()`, keeping things more
functional internally
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant