Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[KVCache] Add implicit KVCache reuse, disable stateful option for cha…
…tCompletion (#359) We introduced the field `stateful` in `chatCompletion()` earlier to allow easier multi-round chatting in #330. However, this is not ideal since we would prefer APIs that are functional in behavior, giving us various benefits (e.g. better fault tolerance for future use cases). Therefore, in this PR: - We disable `chatCompletionRequest.stateful`, and ask users to maintain the chat history explicitly - Instead, we introduce implicit KVCache reuse for multi-round chatting - When we detect users are doing multi-round chatting, we will not reset the KV cache, so only the new message will be prefilled - To detect multi-round chatting, we instantiate a `Conversation` instance for each request, and compare it with the current internal `Conversation`. If they match, it means that we can safely not reset the internal state, and only prefill the new input. To see the behavior, check out `mainMultiroundChat()` in `examples/openai-api/src/openai_api.ts`. Implementation details: - Instantiate `Conversation` object in `ChatModule.prefill()`, since this is the place where various workflows meet (streaming, non-streaming, n > 1, etc.) - The object's state is determined by system prompt, message history, and function calling usages - Inside `prefill()`, we then compare the two objects with `compareConversationObject()`, reset all internal states if false - Another detail is that, instead of overriding `conversation.config.system_message`, we add a field `conversation.override_system_message`, making `conversation.config` protected - We further remove all methods in `ChatModule` that overrides `this.getPipeline().conversation` by changing `updateConversationWithChatCompletionMessages()` to `getConversationFromChatCompletionRequest()`, keeping things more functional internally
- Loading branch information