Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement LLM response streaming #859

Merged
merged 10 commits into from
Jun 27, 2024
Merged

Implement LLM response streaming #859

merged 10 commits into from
Jun 27, 2024

Conversation

dlqqq
Copy link
Collaborator

@dlqqq dlqqq commented Jun 25, 2024

Description

This is a large PR that implements LLM response streaming in Jupyter AI. LangChain LLM classes that implement the _stream() or _astream() methods can render the response incrementally in chunks, which allows users to view the response being built token-by-token.

Also fixes #858 by adding a version ceiling on faiss-cpu.

Demo of single-user scenario

Screen.Recording.2024-06-25.at.3.58.25.PM.mov

Demo of multi-user scenario, showing stream completeness & consistency

Screen.Recording.2024-06-25.at.4.02.43.PM.mov

Extended description (for developers & reviewers)

  • The "Generating response..." pending message is still shown to the user while waiting to receive the first chunk from the LLM.
  • How chat history is retrieved & managed by the client has been fully re-worked; the initial ConnectionMessage object streamed to the client by the server extension on connection now includes a ChatHistory object, which contains the ChatHistory.
    • Clients no longer need to call the GetHistory REST API endpoint to retrieve the history separately; the entire chat history can now be obtained by listening to the Chat websocket connection.
    • This change was made to ensure that new clients connecting while Jupyter AI is streaming a response do not "miss" any chunks that are being actively streamed while joining. Since the chat history is received as soon as the WebSocket connection is established, clients should not miss any chunks even if they join mid-stream. This is demonstrated in the "Demo of multi-user scenario" section above.
  • This PR also introduces a jupyter_ai_test package that includes a TestProvider and TestStreamingProvider.
    • For reviewers: jupyter_ai_test can be installed in your dev environment simply by running jlpm dev-uninstall && jlpm dev-install.
    • Since this package is not listed under .jupyter_releaser.toml, this package will not be released to NPM or PyPI. It is intended for testing in local development workflows only.

Follow-up work

  • The "Include selection" and "Replace selection" checkboxes need to be removed prior to the next release, as they likely do not work anymore. To avoid a breaking change, they must be replaced by alternative features in the UI when removed:
    • "Include selection": This should be implemented by replacing the current "Send" button with a dropdown button that allows users to send a prompt with a text/cell selection.
    • "Replace selection": This should be implemented by:
      1. Supporting replacing text selection in the code action toolbar
      2. Adding a hamburger menu on each message that contains an option for replacing the current text/cell selection with the entire Markdown contents of a message.
    • I argue this should be done in a future PR prior to the next release however, as these features lie out of the scope of this PR.

@dlqqq dlqqq added the enhancement New feature or request label Jun 25, 2024
Copy link
Collaborator

@3coins 3coins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dlqqq
Kudos on adding streaming so quickly 🚀 , the UX is much better with these changes. Added some minor suggestions, but looks good otherwise.
Noticed that /ask is still using the non-streaming messages, do you plan to handle streaming for /ask in a separate PR?

packages/jupyter-ai-test/README.md Outdated Show resolved Hide resolved
packages/jupyter-ai-test/package.json Show resolved Hide resolved
packages/jupyter-ai/jupyter_ai/handlers.py Outdated Show resolved Hide resolved
packages/jupyter-ai/src/chat_handler.ts Show resolved Hide resolved
@krassowski krassowski linked an issue Jun 26, 2024 that may be closed by this pull request
Copy link
Collaborator

@JasonWeill JasonWeill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks and works great! Little to add beyond @3coins 's comments earlier. Also, could you please file issues to cover the proposed additional enhancements out of scope for this PR?

@dlqqq
Copy link
Collaborator Author

dlqqq commented Jun 27, 2024

@3coins @JasonWeill Thank you both for the thoughtful review. I've addressed your comments in the latest revision.

Noticed that /ask is still using the non-streaming messages, do you plan to handle streaming for /ask in a separate PR?

Yes. There were some issues on the langchain-ai/langchain repo about LCEL streaming when using retrievers, so I decided it would be best to implement streaming for /ask in a separate, future PR.

@3coins
Copy link
Collaborator

3coins commented Jun 27, 2024

@dlqqq
Here is a reference to building RAG with LCEL, but ok to do this in a separate PR.
https://python.langchain.com/v0.1/docs/use_cases/question_answering/chat_history/#tying-it-together

@dlqqq
Copy link
Collaborator Author

dlqqq commented Jun 27, 2024

Opened a new issue to track /ask streaming: #863

@dlqqq dlqqq merged commit 5183bc9 into jupyterlab:main Jun 27, 2024
8 checks passed
@dlqqq dlqqq deleted the streaming branch June 27, 2024 23:17
@dlqqq dlqqq added this to the v2.19.0 milestone Jun 27, 2024
@jtpio jtpio mentioned this pull request Jul 1, 2024
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Jupyter AI installation failing on x86 macOS Stream textual responses token-by-token
4 participants