Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow inference (with local LLM) during agent work #28

Closed
Pitachoo11 opened this issue Dec 6, 2023 · 2 comments
Closed

Very slow inference (with local LLM) during agent work #28

Pitachoo11 opened this issue Dec 6, 2023 · 2 comments

Comments

@Pitachoo11
Copy link

Describe the bug
Very slow inference during agent work in comparison to usual LLM interaction
I'm using local setup with API connection to TextGen WebUI in local network
Each iteration of TaskWeaver is very-very slow
generation speed is drastically decreased to around 1-2 t/s (usual speed on same setup 15-20 t/s)

At this communication rate this tool is net very useful, simple coding task like print numbers executed in 20-30 mins.
Is there any tweak to solve it. I guess it could because of relatively large context in each request?

To Reproduce
Steps to reproduce the behavior:

  1. Start the service
  2. Type the user query "any listed query from example description"
  3. Wait for the response forever

Expected behavior
Similar inference speed as Autogen

Environment Information (please complete the following information):

  • OS: MacOS
  • Python Version 3.11
  • LLM that you're using: number of different 7b models
@WillianXu117
Copy link

Describe the bug Very slow inference during agent work in comparison to usual LLM interaction I'm using local setup with API connection to TextGen WebUI in local network Each iteration of TaskWeaver is very-very slow generation speed is drastically decreased to around 1-2 t/s (usual speed on same setup 15-20 t/s)

At this communication rate this tool is net very useful, simple coding task like print numbers executed in 20-30 mins. Is there any tweak to solve it. I guess it could because of relatively large context in each request?

To Reproduce Steps to reproduce the behavior:

  1. Start the service
  2. Type the user query "any listed query from example description"
  3. Wait for the response forever

Expected behavior Similar inference speed as Autogen

Environment Information (please complete the following information):

  • OS: MacOS
  • Python Version 3.11
  • LLM that you're using: number of different 7b models

hi bro, how to run with local llm

@liqul
Copy link
Contributor

liqul commented Feb 4, 2024

Close inactive issues.

@liqul liqul closed this as completed Feb 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants