Skip to content
This repository has been archived by the owner on May 12, 2023. It is now read-only.

Very slow generation with gpt4all #50

Closed
imwide opened this issue Apr 9, 2023 · 16 comments
Closed

Very slow generation with gpt4all #50

imwide opened this issue Apr 9, 2023 · 16 comments

Comments

@imwide
Copy link

imwide commented Apr 9, 2023

Using gpt4all through the file in the attached image:
image
works really well and it is very fast, eventhough I am running on a laptop with linux mint. About 0.2 seconds per token. But when running gpt4all through pyllamacpp, it takes up to 10 seconds for one token to generate. Why is that, and how do i speed it up?

@clauslang
Copy link

Having the same problem over here. Mac M1, 8GB RAM. Chat works really fast, like in the gif in the README, but pyllamacpp painfully slow. Also, very different output, with lower quality. Might have to do with the new ggml weights (#40)?

Tried with both the directly downloaded gpt4all-lora-quantized-ggml.bin and with converting gpt4all-lora-quantized.bin myself.

@abdeladim-s
Copy link
Collaborator

Gpt4all binary is based on an old commit of llama.cpp, so you might get different outcomes when running pyllamacpp.

It might be that you need to build the package yourself, because the build process is taking into account the target CPU, or as @clauslang said, it might be related to the new ggml format, people are reporting similar issues there.

So, What you have to do it to build llama.cpp and compare it to pyllamacpp, if they have the same speed, then it is probably related to the new format.
If llama.cpp is normal, please try to build pyllamacpp as stated in the readme file, and let us know if that solved the issue ?

@clauslang
Copy link

clauslang commented Apr 10, 2023

Thanks @abdeladim-s. Not 100% sure if that's what you mean by building llama.cpp, but here's what I tried:

  • Ran the chat version of gpt4all (like in the README) --> works as expected: fast and fairly good output
  • Built and ran the chat version of alpaca.cpp (like in the README) --> works as expected: fast and fairly good output

Then I tried building pyllamacpp (like in the README):

git clone --recursive https://github.com/nomic-ai/pyllamacpp && cd pyllamacpp
pip install .

and ran the sample script:

from pyllamacpp.model import Model

def new_text_callback(text: str):
    print(text, end="", flush=True)

model = Model(ggml_model='./models/gpt4all-lora-quantized-ggml.bin', n_ctx=512)
model.generate("Once upon a time, ", n_predict=55, new_text_callback=new_text_callback, n_threads=8)

--> very slow and none or poor output
So building from source does not seem to solve the issue for me.

@imwide
Copy link
Author

imwide commented Apr 10, 2023

I have found that decreasing the threads from 8 as a default to 1 doubles the gegneration speed. No idea why but it seems to work. I am trying to get it to go even faster. Ill let you know if I have updates.

@Naugustogi
Copy link

I have found that decreasing the threads from 8 as a default to 1 doubles the gegneration speed. No idea why but it seems to work. I am trying to get it to go even faster. Ill let you know if I have updates.

are you sure it doubles it? threads refers to cpu cores

@Naugustogi
Copy link

Batchsize is the most important thing for speed, don't do too much and don't do too less.

@imwide
Copy link
Author

imwide commented Apr 10, 2023

I have found that decreasing the threads from 8 as a default to 1 doubles the gegneration speed. No idea why but it seems to work. I am trying to get it to go even faster. Ill let you know if I have updates.

are you sure it doubles it? threads refers to cpu cores

I found out what the relation is. Threads cant be more than the number shown in system info, otherwise it becomes REALLY slow. Dont know why since this is easily preventable

@Naugustogi
Copy link

I have found that decreasing the threads from 8 as a default to 1 doubles the gegneration speed. No idea why but it seems to work. I am trying to get it to go even faster. Ill let you know if I have updates.

are you sure it doubles it? threads refers to cpu cores

I found out what the relation is. Threads cant be more than the number shown in system info, otherwise it becomes REALLY slow. Dont know why since this is easily preventable

it's still as lame as a turtle

@abdeladim-s
Copy link
Collaborator

@imwide what I meant by building llama.cpp is to follow this .

Yes, increasing the number of threads is causing some issues, not sure why. Use 4 by default.

You are having a similar issue as this, please go over it and let us know if you find any insights.

@mattorp
Copy link

mattorp commented Apr 14, 2023

Increasing thread count may cause it to include efficiency cores. For me, changing from 8 to 6 for M1 Pro with 6 performance cores fixed it.

@abdeladim-s
Copy link
Collaborator

Hi @mattorp,
Is pyllamacpp working on your Mac M1 ? Could you please help solve this issue #57 ?

@mattorp
Copy link

mattorp commented Apr 24, 2023

Works fine on mine @abdeladim-s, so I'm not of much help for that issue. But hopefully @shivam-singhai's response indicates that the package manager version is the culprit.

@abdeladim-s
Copy link
Collaborator

No problem @mattorp.
Yes @shivam-singhai's response seems the solution to that problem.
Thanks :)

@bsbhaskartp
Copy link

I am having the same problem (gpt4all-lora-quantized-OSX-m1) is very fast (< 1 sec) on my mac. However running with pyllamacpp is very slow. Typical queries with pyllamacpp take > 30 sec. Tried the couple of things suggested above but that didn't change the response time.

@abdeladim-s
Copy link
Collaborator

@bsbhaskartp, if it is slow then you just need to build it from source.
@Naugustogi was having the same issue and he succeed to solve it. Please take a look at this issue, it might help

@bsbhaskartp
Copy link

bsbhaskartp commented May 3, 2023 via email

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants