Very slow generation with gpt4all #50

imwide · 2023-04-09T18:24:29Z

Using gpt4all through the file in the attached image:

works really well and it is very fast, eventhough I am running on a laptop with linux mint. About 0.2 seconds per token. But when running gpt4all through pyllamacpp, it takes up to 10 seconds for one token to generate. Why is that, and how do i speed it up?

clauslang · 2023-04-09T19:29:39Z

Having the same problem over here. Mac M1, 8GB RAM. Chat works really fast, like in the gif in the README, but pyllamacpp painfully slow. Also, very different output, with lower quality. Might have to do with the new ggml weights (#40)?

Tried with both the directly downloaded gpt4all-lora-quantized-ggml.bin and with converting gpt4all-lora-quantized.bin myself.

abdeladim-s · 2023-04-09T22:29:12Z

Gpt4all binary is based on an old commit of llama.cpp, so you might get different outcomes when running pyllamacpp.

It might be that you need to build the package yourself, because the build process is taking into account the target CPU, or as @clauslang said, it might be related to the new ggml format, people are reporting similar issues there.

So, What you have to do it to build llama.cpp and compare it to pyllamacpp, if they have the same speed, then it is probably related to the new format.
If llama.cpp is normal, please try to build pyllamacpp as stated in the readme file, and let us know if that solved the issue ?

clauslang · 2023-04-10T09:02:16Z

Thanks @abdeladim-s. Not 100% sure if that's what you mean by building llama.cpp, but here's what I tried:

Ran the chat version of gpt4all (like in the README) --> works as expected: fast and fairly good output
Built and ran the chat version of alpaca.cpp (like in the README) --> works as expected: fast and fairly good output

Then I tried building pyllamacpp (like in the README):

git clone --recursive https://github.com/nomic-ai/pyllamacpp && cd pyllamacpp
pip install .

and ran the sample script:

from pyllamacpp.model import Model

def new_text_callback(text: str):
    print(text, end="", flush=True)

model = Model(ggml_model='./models/gpt4all-lora-quantized-ggml.bin', n_ctx=512)
model.generate("Once upon a time, ", n_predict=55, new_text_callback=new_text_callback, n_threads=8)

--> very slow and none or poor output
So building from source does not seem to solve the issue for me.

imwide · 2023-04-10T10:45:50Z

I have found that decreasing the threads from 8 as a default to 1 doubles the gegneration speed. No idea why but it seems to work. I am trying to get it to go even faster. Ill let you know if I have updates.

Naugustogi · 2023-04-10T11:42:23Z

I have found that decreasing the threads from 8 as a default to 1 doubles the gegneration speed. No idea why but it seems to work. I am trying to get it to go even faster. Ill let you know if I have updates.

are you sure it doubles it? threads refers to cpu cores

Naugustogi · 2023-04-10T11:50:26Z

Batchsize is the most important thing for speed, don't do too much and don't do too less.

imwide · 2023-04-10T12:56:39Z

I have found that decreasing the threads from 8 as a default to 1 doubles the gegneration speed. No idea why but it seems to work. I am trying to get it to go even faster. Ill let you know if I have updates.

are you sure it doubles it? threads refers to cpu cores

I found out what the relation is. Threads cant be more than the number shown in system info, otherwise it becomes REALLY slow. Dont know why since this is easily preventable

Naugustogi · 2023-04-10T13:11:58Z

I have found that decreasing the threads from 8 as a default to 1 doubles the gegneration speed. No idea why but it seems to work. I am trying to get it to go even faster. Ill let you know if I have updates.

are you sure it doubles it? threads refers to cpu cores

I found out what the relation is. Threads cant be more than the number shown in system info, otherwise it becomes REALLY slow. Dont know why since this is easily preventable

it's still as lame as a turtle

abdeladim-s · 2023-04-10T21:37:58Z

@imwide what I meant by building llama.cpp is to follow this .

Yes, increasing the number of threads is causing some issues, not sure why. Use 4 by default.

You are having a similar issue as this, please go over it and let us know if you find any insights.

mattorp · 2023-04-14T22:54:53Z

Increasing thread count may cause it to include efficiency cores. For me, changing from 8 to 6 for M1 Pro with 6 performance cores fixed it.

abdeladim-s · 2023-04-23T21:49:41Z

Hi @mattorp,
Is pyllamacpp working on your Mac M1 ? Could you please help solve this issue #57 ?

mattorp · 2023-04-24T08:55:46Z

Works fine on mine @abdeladim-s, so I'm not of much help for that issue. But hopefully @shivam-singhai's response indicates that the package manager version is the culprit.

abdeladim-s · 2023-04-25T03:33:04Z

No problem @mattorp.
Yes @shivam-singhai's response seems the solution to that problem.
Thanks :)

bsbhaskartp · 2023-04-26T15:42:19Z

I am having the same problem (gpt4all-lora-quantized-OSX-m1) is very fast (< 1 sec) on my mac. However running with pyllamacpp is very slow. Typical queries with pyllamacpp take > 30 sec. Tried the couple of things suggested above but that didn't change the response time.

abdeladim-s · 2023-05-02T21:07:44Z

@bsbhaskartp, if it is slow then you just need to build it from source.
@Naugustogi was having the same issue and he succeed to solve it. Please take a look at this issue, it might help

bsbhaskartp · 2023-05-03T13:24:53Z

Thanks. Will try it out

…

On Tue, May 2, 2023 at 2:07 PM Abdeladim Sadiki ***@***.***> wrote: @bsbhaskartp <https://github.com/bsbhaskartp>, if it is slow then you just need to build it from source. @Naugustogi <https://github.com/Naugustogi> was having the same issue and he succeed to solve it. Please take a look at this issue <abdeladim-s/pyllamacpp#3>, it might help — Reply to this email directly, view it on GitHub <#50 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AL6MFS5CKKN4VV5HTZKJ5NDXEFZSXANCNFSM6AAAAAAWYIE6SY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

abdeladim-s closed this as completed Apr 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very slow generation with gpt4all #50

Very slow generation with gpt4all #50

imwide commented Apr 9, 2023

clauslang commented Apr 9, 2023

abdeladim-s commented Apr 9, 2023

clauslang commented Apr 10, 2023 •

edited

Loading

imwide commented Apr 10, 2023

Naugustogi commented Apr 10, 2023

Naugustogi commented Apr 10, 2023

imwide commented Apr 10, 2023

Naugustogi commented Apr 10, 2023

abdeladim-s commented Apr 10, 2023

mattorp commented Apr 14, 2023

abdeladim-s commented Apr 23, 2023

mattorp commented Apr 24, 2023

abdeladim-s commented Apr 25, 2023

bsbhaskartp commented Apr 26, 2023

abdeladim-s commented May 2, 2023

bsbhaskartp commented May 3, 2023 via email

Very slow generation with gpt4all #50

Very slow generation with gpt4all #50

Comments

imwide commented Apr 9, 2023

clauslang commented Apr 9, 2023

abdeladim-s commented Apr 9, 2023

clauslang commented Apr 10, 2023 • edited Loading

imwide commented Apr 10, 2023

Naugustogi commented Apr 10, 2023

Naugustogi commented Apr 10, 2023

imwide commented Apr 10, 2023

Naugustogi commented Apr 10, 2023

abdeladim-s commented Apr 10, 2023

mattorp commented Apr 14, 2023

abdeladim-s commented Apr 23, 2023

mattorp commented Apr 24, 2023

abdeladim-s commented Apr 25, 2023

bsbhaskartp commented Apr 26, 2023

abdeladim-s commented May 2, 2023

bsbhaskartp commented May 3, 2023 via email

clauslang commented Apr 10, 2023 •

edited

Loading