Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

a bit slow on my mbp 16 m1 #3

Open
cpietsch opened this issue Aug 9, 2023 · 15 comments
Open

a bit slow on my mbp 16 m1 #3

cpietsch opened this issue Aug 9, 2023 · 15 comments

Comments

@cpietsch
Copy link

cpietsch commented Aug 9, 2023

I downloaded the https://huggingface.co/coreml-projects/Llama-2-7b-chat-coreml model and compiled the chat with xcode. When running the example prompt it takes around 15 minutes to complete. I am not sure what I did wrong, but the performance should be better right ?
2023-08-09 12:01:55.346753+0200 SwiftChat[27414:583595] Metal API Validation Enabled

@jsj
Copy link
Contributor

jsj commented Aug 9, 2023

I believe because this is the unquantized version, if you compress it you will get better pref

@pcuenca
Copy link
Member

pcuenca commented Aug 9, 2023

Hi @cpietsch! It sounds to me as if the model was running on CPU only. Could you maybe try to run it again with the "GPU History" window of Activity Monitor open at the same time? It should show very clear GPU activity if it's in use.

Also, what computer are you using?

@cpietsch
Copy link
Author

cpietsch commented Aug 9, 2023

Hi @pcuenca, I am running it on an Apple M1 Pro with 16 GB and osx 13.4.1
I checked the perf history and it actually does not show significant activity on the GPU and CPU.
Screenshot 2023-08-09 at 17 21 59

@jsj
Copy link
Contributor

jsj commented Aug 9, 2023

Interesting remember that Activity Monitor does not show Neural Engine, perhaps, https://github.com/tlkh/asitop, could provide more insight

@pcuenca
Copy link
Member

pcuenca commented Aug 11, 2023

My suspicion is that the computer is swapping because of memory pressure.

@awmartin
Copy link

Same experience. I have a Macbook Pro M1 Max with 32GB of RAM, and I get 0.39 tokens/s. It's even worse with Falcon 7b.

swift-chat-llama-2-slow

@cpietsch
Copy link
Author

cpietsch commented Aug 16, 2023

I believe because this is the unquantized version, if you compress it you will get better pref

maybe we need to convert the model ourselves. but 0.39 t/s is not that bad...

@awmartin
Copy link

Whelp, just closing all other apps, restarting, and running the SwiftChat build without Xcode has resulted in 4.96 tokens/s. Woohoo!

@cpietsch
Copy link
Author

so @pcuenca was right with the memory pressure

@cpietsch
Copy link
Author

Here are some profiling images which show a low workload:
Screenshot 2023-08-21 at 12 47 24
Screenshot 2023-08-21 at 12 34 28
It seams that other have the same problem

@longseespace
Copy link

I have a same problem. One thing I don't understand is I was able to get fast response using [ollama](https://ollama.ai). Any idea why? I can see that the default model used in ollama is the 7b model 🤔

@cpietsch
Copy link
Author

Nice, ollama worked for me too right out of the box.
I tried to convert llama2 for the swift-chat myself with python -m exporters.coreml -m=./Llama-2-7b-hf --quantize=float16 --compute_units=cpu_and_gpu ll but it always crashes without error after around 15 minutes. 🤔

@markwitt1
Copy link

I am experiencing the same issue on MBP 16 inch. Do you have any updates?

@matiasvillaverde
Copy link

I am encountering a similar issue while utilizing a MacBook M2 with 32GB of RAM. It appears that the system may be engaging in swapping due to elevated memory pressure. I would greatly appreciate any insights or recommendations you might have for optimizing and mitigating the memory footprint in this context.

@AndreaChiChengdu
Copy link

hi guys,any update for this question?
I met the same issue on my M3 mbp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants