Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for cohere plus #650

Merged
merged 2 commits into from
Apr 5, 2024
Merged

Fix for cohere plus #650

merged 2 commits into from
Apr 5, 2024

Conversation

awni
Copy link
Member

@awni awni commented Apr 4, 2024

Use the qk norm param to work with cohere plus.

Machine setting:

sudo sysctl iogpu.wired_lwm_mb=100000

Command for generation:

python -m mlx_lm.generate --model mlx-community/c4ai-command-r-plus-4bit --prompt "Write a quicksort in c++" --temp 0.0 --max-tokens 256 --use-default-chat-template

Command for QLoRA:

python -m mlx_lm.lora --model mlx-community/c4ai-command-r-plus-4bit --data ../lora/data --train --iters 1000  --batch-size 1 --lora-layers 16

@Blaizzy
Copy link
Contributor

Blaizzy commented Apr 4, 2024

I was about to submit a PR, great I checked 😄.

Already uploaded the model to the hub.
https://huggingface.co/mlx-community/c4ai-command-r-plus-4bit

@DenisSergeevitch
Copy link

@Blaizzy Thank you! How much RAM does it require to run 4bit q?

@awni
Copy link
Member Author

awni commented Apr 5, 2024

Needs about 65GB to generate with 4-bit. But the generation is slow right now, trying to debug the performance issue.

@Blaizzy
Copy link
Contributor

Blaizzy commented Apr 5, 2024

@Blaizzy Thank you! How much RAM does it require to run 4bit q?

@DenisSergeevitch, as @awni said 👆🏽.

I can't run it myself, I use a M1 Air 16GB :)

@DenisSergeevitch
Copy link

Thank you, I will wait for i_q1 then

@awni
Copy link
Member Author

awni commented Apr 5, 2024

Btw to get this to run reasonably fast on an M2 Ultra you need to set the wired GPU memory lower limit appropriately. Something like:

sudo sysctl iogpu.wired_lwm_mb=100000

Copy link
Member

@angeloskath angeloskath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@awni awni merged commit c386dd5 into main Apr 5, 2024
2 checks passed
@awni awni deleted the cohere_plus branch April 5, 2024 21:11
@jeanromainroy
Copy link

I am running the 4-bit version of Command-R-Plus and I consistently see the GPU usage dropping during generation and performance becoming abysmal.

Screenshot 2024-04-08 at 10 10 21 PM

My machine is the M2 Ultra 192GB and,

ProductName: macOS
ProductVersion: 14.3
BuildVersion: 23D56

@Blaizzy
Copy link
Contributor

Blaizzy commented Apr 9, 2024

@awni 👆🏽

@awni
Copy link
Member Author

awni commented Apr 9, 2024

@jeanromainroy did you set the memory limits? You could try making it larger:

sudo sysctl iogpu.wired_lwm_mb=150000

@jeanromainroy
Copy link

jeanromainroy commented Apr 9, 2024

Even after setting this,

sudo sysctl iogpu.wired_lwm_mb=150000

I still see the GPU usage dropping before the completion ends.

Screenshot 2024-04-09 at 12 34 55 PM

@awni
Copy link
Member Author

awni commented Apr 9, 2024

Do you mind to open an issue and include the command, versions of MLX / MLX LM, OS etc?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants