Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for IQ1_S, IQ3_S, IQ2_S, IQ4_XS. IQ4_NL is not functional #3657

Closed
wants to merge 3 commits into from

Conversation

mann1x
Copy link
Contributor

@mann1x mann1x commented Apr 15, 2024

This patch adds support for IQ1_S, IQ3_S, IQ2_S, IQ4_XS.

IQ4_NL is using a different format, have to investigate further what are the differences.

@jukofyork
Copy link

Definitely need IQ4_XS:

https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9

My only worry is the enum values will change and bugger things up in future.

@jmorganca can you or other main dev here have a look at this and confirm the order is likely to stay the same even if this PR isn't used?

@jukofyork
Copy link

Works perfectly 👍

Copy link
Contributor

@sammcj sammcj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works, nice!

WizardLM 2 8x22B IQ3_S on Macbook Pro M2 Max (96GB)

  • 56GB ram (with 16K context)
  • time to first token: 3.72s
  • speed: 11.06 tok/s

@mann1x
Copy link
Contributor Author

mann1x commented Apr 17, 2024

I have updated it to fix IQ4_NL

@lowlyocean
Copy link

This is great. Any specific reason IQ3_M isn't included?

@mann1x
Copy link
Contributor Author

mann1x commented Apr 19, 2024

This is great. Any specific reason IQ3_M isn't included?

I think there's still something wrong, you can quantize it but not run it with the main release.

If you want to keep an eye if something new comes up and give a heads up:
https://github.com/ggerganov/llama.cpp/blob/bca40e98149c7b673558ddd7a3ebeffef789349d/gguf-py/gguf/constants.py#L762

Check these constants on the latest release, if there's something new then we can add it once that release is included in ollama.

@zedmango
Copy link

Any chance of getting IQ2M, IQ3XS, IQ3M, IQ4XS, IQ4 added? I really would like those.

@mann1x
Copy link
Contributor Author

mann1x commented Apr 19, 2024

@sammcj Do you have to approve it again?

@sammcj
Copy link
Contributor

sammcj commented Apr 19, 2024

Nah it's just waiting on someone with contributor level access to merge it.

@BruceMacD BruceMacD self-assigned this May 3, 2024
@BruceMacD
Copy link
Contributor

Thanks for doing this @mann1x, this looks good. There's another ongoing PR that moves some of this stuff around (#3682) which is going in soon, so I'll get this merged once that is in to prevent conflicts.

If you'd like you can use this branch I made to test the changes as a reference for how to rebase this branch once #3682 goes into main, otherwise I can just merge things through for you.
d40497b

@mann1x
Copy link
Contributor Author

mann1x commented May 4, 2024

@BruceMacD
I'm a bit overloaded lately, if you can do the merge I'd really appreciate it! Thanks

@sammcj
Copy link
Contributor

sammcj commented May 10, 2024

Any update on getting this merged?

Just went to create a model and was reminded these are missing.

ollama create meta-llama-3-70b-instruct-bartowski:iq2_m -f Modelfile-llama3
transferring model data ⠇
transferring model data
Error: invalid file magic

@BruceMacD
Copy link
Contributor

@sammcj I've rebased these changes onto the new structure in main in #4322, hoping to get it merged for v0.1.36. Thanks for bringing this to our attention originally.

Closing this PR now to carry the commit in #4322

@BruceMacD BruceMacD closed this May 10, 2024
@sammcj
Copy link
Contributor

sammcj commented May 11, 2024

Legend, thanks Bruce!

@sammcj
Copy link
Contributor

sammcj commented May 25, 2024

Just wanted to say thanks again for this, it's really amazing being able to run 70b models on a single 24GB GPU at a decent speed without having them degrade* in quality to the point where smaller models make more sense.

For example my single RTX3090 server can now run Llama 3 70b with a now supported iq2_xs quant size get achieve 21tk/s with Ollama 🎉

ollama run meta-llama-3-70b-instruct-maziyarpanahi:iq2_xs tell me a short joke --verbose
Here's one:

Why did the computer go to the doctor?

It had a virus!

Hope that made you laugh!

total duration:       1.685537801s
load duration:        552.816µs
prompt eval count:    14 token(s)
prompt eval duration: 455.07ms
prompt eval rate:     30.76 tokens/s
eval count:           25 token(s)
eval duration:        1.188925s
eval rate:            21.03 tokens/s   <---

ollama ps
NAME                                           	ID          	SIZE 	PROCESSOR	UNTIL
meta-llama-3-70b-instruct-maziyarpanahi:iq2_xs	a5fe03111c70	23 GB	100% GPU 	43 minutes from now

SCR-20240525-pjab-2

*source

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants