New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ollama fails to create models when using IQ quantized GGUFs - Error: invalid file magic #3622
Comments
@sammcj IQ3_XS is not supported. This is the list of the supported quantizations for now in the main release: const (
fileTypeF32 uint32 = iota
fileTypeF16
fileTypeQ4_0
fileTypeQ4_1
fileTypeQ4_1_F16
fileTypeQ8_0 uint32 = iota + 2
fileTypeQ5_0
fileTypeQ5_1
fileTypeQ2_K
fileTypeQ3_K_S
fileTypeQ3_K_M
fileTypeQ3_K_L
fileTypeQ4_K_S
fileTypeQ4_K_M
fileTypeQ5_K_S
fileTypeQ5_K_M
fileTypeQ6_K
fileTypeIQ2_XXS
fileTypeIQ2_XS
fileTypeQ2_K_S
fileTypeQ3_K_XS
fileTypeIQ3_XXS
) |
Thanks @mann1x, that's interesting, any idea why that might be? IQ3_XS seems like a bit of a sweet spot as I think it's usually pretty much as good as IQ4, but still much smaller where IQ3_XXS is a drop. |
They will be supported in the future, not sure when. |
Ah, I haven't actually noticed they're that much slower than the K quants, maybe I should try running Q3_K_M instead of IQ3_XS on my Macbook 🤔 |
To be honest anything below Q4 is poor quality, better to pick a smaller model. |
|
They're not |
Do you have any data to support the claim that a smaller model with a higher quant will outperform a larger model with a smaller quant? As long as ollama only supports GGUF, I don't know how "other formats better suited for 2/3 bit" is relevant to this discussion |
+1 to requesting support for the rest of the IQ quants. I'm especially interested in IQ4_NL, personally. An IQ4_NL quant of Command-R with 2K context fits and works on a 24 GiB card. A Q4_K quant of the same goes OOM after about 200 context |
I don't know enough to tell for sure, do you have any reference? https://huggingface.co/Lewdiculous/Eris_7B-GGUF-IQ-Imatrix From what I understood; the IQ quants are just another format and you can just quantize the model with it but it will be very inefficient and you lose the size reduction advantage.
Not right now, there are still problems with the K-quants, more pressing items so not much of a prio for llama.cpp or ollama
I didn't test them myself but I've seen benchmarks, not very recent, where the t/s went down from 20-25 to 15-20.
I mean to create the i-matrix
ollama uses llama.cpp as backend so anything about llama.cpp is relevant I never claimed that "a smaller model with a higher quant will outperform a larger model with a smaller quant" |
An IQ quant is a new quantization format for GGUF files. ggerganov/llama.cpp#4773
There's no point trying to disprove an opinion. All of us are personally interested.
Positive. I created the i-matrices and quantized the models myself. I've since read there might be slowdown while offloading, which I'm not doing. On my GPU, performance is the same.
I mean the whole process. It depends on the model, of course. A 32B takes a few minutes. A 72B takes a couple hours, but I don't think I can realistically run a model that big. Smaller models would probably be seconds.
Last I checked, llama.cpp only uses GGUF too, so my point stands. I see you've linked to a thread about a conversion script. That converts to GGUF. So we're back to GGUF again. Starting to smell like bad faith around here.
"To be honest anything below Q4 is poor quality, better to pick a smaller model." - You.
See above.
You tell me. That's my point.
So you say X, I ask for evidence of X, you claim not to have said X, then say X again, claim everyone says X and that it's obvious why they say X, again, without evidence. From what I've read, most people actually say Y, also without evidence. That's why I asked for evidence. Because I'd like to know. Actual benchmarks would be nice. Much better than empty claims. I'm done arguing with you, "for obvious reasons." @sammcj I was trying to defend your point. Maybe you missed that. Oh well |
I'm done too arguing, there's really no obvious reason why you should attack me or defend @sammcj... But thanks for all the useful information and the tip about che chunk size, I'll try that! |
Made a PR to support the latest IQ formats: #3657 IQ4_NL is now fixed. They work pretty nice for me but only on the GPU. With the latest llama.cpp I can create the imatrix.dat for Starling-LM-7B-beta in less than 2 minutes and the quantization is barely slower than the normal one, Made a quick benchmark, Ryzen 5950X and RTX 3090 Be careful with IQ3_XXS, it's a CPU killer :) Q4_0 GPU Q4_0 CPU [66°C] IQ4_XS GPU IQ4_XS CPU [70°C] IQ3_XXS GPU IQ3_XXS CPU [80°C] IQ3_S GPU IQ2_XXS GPU IQ2_XS GPU IQ2_S GPU IQ1_S GPU IQ4_NL GPU Size of the files: |
We definitely need IQ4_XS: https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9 But I'm a bit afraid of using this PR in case it buggers up all the imported models if/when the enum order changes... |
The enum order doesn't matter, the type is being checked over the tensors func (t Tensor) typeSize() uint64 {
blockSize := t.blockSize()
switch t.Kind {
case 0: // FP32
return 4
case 1: // FP16
return 2
case 2: // Q4_0
return 2 + blockSize/2
case 3: // Q4_1
return 2 + 2 + blockSize/2
case 6: // Q5_0
return 2 + 4 + blockSize/2
case 7: // Q5_1
return 2 + 2 + 4 + blockSize/2
case 8: // Q8_0
return 2 + blockSize
case 9: // Q8_1
return 4 + 4 + blockSize
case 10: // Q2_K
return blockSize/16 + blockSize/4 + 2 + 2
case 11: // Q3_K
return blockSize/8 + blockSize/4 + 12 + 2
case 12: // Q4_K
return 2 + 2 + 12 + blockSize/2
case 13: // Q5_K
return 2 + 2 + 12 + blockSize/8 + blockSize/2
case 14: // Q6_K
return blockSize/2 + blockSize/4 + blockSize/16 + 2
case 15: // Q8_K
return 2 + blockSize + 2*blockSize/16
case 16: // IQ2_XXS
return 2 + 2*blockSize/8
case 17: // IQ2_XS
return 2 + 2*blockSize/8 + blockSize/32
case 18: // IQ3_XXS
return 2 + 3*blockSize/8
default:
return 0
}
} |
So it's definitely not stored anywhere in Ollama's metadata files (that was my main worry)? |
Definitely not, the file is parsed every time it's loaded. |
Thanks! I'll give it a try later and report back. Hopefully it gets accepted soon. |
According to this table: https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9 The 8x22B model (which has roughtly 141B parameters, be it WizardLM or not) would have IQ3_XS at 58GB, which may be just the sweet spot for people with 64GB memory (Mac or PC). |
If you get that going, would you mind posting performance numbers? |
Let it go, I don't mind :) It's just a misunderstanding. I'm not giving up of course! But I'd like to have some help, another pair of eyes. |
Bingo, exactly my use case. Obviously if it's a lot slower than say Q3_something it may not be worth it but if there's not much in it - definitely a win. |
No I haven't got it running yet. I would expect it to be pretty slow on PC using CPU, but Mac with greater memory bandwidth should be pretty usable. |
"If you get that going, would you mind posting performance numbers?" |
I have updated the PR to fix IQ4_NL support, I will add the benchmark to the table above |
Any chance of getting IQ2M, IQ3XS, IQ3M, IQ4XS, IQ4 added? I really would like those. |
Thank you |
What is the issue?
Creating a Ollama model from a standard IQ quantized GGUF fails with "Error: invalid file magic"
I've tried with pre-built Ollama packages and compiling Ollama from source.
With the output here I am using the latest Ollama built from main.
llama.cpp and lm-studio
Model
Seems to happen with all IQ3 based models I've found.
For example, here I've tried with zephyr-orpo-141b-A35b-v0.1 at IQ3_XS
Modelfile
What did you expect to see?
The model to be successfully imported the same as any non-IQ quant GGUF.
Steps to reproduce
As per above
gguf-split --merge <first gguf file> <output file>
as it seems Ollama doesn't support multi-file models (see log below)Are there any recent changes that introduced the issue?
I think it's always been a problem, at least whenever I've tried it
OS
macOS
Architecture
arm64
Platform
No response
Ollama version
main, v0.1.31
GPU
Apple
GPU info
96GB M2 Max
CPU
Apple
Other software
Merge multi-part GGUF using gguf-split
llama.cpp load logs (without Ollama)
ollama serve logs
The text was updated successfully, but these errors were encountered: