Replies: 27 comments
-
|
ah ok, maybe #1 |
Beta Was this translation helpful? Give feedback.
-
|
it is rather slow, q8 is the fastest i guess |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Currently, it only supports running on the CPU. The CPU performance on Colab is not very strong, which results in slower processing. I'm currently working on optimizing its CPU performance and adding support for GPU acceleration. |
Beta Was this translation helpful? Give feedback.
-
|
My old Skylake PC takes about 38s per step for the 8bit model. (OpenBLAS doesn't seem to make a difference) My old Laptop from 2016 needs 90s per step with the 8bit model. |
Beta Was this translation helpful? Give feedback.
-
|
Sample Picture test on M1 16G, 5-bit, 512x768, 15 steps, euler a 16-bit: memory < 3GB , 23 s/step |
Beta Was this translation helpful? Give feedback.
-
|
@czkoko are you using sd 1.5 ggml base model? I think your result is just too good for just an base model |
Beta Was this translation helpful? Give feedback.
-
|
@juniofaathir SD 1.5 base model can't generate such portrait, i use epicrealism |
Beta Was this translation helpful? Give feedback.
-
|
@czkoko you can use that model?? I've been trying some civitai model and converting it, but it didn't work like at #8 |
Beta Was this translation helpful? Give feedback.
-
|
@juniofaathir There is no problem for me to use it. You can try the model I mentioned, or other training models, and filter merge models. |
Beta Was this translation helpful? Give feedback.
-
|
@czkoko |
Beta Was this translation helpful? Give feedback.
-
|
Linking my tests using cuda acceleration (cublas) here #6 (comment) |
Beta Was this translation helpful? Give feedback.
-
@juniofaathir Most of the SD 1.x models from Civitai are working fine, except for a few that include control model weights. I'm currently researching how to adapt these models. |
Beta Was this translation helpful? Give feedback.
-
|
@leejet hey, this implementation seems to use a very low amount of ram, lower and faster than using onnx f16 models. Thank you for your efforts! It seems like the peak RAM usage stays at the minimum 1.4gb, when doing 256×384 images, using the current "q4_0" method! Are you choosing a specific "recipe"? like explained here: https://huggingface.co/blog/stable-diffusion-xl-coreml The current composition of the model: using these mixed quantization methods seems better than creating distilled models, they can be tailored and optimized for individual models.. |
Beta Was this translation helpful? Give feedback.
-
|
Here's something interesting: I almost got a full generation on a 2gb 32bit mobile phone, before running out of ram. If someone has a better 32bit arm device, please see if the generation is successful. |
Beta Was this translation helpful? Give feedback.
-
This is determined by the characteristics of the ggml library, quantization can only be for the weight of the full connection layer, and the weight of the convolutional layer can only be f16. |
Beta Was this translation helpful? Give feedback.
-
Tested with q4_0 of default v1.4 checkpoint. |
Beta Was this translation helpful? Give feedback.
-
@ClashSAN The project works well on Android so maybe @leejet wants to update the supported platform list. |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
Glad to hear that. I'll update the documentation later. |
Beta Was this translation helpful? Give feedback.
-
|
By the way, I've made a small optimization to make inference faster. I've tested it and it provides a |
Beta Was this translation helpful? Give feedback.
-
|
@leejet do I need to |
Beta Was this translation helpful? Give feedback.
-
Yes, you need to make again |
Beta Was this translation helpful? Give feedback.
-
|
fyi, GGML is deprecated and replaced by GGUF, people might like to slow down in creating ggml's in advance :> |
Beta Was this translation helpful? Give feedback.
-
|
I've created a new benchmark category in the discussion forum and posted some benchmark information. You can also share your benchmark information there if you'd like. https://github.com/leejet/stable-diffusion.cpp/discussions/categories/benchmark |
Beta Was this translation helpful? Give feedback.
-
|
I'm really digging this project. It's pretty interesting how the timing and memory usage don't really change based on the precision - unlike llama.cpp where speed scales linearly with precision (so q8 is twice as fast as f16). Whether it's f32, f16, q8, q4, they all take about the same time and memory. Also want to say it's noticeably slower than the OpenVINO version of Stable Diffusion. So, there's definitely room for improvement. |
Beta Was this translation helpful? Give feedback.
-
you can try mnn-diffusion, on andriod 8Gen3, it can reach 2s/iter, and 1s/iter on Apple M3, with 512x512 images. reference: https://zhuanlan.zhihu.com/p/721798565 |
Beta Was this translation helpful? Give feedback.









Uh oh!
There was an error while loading. Please reload this page.
-
Can you share how many seconds or it/s you do with your hardware (CPU/GPU/RAM) ?
Beta Was this translation helpful? Give feedback.
All reactions