Benchmark ? #923

grigio · 2023-08-20T22:56:03Z

grigio
Aug 20, 2023

Can you share how many seconds or it/s you do with your hardware (CPU/GPU/RAM) ?

grigio · 2023-08-20T23:23:47Z

grigio
Aug 20, 2023
Author

ah ok, maybe #1

0 replies

mjkrakowski · 2023-08-20T23:28:17Z

mjkrakowski
Aug 20, 2023

it is rather slow, q8 is the fastest i guess

sdcpptest.ipynb.txt

0 replies

mjkrakowski · 2023-08-20T23:45:25Z

mjkrakowski
Aug 20, 2023

with 256x256px image size the q4_1 took about 8-9 minutes.

0 replies

leejet · 2023-08-21T00:30:41Z

leejet
Aug 21, 2023
Maintainer

it is rather slow, q8 is the fastest i guess

Currently, it only supports running on the CPU. The CPU performance on Colab is not very strong, which results in slower processing. I'm currently working on optimizing its CPU performance and adding support for GPU acceleration.

0 replies

h3ndrik · 2023-08-21T14:26:16Z

h3ndrik
Aug 21, 2023

My old Skylake PC takes about 38s per step for the 8bit model. (OpenBLAS doesn't seem to make a difference)
the f32 model about 40s per step.

My old Laptop from 2016 needs 90s per step with the 8bit model.

0 replies

czkoko · 2023-08-21T15:41:53Z

czkoko
Aug 21, 2023

Sample Picture test on M1 16G, 5-bit, 512x768, 15 steps, euler a
The picture quality of q5_1 is quite good.

16-bit: memory < 3GB , 23 s/step
5-bit: memory < 2GB , 22.5 s/step

0 replies

juniofaathir · 2023-08-21T18:18:24Z

juniofaathir
Aug 21, 2023

@czkoko are you using sd 1.5 ggml base model? I think your result is just too good for just an base model

0 replies

czkoko · 2023-08-21T18:31:15Z

czkoko
Aug 21, 2023

@juniofaathir SD 1.5 base model can't generate such portrait, i use epicrealism

0 replies

juniofaathir · 2023-08-21T18:34:50Z

juniofaathir
Aug 21, 2023

@czkoko you can use that model?? I've been trying some civitai model and converting it, but it didn't work like at #8

0 replies

czkoko · 2023-08-21T19:01:24Z

czkoko
Aug 21, 2023

@juniofaathir There is no problem for me to use it. You can try the model I mentioned, or other training models, and filter merge models.

0 replies

mjkrakowski · 2023-08-21T20:24:11Z

mjkrakowski
Aug 21, 2023

@czkoko
i was able to convert "reliberate" but not realisticvision5.1 with baked VAE. if the civit ai model has a vae free version, you should be able to convert any of them. all major models have a huggingface link you should prefer over civit ai.

0 replies

klosax · 2023-08-21T22:19:35Z

klosax
Aug 21, 2023

Linking my tests using cuda acceleration (cublas) here #6 (comment)

0 replies

leejet · 2023-08-22T12:43:06Z

leejet
Aug 22, 2023
Maintainer

@czkoko you can use that model?? I've been trying some civitai model and converting it, but it didn't work like at #8

@juniofaathir Most of the SD 1.x models from Civitai are working fine, except for a few that include control model weights. I'm currently researching how to adapt these models.

0 replies

ClashSAN · 2023-08-22T21:56:42Z

ClashSAN
Aug 22, 2023

@leejet hey, this implementation seems to use a very low amount of ram, lower and faster than using onnx f16 models. Thank you for your efforts!

It seems like the peak RAM usage stays at the minimum 1.4gb, when doing 256×384 images, using the current "q4_0" method!

Are you choosing a specific "recipe"?

like explained here: https://huggingface.co/blog/stable-diffusion-xl-coreml

The current composition of the model:

using these mixed quantization methods seems better than creating distilled models, they can be tailored and optimized for individual models..

0 replies

ClashSAN · 2023-08-22T23:02:24Z

ClashSAN
Aug 22, 2023

Here's something interesting: I almost got a full generation on a 2gb 32bit mobile phone, before running out of ram. If someone has a better 32bit arm device, please see if the generation is successful.

 ~/stable-diffusion.cpp $ ./sd -m anything-v3-1-ggml-model-q4_0.bin -W 64 -H 64 -p "frog" --steps 1
WARNING: linker: /data/data/com.termux/files/home/stable-diffusion.cpp/sd: unsupported flags DT_FLAGS_1=0x8000001
[INFO]  stable-diffusion.cpp:2191 - loading model from 'anything-v3-1-ggml-model-q4_0.bin'
[INFO]  stable-diffusion.cpp:2216 - ftype: q4_0
[INFO]  stable-diffusion.cpp:2261 - params ctx size =  1431.26 MB
[INFO]  stable-diffusion.cpp:2401 - loading model from 'anything-v3-1-ggml-model-q4_0.bin' completed, taking 21.55s
[INFO]  stable-diffusion.cpp:2482 - condition graph use 4.30MB of memory: static 1.37MB, dynamic = 2.93MB
[INFO]  stable-diffusion.cpp:2482 - condition graph use 4.30MB of memory: static 1.37MB, dynamic = 2.93MB
[INFO]  stable-diffusion.cpp:2824 - get_learned_condition completed, taking 15.42s
[INFO]  stable-diffusion.cpp:2832 - start sampling
[INFO]  stable-diffusion.cpp:2676 - step 1 sampling completed, taking 180.52s
[INFO]  stable-diffusion.cpp:2691 - diffusion graph use 11.46MB of memory: static 2.82MB, dynamic = 8.63MB
[INFO]  stable-diffusion.cpp:2837 - sampling completed, taking 180.62s
Killed
~/stable-diffusion.cpp $

0 replies

leejet · 2023-08-22T23:57:35Z

leejet
Aug 22, 2023
Maintainer

Are you choosing a specific "recipe"?

This is determined by the characteristics of the ggml library, quantization can only be for the weight of the full connection layer, and the weight of the convolutional layer can only be f16.

0 replies

walking-octopus · 2023-08-24T08:46:28Z

walking-octopus
Aug 24, 2023

60 seconds per step on Asus Zenbook UX430UNR 1.0. 4 threads.
30 seconds per step on Thinkpad T14 (AMD; Gen 1). 6 threads.

Tested with q4_0 of default v1.4 checkpoint.

0 replies

nviet · 2023-08-24T10:33:06Z

nviet
Aug 24, 2023

Here's something interesting: I almost got a full generation on a 2gb 32bit mobile phone, before running out of ram. If someone has a better 32bit arm device, please see if the generation is successful.

@ClashSAN
I used Stable Diffusion v1.5 but in half precision mode (fp16) only. It took around 55 minutes to generate a 512x512 image on my phone (Snapdragon 888 chipset with 8GB RAM).

./bin/sd -m ~/storage/shared/v1-5-pruned-emaonly-ggml-model-f16.bin -p "a lovely cat"
[INFO]  stable-diffusion.cpp:2687 - loading model from '/data/data/com.termux/files/home/storage/shared/v1-5-pruned-emaonly-ggml-model-f16.bin'
[INFO]  stable-diffusion.cpp:2712 - ftype: f16
[INFO]  stable-diffusion.cpp:2941 - total params size = 1969.97MB (clip 235.01MB, unet 1640.45MB, vae 94.51MB)
[INFO]  stable-diffusion.cpp:2943 - loading model from '/data/data/com.termux/files/home/storage/shared/v1-5-pruned-emaonly-ggml-model-f16.bin' completed, taking 13.11s
[INFO]  stable-diffusion.cpp:3066 - condition graph use 239.58MB of memory: params 235.01MB, runtime 4.57MB (static 1.64MB, dynamic 2.93MB)
[INFO]  stable-diffusion.cpp:3066 - condition graph use 239.58MB of memory: params 235.01MB, runtime 4.57MB (static 1.64MB, dynamic 2.93MB)
[INFO]  stable-diffusion.cpp:3552 - get_learned_condition completed, taking 3.01s
[INFO]  stable-diffusion.cpp:3568 - start sampling
[INFO]  stable-diffusion.cpp:3260 - step 1 sampling completed, taking 99.22s
[INFO]  stable-diffusion.cpp:3260 - step 2 sampling completed, taking 110.11s
[INFO]  stable-diffusion.cpp:3260 - step 3 sampling completed, taking 108.13s
[INFO]  stable-diffusion.cpp:3260 - step 4 sampling completed, taking 103.45s
[INFO]  stable-diffusion.cpp:3260 - step 5 sampling completed, taking 104.38s
[INFO]  stable-diffusion.cpp:3260 - step 6 sampling completed, taking 102.38s
[INFO]  stable-diffusion.cpp:3260 - step 7 sampling completed, taking 102.27s
[INFO]  stable-diffusion.cpp:3260 - step 8 sampling completed, taking 108.72s
[INFO]  stable-diffusion.cpp:3260 - step 9 sampling completed, taking 99.60s
[INFO]  stable-diffusion.cpp:3260 - step 10 sampling completed, taking 99.32s
[INFO]  stable-diffusion.cpp:3260 - step 11 sampling completed, taking 189.10s
[INFO]  stable-diffusion.cpp:3260 - step 12 sampling completed, taking 214.05s
[INFO]  stable-diffusion.cpp:3260 - step 13 sampling completed, taking 183.40s
[INFO]  stable-diffusion.cpp:3260 - step 14 sampling completed, taking 203.24s
[INFO]  stable-diffusion.cpp:3260 - step 15 sampling completed, taking 219.05s
[INFO]  stable-diffusion.cpp:3260 - step 16 sampling completed, taking 219.44s
[INFO]  stable-diffusion.cpp:3260 - step 17 sampling completed, taking 241.86s
[INFO]  stable-diffusion.cpp:3260 - step 18 sampling completed, taking 215.12s
[INFO]  stable-diffusion.cpp:3260 - step 19 sampling completed, taking 219.98s
[INFO]  stable-diffusion.cpp:3260 - step 20 sampling completed, taking 220.93s
[INFO]  stable-diffusion.cpp:3287 - diffusion graph use 2264.22MB of memory: params 1640.45MB, runtime 623.77MB (static 69.56MB, dynamic 554.21MB)
[INFO]  stable-diffusion.cpp:3573 - sampling completed, taking 3163.83s
[INFO]  stable-diffusion.cpp:3496 - vae graph use 2271.63MB of memory: params 94.51MB, runtime 2177.12MB (static 1153.12MB, dynamic 1024.00MB)
[INFO]  stable-diffusion.cpp:3586 - decode_first_stage completed, taking 197.78s
[INFO]  stable-diffusion.cpp:3600 - txt2img completed in 3364.61s, use 2358.73MB of memory: peak params memory 1969.97MB, peak runtime memory 2177.12MB
save result image to 'output.png'

The project works well on Android so maybe @leejet wants to update the supported platform list.

0 replies

grigio · 2023-08-24T16:06:50Z

grigio
Aug 24, 2023
Author

AMD Ryzen 7 7700 test with q8_0 and f16

docker run --rm -v $PWD/models:/models -v $PWD/output/:/output sd --mode txt2img -m /models/v1-5-pruned-emaonly-ggml-model-q8_0.bin -p "beduin riding a white bear in the desert, high quality, bokeh"  -o /output/img2img_output.png
[INFO]  stable-diffusion.cpp:3260 - step 20 sampling completed, taking 9.14s
[INFO]  stable-diffusion.cpp:3280 - diffusion graph use 2022.78MB of memory: params 1399.01MB, runtime 623.77MB (static 69.56MB, dynamic 554.21MB)
[INFO]  stable-diffusion.cpp:3573 - sampling completed, taking 178.27s
[INFO]  stable-diffusion.cpp:3489 - vae graph use 2271.63MB of memory: params 94.51MB, runtime 2177.12MB (static 1153.12MB, dynamic 1024.00MB)
[INFO]  stable-diffusion.cpp:3586 - decode_first_stage completed, taking 32.42s
[INFO]  stable-diffusion.cpp:3594 - txt2img completed in 210.78s, use 2271.63MB of memory: peak params memory 1618.61MB, peak runtime memory 2177.12MB
save result image to '/output/img2img_output.png'

[INFO]  stable-diffusion.cpp:3280 - diffusion graph use 2264.22MB of memory: params 1640.45MB, runtime 623.77MB (static 69.56MB, dynamic 554.21MB)
[INFO]  stable-diffusion.cpp:3573 - sampling completed, taking 177.67s
[INFO]  stable-diffusion.cpp:3489 - vae graph use 2271.63MB of memory: params 94.51MB, runtime 2177.12MB (static 1153.12MB, dynamic 1024.00MB)
[INFO]  stable-diffusion.cpp:3586 - decode_first_stage completed, taking 32.74s
[INFO]  stable-diffusion.cpp:3594 - txt2img completed in 210.51s, use 2358.73MB of memory: peak params memory 1969.97MB, peak runtime memory 2177.12MB
save result image to '/output/img2img_output.png'

0 replies

leejet · 2023-08-24T16:06:59Z

leejet
Aug 24, 2023
Maintainer

The project works well on Android so maybe @leejet wants to update the supported platform list.

Glad to hear that. I'll update the documentation later.

0 replies

leejet · 2023-08-24T16:13:29Z

leejet
Aug 24, 2023
Maintainer

By the way, I've made a small optimization to make inference faster. I've tested it and it provides a ~10% speed improvement. Feel free to pull the latest code and give it a try. Just a reminder, don't forget to run the following code to update the submodule:

git pull origin master
git submodule update

0 replies

juniofaathir · 2023-08-24T23:34:45Z

juniofaathir
Aug 24, 2023

@leejet do I need to make again?

0 replies

leejet · 2023-08-25T00:12:12Z

leejet
Aug 25, 2023
Maintainer

@leejet do I need to make again?

Yes, you need to make again

0 replies

mjkrakowski · 2023-08-25T12:05:29Z

mjkrakowski
Aug 25, 2023

fyi, GGML is deprecated and replaced by GGUF, people might like to slow down in creating ggml's in advance :>

https://github.com/ggerganov/llama.cpp

0 replies

leejet · 2023-08-26T10:08:01Z

leejet
Aug 26, 2023
Maintainer

I've created a new benchmark category in the discussion forum and posted some benchmark information. You can also share your benchmark information there if you'd like.

https://github.com/leejet/stable-diffusion.cpp/discussions/categories/benchmark

0 replies

RedAndr · 2023-09-27T19:33:43Z

RedAndr
Sep 27, 2023

I'm really digging this project. It's pretty interesting how the timing and memory usage don't really change based on the precision - unlike llama.cpp where speed scales linearly with precision (so q8 is twice as fast as f16). Whether it's f32, f16, q8, q4, they all take about the same time and memory. Also want to say it's noticeably slower than the OpenVINO version of Stable Diffusion. So, there's definitely room for improvement.

0 replies

bitxsw93 · 2024-10-10T03:44:44Z

bitxsw93
Oct 10, 2024

Here's something interesting: I almost got a full generation on a 2gb 32bit mobile phone, before running out of ram. If someone has a better 32bit arm device, please see if the generation is successful.

@ClashSAN I used Stable Diffusion v1.5 but in half precision mode (fp16) only. It took around 55 minutes to generate a 512x512 image on my phone (Snapdragon 888 chipset with 8GB RAM).

./bin/sd -m ~/storage/shared/v1-5-pruned-emaonly-ggml-model-f16.bin -p "a lovely cat"
[INFO]  stable-diffusion.cpp:2687 - loading model from '/data/data/com.termux/files/home/storage/shared/v1-5-pruned-emaonly-ggml-model-f16.bin'
[INFO]  stable-diffusion.cpp:2712 - ftype: f16
[INFO]  stable-diffusion.cpp:2941 - total params size = 1969.97MB (clip 235.01MB, unet 1640.45MB, vae 94.51MB)
[INFO]  stable-diffusion.cpp:2943 - loading model from '/data/data/com.termux/files/home/storage/shared/v1-5-pruned-emaonly-ggml-model-f16.bin' completed, taking 13.11s
[INFO]  stable-diffusion.cpp:3066 - condition graph use 239.58MB of memory: params 235.01MB, runtime 4.57MB (static 1.64MB, dynamic 2.93MB)
[INFO]  stable-diffusion.cpp:3066 - condition graph use 239.58MB of memory: params 235.01MB, runtime 4.57MB (static 1.64MB, dynamic 2.93MB)
[INFO]  stable-diffusion.cpp:3552 - get_learned_condition completed, taking 3.01s
[INFO]  stable-diffusion.cpp:3568 - start sampling
[INFO]  stable-diffusion.cpp:3260 - step 1 sampling completed, taking 99.22s
[INFO]  stable-diffusion.cpp:3260 - step 2 sampling completed, taking 110.11s
[INFO]  stable-diffusion.cpp:3260 - step 3 sampling completed, taking 108.13s
[INFO]  stable-diffusion.cpp:3260 - step 4 sampling completed, taking 103.45s
[INFO]  stable-diffusion.cpp:3260 - step 5 sampling completed, taking 104.38s
[INFO]  stable-diffusion.cpp:3260 - step 6 sampling completed, taking 102.38s
[INFO]  stable-diffusion.cpp:3260 - step 7 sampling completed, taking 102.27s
[INFO]  stable-diffusion.cpp:3260 - step 8 sampling completed, taking 108.72s
[INFO]  stable-diffusion.cpp:3260 - step 9 sampling completed, taking 99.60s
[INFO]  stable-diffusion.cpp:3260 - step 10 sampling completed, taking 99.32s
[INFO]  stable-diffusion.cpp:3260 - step 11 sampling completed, taking 189.10s
[INFO]  stable-diffusion.cpp:3260 - step 12 sampling completed, taking 214.05s
[INFO]  stable-diffusion.cpp:3260 - step 13 sampling completed, taking 183.40s
[INFO]  stable-diffusion.cpp:3260 - step 14 sampling completed, taking 203.24s
[INFO]  stable-diffusion.cpp:3260 - step 15 sampling completed, taking 219.05s
[INFO]  stable-diffusion.cpp:3260 - step 16 sampling completed, taking 219.44s
[INFO]  stable-diffusion.cpp:3260 - step 17 sampling completed, taking 241.86s
[INFO]  stable-diffusion.cpp:3260 - step 18 sampling completed, taking 215.12s
[INFO]  stable-diffusion.cpp:3260 - step 19 sampling completed, taking 219.98s
[INFO]  stable-diffusion.cpp:3260 - step 20 sampling completed, taking 220.93s
[INFO]  stable-diffusion.cpp:3287 - diffusion graph use 2264.22MB of memory: params 1640.45MB, runtime 623.77MB (static 69.56MB, dynamic 554.21MB)
[INFO]  stable-diffusion.cpp:3573 - sampling completed, taking 3163.83s
[INFO]  stable-diffusion.cpp:3496 - vae graph use 2271.63MB of memory: params 94.51MB, runtime 2177.12MB (static 1153.12MB, dynamic 1024.00MB)
[INFO]  stable-diffusion.cpp:3586 - decode_first_stage completed, taking 197.78s
[INFO]  stable-diffusion.cpp:3600 - txt2img completed in 3364.61s, use 2358.73MB of memory: peak params memory 1969.97MB, peak runtime memory 2177.12MB
save result image to 'output.png'

The project works well on Android so maybe @leejet wants to update the supported platform list.

you can try mnn-diffusion, on andriod 8Gen3, it can reach 2s/iter, and 1s/iter on Apple M3, with 512x512 images.

reference: https://zhuanlan.zhihu.com/p/721798565
source code: https://github.com/alibaba/MNN/tree/master

0 replies

Benchmark ? #923

Uh oh!

Replies: 27 comments

Uh oh!

grigio Aug 20, 2023 Author

Uh oh!

Uh oh!

Uh oh!

leejet Aug 21, 2023 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

leejet Aug 22, 2023 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

leejet Aug 22, 2023 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

grigio Aug 24, 2023 Author

Uh oh!

leejet Aug 24, 2023 Maintainer

Uh oh!

leejet Aug 24, 2023 Maintainer

Uh oh!

Uh oh!

leejet Aug 25, 2023 Maintainer

Uh oh!

Uh oh!

leejet Aug 26, 2023 Maintainer

Uh oh!

Uh oh!

grigio
Aug 20, 2023
Author

leejet
Aug 21, 2023
Maintainer

leejet
Aug 22, 2023
Maintainer

leejet
Aug 22, 2023
Maintainer

grigio
Aug 24, 2023
Author

leejet
Aug 24, 2023
Maintainer

leejet
Aug 24, 2023
Maintainer

leejet
Aug 25, 2023
Maintainer

leejet
Aug 26, 2023
Maintainer