Skip to content

First impressions info dump #1

@Green-Sky

Description

@Green-Sky

Hey, finally stable diffusion for ggml 😄

Did a test run

$ ./sd -t 8 -m ../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin -p "alps, distant alms, small church, (cinematic:1.3), intricate details, (ArtStation:1.2), nikon dlsr, masterpiece, hyperreal"
[INFO]  stable-diffusion.cpp:2189 - loading model from '../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin'
[INFO]  stable-diffusion.cpp:2214 - ftype: q8_0
[INFO]  stable-diffusion.cpp:2259 - params ctx size =  1618.72 MB
[INFO]  stable-diffusion.cpp:2399 - loading model from '../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin' completed, taking 0.46s
[INFO]  stable-diffusion.cpp:2477 - condition graph use 4.34MB of memory: static 1.41MB, dynamic = 2.93MB
[INFO]  stable-diffusion.cpp:2477 - condition graph use 4.34MB of memory: static 1.41MB, dynamic = 2.93MB
[INFO]  stable-diffusion.cpp:2822 - get_learned_condition completed, taking 0.16s
[INFO]  stable-diffusion.cpp:2830 - start sampling
[INFO]  stable-diffusion.cpp:2674 - step 1 sampling completed, taking 18.34s
[INFO]  stable-diffusion.cpp:2674 - step 2 sampling completed, taking 18.24s
[INFO]  stable-diffusion.cpp:2674 - step 3 sampling completed, taking 18.65s
[INFO]  stable-diffusion.cpp:2674 - step 4 sampling completed, taking 18.41s
[INFO]  stable-diffusion.cpp:2674 - step 5 sampling completed, taking 18.31s
[INFO]  stable-diffusion.cpp:2674 - step 6 sampling completed, taking 18.18s
[INFO]  stable-diffusion.cpp:2674 - step 7 sampling completed, taking 18.21s
[INFO]  stable-diffusion.cpp:2674 - step 8 sampling completed, taking 18.29s
[INFO]  stable-diffusion.cpp:2674 - step 9 sampling completed, taking 18.21s
[INFO]  stable-diffusion.cpp:2674 - step 10 sampling completed, taking 18.28s
[INFO]  stable-diffusion.cpp:2674 - step 11 sampling completed, taking 18.19s
[INFO]  stable-diffusion.cpp:2674 - step 12 sampling completed, taking 18.00s
[INFO]  stable-diffusion.cpp:2674 - step 13 sampling completed, taking 18.03s
[INFO]  stable-diffusion.cpp:2674 - step 14 sampling completed, taking 18.54s
[INFO]  stable-diffusion.cpp:2674 - step 15 sampling completed, taking 18.32s
[INFO]  stable-diffusion.cpp:2674 - step 16 sampling completed, taking 18.41s
[INFO]  stable-diffusion.cpp:2674 - step 17 sampling completed, taking 18.29s
[INFO]  stable-diffusion.cpp:2674 - step 18 sampling completed, taking 18.51s
[INFO]  stable-diffusion.cpp:2674 - step 19 sampling completed, taking 18.62s
[INFO]  stable-diffusion.cpp:2674 - step 20 sampling completed, taking 18.11s
[INFO]  stable-diffusion.cpp:2686 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[INFO]  stable-diffusion.cpp:2835 - sampling completed, taking 366.14s
[INFO]  stable-diffusion.cpp:2766 - vae graph use 2177.12MB of memory: static 1153.12MB, dynamic = 1024.00MB
[INFO]  stable-diffusion.cpp:2842 - decode_first_stage completed, taking 57.66s
[INFO]  stable-diffusion.cpp:2843 - txt2img completed in 423.96s, with a runtime memory usage of 2177.12MB and parameter memory usage of 1618.58MB
save result image to 'output.png'

output

Painpoint: the extra python libs for conversion. Got a pip install error bc i have an incompatible version of something installed already, convert.py worked anyway though. :)

Timings: I used the q8_0 quantization and ran with different thread counts:
I have a 12core(24threads) cpu.
I took the timing of a sampling step.

quant q8_0 q4_0 f16
-t 1 75.31s 75.20s 82.92s
-t 2 42.44s
-t 4 28.65s 29.23s 30.00s
-t 6 21.68s
-t 8 18.34s 18.89s 19.05s
-t 10 16.38s 16.78s 17.61s
-t 12 16.26s 16.98s 18.11s
-t 14 17.93s
-t 16 16.80s
-t 18 16.70s
-t 20 16.20s
-t 22 16.96s
-t 24 18.93s

Additional questions:

  1. do you have/plan to support token weighing? ( eg: (cinematic:1.3) )
  2. are you looking into supporting cuda/opencl backends from ggml?
  3. are you looking into k-quants (like llama.cpp) and some form of quality mesurement of quantizations? (since k-quants use different quant for different parts of the model)
  4. it would be nice if the tool printed the "system line" (see https://github.com/ggerganov/llama.cpp/blob/f64d44a9b9581cd58f7ec40f4fa1c3ca5ca18e1e/llama.cpp#L4267 )
  5. did not see it mentioned, does it support sd 2.x / do you plan to add support for that ?
  6. my little benchmark suggests the bottleneck is not the model file, but the dynamic data. What number type do you use for it, llama.cpp has shown little to no degradation in quality when using f16 instead of f32 for the kv-cache.

edit: added f16 timings

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions