First impressions info dump

Hey, finally stable diffusion for ggml :smile: 

Did a test run
```
$ ./sd -t 8 -m ../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin -p "alps, distant alms, small church, (cinematic:1.3), intricate details, (ArtStation:1.2), nikon dlsr, masterpiece, hyperreal"
[INFO]  stable-diffusion.cpp:2189 - loading model from '../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin'
[INFO]  stable-diffusion.cpp:2214 - ftype: q8_0
[INFO]  stable-diffusion.cpp:2259 - params ctx size =  1618.72 MB
[INFO]  stable-diffusion.cpp:2399 - loading model from '../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin' completed, taking 0.46s
[INFO]  stable-diffusion.cpp:2477 - condition graph use 4.34MB of memory: static 1.41MB, dynamic = 2.93MB
[INFO]  stable-diffusion.cpp:2477 - condition graph use 4.34MB of memory: static 1.41MB, dynamic = 2.93MB
[INFO]  stable-diffusion.cpp:2822 - get_learned_condition completed, taking 0.16s
[INFO]  stable-diffusion.cpp:2830 - start sampling
[INFO]  stable-diffusion.cpp:2674 - step 1 sampling completed, taking 18.34s
[INFO]  stable-diffusion.cpp:2674 - step 2 sampling completed, taking 18.24s
[INFO]  stable-diffusion.cpp:2674 - step 3 sampling completed, taking 18.65s
[INFO]  stable-diffusion.cpp:2674 - step 4 sampling completed, taking 18.41s
[INFO]  stable-diffusion.cpp:2674 - step 5 sampling completed, taking 18.31s
[INFO]  stable-diffusion.cpp:2674 - step 6 sampling completed, taking 18.18s
[INFO]  stable-diffusion.cpp:2674 - step 7 sampling completed, taking 18.21s
[INFO]  stable-diffusion.cpp:2674 - step 8 sampling completed, taking 18.29s
[INFO]  stable-diffusion.cpp:2674 - step 9 sampling completed, taking 18.21s
[INFO]  stable-diffusion.cpp:2674 - step 10 sampling completed, taking 18.28s
[INFO]  stable-diffusion.cpp:2674 - step 11 sampling completed, taking 18.19s
[INFO]  stable-diffusion.cpp:2674 - step 12 sampling completed, taking 18.00s
[INFO]  stable-diffusion.cpp:2674 - step 13 sampling completed, taking 18.03s
[INFO]  stable-diffusion.cpp:2674 - step 14 sampling completed, taking 18.54s
[INFO]  stable-diffusion.cpp:2674 - step 15 sampling completed, taking 18.32s
[INFO]  stable-diffusion.cpp:2674 - step 16 sampling completed, taking 18.41s
[INFO]  stable-diffusion.cpp:2674 - step 17 sampling completed, taking 18.29s
[INFO]  stable-diffusion.cpp:2674 - step 18 sampling completed, taking 18.51s
[INFO]  stable-diffusion.cpp:2674 - step 19 sampling completed, taking 18.62s
[INFO]  stable-diffusion.cpp:2674 - step 20 sampling completed, taking 18.11s
[INFO]  stable-diffusion.cpp:2686 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[INFO]  stable-diffusion.cpp:2835 - sampling completed, taking 366.14s
[INFO]  stable-diffusion.cpp:2766 - vae graph use 2177.12MB of memory: static 1153.12MB, dynamic = 1024.00MB
[INFO]  stable-diffusion.cpp:2842 - decode_first_stage completed, taking 57.66s
[INFO]  stable-diffusion.cpp:2843 - txt2img completed in 423.96s, with a runtime memory usage of 2177.12MB and parameter memory usage of 1618.58MB
save result image to 'output.png'
```
![output](https://github.com/leejet/stable-diffusion.cpp/assets/2938071/dde2654a-dc76-4a12-9a46-fa597389512b)


Painpoint: the extra python libs for conversion. Got a pip install error bc i have an incompatible version of something installed already, `convert.py` worked anyway though. :)

Timings: I used the q8_0 quantization and ran with different thread counts:
I have a 12core(24threads) cpu.
I took the timing of a sampling step.
| quant | q8_0  | q4_0 | f16 |
|------|--------|------|-----|
| -t 1 | 75.31s | 75.20s | 82.92s |
| -t 2 | 42.44s |        | |
| -t 4 | 28.65s | 29.23s | 30.00s |
| -t 6 | 21.68s |        | |
| -t 8 | 18.34s | 18.89s | 19.05s |
| -t 10 | 16.38s | 16.78s | 17.61s |
| -t 12 | 16.26s | 16.98s | 18.11s |
| -t 14 | 17.93s |        | |
| -t 16 | 16.80s |        | |
| -t 18 | 16.70s |        | |
| -t 20 | 16.20s |        | |
| -t 22 | 16.96s |        | |
| -t 24 | 18.93s |        | |

Additional questions:
1. do you have/plan to support token weighing? ( eg: `(cinematic:1.3)` )
2. are you looking into supporting cuda/opencl backends from ggml?
3. are you looking into k-quants (like llama.cpp) and some form of quality mesurement of quantizations? (since k-quants use different quant for different parts of the model)
4. it would be nice if the tool printed the "system line" (see https://github.com/ggerganov/llama.cpp/blob/f64d44a9b9581cd58f7ec40f4fa1c3ca5ca18e1e/llama.cpp#L4267 )
5. did not see it mentioned, does it support sd 2.x / do you plan to add support for that ?
6. my little benchmark suggests the bottleneck is not the model file, but the *dynamic* data. What number type do you use for it, llama.cpp has shown little to no degradation in quality when using f16 instead of f32 for the kv-cache.

edit: added f16 timings

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

First impressions info dump #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

quant	q8_0	q4_0	f16
-t 1	75.31s	75.20s	82.92s
-t 2	42.44s
-t 4	28.65s	29.23s	30.00s
-t 6	21.68s
-t 8	18.34s	18.89s	19.05s
-t 10	16.38s	16.78s	17.61s
-t 12	16.26s	16.98s	18.11s
-t 14	17.93s
-t 16	16.80s
-t 18	16.70s
-t 20	16.20s
-t 22	16.96s
-t 24	18.93s

First impressions info dump #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions