Sharing the experience of using DirectML for the new users. #84

Miraihi · 2023-04-17T13:22:37Z

Miraihi
Apr 17, 2023

After about 2 months of being a SD DirectML power user and an active person in the discussions here I finally made my mind to compile the knowledge I've gathered after all that time. Considering that DirectML implementation is more of a translation layer rather than a low-level rewrite of the original code, some features of the original SD webui are bound to not function properly, and different AMD cards may also need a different approaches. Nevertheless, this post has been made from the perspective of AMD RX 580 (8GB) owner. The article will get expanded as I amass more knowledge.

Preface

Before I even start, I think I should make it clear that the guide has been written before the large 1.3.0 update that considerably changed a lot going on under the hood. So I should probably add the part 0 for the new stuff in 1.3.0 to be aware of.

Part 0. Optimizations

By all means read and take into the consideration everything written below, but the update 1.3.0 adds an important tab into the options: "Optimizations". Essentially, right now it's unnecessary to add any of --opt... arguments into the webui-user.bat, and instead you choose them from the dropdown list right here.

Cross attention optimization

V1 - Original v1 - The least memory-hungry version of the standard split-attention. Safe option
DoggettX - Essentially the split-attention as we know it. Default.
Sub-quadratic - Our go-to choice in the previous version, but unfortunately DOESN'T WORK with token merging. And it's practically impossible to run post 1.3.0 without it.
sdp - scaled dot product - Looks like the best option for me, but very RAM-hungry. Won't work for everyone. In my experience it's the least memory leak averse optimization, so you can keep SD running for longer before you have to restart it.
InvokeAI - For some reason incredibly slow for me. Not recommended.

Negative Guidance minimum sigma

Allows the sampler to skip the negative prompt when it theoretically doesn't matter. It brings pretty significant speedup for almost no cost from the value of 1 up to 3.

Token Merging

The brand new feature coming with a major speedup for all of your generations. The values go from 0.1 to 0.9 and offers a major speedup (Up to 50% faster) and decreases the memory consumption. In exchange it may take some details off your generated pictures, offer less variety and may not work well with some LORAs. For the most part it worth it. 0.2-0.5 is a safe value range. Same goes for img2img.
I've never been able to use high-res pass without running out of memory so can't really attest the effectiveness, but I guess you can use pretty high values here, up to 0.8.

Check "Pad prompt/negative prompt to be same length", won't hurt at all.
"Persistent cond cache" - Seems like brings a minor speedup, but freezes SD on the next generation if you change the optimization method.

Part 1. Arguments

Being aware of the arguments is probably the most important knowledge of the new user, considering the wide range of optimisations and compatability fixes they offer. I'll try to be brief and concise.

The changes to the arguments are made in webui-user.bat (Open it in any text editor). The format goes as follows:
set COMMANDLINE_ARGS=(string of arguments). Example of the edited webui-user.bat:

@echo off

set PYTHON=
set GIT=
set VENV_DIR=
set PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512
set COMMANDLINE_ARGS=--medvram --precision full --no-half --no-half-vae --opt-split-attention --disable-nan-check --autolaunch --api
set SAFETENSORS_FAST_GPU=1

call webui.bat

I'll mention the arguments in the order of the actual usefulness, from "must have" to "placebo".

Must have

--no-half, --no-half-vae and --precision full (update: Actually, --precision autocast also works, and even running without that argument brings seemingly no problems, but looks like my graphics card can't take advantage of fp-16 calculations anyway, so there's virtually no difference) - Actual must have arguments for most of the AMD cards. Without them most of you are going to get black squares instead of the pictures, and for some these even offer a sizable speedup. Also inpainting often doesn't work at all without --no-half on. There is a less memory-hungry alternative for the --no-half called --upcast-sampling, you can experiment with it. For me, though, it breaks inpainting.

--medvram - Considering the overall memory inefficiency of DirectML, if you have a graphics card with 8 or less GB of Vram this argument is non-negotiable. It doesn't hurt the performance nearly as much as --lowvram and make the SD actually usable for many AMD card owners.
--always-batch-cond-uncond - Only works when --medvram is on. Add this argument if you use ControlNet or some LORAs make a mess of your generated pictures.

--disable-nan-check - Used to getting rid of the pesky error "A tensor with all NaNs was produced in Unet". Won't hurt to have.

Nice to have

--opt-sub-quad-attention - Probably the most powerful and new of the opt arguments. Sometimes a culpit to producing a black squares, but offers a pretty good speedup. Can be complemented by the additional arguments --sub-quad-q-chunk-size, --sub-quad-kv-chunk-size, --sub-quad-chunk-threshold. Experiment and play around with the numbers.

--opt-split-attention and --opt-split-attention-v1 are the viable alternatives to the upmentioned optimisation argument, the latter being more restrictive, but less memory-hungry. Use these if you have any problems with --opt-sub-quad-attention.

--autolaunch - Launches webUI right as the program starts up. I personally find in convenient.

Experimental/Placebo

These in general won't hurt, but at the same time are pretty recent optimisations that haven't been tested enough just yet.

set SAFETENSORS_FAST_GPU=1 (Place it at the separate line at webui-user.bat) - I'm pretty sure this argument helped me to push my image size boundaries to about 200 pixels upwards! And, well, didn't hurt for sure.

set PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:128 - Probably the most placebo of my actually used arguments, but I think I get less "out of memory" errors with that, and even if I get, I can often re-generate a picture with a success.

Part 2. Extensions

Considering that some features of the DirectML implementations doesn't function properly, some of them can be successfully replaced with an extension.

Tagger is your alternative to the native interrogators (Both of them, Danbooru and CLIP, just don't work at all for many systems). This one provides danbooru-like tags that can be immediately exported into the prompt.

ControlNet

There are some caveats to using ControlNet on DirectML.
First of all, we are usually out of memory without "lowvram" flag, but there are ways around that.

Lightweight controlnet models. For my RX580 (8GB) these work without checking "lowvram".
Even more lightweight t2i-adapter models. The older pruned versions can be found at the link above, while the newer can be found there. Don't forget to put the corresponding yaml next to the model and use lower weight (about 0.8).

Then, there are problems with the preprocessors, namely OpenPose and Depth. Regarding depth we, fortunately, can use the leres version. When it comes to the OpenPose, use this extension OpenPose editor.

Image editor plugins

If you're not averse to holding a stylus & tablet yourself and want to tap into the power of the AI to augment your works, look no further then on the image editor plugins. These are actually so powerful that can replace the webui completely. I personally know of the two options.

Photoshop plugin - Not what I use, but I imagine what's going to be more popular.
Krita plugin - A plugin for Krita - a powerful, free and open source piece of software I personally adore.

Part 3. General usage tips

Models

When looking for models, try to choose the pruned fp16 versions at about 2GB of size, as they are the most memory-efficient and offer about the same quality, the only caveat being that these are a worse material for the custom model mixes (Checkpoint Merger tab).
Example: Dreamshaper.

Samplers

Since the release of SD the samplers came a long way, and currently we have some a handful of almost uncontested winners.

Euler a - A very creative sampler that's usable at 16 steps and higher. The image change a lot depending on a number of steps. I'd say this is the best sampler for inpainting something new into the picture (In "latent noise" mode and 0.90 - 1.0 denoising strength).
DPM++ 2M Karras - The most universal sampler. Fast, produces good results at as low as 12 steps and only gets more detailed and converging results at higher steps. Most often you don't need anything higher than 30.
DPM++ SDE Karras - The slowest, but the most high-quality sampler that's also good at lower steps.
DDIM - One of the mathematically simplest samplers, that's also really fast and good at lower steps, but very uncreative, what is actually a blessing in disguise since it allows for the fast, rough iterations at txt2img and equally fast img2img upscaling at lower denoising strength.

From time to time I'm testing other samplers like LMS Karras and the new UniPC, but ultimatelly always fall back to these four.

Upscaling

For now, stay away from using hires.fix. Most of the time it leads to the "out of memory" error. Instead, use these extensions:

Multidiffusion (Play around with Decoder tile size option to avoid grey tiles)
Ultimate upscaler

Regarding upscaler models, the most stable ones are ESRGAN_4x (Paintings, Concept art), R-ESRGAN 4x+ (Photorealistic), R-ESRGAN 4x+ Anime6B (Well, anime and manga).

Aptronymist · 2023-04-17T22:40:57Z

Aptronymist
Apr 17, 2023

@echo off

set PYTHON=
set GIT=
set VENV_DIR=
set PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:128
set COMMANDLINE_ARGS=--medvram --always-batch-cond-uncond --no-half --no-half-vae --precision full --opt-sub-quad-attention --sub-quad-q-chunk-size 256 --sub-quad-kv-chunk-size 256 --sub-quad-chunk-threshold 70 --disable-nan-check --use-cpu interrogate clip --no-hashing --autolaunch --api
set SAFETENSORS_FAST_GPU=1

I see that you're using the --use-cpu clip now, have you found it makes a real difference for you?

1. [Multidiffusion](https://github.com/pkuliyi2015/multidiffusion-upscaler-for-automatic1111)

2. [Ultimate upscaler](https://github.com/Coyote-A/ultimate-upscale-for-automatic1111)

I've had a fair amount of trouble getting multidiffusion and the tiled vae working correctly on my rx590, how are you using it?

10 replies

Miraihi Apr 18, 2023
Author

20 minutes? I am able to upscale 512x512 image to 2048x2048 one using Ultimate upscaler + 4xUltrasharp. It takes 12 seconds (512x512) + 2 minutes (upscale to 2048x2048. 16tiles).

Well, my GPU is fairly old RX580.
Ultimate upscaler is certainly faster, but less faithful. Also it only works with img2img, while you can actually generate images from scratch using Multidiffusion.

lshqqytiger Apr 18, 2023
Maintainer

Yes. But I'm just impressed that I can generate such bigger images with my RX 5700 XT which also has 8GB vram.
It is very surprising compared to three months ago when I first tried this.
Thanks for your hard work!

Aptronymist Apr 18, 2023

No, clip interrogator still doesn't work, both the extension and the native one.

I've experimented with Multidiffusion today quite a bit, and it still fails to generate anything bigger than 1000p without missing tiles. The best way to upscale I've found is actually using krita plugin, where you can selectively inpaint-upscale a chosen part of a previously upscaled image at full resolution. I can't recommend the image editor plugins for SD enough.

Ah, I wasn't trying to do anything for the clip interrogator, that's just handled by 'interrogate' I was trying to put the text encoder on the CPU because it only gets used at the very start and I read something on the diffusers page that inspired the idea. I don't know if putting 'clip' is the appropriate name for that module though, I haven't found a solid indicator of what that would be called precisely if not that in SD webui with all the hijacking stuff going on.

Miraihi Apr 18, 2023
Author

@Aptronymist Who knows, maybe in that case we should use --use-cpu BLIP

Aptronymist Apr 18, 2023

@Aptronymist Who knows, maybe in that case we should use --use-cpu BLIP

I actually did put that in my command line, but I don't know how to tell if it's working! hahaha!
--use-cpu doesn't seem to give a crap what you put into it, and I don't have any idea how to find out what counts as something that CAN be in it, so... I just put in blip, clip, encoder, etc...

Yodrik · 2023-04-22T10:10:43Z

Yodrik
Apr 22, 2023

Hi Miraihi,
I have three questions for you;

Do you have the Possible improvements to DPM++ 2M Karras sampling AUTOMATIC1111/stable-diffusion-webui#8457 Possible improvements to DPM++ 2M Karras sampling applied?
Do you have a set prompt case that you use for testing speed and changes? Maybe something on a X/Y/Z plot.
What sort of convinced you to use -medvram vs. lowvram or no argument? I'm guessing the controlnet/lora issue? If that was excluded any other considerations?

TY.

3 replies

Miraihi Apr 22, 2023
Author

I haven't try that, probably should! Thanks for the heads up.
In my experience prompt doesn't affect speed at all, or by a very small margin. So I don't do that kind of testing.
There is an explanation on most arguments on Automatic1111 wiki. Since our DirectML implementation of SD doesn't have the memory monitor and is pretty inefficient with vram in general, we have to keep the memory consumption under control by other means, so we don't run out of it at the last steps of the generation. --lowvram is very undesirable argument because it limits the GPU performance harshly (In my case I get 60% utilization tops). --medvram, on the other hand, brings barely noticeable slowdown while protecting you from out of memory error very effectively.

EDIT: Tried the DPM++ 2M Karras fix. It actually gets rid of the "low-res/blurred" look I get when using the standard SD vae-ft-mse-840000-ema-pruned vae at generating the lower resolutions. I thought that was vae's problem, but seems like it wasn't. Really interesting.

Yodrik Apr 22, 2023

Yes, it is sort of a "less steps" needed speed up.
Do you keep any record of average time for certain image sizes? I don't know if "time taken" is ever recorded anywhere.
I realized I changed what arguments I used after reading this Works on AMD RX 7900 XT on Windows, but VRAM doesn't clear after each batch #17 saltyollpheist steps... and started using DPM++ 2M Karras exclusively. Before that I always had on and off issues generating even 512x512 squares without -lowvram. I just switched it to --medvram and went from 4m 20s to 1m 10s at 527x700. I always noticed my usage was bouncing up and down, but now its maxed out. Wow... was wasting a lot of time.

I have a RX 6600 8GB for your information.

Miraihi Apr 22, 2023
Author

I will add the "average time for 600x800 resolution" for various samplers later.

songib · 2023-05-03T11:07:36Z

songib
May 3, 2023

"Can be complemented by the additional arguments --sub-quad-q-chunk-size, --sub-quad-kv-chunk-size, --sub-quad-chunk-threshold. Experiment and play around with the numbers."

I want to try this after the bug today got resolved, can you give me some setting numbers to try on? I'm on RX5700xt
I got a black image often when doing the batch. xd
or maybe I use "--opt-split-attention" or "--opt-split-attention-v1" I didn't notice the difference when I try this stuff should I use both or just one of them?

"set SAFETENSORS_FAST_GPU=1 (Place it at the separate line at webui-user.bat) - I'm pretty sure this argument helped me to push my image size boundaries to about 200 pixels upwards! And, well, didn't hurt for sure."
Seems useful I'll try this for sure.

"Tagger is your alternative to the native interrogators"
I use this ext too, it's quite useful to get some of the details.
So it's impossible to use Clip Interrogator on AMD cards yet?
idk how this stuff works but can we make some of the model works on Tagger?

"Image editor plugins"
this stuff got me confused more since I can't reference more resources if something broke, so, for now, I just hop in and out between WebUI and Art programs. xd
maybe I can try it at another time.

"Models"
I got some cool merge between "Deliberate + Chillout + DreamlikeDiffusion + stylejourney", it's just missing a little bit of details.
I think we can get it back in inpainting or so, so fp16 models are okayish tbh

"Samplers"
I agree on two of them since I didn't try the other two enough. for me
Euler a - good for exploring, and I found this render a bit soft idk why
DPM++ 2M Karras - good for details but I often forgot to switch this off. xd
and it's a bit slower compared to "Euler a"
ill try the other two next time since I got the behavior figure it out on "Euler a" and "DPM++ 2M Karas".
DPM++ SDE Karras - this is kinda similar to 2M Karras, and what sample range that you recommend to try it out on this? and at what range does it meet the other sampler here?
DDIM - didn't try this idk how it behaves at all. xd
so far are just two samplers that I am mostly comfortable working on.

"Upscaling"
Multidiffusion - Always give me gray box idk why. try lower it up and down, but ill try this stuff again next time.
can you suggest some tutorials or something that I can reference next time? (since i never got a full picture)

TY for sharing it helps to get a new perspective.
and how did we get a stable character face other than Textual inversion and LORA?
I try naming the character first but can be a hit or miss. (can get "kinda" the same face.)

4 replies

Miraihi May 3, 2023
Author

Use only one of the --opt arguments. Both just won't work. You can see my example for what I consider the optimal arguments.

Tagger is your only option regarding interrogate.

DPM++ 2M Karras is good in 90% of the cases.
Multidiffusion is very hit or miss. The only relevant options are in "Tiled Vae". In general you probably won't be able to make it work in a stable manner, so Ultimate upscaler is a way to go.

songib May 3, 2023

ty i use --opt-split-attention i didnt pay attention when trying the command stuff unless it break stuff like black img. xd

I use Tagger already, the one that in img2img what clip model is that?

"The only relevant options are in "Tiled Vae"
I thought it already use VAE when upscaling, maybe that's why some of them are oversaturated sometimes, or the other way around. xd

thanks

qwerkilo May 5, 2023

Multidiffusion try this:
Tiled Diffusion width and height set 128 overlap 64
Tiled VAE encoder 1024 decoder 128

songib May 6, 2023

@qwerkilo ill try it next time.

@Miraihi I tested "--precision autocast" it's ok (idk the different but yet) but inpainting didn't work so I stopped. xd

howDoILearnPython · 2023-05-03T13:26:25Z

howDoILearnPython
May 3, 2023

i got some experience sharing too.
i'm not sure if this is DirectML issue or general webui issue. When your prompt included a lot of lora (>3), it will consume many vram. in my case, to generate 512X768 with 4 difference lora, vram usage will surge up to 19GB (with --medvram)

1 reply

Miraihi May 3, 2023
Author

Well, it's less about the number of loras and more about the size of them. Of course, you should keep an eye on the size of your textual inversions and LORAs and treat them as a secondary models.

howDoILearnPython · 2023-05-03T13:56:39Z

howDoILearnPython
May 3, 2023

with the latest support of torch 2.0 by webu-directML i tried the new --opt-sdp-attention
i replaced "--opt-sub-quad-attention --sub-quad-q-chunk-size 256 --sub-quad-kv-chunk-size 256" with -opt-sdp-attention
and generated the same image. (512x768).
For --opt-sdp-attention, the peak VRAM used is 19.0GB and generation done in 2m 46.59s
for --opt-sub-quad-attention...., the peak VRAM used is also 19.0GB and generation done in 3m 7.57s

although just a little improvement, it's worth to update your local repo and use torch 2.0 with --opt-sdp-attention (if you currently using --opt-sub-quad-attention)

0 replies

jicka · 2023-05-05T11:18:18Z

jicka
May 5, 2023

To contribute to the discussion, I ran the same prompt multiple times with different launch settings to see the differences.

Setup:

Platform: Windows 11
GPU: AMD RX 7900 XTX (24GB)
Latest commit as of 05/05/23

Prompt parameters:

Model: v2-1_768-ema-pruned.ckpt
Sampling method: DPM++ 2M Karras
Sampling steps: 25
CFG Scale: 8
Dimensions: 768x768
fixed seeds
X/Y/Z plot to generate 4 images at once with fixed seeds

Methodology:

Load with new parameters
Run prompt twice (to make sure model is loaded and GPU is warm)
close process, make sure VRAM is freed and restart

With:

set SAFETENSORS_FAST_GPU=1
set PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512

Parameters	Speed
`--medvram --precision full`	1.78 it/sec
`--medvram`	1.82 it/sec
`--medvram --disable-nan-check`	1.82 it/sec
`--medvram --no-half --precision full --opt-sub-quad-attention`	1.90 it/sec
`--medvram --no-half --no-half-vae --opt-sub-quad-attention`	1.90 it/sec
`--medvram --no-half --disable-nan-check --no-half-vae --precision full --opt-sub-quad-attention`	2.00 it/sec
`--medvram --no-half --disable-nan-check --no-half-vae --opt-sub-quad-attention`	2.01 it/sec
`--medvram --opt-sdp-attention --no-half --precision full`	2.42 it/sec
`--medvram --opt-sdp-attention`	Error during generation

Without

set SAFETENSORS_FAST_GPU=1
set PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512

Parameters	Speed
`--medvram --opt-sdp-attention --no-half --precision full`	2.42 it/sec

22 replies

pw405 Jun 24, 2023
Collaborator

Update: While settings above did fix inpainting & (pretty sure) helped performance - after generating a few images & changing models a few times, eventually all 24+ GB of VRAM gets allocated and problems happen: poor image outputs/crashing GPU drivers.

Tested with the SAFETENSORS & PYTORCH_CUDA disabled for a while and VRAM allocation issue still present. Re-enabled these. Removed --opt-sdp-attention.

Something strange I've noticed with the 7900XTX: Something happens that I can't seem to reproduce reliably. The card will run at ~3,000+ Mhz, report ~100% utilization, but only consumer ~250 watts. Generally it should consume ~350 watts at 100% utilization. When this happens, SD performance drastically decreases.

jicka Jun 24, 2023

Your GPU might be thermal throttling. If you have one of the less expensive models (as I do), they will thermal throttle when being used at full load for extended periods of time. When playing games, GPUs are seldom at 100% usage for prolonged periods of time, so GPU makers can put less cooling onto them.

To check out this theory, you can check the GPU hotspot / junction temperature while generating in the AMD software on windows. If it reaches ~100°c, the GPU will throttle.

I have setup and aggressive fan curve in the AMD software to avoid this. I also have reduced the power limit a little. With that, I manage to stay under ~90°c.

pw405 Jun 25, 2023
Collaborator

@jicka Thermals are interesting. I have a made by AMD Reference model. Thankfully it was NOT affected by the junction temp problem. In gaming, junction never hits 90. However, running SD with only --medvram would often push junction to 95, especially when doing longer renders.

However when this 250 watt problem happens, thermals are absolutely fine. Junction in low 70's, core temp low 60's.

In fact, I created a mild undervolt/fan curve OC profile specifically for running SD. However, after activating the aforementioned launch parameters, I no longer see the high junction temp so I have been using default power/OC settings.

jicka Jun 25, 2023

This is very surprising !

Sorry for not being able to help.

Maybe you're hitting memory limits and some form of swapping is happening (the code from other programs in GPU memory could be sent back to RAM, slowing the GPU down)? I don't know enough about the topic for this to be anything more than a wild guess...

pw405 Jun 25, 2023
Collaborator

@jicka after running yesterday with the following, it seems to improve overall performance, inpainting works, and I haven't encountered the strange power limit/high frequency issues yet. I also had Dreambooth extension installed, which I have since disabled. I think some of the CUDA-specific DLL's were being loaded and perhaps causing issues? (Just a guess).

set COMMANDLINE_ARGS=--medvram --precision full --no-half --no-half-vae --opt-split-attention --disable-nan-check --autolaunch --api
set SAFETENSORS_FAST_GPU=1
set PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512

For reference, below is a Radeon perf metrics + HWinfo64 screencap of the strange high frequency/lower power issue:

howDoILearnPython · 2023-05-16T11:52:16Z

howDoILearnPython
May 16, 2023

Hi all.
I recommend all diffuser here update your repo by git pull and see if there are improvement in VRAM consumption.

i just done git pull today and found that the memory consumption is greatly improved.
Last week, i need about 19GB of VRAM to generate 512X768. Today i just need 11.8GB of VRAM.
No argument change in this week. Only installed the dynamic prompt extension (which is not expected to be the cause on reduced VRAM usage)

0 replies

TedBroiler · 2023-05-17T13:08:12Z

TedBroiler
May 17, 2023

My experience with parameters: After i removed "--opt-sub-quad-attention" and changed it to "--opt-split-attention" i don't get any black images any more (System: Intel Xeon, 16GB RAM, AMD RX580 8GB RAM). Still getting "not enough gpu memory" error every now and then, but it's very inconsistent. Sometimes it's right at the first image, sometimes i can create 20 images without any error. Weird.

1 reply

Miraihi May 17, 2023
Author

Yeah, --opt-split-attention is the safest one.

analogbliss · 2023-05-24T14:58:39Z

analogbliss
May 24, 2023

Question about GPU use on AMD &
My experience with AMD gpu and Automatic1111
I have been able to use direct ML/automatic 1111 on this based on info from this thread.

Platform: Windows 11
GPU: AMD RX 5700 (8GB)
Note this is a pc primary used for music DAW use, (hence the modest gpu) but i dabble in automatic 1111. Also my programming days are far behind me but i have some basic comprehension :)

this seems to be a working parametre setup
set COMMANDLINE_ARGS=--medvram --opt-sdp-attention --no-half --precision full --disable-nan-check --autolaunch --skip-torch-cuda-test
set SAFETENSORS_FAST_GPU=1

However, I am unable to force the GPU to utilize it. It seems like the actual working of the UI part then runs on CPU only. Has anobody have had this issue?

2 replies

lshqqytiger May 24, 2023
Maintainer

You should clone https://github.com/lshqqytiger/stable-diffusion-webui-directml not https://github.com/AUTOMATIC1111/stable-diffusion-webui.

analogbliss May 28, 2023

Ok ill try that thank you!

Enferlain · 2023-05-25T16:25:42Z

Enferlain
May 25, 2023

Thanks, tried the stuff I read in here to test out how it would run. Unfortunately the gpu gets like 0 utilization other than the vram maxing out and doesn't do more than 1it/s~ under any scenario, whereas on linux with same settings it gets like 5it/s and has 100% gpu utilization.

I hope directml gets more updates, but it would be better to finally get rocm on windows

0 replies

Rizzlord · 2023-05-28T15:27:38Z

Rizzlord
May 28, 2023

after changing medvram, nancheck, and set SAFETENSORS_FAST_GPU=1 never get a full image anymore, they always get done 20% rest is without color and stuff, even after changing back.

1 reply

BRIANJM03 May 29, 2023

I've had this problem, but it was using --opt-sub-quad-attention. I'm using --opt-split-attention instead and it works.

howDoILearnPython · 2023-06-01T14:29:48Z

howDoILearnPython
Jun 1, 2023

Hello everyone. I strongly recommend everyone update to the latest version. With the latest update, the webui now supported Token Merging.
It drastically improve the memory consumption.
In previous build, i need around 12GB VRAM to generate image of 512X768
Now, i just need 6GB VRAM. 50% improved.

It also reduce the VRAM usage in hires fix. I used to have 19GB VRAM consumed in 512X768 and 1.4X hires fix. it's now use 11GB.

But please be reminded that, after apply Token Merging, the old image can not regenerate even with the same seed.

8 replies

songib Jun 5, 2023

@Miraihi
nice.

I got a problem atm, cant generate img at all after ctrlnet update today, and disabling ctrlnet seems didn't help. xd
using --medvram didn't works but --lowvram works atm. idk

Miraihi Jun 5, 2023
Author

@songib Yes, I also have problems with going out of memory using ControlNet. Only thing you can do is to fiddle with Token Merging ratios. Increasing them reduces the memory requirements, but makes the results less creative. Or try using the lighter T2I-Adapter models.

songib Jun 5, 2023

@Miraihi
I saw your setting in another post, ill try that out now.
Using --lowvram atm seems so slow for just txt2img. xd

Miraihi Jun 5, 2023
Author

@songib Never use --lowvram if you have at least 8 GB of Vram. At very least you can use the "Low Vram" checkmark in ControlNet options.

songib Jun 5, 2023

@Miraihi
yeah, it's pretty bad.
ill change it again to --medvram
the token seems broken on my side idk, is this because of the setting "Optimization" maybe.
still testing stuff atm.
seems like dynamic prompt is kinda broke or somthing.

howDoILearnPython · 2023-06-05T11:14:38Z

howDoILearnPython
Jun 5, 2023

I tried to enable ToMe and it work fine in most scenario.
But it fail to work when using easynegative texture inversion, all image generated is weird.
Not sure if it's common problem across SD, or DirectML specific, or my own problem only?
Anyone got similar problem?

2 replies

Miraihi Jun 5, 2023
Author

Yeah, the speedup is very, very respectable, but the results are often at least boring (All the generated images look very similar) if not outright broken (Especially using img2img with certain "style" LORAs). The worst part is that in current commit you practically can't get SD to work without ToMe, or you're getting out of memory. Have to back off to vladmandic-directml for some tasks.

howDoILearnPython Jun 5, 2023

i tried vladmandic fork too. I would prefer lshqqytiger webui in term of overall impression, user experience and performance.

Tanuki33 · 2023-06-05T19:16:57Z

Tanuki33
Jun 5, 2023

just sharing my experience
so far I'm happy with this version. ebf229b
previously I couldn't use --medvram because it got OOM. but now I can use --medvram instead of --lowvram even when generating 576x1024 images at 2x faster speed (yes because now it uses all my gpu power)

i use RX570 4GB and this is my commandline : --sub-quad-q-chunk-size 256 --always-batch-cond-uncond --medvram --precision full --upcast-sampling --autolaunch --opt-sub-quad-attention --disable-nan-check --opt-split-attention-v1

4 replies

Aptronymist Jun 9, 2023

You should set your sub-quadratic chunking threshold to be 70 or 60, maybe take the q-chunk and kv-chunk down to 128, also you can only use ONE of those at at time, so toss out --opt-split-attention-v1.

Aptronymist Jun 9, 2023

when generating 576x1024 images

When making images that big with such little vram, I'd think your best bet would be to use tiled diffusion/tiled vae or highres fix with a 256x512 starting size that is then upped by 2x by highres.

Tanuki33 Jun 10, 2023

thanks for the advice, but its work

gigend Jun 24, 2023

does it often get a bsod or suddenly the screen turns off? I also use the same type of vga as you.

can i copy your other arg web-ui user?

i try your setting but get error:

res[0:res.shape[0], i * query_chunk_size:i * query_chunk_size + attn_scores.shape[1], 0:res.shape[2]] = attn_scores
RuntimeError: Could not allocate tensor with 2478080 bytes. There is not enough GPU video memory available!

songib · 2023-06-10T20:52:13Z

songib
Jun 10, 2023

Alo Guys (i'm on 5700xt)
First of all, I never use hires fix until recently. Now the problem is if I use these settings
set COMMANDLINE_ARGS=--no-half --no-half-vae --precision full --opt-split-attention --medvram --api --autolaunch --disable-nan-check --theme dark --use-cpu CLIP interrogate I can't use HiresFix what so ever but I can do some inpainting, but if use this one set COMMANDLINE_ARGS= --opt-split-attention --medvram --api --autolaunch --disable-nan-check --theme dark --use-cpu CLIP interrogate I can use HiresFix but I can't use inpainting. idk

Trying Tilled Vae didn't works atm unless using this COMMANDLINE_ARGS= --opt-split-attention --medvram --api --autolaunch --disable-nan-check --theme dark --use-cpu CLIP interrogate. (UltimateSD Upscale still works, I use this mostly for upscalling and some ctrlnet)

Any solution for this, sometimes I just want to do HiresFix and be done with it or proceed to the next step of course.

5 replies

Miraihi Jun 10, 2023
Author

You can't use inpainting without --no-half.

songib Jun 11, 2023

yeah, I just realize that. I use this atm set COMMANDLINE_ARGS= --opt-split-attention --no-half --always-batch-cond-uncond --medvram --api --autolaunch --disable-nan-check --theme dark --use-cpu CLIP BLIP interrogate
still cant hiresFix though idk.
ill test more settings next time.

Miraihi Jun 11, 2023
Author

You can increase the ratio for Token Merging specifically for HirezFix and save vram at pretty low cost. Settings - Optimizations.

Miraihi Jun 11, 2023
Author

I'd also advice to always use --precision full if you know that your graphics card doesn't support fp16 calculations. It's going to work without it, but it hurts the performance.

songib Jun 13, 2023

@Miraihi
I Try it out and didn't works. I copied your exact setting atm for optimization settings.
I usually generate stuff at 512*768 and HiresFix at 1.5 (resize: from 512x768 to 768x1152) it works without -no-half --no-half-vae --precision full seems like no HiresFix for me, so I use the "WebUI Inpainting Mode" instead and Use ADetailer to fix some figures, but other than figures I can't do anything atm other than using Img2Img tab to do some inpainting and stuff (But tbh I hate this part since it can break stuff that I like from the initial images).

So is it possible to have different settings for txt2Img and Img2Img, So I can use HiresFix, then if I want to do some inpainting, I can use the "inpainting settings" without relaunching the WebUI?

One more thing, my WebUI keeps saving stuff in temp folders, since I reinstall it. is there anything I could do to prevent this?
Already set up the save folder but keep making stuff in the temp folder. idk

art-Gam · 2023-06-24T10:01:53Z

art-Gam
Jun 24, 2023

seems that 7900 xt works fine with just these settings:
set COMMANDLINE_ARGS= --opt-sub-quad-attention --autolaunch
set SAFETENSORS_FAST_GPU=1

later maybe will try olive, some negative embeddings cause runtime errors, lora/lycoris and locon models seems working.

2 replies

pw405 Jun 26, 2023
Collaborator

Are you able to do inpainting with these settings? I had assumed --no-half was required. (Haven't tested myself just yet. I'm also on Navi 31.)

songib Jun 30, 2023

in my case need --no-half for inpainting to works

yamfun · 2023-07-06T02:48:41Z

yamfun
Jul 6, 2023

Recently set up stable-diffusion-webui-directml on my pc, just want to report my spec and speed.
I have a MSI RX 6600 MECH 2X 8G, on Windows, driver: 23.5.2, and webui-user.bat with only the change of "set COMMANDLINE_ARGS= --autolaunch"
My basic txt2img speed is 2.23s/it, am I slower than other people with rx6600 on Windows?

(I tried the suggested arguments from this post too but either the arguments or some of my extensions are not working, and so the SD only show "Waiting" when I click generate and nothing more happens, even the Installed Extensions list can't show the list content. Basically like only the webui is working but the backend is not. And even if I revert the args, I will have to restart the whole PC several times until the backend really come back to life. Not sure how to enable some verbose log to see what's wrong. I may try a more thorough "one args change each time test" later to find out which one is the culprit...)

2 replies

Miraihi Jul 6, 2023
Author

--medvram --precision full --no-half --no-half-vae are pretty much the must and can't possibly mess up anything. Something is wrong with your installation or models. Also check the beginning of the guide.

robonxt Jul 7, 2023

Looks about right, Mine is around 1.09s/it on a good run, but most of the time is around 1.50~3.00.

Tip: when you close the command line ("kill" the program"), and then you start it up again, the first generation will almost always take longer than normal. Generate again and you may find it to be faster

robonxt · 2023-07-07T19:52:37Z

robonxt
Jul 7, 2023

Been using this guide for a while now and just wanted to say thank you for consolidating all the good AMD stuff into one place!

AMD 6600XT 8GB
Gonna put my findings here

Working settings for me:

Optimizer: Doggettx
NGMS: 1.5
img2img: 0.5
hi.res: 0.5 (I don't use hi.res since it OOM too often)

Time speedup with NGMS only:

Negative Guidance minimum sigma	512x768	1024x1024
0	~29 seconds	~2:10 minutes
1.5	~26 seconds	~1:50 minutes

6 replies

robonxt Jul 7, 2023

A question: In your guide, you mentioned the different types of optimizations available. However, InvokeAI is missing. Is there a pro/con to InvokeAI vs the other ones? (I found that InvokeAI seems to be stable for me, but because I change and break settings often, I don't have really know which one is better lol)

Miraihi Jul 7, 2023
Author

InvokeAI has been incredibly slow for me so I didn't even mention it. Recently I've upgraded my RAM and sdp-attention finally became usable, and probably the best choice (It's on par with xformers after all).

robonxt Jul 7, 2023

I'm assuming sdp-attention is InvokeAI optimization? Interesting. I had the --opt-quad flag enabled, so selecting optimizations doesn't affect the speed since we hardcoded it in?

Miraihi Jul 7, 2023
Author

No, SDP is scaled dop product. sub-quadratic doesn't work with token merging so I abstain from using it. You don't need opt flags in webui-user.bat if you choose them from the webui options.

robonxt Jul 7, 2023

I'll have to play with it more then. I'm changing settings like way too often so I am unable to replicate my previous tries lol

Also, setting optimizations makes WebUI default to InvokeAI, which gives me black images as final image if token merging is enabled.

Miraihi · 2023-07-08T13:08:19Z

Miraihi
Jul 8, 2023
Author

Hey, I have a bit "silent news" today. After recent fiddling with SD DirectML, I've found out that Danbooru interrogation finally works correctly. So WD Tagger extension is not necessary anymore. Interrogate CLIP still doesn't work though.

2 replies

pw405 Jul 8, 2023
Collaborator

Are you running the new 23.7.1 drivers? See my post a few minutes ago, but it also seems inpainting is working on RDNA3 without using --no half, my previous testing showed that this would fail.

Miraihi Jul 8, 2023
Author

Yes, I do. Sounds great, I'll try it.

pw405 · 2023-07-08T14:45:02Z

pw405
Jul 8, 2023
Collaborator

Interesting discovery on RDNA3 (7900 XTX)!

I accidently had two COMMANDLINE_ARGS in my batch file (forgot to comment one out). I was testing --upcast-sampling instead --no-half . Previously, --no-half . was REQUIRED to successfully inpaint. I tried it just a few days ago and inpainting failed (just leaves a blurred smear where the inpaint was supposed to be).

After fixing my batch file by removing the duplicate command line arg's, I tested inpainting again and it seems to be working fine without --no-half !

Unsure if this was related to the new Radeon 23.7.1 or a recent update to the DirectML Stable Diffusion/etc.

My batch file settings:

set COMMANDLINE_ARGS= --medvram --opt-sdp-attention --upcast-sampling --precision full --autolaunch
set SAFETENSORS_FAST_GPU=1
set PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512

Additionally - since the new driver update/command line args, I've noticed VRAM utilization seems MUCH better. Tends to use ~18 GB and not grow continuously. More testing needed here. Any other RDNA3 users have a chance to test with Radeon driver 23.7.1?

5 replies

Miraihi Jul 8, 2023
Author

Doesn't work for me (RX580 8GB). But it makes sense to work with the newer graphics cards that support float16 operations. If that's really the case you can also try using --precision autocast and see if it brings a speedup or not.

pw405 Jul 8, 2023
Collaborator

If I do precision auto cast instead of upcast sampling, should I also remove precision full?

( PS - Is there somewhere I can find official definitions of what all of these mean?)

Miraihi Jul 8, 2023
Author

--precision autocast means "Use float16 operations when possible".
You can read this reddit comment for the approximate answer.

robonxt Jul 8, 2023

Time to update my drivers and try out the autocast argument. been running full, and it seems like several models complain about not being able to use float16 (or something like that)

pw405 Jul 9, 2023
Collaborator

My friend has an XTX as well. He has NOT updated to 23.7.1 drivers yet. We updated his local repo (git pull in batch file). Then I had him run the activate.bat in ...venv\scripts and run: pip install -r requirements.txt --upgrade per instructions above from jicka (May 21st). He was able to use my same command line arguments successfully with inpainting.

That said, I do feel like the new driver is running SD faster. No data to demonstrate though.

Pinoles17 · 2023-07-09T02:15:25Z

Pinoles17
Jul 9, 2023

I am using a 6600XT 8GB Nitro+ and I tried a lot of commands trying to use Inpaint and using HiresFix x2 (1024x1024). This is what I found:

--no-half is a must to use Inpaint, but for some reason I am able to do 1024x1024 with HiresFix x2, if I add the previous command, "Out of memory error" appears.

-My previous commands to use HiresFix x2 (1024x1024) but can't use Inpaint
(--enable-insecure-extension-access --medvram --disable-nan-check --opt-sub-quad-attention --opt-split-attention --sub-quad-q-chunk-size 256 --sub-quad-kv-chunk-size 256 --sub-quad-chunk-threshold 70 --autolaunch)

-Currently I'm using these commands and it works well and fast with HiresFix x1.5 (768x768) and Inpaint working.
(--enable-insecure-extension-access --medvram --disable-nan-check --autolaunch --no-half-vae --no-half)

Which cross atention optimization may fit with my GPU?

3 replies

robonxt Jul 9, 2023

I have a 6600XT 8GB as well, and I found out just tonight after testing that doggettx is the fastest for my GPU. I also set 1.5 for the Negative Guidance minimum sigma option, and noticed around 20 seconds speedup for a 1024x1024, or around 3 seconds speedup for 512x768 (no hi.res enabled)

Negative Guidance minimum sigma	512x768	1024x1024
0	~29 seconds	~2:10 minutes
1.5	~26 seconds	~1:50 minutes

Miraihi Jul 9, 2023
Author

Curiously, after the 1.3.0 I'm unable to generate images higher than 800p. The memory requirements has skyrocketed even though I have 8 gb or vram.

robonxt Jul 9, 2023

I've noticed that if you had cache cood enabled, and you get a OOM, your future generations will get more OOMs until you completely kill and restart the WebUI.

Also, having padding on pos/neg prompts changes the seeds and slows down generation when I was testing 512x512s and 1024x1024s.

Miraihi · 2023-07-09T13:21:20Z

Miraihi
Jul 9, 2023
Author

Fo those still not using a live preview: TAESD variant is absolutely glorious. Try it, it helps a lot with testing.
Live preview display period - 5 is a good value.

0 replies

Miraihi · 2023-07-10T16:48:52Z

Miraihi
Jul 10, 2023
Author

BIG news - The main branch multiplatform version of vladmandic/automatic actually works great. Even ControlNet functions as intended, without going OOM. Run the shortcut with the arguments vladmandicSD\webui.bat --medvram --upgrade --use-directml --autolaunch

13 replies

songib Jul 31, 2023

@Miraihi is this correct for SD.Next Vlad set COMMANDLINE_ARGS= --use-directml --autolaunch ?
or any command line that I can try?
and any setting that I can try aswell if you have something.
Adetailer didn't work atm.

Miraihi Jul 31, 2023
Author

I use --medvram --quick --use-directml as a simple shortcut arguments for webui.bat.

songib Jul 31, 2023

I forgot about --medvram lmao (i can generate img without it idk why). what this command does --quick ?.
Can you share your Compute Settings and Diffusers Settings?
one more thing, Is there any theme that you recommend? (the default SD.Next it's cumbersome, using default gradio atm)
and how to keep this thing checked (Batch, Seed details, Advanced)?

Miraihi Jul 31, 2023
Author

You don't need --quick, it's used so the webui start more quickly if you're sure there wasn't any updates.
Diffusers settings don't change anything. I think it only makes sense if we use the newer SDXL. I didn't even try.

songib Jul 31, 2023

ahh, ok ty.

Siralob · 2023-08-02T16:21:25Z

Siralob
Aug 2, 2023

I'm newbie here, so I want to know which brunch is the sweet for an old amd gpu. My card is RX580 8GB. ROCm won't support anymore my card sadly, I know it's bit old as AMD thinks too. In the future update, ROCm update will do. But it for the RDNA2 and RDNA3 ( ROCm 5.6.0 windows ).

Is it worth to go back to use v1.2.1 ( is it right? I think before these commit e.g. this one instead of using latest v1.5.1 .

I found the old version can manege memory in better and can generate higher resolution. (I mean no offense for anything and anyone else...!! I'm just curious🥲)
It's because I'm in struggle to manege my 8gb vram for balance between higher resolution and details or some refine. When I want to use Adetailer I must have --no-half, then I can't go higher than 640x640 resolution. It's random actually in sometimes he can generate 512x640, but suddenly I got OOM. Without --no-half he can push generate 512 square to 1024 with hires.fix wow. I can't use all features at once (I'm greedy tho).
So, if these OOM related to vram usage and management internally, I would go older version. (or I would buy brand new card)

I also have latest vladmandic/automatic, so it could be good in some way to have old and new environment parallel.
Does anyone have same settings or environments? Thanks.

7 replies

Miraihi Aug 2, 2023
Author

Practical resolution limit for me is about 768x1024. lshqqytiger makes some substantial strides in DML memory management (For example recently he's added two methods of actual memory monitoring), so it's certainly worth checking on the newest commits.

If we're talking about the workflow, you don't really need to go for the huge resolution all the time. Generate some good base image around 512p or 768p, then upscale by x1,5-x2 and refine it using inpainting, then use Ultimate SD upscaler script (img2img), and you're golden. Also there's a pretty good Photoshop extension (Krita's one is a mess right now unfortunately).

Siralob Aug 3, 2023

Thank you so much for details. This thread is goldmine for me! I tried bunch of settings from the top of this post. I finally understood what I actually used and I set them in wrong way. I don't need to put these cross attention arguments in my webui.bat😂

set PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.8,max_split_size_mb:512
set COMMANDLINE_ARGS= --medvram --no-half --precision full --disable-nan-check
set SAFETENSORS_FAST_GPU=1
set ACCELERATE="Ture"

(I'm not sure performance could be affected both --precision full and --precision autocast. Maybe Polaris family works with FP32. Please, correct me if I was wrong.)
I found the memory usage wasn't huge difference between each cross attention on my settings. So, I grab V1 - Original v1. Memory stay around 7700MB~ (and can run with 2 Batch size with no problem).

About the workflow, big thanks to know Ultimate SD upscaler, this works like a charm for 8gb (or less than 8gb?) card imo. I need more practice with this new way to enlarge.

Miraihi Aug 3, 2023
Author

It's not a question of being able to work with FP32, it's a question of being able to work with FP16. And it seems like Polaris can't, so use --Precision full or your performance going to hurt.

Siralob Aug 3, 2023

Thank you. I got it. I found some articles, it seems like pre-gfx900 hardware(Polaris) support fp16 but never do it. (Vega can)
So, FP16 optimizations won't work with gfx803 series cards (likes no benefits from it). ahh, I got enough information.

Anyways, I can use 2 split attentions with token merging or sub-quad without it. Vram usage is almost same on 1.5.1, imo. so, maybe some minor tweaks are there.

pw405 Aug 3, 2023
Collaborator

On RDNA 3, I have found precision auto-cast creates bad images or outright fails in both SDXL and 1.5 models.
I downloaded Radeon Performance Profiler, and if I can figure out how to use it, I will share some findings. I'm struggling with what arguments to use for best performance and image quality.

Many work on RDNA 3, BUT, some cause really strange GPU behavior: reporting 100% utilization, but only using about 50% of the power budget.

TheDamnedKirai · 2023-09-14T21:40:17Z

TheDamnedKirai
Sep 14, 2023

I am having a really strange issue with my 6750XT. I can't use DEFORUM with Animation Mode 3D, it just creates broken images after the first one. It works flawlessly with Animation Mode 3D. Anyone experienced this?
EDIT: There seems to be a known issue #29 anyone managed to find any workaround?

0 replies

mayeutow · 2023-10-16T14:44:57Z

mayeutow
Oct 16, 2023

hey, i'm very new here. i have the same gpu, RX580 8GB. my vram seems to be always full when generating, and my driver failed 2 times, getting a black screen and fan speed reaching %100. i'm already sick of reinstalling drivers, do you know how can i prevent this issue?
and i start to encounter not enough vram errors after around 20 minutes of a session.
these are the command lines that i use on my .bat file, there might be something wrong as i have no idea what most of them do:
set PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512 set COMMANDLINE_ARGS=- -always-batch-cond-uncond --medvram --precision full --no-half --no-half-vae --opt-split-attention --disable-nan-check --autolaunch --api set SAFETENSORS_FAST_GPU=1

3 replies

gigend Nov 5, 2023

same issue, after generating some image, i have black screen and force me to restart the pc.

GriZen83 Dec 14, 2023

Glad to hear to hear that someone confirm it, since I have the same issue with my 7900xtx. But I am hoping that rocm and further enhancments will fix it eventually.

PrincipalSkinner Mar 30, 2024

I'm a bit late here, but hopefully it helps others that might have this issue in the future. This is what happens to me when my GPU (also an RX580 8GB) overheats. At 85°C, on the dot, it goes black and I have to cold boot -- can't speak on fan speeds since my curve puts my fans at 100% at 75°C. I fixed this by tweaking my Afterburner settings (underclocking, specifically). However, there's a trade off. The image generation is a bit worse in my experience (probably as expected). I wish someone would write an extension to allow us to specify a delay in image generations - I think 15sec between each image in a batch/"forever" setting would be fine for me. I'd make this extension myself, but I do not do javascript or python.

DivanoDova · 2023-12-16T14:50:42Z

DivanoDova
Dec 16, 2023

Does anyone have any idea why since yesterday when i use Generate Forever i need to wait a long time between each images

1 reply

pw405 Dec 21, 2023
Collaborator

Oh not sure... is it an extension?

PrincipalSkinner · 2024-03-28T23:50:53Z

PrincipalSkinner
Mar 28, 2024

Hello,
Just wanted to thank you for this. I also have an RX580, and my generations were fine but I was looking for some more information about a command line argument and found myself here. Since you were also using an RX580 I figured I'd give your recommendations a shot and the generations are a bit faster than I was getting, at the cost of more VRAM usage - only slightly though (200mb maybe). A bonus was that I didn't require 2 .bat files depending on whether or not I needed --no-half.

However, I did want to leave a comment saying your topic is a few months old now, but it still works with one caveat. Padding the positive/negative prompts produced a white image in preview, which turned black upon completion .. so I had to disable that. The only other thing I modified was that with your settings above I could not generate 1024x1024 images like I could previously. The best I could do routinely without going OOM was 576x1024 (768x768 for 1:1). However, when I set the cross attention optimizer to Doggettx I could do 768x1024 (~5sec/it), so that might be something to look into if you're interested. I could not get any of the others to complete more than 3-4 iterations.

One criticism I would make, if allowed, is that many times you speak of RAM usage and I'm assuming you meant VRAM in every instance. If you ever find the time to update this again it might be nice to note the kind of ram if both are implicated at various points.

I haven't read the other comments, but I will in a few hours (hockey game) to see if there are any other optimizations I could try on my set up.

Thanks again for this discussion. It completely made me forget what I was trying to get more information on (I think it was what, exactly, ---data-dir consisted of, it appears to be models and extensions only and not settings or anything else).

0 replies

Sharing the experience of using DirectML for the new users. #84

Preface

Part 0. Optimizations

Cross attention optimization

Negative Guidance minimum sigma

Token Merging

Part 1. Arguments

Must have

Nice to have

Experimental/Placebo

Part 2. Extensions

ControlNet

Image editor plugins

Part 3. General usage tips

Models

Samplers

Upscaling

Replies: 30 comments · 114 replies

Miraihi Apr 18, 2023 Author

lshqqytiger Apr 18, 2023 Maintainer

Miraihi Apr 18, 2023 Author

Miraihi Apr 22, 2023 Author

Miraihi Apr 22, 2023 Author

Miraihi May 3, 2023 Author

Miraihi May 3, 2023 Author

pw405 Jun 24, 2023 Collaborator

pw405 Jun 25, 2023 Collaborator

pw405 Jun 25, 2023 Collaborator

Miraihi May 17, 2023 Author

lshqqytiger May 24, 2023 Maintainer

Miraihi Jun 5, 2023 Author

Miraihi Jun 5, 2023 Author

Miraihi Jun 5, 2023 Author

Replies: 30 comments 114 replies

Miraihi Apr 18, 2023
Author

lshqqytiger Apr 18, 2023
Maintainer

Miraihi Apr 18, 2023
Author

Miraihi Apr 22, 2023
Author

Miraihi Apr 22, 2023
Author

Miraihi May 3, 2023
Author

Miraihi May 3, 2023
Author

pw405 Jun 24, 2023
Collaborator

pw405 Jun 25, 2023
Collaborator

pw405 Jun 25, 2023
Collaborator

Miraihi May 17, 2023
Author

lshqqytiger May 24, 2023
Maintainer

Miraihi Jun 5, 2023
Author

Miraihi Jun 5, 2023
Author

Miraihi Jun 5, 2023
Author