Replies: 30 comments 114 replies
-
I see that you're using the --use-cpu clip now, have you found it makes a real difference for you?
I've had a fair amount of trouble getting multidiffusion and the tiled vae working correctly on my rx590, how are you using it? |
Beta Was this translation helpful? Give feedback.
-
Hi Miraihi,
TY. |
Beta Was this translation helpful? Give feedback.
-
"Can be complemented by the additional arguments --sub-quad-q-chunk-size, --sub-quad-kv-chunk-size, --sub-quad-chunk-threshold. Experiment and play around with the numbers." I want to try this after the bug today got resolved, can you give me some setting numbers to try on? I'm on RX5700xt "set SAFETENSORS_FAST_GPU=1 (Place it at the separate line at webui-user.bat) - I'm pretty sure this argument helped me to push my image size boundaries to about 200 pixels upwards! And, well, didn't hurt for sure." "Tagger is your alternative to the native interrogators" "Image editor plugins" "Models" "Samplers" "Upscaling" TY for sharing it helps to get a new perspective. |
Beta Was this translation helpful? Give feedback.
-
i got some experience sharing too. |
Beta Was this translation helpful? Give feedback.
-
with the latest support of torch 2.0 by webu-directML i tried the new --opt-sdp-attention although just a little improvement, it's worth to update your local repo and use torch 2.0 with --opt-sdp-attention (if you currently using --opt-sub-quad-attention) |
Beta Was this translation helpful? Give feedback.
-
To contribute to the discussion, I ran the same prompt multiple times with different launch settings to see the differences. Setup:
Prompt parameters:
Methodology:
With:
Without
|
Beta Was this translation helpful? Give feedback.
-
Hi all. i just done git pull today and found that the memory consumption is greatly improved. |
Beta Was this translation helpful? Give feedback.
-
My experience with parameters: After i removed "--opt-sub-quad-attention" and changed it to "--opt-split-attention" i don't get any black images any more (System: Intel Xeon, 16GB RAM, AMD RX580 8GB RAM). Still getting "not enough gpu memory" error every now and then, but it's very inconsistent. Sometimes it's right at the first image, sometimes i can create 20 images without any error. Weird. |
Beta Was this translation helpful? Give feedback.
-
Question about GPU use on AMD & Platform: Windows 11 this seems to be a working parametre setup However, I am unable to force the GPU to utilize it. It seems like the actual working of the UI part then runs on CPU only. Has anobody have had this issue? |
Beta Was this translation helpful? Give feedback.
-
Thanks, tried the stuff I read in here to test out how it would run. Unfortunately the gpu gets like 0 utilization other than the vram maxing out and doesn't do more than 1it/s~ under any scenario, whereas on linux with same settings it gets like 5it/s and has 100% gpu utilization. I hope directml gets more updates, but it would be better to finally get rocm on windows |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
Hello everyone. I strongly recommend everyone update to the latest version. With the latest update, the webui now supported Token Merging. It also reduce the VRAM usage in hires fix. I used to have 19GB VRAM consumed in 512X768 and 1.4X hires fix. it's now use 11GB. But please be reminded that, after apply Token Merging, the old image can not regenerate even with the same seed. |
Beta Was this translation helpful? Give feedback.
-
I tried to enable ToMe and it work fine in most scenario. |
Beta Was this translation helpful? Give feedback.
-
just sharing my experience i use RX570 4GB and this is my commandline : |
Beta Was this translation helpful? Give feedback.
-
Alo Guys (i'm on 5700xt) Trying Tilled Vae didn't works atm unless using this Any solution for this, sometimes I just want to do HiresFix and be done with it or proceed to the next step of course. |
Beta Was this translation helpful? Give feedback.
-
seems that 7900 xt works fine with just these settings: later maybe will try olive, some negative embeddings cause runtime errors, lora/lycoris and locon models seems working. |
Beta Was this translation helpful? Give feedback.
-
Recently set up stable-diffusion-webui-directml on my pc, just want to report my spec and speed. (I tried the suggested arguments from this post too but either the arguments or some of my extensions are not working, and so the SD only show "Waiting" when I click generate and nothing more happens, even the Installed Extensions list can't show the list content. Basically like only the webui is working but the backend is not. And even if I revert the args, I will have to restart the whole PC several times until the backend really come back to life. Not sure how to enable some verbose log to see what's wrong. I may try a more thorough "one args change each time test" later to find out which one is the culprit...) |
Beta Was this translation helpful? Give feedback.
-
Been using this guide for a while now and just wanted to say thank you for consolidating all the good AMD stuff into one place! AMD 6600XT 8GB Working settings for me: Optimizer: Doggettx Time speedup with NGMS only:
|
Beta Was this translation helpful? Give feedback.
-
Hey, I have a bit "silent news" today. After recent fiddling with SD DirectML, I've found out that Danbooru interrogation finally works correctly. So WD Tagger extension is not necessary anymore. Interrogate CLIP still doesn't work though. |
Beta Was this translation helpful? Give feedback.
-
Interesting discovery on RDNA3 (7900 XTX)! I accidently had two COMMANDLINE_ARGS in my batch file (forgot to comment one out). I was testing After fixing my batch file by removing the duplicate command line arg's, I tested inpainting again and it seems to be working fine without Unsure if this was related to the new Radeon 23.7.1 or a recent update to the DirectML Stable Diffusion/etc. My batch file settings:
Additionally - since the new driver update/command line args, I've noticed VRAM utilization seems MUCH better. Tends to use ~18 GB and not grow continuously. More testing needed here. Any other RDNA3 users have a chance to test with Radeon driver 23.7.1? |
Beta Was this translation helpful? Give feedback.
-
I am using a 6600XT 8GB Nitro+ and I tried a lot of commands trying to use Inpaint and using HiresFix x2 (1024x1024). This is what I found: --no-half is a must to use Inpaint, but for some reason I am able to do 1024x1024 with HiresFix x2, if I add the previous command, "Out of memory error" appears. -My previous commands to use HiresFix x2 (1024x1024) but can't use Inpaint -Currently I'm using these commands and it works well and fast with HiresFix x1.5 (768x768) and Inpaint working. Which cross atention optimization may fit with my GPU? |
Beta Was this translation helpful? Give feedback.
-
Fo those still not using a live preview: |
Beta Was this translation helpful? Give feedback.
-
BIG news - The main branch multiplatform version of |
Beta Was this translation helpful? Give feedback.
-
I'm newbie here, so I want to know which brunch is the sweet for an old amd gpu. My card is RX580 8GB. ROCm won't support anymore my card sadly, I know it's bit old as AMD thinks too. In the future update, ROCm update will do. But it for the RDNA2 and RDNA3 ( ROCm 5.6.0 windows ). Is it worth to go back to use v1.2.1 ( is it right? I think before these commit e.g. this one instead of using latest v1.5.1 . I found the old version can manege memory in better and can generate higher resolution. (I mean no offense for anything and anyone else...!! I'm just curious🥲) I also have latest |
Beta Was this translation helpful? Give feedback.
-
I am having a really strange issue with my 6750XT. I can't use DEFORUM with Animation Mode 3D, it just creates broken images after the first one. It works flawlessly with Animation Mode 3D. Anyone experienced this? |
Beta Was this translation helpful? Give feedback.
-
hey, i'm very new here. i have the same gpu, RX580 8GB. my vram seems to be always full when generating, and my driver failed 2 times, getting a black screen and fan speed reaching %100. i'm already sick of reinstalling drivers, do you know how can i prevent this issue? |
Beta Was this translation helpful? Give feedback.
-
Does anyone have any idea why since yesterday when i use Generate Forever i need to wait a long time between each images |
Beta Was this translation helpful? Give feedback.
-
Hello, However, I did want to leave a comment saying your topic is a few months old now, but it still works with one caveat. Padding the positive/negative prompts produced a white image in preview, which turned black upon completion .. so I had to disable that. The only other thing I modified was that with your settings above I could not generate 1024x1024 images like I could previously. The best I could do routinely without going OOM was 576x1024 (768x768 for 1:1). However, when I set the cross attention optimizer to Doggettx I could do 768x1024 (~5sec/it), so that might be something to look into if you're interested. I could not get any of the others to complete more than 3-4 iterations. One criticism I would make, if allowed, is that many times you speak of RAM usage and I'm assuming you meant VRAM in every instance. If you ever find the time to update this again it might be nice to note the kind of ram if both are implicated at various points. I haven't read the other comments, but I will in a few hours (hockey game) to see if there are any other optimizations I could try on my set up. Thanks again for this discussion. It completely made me forget what I was trying to get more information on (I think it was what, exactly, ---data-dir consisted of, it appears to be models and extensions only and not settings or anything else). |
Beta Was this translation helpful? Give feedback.
-
After about 2 months of being a SD DirectML power user and an active person in the discussions here I finally made my mind to compile the knowledge I've gathered after all that time. Considering that DirectML implementation is more of a translation layer rather than a low-level rewrite of the original code, some features of the original SD webui are bound to not function properly, and different AMD cards may also need a different approaches. Nevertheless, this post has been made from the perspective of AMD RX 580 (8GB) owner. The article will get expanded as I amass more knowledge.
Preface
Before I even start, I think I should make it clear that the guide has been written before the large 1.3.0 update that considerably changed a lot going on under the hood. So I should probably add the part 0 for the new stuff in 1.3.0 to be aware of.
Part 0. Optimizations
By all means read and take into the consideration everything written below, but the update 1.3.0 adds an important tab into the options: "Optimizations". Essentially, right now it's unnecessary to add any of
--opt...
arguments into thewebui-user.bat
, and instead you choose them from the dropdown list right here.Cross attention optimization
V1 - Original v1
- The least memory-hungry version of the standard split-attention. Safe optionDoggettX
- Essentially the split-attention as we know it. Default.Sub-quadratic
- Our go-to choice in the previous version, but unfortunately DOESN'T WORK with token merging. And it's practically impossible to run post 1.3.0 without it.sdp - scaled dot product
- Looks like the best option for me, but very RAM-hungry. Won't work for everyone. In my experience it's the least memory leak averse optimization, so you can keep SD running for longer before you have to restart it.InvokeAI
- For some reason incredibly slow for me. Not recommended.Negative Guidance minimum sigma
Allows the sampler to skip the negative prompt when it theoretically doesn't matter. It brings pretty significant speedup for almost no cost from the value of 1 up to 3.
Token Merging
The brand new feature coming with a major speedup for all of your generations. The values go from 0.1 to 0.9 and offers a major speedup (Up to 50% faster) and decreases the memory consumption. In exchange it may take some details off your generated pictures, offer less variety and may not work well with some LORAs. For the most part it worth it. 0.2-0.5 is a safe value range. Same goes for img2img.
I've never been able to use high-res pass without running out of memory so can't really attest the effectiveness, but I guess you can use pretty high values here, up to 0.8.
Check "Pad prompt/negative prompt to be same length", won't hurt at all.
"Persistent cond cache" - Seems like brings a minor speedup, but freezes SD on the next generation if you change the optimization method.
Part 1. Arguments
Being aware of the arguments is probably the most important knowledge of the new user, considering the wide range of optimisations and compatability fixes they offer. I'll try to be brief and concise.
The changes to the arguments are made in
webui-user.bat
(Open it in any text editor). The format goes as follows:set COMMANDLINE_ARGS=(string of arguments)
. Example of the editedwebui-user.bat
:I'll mention the arguments in the order of the actual usefulness, from "must have" to "placebo".
Must have
--no-half
,--no-half-vae
and--precision full
(update: Actually,--precision autocast
also works, and even running without that argument brings seemingly no problems, but looks like my graphics card can't take advantage of fp-16 calculations anyway, so there's virtually no difference) - Actual must have arguments for most of the AMD cards. Without them most of you are going to get black squares instead of the pictures, and for some these even offer a sizable speedup. Also inpainting often doesn't work at all without--no-half
on. There is a less memory-hungry alternative for the--no-half
called--upcast-sampling
, you can experiment with it. For me, though, it breaks inpainting.--medvram
- Considering the overall memory inefficiency of DirectML, if you have a graphics card with 8 or less GB of Vram this argument is non-negotiable. It doesn't hurt the performance nearly as much as--lowvram
and make the SD actually usable for many AMD card owners.--always-batch-cond-uncond
- Only works when--medvram
is on. Add this argument if you use ControlNet or some LORAs make a mess of your generated pictures.--disable-nan-check
- Used to getting rid of the pesky error "A tensor with all NaNs was produced in Unet". Won't hurt to have.Nice to have
--opt-sub-quad-attention
- Probably the most powerful and new of theopt
arguments. Sometimes a culpit to producing a black squares, but offers a pretty good speedup. Can be complemented by the additional arguments--sub-quad-q-chunk-size
,--sub-quad-kv-chunk-size
,--sub-quad-chunk-threshold
. Experiment and play around with the numbers.--opt-split-attention
and--opt-split-attention-v1
are the viable alternatives to the upmentioned optimisation argument, the latter being more restrictive, but less memory-hungry. Use these if you have any problems with--opt-sub-quad-attention
.--autolaunch
- Launches webUI right as the program starts up. I personally find in convenient.Experimental/Placebo
These in general won't hurt, but at the same time are pretty recent optimisations that haven't been tested enough just yet.
set SAFETENSORS_FAST_GPU=1
(Place it at the separate line atwebui-user.bat
) - I'm pretty sure this argument helped me to push my image size boundaries to about 200 pixels upwards! And, well, didn't hurt for sure.set PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:128
- Probably the most placebo of my actually used arguments, but I think I get less "out of memory" errors with that, and even if I get, I can often re-generate a picture with a success.Part 2. Extensions
Considering that some features of the DirectML implementations doesn't function properly, some of them can be successfully replaced with an extension.
Tagger is your alternative to the native interrogators (Both of them, Danbooru and CLIP, just don't work at all for many systems). This one provides danbooru-like tags that can be immediately exported into the prompt.
ControlNet
There are some caveats to using ControlNet on DirectML.
First of all, we are usually out of memory without "lowvram" flag, but there are ways around that.
Then, there are problems with the preprocessors, namely OpenPose and Depth. Regarding depth we, fortunately, can use the
leres
version. When it comes to the OpenPose, use this extension OpenPose editor.Image editor plugins
If you're not averse to holding a stylus & tablet yourself and want to tap into the power of the AI to augment your works, look no further then on the image editor plugins. These are actually so powerful that can replace the webui completely. I personally know of the two options.
Part 3. General usage tips
Models
When looking for models, try to choose the pruned fp16 versions at about 2GB of size, as they are the most memory-efficient and offer about the same quality, the only caveat being that these are a worse material for the custom model mixes (
Checkpoint Merger
tab).Example: Dreamshaper.
Samplers
Since the release of SD the samplers came a long way, and currently we have some a handful of almost uncontested winners.
From time to time I'm testing other samplers like LMS Karras and the new UniPC, but ultimatelly always fall back to these four.
Upscaling
For now, stay away from using hires.fix. Most of the time it leads to the "out of memory" error. Instead, use these extensions:
Decoder tile size
option to avoid grey tiles)Regarding upscaler models, the most stable ones are ESRGAN_4x (Paintings, Concept art), R-ESRGAN 4x+ (Photorealistic), R-ESRGAN 4x+ Anime6B (Well, anime and manga).
Beta Was this translation helpful? Give feedback.
All reactions