Getting stuck at "Distributed - injecting images 100%" #28

shootie22 · 2024-05-09T12:50:31Z

Hi, I first want to thank you for this project. I'm running into an issue where after finishing to generate the image, stable diffusion gets stuck at: Distributed - injecting images 100%.

I currently have 2 GPUs installed. One instance is running on device 0, the other on device 1 and I can confirm they are both being used through nvtop and nvidia-smi. Both instances are ran from the same folder, on different ports; unsure if this is how it's supposed to be used.

I have installed the extension and judging by the log, it seems to work. It generates an image, but gets stuck at the mentioned status. There are no errors I can see. When that status is displayed, the log reports the 2nd instance is idle.

Am I doing something wrong? If so, can you expand a bit on what the proper usage of this extension looks like? Please let me know if you need more information. Thank you.

Output from the main instance:

DISTRIBUTED | INFO     Job distribution:                                                                                  world.py:554                       1 * 1 iteration(s) + 1 complementary: 2 images total                                                           
                       'master' - 1 image(s) @ 2.97 ipm                                                                               
                       'slave1' - 1 image(s) @ 3.40 ipm                                                                               
                                                                                                                                      
DISTRIBUTED | WARNING  local script(s): [Hypertile], [Comments] seem to be unsupported by worker 'slave1'                worker.py:393
                                                                                                                                      
100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:14<00:00,  1.34it/s]Total progress: 100%|█████████████████████████████████████████████████████████████████████████████████| 20/20 [00:13<00:00,  1.38it/s]

The text was updated successfully, but these errors were encountered:

papuSpartan · 2024-05-09T22:14:33Z

Could you add --distributed-debug to your sdwui launch arguments and continue to run sdwui to see if this happens again? If it does, this will give me more verbose logs to work with. Whenever this happens, console output from the slave instance will be useful as well. Has sdwui ever gotten stuck near 100% for you before using this extension?

Also, yes, you're using it just fine as long as the compute device for each instance is different.

shootie22 · 2024-05-10T12:33:57Z

Thanks for the reply. I ran both instances with --distributed-debug and this is the output when it hangs at that status:
main:
100%|█████████████████████████████████████████████████████████████████████████████| 20/20 [00:15<00:00, 1.28it/s]DISTRIBUTED | DEBUG Took master 18.84s distributed.py:161DISTRIBUTED | DEBUG waiting for worker thread 'slave_request' distributed.py:165DISTRIBUTED | DEBUG waiting for worker thread 'slave_request' distributed.py:165DISTRIBUTED | DEBUG waiting for worker thread 'slave_request' distributed.py:165DISTRIBUTED | DEBUG waiting for worker thread 'slave_request' distributed.py:165
slave node:
100%|█████████████████████████████████████████████████████████████████████████████| 20/20 [00:18<00:00, 1.09it/s]DISTRIBUTED | DEBUG Took master 21.30s distributed.py:161DISTRIBUTED | DEBUG waiting for worker thread 'slave_request' distributed.py:165

this is the output of the log on the web-ui:

DEBUG - waiting for worker thread 'slave_request'
DEBUG - Took master 18.96s
DEBUG - had to substitute sampler index with name
DEBUG - worker 'slave' predicts it will take 18.772s to generate 40 image(s) at a speed of 118.31 ipm

DEBUG - worker 'slave' loaded weights in 0.01s
DEBUG - Worker 'slave' 3.54/3.94 GB VRAM free

DEBUG - 'slave' job's given starting seed is 1141091032 with 1 coming before it
INFO - Job distribution:
1 * 1 iteration(s) + 40 complementary: 41 images total
'master' - 1 image(s) @ 3.47 ipm
'slave' - 40 image(s) @ 118.31 ipm

I've not encountered this issue before using the extension, and it goes away if I disable it when generating.

papuSpartan · 2024-05-10T13:31:35Z

How long was your main instance running until you encountered this? Also, did you happen to use the interrupt/skip button on a request in the web-ui at any point before this happened?

shootie22 · 2024-05-10T13:33:42Z

I had just restarted both instances, so newly started. I didn't interrupt the generation. I tried with no prior generation to the distributed one, and with one on each instance beforehand; same result. D:

papuSpartan · 2024-05-10T13:37:12Z

When you say restarted you mean you fully stopped sdwui and restarted both? (didn't use the restart button built into the web interface?)

shootie22 · 2024-05-10T13:38:11Z

Correct, stopped and started again.

papuSpartan · 2024-05-10T13:43:02Z

So this is happening everytime first try for you after rebooting sdwui?

shootie22 · 2024-05-10T13:45:03Z

Yes, every time I try generating an image with the Distributed checkbox ticked, it will happen, regardless if it's the first image generated on the instance or not.

papuSpartan · 2024-05-10T13:51:16Z

What seems most likely is that optimize_jobs is being run multiple times before a singular request which should never happen. This could be due to an import conflict but I'm not sure yet.

papuSpartan · 2024-05-10T14:08:51Z

Could you add this debug statement logger.debug(f"added job for worker {worker.label}") after this line? After, reproduce the issue and post the logs like you did before and it may show that is where the issue is occurring.

papuSpartan · 2024-05-10T14:38:35Z

Could you also post your distributed-config.json file from the extension folder

shootie22 · 2024-05-10T14:43:33Z

I tried generating one image.
Master:

DISTRIBUTED | DEBUG    config loaded                                                                  world.py:643DISTRIBUTED | DEBUG    added job for worker master                                                    world.py:401DISTRIBUTED | DEBUG    added job for worker slave                                                     world.py:401DISTRIBUTED | DEBUG    World initialized!                                                       distributed.py:237DISTRIBUTED | DEBUG    The requested number of images(1) was not cleanly divisible by the number of   world.py:483
                       realtime nodes(2) resulting in 1 that will be redistributed                                
DISTRIBUTED | DEBUG    There's 19.01s of slack time available for worker 'slave'                      world.py:526DISTRIBUTED | DEBUG    worker 'slave':                                                                world.py:531
                       40 complementary image(s) = 19.01s slack/ 0.47s per requested image                        
DISTRIBUTED | INFO     Job distribution:                                                              world.py:555                       1 * 1 iteration(s) + 40 complementary: 41 images total                                     
                       'master' - 1 image(s) @ 3.47 ipm                                                           
                       'slave' - 40 image(s) @ 118.31 ipm                                                         
                                                                                                                  
DISTRIBUTED | DEBUG    'slave' job's given starting seed is 143432277 with 1 coming before it   distributed.py:344DISTRIBUTED | DEBUG    Worker 'slave' 3.54/3.94 GB VRAM free                                         worker.py:318
                                                                                                                  
DISTRIBUTED | DEBUG    worker 'slave' loaded weights in 0.01s                                        worker.py:675DISTRIBUTED | DEBUG    worker 'slave' predicts it will take 18.772s to generate 40 image(s) at a     worker.py:335                       speed of 118.31 ipm                                                                        
                                                                                                                  
DISTRIBUTED | DEBUG    local script(s): [Hypertile], [Comments] seem to be unsupported by worker     worker.py:391                       'slave'                                                                                    
                                                                                                                  
DISTRIBUTED | DEBUG    had to substitute sampler index with name                                     worker.py:422100%|█████████████████████████████████████████████████████████████████████████████| 20/20 [00:15<00:00,  1.28it/s]DISTRIBUTED | DEBUG    Took master 18.89s                                                       distributed.py:161DISTRIBUTED | DEBUG    waiting for worker thread 'slave_request'                                distributed.py:165

Slave:

DISTRIBUTED | DEBUG    config loaded                                                                  world.py:643DISTRIBUTED | DEBUG    added job for worker master                                                    world.py:401DISTRIBUTED | DEBUG    added job for worker slave                                                     world.py:401DISTRIBUTED | DEBUG    World initialized!                                                       distributed.py:237DISTRIBUTED | DEBUG    The requested number of images(1) was not cleanly divisible by the number of   world.py:483
                       realtime nodes(2) resulting in 1 that will be redistributed                                
DISTRIBUTED | DEBUG    There's 19.01s of slack time available for worker 'slave'                      world.py:526DISTRIBUTED | DEBUG    worker 'slave':                                                                world.py:531
                       40 complementary image(s) = 19.01s slack/ 0.47s per requested image                        
DISTRIBUTED | INFO     Job distribution:                                                              world.py:555                       1 * 1 iteration(s) + 40 complementary: 41 images total                                     
                       'master' - 1 image(s) @ 3.47 ipm                                                           
                       'slave' - 40 image(s) @ 118.31 ipm                                                         
                                                                                                                  
DISTRIBUTED | DEBUG    'slave' job's given starting seed is 143432277 with 1 coming before it   distributed.py:344DISTRIBUTED | DEBUG    Worker 'slave' 3.54/3.94 GB VRAM free                                         worker.py:318
                                                                                                                  
DISTRIBUTED | DEBUG    worker 'slave' loaded weights in 0.01s                                        worker.py:675DISTRIBUTED | DEBUG    worker 'slave' predicts it will take 18.772s to generate 40 image(s) at a     worker.py:335                       speed of 118.31 ipm                                                                        
                                                                                                                  
DISTRIBUTED | DEBUG    local script(s): [Hypertile], [Comments] seem to be unsupported by worker     worker.py:391                       'slave'                                                                                    
                                                                                                                  
DISTRIBUTED | DEBUG    had to substitute sampler index with name                                     worker.py:422100%|█████████████████████████████████████████████████████████████████████████████| 20/20 [00:15<00:00,  1.28it/s]DISTRIBUTED | DEBUG    Took master 18.89s                                                       distributed.py:161DISTRIBUTED | DEBUG    waiting for worker thread 'slave_request'                                distributed.py:165

Config:

{
   "workers": [
      {
         "master": {
            "avg_ipm": 3.468163269192995,
            "master": true,
            "address": "0.0.0.0",
            "port": 7860,
            "eta_percent_error": [],
            "tls": false,
            "state": 1,
            "user": null,
            "password": null,
            "pixel_cap": -1
         }
      },
      {
         "slave": {
            "avg_ipm": 118.3148549027504,
            "master": false,
            "address": "0.0.0.0",
            "port": 7861,
            "eta_percent_error": [],
            "tls": false,
            "state": 1,
            "user": "None",
            "password": "None",
            "pixel_cap": -1
         }
      }
   ],
   "benchmark_payload": {
      "prompt": "A herd of cows grazing at the bottom of a sunny valley",
      "negative_prompt": "",
      "steps": 20,
      "width": 512,
      "height": 512,
      "batch_size": 1
   },
   "job_timeout": 3,
   "enabled": true,
   "complement_production": true
}

papuSpartan · 2024-05-10T14:47:05Z

What happened to your speed ratings btw? Before it showed both of your instances were going at about 3 ipm but now the slave is at around 120 ipm?

shootie22 · 2024-05-10T14:48:34Z

It's strange, it sometimes shows something normal like 3, but usually the slave is really high at about 120ipm. I wish it could do 120ipm haha.

papuSpartan · 2024-05-10T20:20:48Z

Can you let me know if this also happens consistently on 424a1c8

shootie22 · 2024-05-11T08:13:26Z

Just tested, and on that commit I cannot generate anything with the extension enabled. :( It hangs like this.

master:

Launching Web UI with arguments: --listen --api --medvram --xformers --enable-insecure-extension-access --device-id=0 --port 7860 --distributed-debug
DISTRIBUTED | DEBUG    config loaded                                                                  world.py:658DISTRIBUTED | INFO     doing initial ping sweep to see which workers are reachable               distributed.py:51DISTRIBUTED | DEBUG    checking if worker 'slave' is reachable...                                     world.py:693DISTRIBUTED | INFO     worker 'slave' is online                                                       world.py:720Distributed: worker 'slave' is online
Loading weights [139ac005d4] from /home/main/programs/stablediff/stable-diffusion-webui/models/Stable-diffusion/realisticVisionV60B1_v51VAE.ckpt
DISTRIBUTED | DEBUG    config loaded                                                                  world.py:658DISTRIBUTED | DEBUG    config loaded                                                                  world.py:658Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.
DISTRIBUTED | DEBUG    config loaded                                                                  world.py:658DISTRIBUTED | DEBUG    config loaded                                                                  world.py:658DISTRIBUTED | DEBUG    config loaded                                                                  world.py:658DISTRIBUTED | DEBUG    config loaded                                                                  world.py:658Startup time: 17.8s (prepare environment: 2.9s, import torch: 7.7s, import gradio: 1.0s, setup paths: 2.0s, initialize shared: 0.2s, other imports: 0.7s, load scripts: 1.1s, create ui: 1.0s, gradio launch: 0.2s, add APIs: 0.9s).Creating model from config: /home/main/programs/stablediff/stable-diffusion-webui/configs/v1-inference.yaml
/home/main/programs/stablediff/stable-diffusion-webui/venv/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Applying attention optimization: Doggettx... done.
Model loaded in 8.4s (load weights from disk: 5.3s, create model: 0.6s, apply weights to model: 1.6s, apply half(): 0.7s, calculate empty prompt: 0.2s).
DISTRIBUTED | DEBUG    config loaded                                                                  world.py:658DISTRIBUTED | DEBUG    recorded speed for worker 'master' is invalid                                  world.py:214DISTRIBUTED | DEBUG    recorded speed for worker 'slave' is invalid                                   world.py:214DISTRIBUTED | DEBUG    worker 'slave' loaded weights in 0.01s                                        worker.py:686

slave:

Launching Web UI with arguments: --listen --api --medvram --xformers --enable-insecure-extension-access --device-id=1 --port 7861
DISTRIBUTED | INFO     doing initial ping sweep to see which workers are reachable               distributed.py:51DISTRIBUTED | ERROR    HTTPConnectionPool(host='0.0.0.0', port=7861): Max retries exceeded with url: worker.py:618                       /sdapi/v1/memory (Caused by                                                                
                       NewConnectionError('<urllib3.connection.HTTPConnection object at                           
                       0x75b52fa81540>: Failed to establish a new connection: [Errno 111] Connection              
                       refused'))                                                                                 
DISTRIBUTED | INFO     worker 'slave' is unreachable                                                  world.py:725Loading weights [139ac005d4] from /home/main/programs/stablediff/stable-diffusion-webui/models/Stable-diffusion/realisticVisionV60B1_v51VAE.ckpt
Running on local URL:  http://0.0.0.0:7861

To create a public link, set `share=True` in `launch()`.
Startup time: 15.3s (prepare environment: 2.6s, import torch: 5.4s, import gradio: 1.1s, setup paths: 2.1s, initialize shared: 0.2s, other imports: 0.7s, load scripts: 1.0s, create ui: 1.1s, gradio launch: 0.2s, add APIs: 0.9s).Creating model from config: /home/main/programs/stablediff/stable-diffusion-webui/configs/v1-inference.yaml
/home/main/programs/stablediff/stable-diffusion-webui/venv/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Applying attention optimization: Doggettx... done.
Model loaded in 13.2s (load weights from disk: 5.0s, create model: 0.5s, apply weights to model: 5.5s, apply half(): 2.0s, calculate empty prompt: 0.2s).

WebUI extension log:

DEBUG - config loaded
DEBUG - config loaded
DEBUG - config loaded
DEBUG - config loaded
DEBUG - config loaded
DEBUG - config loaded
INFO - worker 'slave' is online
DEBUG - checking if worker 'slave' is reachable...
INFO - doing initial ping sweep to see which workers are reachable
DEBUG - config loaded

It seems that it doesn't send the command to the slave instance?
I deleted the previous extension version, started SD, shut it down, installed the new one.

papuSpartan · 2024-05-11T11:43:17Z

Does this happen with no other extensions enabled (builtin ones should be fine)? Also can you post your extension list of what you've been using.

shootie22 · 2024-05-11T12:51:35Z

This is the only non-builtin extension I'm using, and it works fine if I disable it.

papuSpartan · 2024-05-11T18:31:08Z

In that case:

what commit of sdwui are you on
what version is your python interpreter
can you post your distributed.log file from the extension folder

Were you reloading the config yourself multiple times in a row? At least initially it looks like your slave instance wasn't running yet as you were getting a connection refused error.

shootie22 · 2024-05-11T20:15:50Z

sdwui version: 1.9.3 (1c0a0c4)
python ver: 3.10.14
distributed.log:

2024-05-11 08:02:47,965 - ERROR - Config was not found at '/home/main/programs/stablediff/stable-diffusion-webui/extensions/stable-diffusion-webui-distributed/distributed-config.json'
2024-05-11 08:02:47,972 - INFO - Generated new config file at '/home/main/programs/stablediff/stable-diffusion-webui/extensions/stable-diffusion-webui-distributed/distributed-config.json'
2024-05-11 08:02:47,975 - ERROR - config is corrupt or invalid JSON, unable to load
2024-05-11 08:02:47,976 - DEBUG - cannot parse null config (present but empty config file?)
generating defaults for config
2024-05-11 08:02:47,979 - DEBUG - config saved
2024-05-11 08:02:47,981 - INFO - doing initial ping sweep to see which workers are reachable
2024-05-11 08:02:48,115 - DEBUG - config loaded
2024-05-11 08:02:48,561 - DEBUG - config loaded
2024-05-11 08:02:49,560 - DEBUG - config loaded
2024-05-11 08:02:49,605 - DEBUG - config loaded
2024-05-11 08:02:49,754 - DEBUG - config loaded
2024-05-11 08:02:49,798 - DEBUG - config loaded
2024-05-11 08:02:50,067 - INFO - doing initial ping sweep to see which workers are reachable
2024-05-11 08:03:12,589 - DEBUG - config saved
2024-05-11 08:03:24,649 - INFO - Redoing benchmarks...
2024-05-11 08:03:24,661 - DEBUG - worker 'slave' loaded weights in 0.01s
2024-05-11 08:04:00,909 - DEBUG - master finished warming up

2024-05-11 08:04:18,209 - INFO - Sample 1: Worker 'master'(0.0.0.0:7860) - 3.47 image(s) per minute

2024-05-11 08:04:35,482 - INFO - Sample 2: Worker 'master'(0.0.0.0:7860) - 3.47 image(s) per minute

2024-05-11 08:04:52,793 - INFO - Sample 3: Worker 'master'(0.0.0.0:7860) - 3.47 image(s) per minute

2024-05-11 08:04:52,796 - DEBUG - Worker 'master' average ipm: 3.47
2024-05-11 08:04:52,798 - INFO - benchmarking worker 'master'
2024-05-11 08:04:52,800 - INFO - benchmarking worker 'slave'
2024-05-11 08:04:52,897 - DEBUG - Worker 'slave' 3.76/3.94 GB VRAM free

2024-05-11 08:06:14,143 - DEBUG - config loaded
2024-05-11 08:06:14,145 - DEBUG - recorded speed for worker 'master' is invalid
2024-05-11 08:06:14,147 - DEBUG - recorded speed for worker 'slave' is invalid
2024-05-11 08:06:14,156 - DEBUG - worker 'slave' loaded weights in 0.01s
2024-05-11 08:06:42,980 - DEBUG - handling interrupt signal
2024-05-11 08:06:42,984 - DEBUG - config saved
2024-05-11 08:06:57,760 - INFO - doing initial ping sweep to see which workers are reachable
2024-05-11 08:06:57,769 - ERROR - HTTPConnectionPool(host='0.0.0.0', port=7861): Max retries exceeded with url: /sdapi/v1/memory (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7a920dc857b0>: Failed to establish a new connection: [Errno 111] Connection refused'))
2024-05-11 08:06:57,772 - INFO - worker 'slave' is unreachable
2024-05-11 08:06:59,163 - DEBUG - config loaded
2024-05-11 08:06:59,170 - INFO - doing initial ping sweep to see which workers are reachable
2024-05-11 08:06:59,172 - DEBUG - checking if worker 'slave' is reachable...
2024-05-11 08:06:59,210 - INFO - worker 'slave' is unreachable
2024-05-11 08:06:59,343 - DEBUG - config loaded
2024-05-11 08:06:59,784 - DEBUG - config loaded
2024-05-11 08:07:00,814 - DEBUG - config loaded
2024-05-11 08:07:00,862 - DEBUG - config loaded
2024-05-11 08:07:01,025 - DEBUG - config loaded
2024-05-11 08:07:01,073 - DEBUG - config loaded
2024-05-11 08:07:18,009 - DEBUG - config loaded
2024-05-11 08:07:18,011 - DEBUG - recorded speed for worker 'master' is invalid
2024-05-11 08:07:18,013 - DEBUG - recorded speed for worker 'slave' is invalid
2024-05-11 08:07:18,023 - DEBUG - worker 'slave' loaded weights in 0.01s
2024-05-11 08:07:47,572 - WARNING - config reports invalid speed (0 ipm) for worker 'master'
please re-benchmark
2024-05-11 08:07:47,575 - WARNING - config reports invalid speed (0 ipm) for worker 'slave'
please re-benchmark
2024-05-11 08:07:51,197 - INFO - Redoing benchmarks...
2024-05-11 08:07:51,207 - DEBUG - worker 'slave' loaded weights in 0.01s
2024-05-11 08:08:39,355 - DEBUG - handling interrupt signal
2024-05-11 08:08:39,359 - DEBUG - config saved
2024-05-11 08:08:54,534 - INFO - doing initial ping sweep to see which workers are reachable
2024-05-11 08:08:54,543 - ERROR - HTTPConnectionPool(host='0.0.0.0', port=7861): Max retries exceeded with url: /sdapi/v1/memory (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7364e6e7d750>: Failed to establish a new connection: [Errno 111] Connection refused'))
2024-05-11 08:08:54,546 - INFO - worker 'slave' is unreachable
2024-05-11 08:08:55,413 - DEBUG - config loaded
2024-05-11 08:08:55,420 - INFO - doing initial ping sweep to see which workers are reachable
2024-05-11 08:08:55,422 - DEBUG - checking if worker 'slave' is reachable...
2024-05-11 08:08:55,426 - ERROR - HTTPConnectionPool(host='0.0.0.0', port=7861): Max retries exceeded with url: /sdapi/v1/memory (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7b078428d4b0>: Failed to establish a new connection: [Errno 111] Connection refused'))
2024-05-11 08:08:55,429 - INFO - worker 'slave' is unreachable
2024-05-11 08:08:55,564 - DEBUG - config loaded
2024-05-11 08:08:56,008 - DEBUG - config loaded
2024-05-11 08:08:56,965 - DEBUG - config loaded
2024-05-11 08:08:57,010 - DEBUG - config loaded
2024-05-11 08:08:57,159 - DEBUG - config loaded
2024-05-11 08:08:57,204 - DEBUG - config loaded
2024-05-11 08:09:58,364 - DEBUG - handling interrupt signal
2024-05-11 08:09:58,369 - DEBUG - config saved
2024-05-11 08:10:15,019 - INFO - doing initial ping sweep to see which workers are reachable
2024-05-11 08:10:15,028 - ERROR - HTTPConnectionPool(host='0.0.0.0', port=7861): Max retries exceeded with url: /sdapi/v1/memory (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x75b52fa81540>: Failed to establish a new connection: [Errno 111] Connection refused'))
2024-05-11 08:10:15,031 - INFO - worker 'slave' is unreachable
2024-05-11 08:10:34,280 - DEBUG - config loaded
2024-05-11 08:10:34,287 - INFO - doing initial ping sweep to see which workers are reachable
2024-05-11 08:10:34,289 - DEBUG - checking if worker 'slave' is reachable...
2024-05-11 08:10:34,414 - INFO - worker 'slave' is online
2024-05-11 08:10:34,550 - DEBUG - config loaded
2024-05-11 08:10:34,980 - DEBUG - config loaded
2024-05-11 08:10:35,970 - DEBUG - config loaded
2024-05-11 08:10:36,017 - DEBUG - config loaded
2024-05-11 08:10:36,163 - DEBUG - config loaded
2024-05-11 08:10:36,206 - DEBUG - config loaded
2024-05-11 08:11:05,151 - DEBUG - config loaded
2024-05-11 08:11:05,154 - DEBUG - recorded speed for worker 'master' is invalid
2024-05-11 08:11:05,156 - DEBUG - recorded speed for worker 'slave' is invalid
2024-05-11 08:11:05,166 - DEBUG - worker 'slave' loaded weights in 0.01s
2024-05-11 08:22:31,026 - DEBUG - handling interrupt signal
2024-05-11 08:22:31,029 - DEBUG - config saved
2024-05-11 12:04:14,157 - DEBUG - config loaded
2024-05-11 12:04:14,163 - INFO - doing initial ping sweep to see which workers are reachable
2024-05-11 12:04:14,165 - DEBUG - checking if worker 'slave' is reachable...
2024-05-11 12:04:14,169 - ERROR - HTTPConnectionPool(host='0.0.0.0', port=7861): Max retries exceeded with url: /sdapi/v1/memory (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x72c67007d270>: Failed to establish a new connection: [Errno 111] Connection refused'))
2024-05-11 12:04:14,172 - INFO - worker 'slave' is unreachable
2024-05-11 12:04:14,300 - DEBUG - config loaded
2024-05-11 12:04:14,736 - DEBUG - config loaded
2024-05-11 12:04:15,672 - DEBUG - config loaded
2024-05-11 12:04:15,714 - DEBUG - config loaded
2024-05-11 12:04:15,862 - DEBUG - config loaded
2024-05-11 12:04:15,906 - DEBUG - config loaded
2024-05-11 12:04:39,376 - DEBUG - config loaded
2024-05-11 12:04:39,378 - DEBUG - recorded speed for worker 'master' is invalid
2024-05-11 12:04:39,380 - DEBUG - recorded speed for worker 'slave' is invalid
2024-05-11 13:00:21,789 - DEBUG - config loaded
2024-05-11 13:00:21,792 - DEBUG - recorded speed for worker 'master' is invalid
2024-05-11 13:00:21,794 - DEBUG - recorded speed for worker 'slave' is invalid
2024-05-11 13:02:15,027 - DEBUG - config loaded
2024-05-11 13:02:15,029 - DEBUG - recorded speed for worker 'master' is invalid
2024-05-11 13:02:15,031 - DEBUG - recorded speed for worker 'slave' is invalid
2024-05-11 19:54:30,599 - DEBUG - config loaded
2024-05-11 19:54:30,602 - DEBUG - recorded speed for worker 'master' is invalid
2024-05-11 19:54:30,604 - DEBUG - recorded speed for worker 'slave' is invalid
2024-05-11 19:56:46,931 - DEBUG - config loaded
2024-05-11 19:56:46,935 - DEBUG - recorded speed for worker 'master' is invalid
2024-05-11 19:56:46,937 - DEBUG - recorded speed for worker 'slave' is invalid
2024-05-11 19:58:26,201 - DEBUG - config loaded
2024-05-11 19:58:26,204 - DEBUG - recorded speed for worker 'master' is invalid
2024-05-11 19:58:26,206 - DEBUG - recorded speed for worker 'slave' is invalid
2024-05-11 20:01:18,842 - DEBUG - config loaded
2024-05-11 20:01:18,845 - DEBUG - recorded speed for worker 'master' is invalid
2024-05-11 20:01:18,847 - DEBUG - recorded speed for worker 'slave' is invalid
2024-05-11 20:03:07,938 - INFO - doing initial ping sweep to see which workers are reachable
2024-05-11 20:03:07,946 - ERROR - HTTPConnectionPool(host='0.0.0.0', port=7861): Max retries exceeded with url: /sdapi/v1/memory (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ef589479570>: Failed to establish a new connection: [Errno 111] Connection refused'))
2024-05-11 20:03:07,950 - INFO - worker 'slave' is unreachable
2024-05-11 20:05:22,187 - DEBUG - config loaded
2024-05-11 20:05:22,193 - DEBUG - recorded speed for worker 'master' is invalid
2024-05-11 20:05:22,196 - DEBUG - recorded speed for worker 'slave' is invalid
2024-05-11 20:05:22,307 - DEBUG - worker 'slave' loaded weights in 0.11s
2024-05-11 20:05:52,491 - DEBUG - handling interrupt signal
2024-05-11 20:05:52,740 - DEBUG - config saved
2024-05-11 20:06:09,005 - INFO - doing initial ping sweep to see which workers are reachable
2024-05-11 20:06:09,014 - ERROR - HTTPConnectionPool(host='0.0.0.0', port=7861): Max retries exceeded with url: /sdapi/v1/memory (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fa44527d450>: Failed to establish a new connection: [Errno 111] Connection refused'))
2024-05-11 20:06:09,017 - INFO - worker 'slave' is unreachable
2024-05-11 20:06:09,909 - DEBUG - config loaded
2024-05-11 20:06:09,916 - INFO - doing initial ping sweep to see which workers are reachable
2024-05-11 20:06:09,918 - DEBUG - checking if worker 'slave' is reachable...
2024-05-11 20:06:09,922 - ERROR - HTTPConnectionPool(host='0.0.0.0', port=7861): Max retries exceeded with url: /sdapi/v1/memory (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x715c71e79330>: Failed to establish a new connection: [Errno 111] Connection refused'))
2024-05-11 20:06:09,925 - INFO - worker 'slave' is unreachable
2024-05-11 20:06:10,059 - DEBUG - config loaded
2024-05-11 20:06:10,507 - DEBUG - config loaded
2024-05-11 20:06:11,448 - DEBUG - config loaded
2024-05-11 20:06:11,495 - DEBUG - config loaded
2024-05-11 20:06:11,643 - DEBUG - config loaded
2024-05-11 20:06:11,686 - DEBUG - config loaded
2024-05-11 20:06:34,215 - DEBUG - handling interrupt signal
2024-05-11 20:06:34,219 - DEBUG - config saved
2024-05-11 20:06:48,342 - DEBUG - config loaded
2024-05-11 20:06:48,349 - INFO - doing initial ping sweep to see which workers are reachable
2024-05-11 20:06:48,351 - DEBUG - checking if worker 'slave' is reachable...
2024-05-11 20:06:48,478 - INFO - worker 'slave' is online
2024-05-11 20:06:48,623 - DEBUG - config loaded
2024-05-11 20:06:49,052 - DEBUG - config loaded
2024-05-11 20:06:49,988 - DEBUG - config loaded
2024-05-11 20:06:50,036 - DEBUG - config loaded
2024-05-11 20:06:50,185 - DEBUG - config loaded
2024-05-11 20:06:50,227 - DEBUG - config loaded
2024-05-11 20:08:55,680 - DEBUG - config loaded
2024-05-11 20:08:55,683 - DEBUG - recorded speed for worker 'master' is invalid
2024-05-11 20:08:55,685 - DEBUG - recorded speed for worker 'slave' is invalid
2024-05-11 20:08:55,695 - DEBUG - worker 'slave' loaded weights in 0.01s

log of both instances starting, and after I tried generating an image. they just hang like this on the latest commit:

on the left is the master, on the right is the slave.

Were you reloading the config yourself multiple times in a row? At least initially it looks like your slave instance wasn't running yet as you were getting a connection refused error.

I always restart the instances by stopping and starting them. I believe the connection refused is due to the slave trying to ping itself (after fetching the worker config) before it started. Before trying to generate, I always check if the status is IDLE on the slave from the webui.

papuSpartan · 2024-05-12T00:16:05Z

Remove the extension from the slave worker, you only need it installed on the main instance. If you're using the same installation root for more than one instance this means that you probably need to use sdwui's command line options so you can force it to use a separate config that has the extension disabled for the slave instance. You can see on the slave instance it's trying to connect back to itself as a worker (since the port is the same), this shouldn't happen.

shootie22 · 2024-05-12T10:08:27Z

Thank you - I was able to get it working by adding --disable-all-extensions which disables all extra extensions on the slave instance, apart from the built-in ones.

But I am now facing an issue where my slave instance is seemingly running out of VRAM when generating using the extension. Low sampling steps seem to work, but anything higher than ~10-15 seems to cause it to run out of VRAM.

It's strange because it works just fine if I am to generate through its own web-ui, at any number of sampling steps, even higher resolutions. Do you think this be an issue with how the extension spreads the workload?

papuSpartan · 2024-05-12T11:07:26Z

The problem in this case is that your slave's ipm was ending up at around 120 for some reason (3 ipm like before sounds about right). The best thing to do would be to rebenchmark or manually adjust that ipm in the config. Then, the distribution logic should split your requests about evenly and this should be far less of an issue.

papuSpartan · 2024-05-13T10:38:14Z

If there are still issues let me know and I can reopen the issue.

shootie22 · 2024-05-13T20:25:05Z

After another benchmark and some restarting I was able to get it to work. Thank you for your time and help!

papuSpartan closed this as completed May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting stuck at "Distributed - injecting images 100%" #28

Getting stuck at "Distributed - injecting images 100%" #28

shootie22 commented May 9, 2024

papuSpartan commented May 9, 2024

shootie22 commented May 10, 2024 •

edited

Loading

papuSpartan commented May 10, 2024

shootie22 commented May 10, 2024

papuSpartan commented May 10, 2024

shootie22 commented May 10, 2024

papuSpartan commented May 10, 2024

shootie22 commented May 10, 2024

papuSpartan commented May 10, 2024

papuSpartan commented May 10, 2024

papuSpartan commented May 10, 2024

shootie22 commented May 10, 2024

papuSpartan commented May 10, 2024

shootie22 commented May 10, 2024

papuSpartan commented May 10, 2024

shootie22 commented May 11, 2024 •

edited

Loading

papuSpartan commented May 11, 2024 •

edited

Loading

shootie22 commented May 11, 2024 •

edited

Loading

papuSpartan commented May 11, 2024 •

edited

Loading

shootie22 commented May 11, 2024

papuSpartan commented May 12, 2024

shootie22 commented May 12, 2024

papuSpartan commented May 12, 2024

papuSpartan commented May 13, 2024 •

edited

Loading

shootie22 commented May 13, 2024

Getting stuck at "Distributed - injecting images 100%" #28

Getting stuck at "Distributed - injecting images 100%" #28

Comments

shootie22 commented May 9, 2024

papuSpartan commented May 9, 2024

shootie22 commented May 10, 2024 • edited Loading

papuSpartan commented May 10, 2024

shootie22 commented May 10, 2024

papuSpartan commented May 10, 2024

shootie22 commented May 10, 2024

papuSpartan commented May 10, 2024

shootie22 commented May 10, 2024

papuSpartan commented May 10, 2024

papuSpartan commented May 10, 2024

papuSpartan commented May 10, 2024

shootie22 commented May 10, 2024

papuSpartan commented May 10, 2024

shootie22 commented May 10, 2024

papuSpartan commented May 10, 2024

shootie22 commented May 11, 2024 • edited Loading

papuSpartan commented May 11, 2024 • edited Loading

shootie22 commented May 11, 2024 • edited Loading

papuSpartan commented May 11, 2024 • edited Loading

shootie22 commented May 11, 2024

papuSpartan commented May 12, 2024

shootie22 commented May 12, 2024

papuSpartan commented May 12, 2024

papuSpartan commented May 13, 2024 • edited Loading

shootie22 commented May 13, 2024

shootie22 commented May 10, 2024 •

edited

Loading

shootie22 commented May 11, 2024 •

edited

Loading

papuSpartan commented May 11, 2024 •

edited

Loading

shootie22 commented May 11, 2024 •

edited

Loading

papuSpartan commented May 11, 2024 •

edited

Loading

papuSpartan commented May 13, 2024 •

edited

Loading