DeferredCudaCallError: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. 


### Describe the bug
```
import torch
from diffusers import StableDiffusionXLInpaintPipeline
from diffusers.utils import load_image
from diffusers import ControlNetModel, StableDiffusionXLControlNetImg2ImgPipeline, AutoencoderKL
# import os
# os.environ["CUDA_VISIBLE_DEVICES"] = "1"

device = "cuda"

pipe = StableDiffusionXLInpaintPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    variant="fp16",
    use_safetensors=True,
)
pipe.to(device)
```

### Reproduction

```
pipe = StableDiffusionXLInpaintPipeline(...)
pipe.to("cuda")
```

### Logs

```shell
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
File ~/.miniconda3/envs/stable-diffusion-train/lib/python3.10/site-packages/torch/cuda/__init__.py:260, in _lazy_init()
    259 try:
--> 260     queued_call()
    261 except Exception as e:

File ~/.miniconda3/envs/stable-diffusion-train/lib/python3.10/site-packages/torch/cuda/__init__.py:145, in _check_capability()
    144 for d in range(device_count()):
--> 145     capability = get_device_capability(d)
    146     major = capability[0]

File ~/.miniconda3/envs/stable-diffusion-train/lib/python3.10/site-packages/torch/cuda/__init__.py:381, in get_device_capability(device)
    369 r"""Gets the cuda capability of a device.
    370 
    371 Args:
   (...)
    379     tuple(int, int): the major and minor cuda capability of the device
    380 """
--> 381 prop = get_device_properties(device)
    382 return prop.major, prop.minor

File ~/.miniconda3/envs/stable-diffusion-train/lib/python3.10/site-packages/torch/cuda/__init__.py:399, in get_device_properties(device)
    398     raise AssertionError("Invalid device id")
--> 399 return _get_device_properties(device)

RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. 

The above exception was the direct cause of the following exception:

DeferredCudaCallError                     Traceback (most recent call last)
/home/brand/develop/stable-diffusion/stable-diffusion-train/notes/image_inpainting_xl.ipynb 单元格 3 line 2
     11 # vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16).to(device)
     12 
     13 # pipe = StableDiffusionXLInpaintPipeline.from_pretrained(
   (...)
     16 #     torch_dtype=torch.float32,
     17 # )
     19 pipe = StableDiffusionXLInpaintPipeline.from_pretrained(
     20     "stabilityai/stable-diffusion-xl-base-1.0",
     21     torch_dtype=torch.float16,
     22     variant="fp16",
     23     use_safetensors=True,
     24 )
---> 25 pipe.to(device)

File ~/.miniconda3/envs/stable-diffusion-train/lib/python3.10/site-packages/diffusers/pipelines/pipeline_utils.py:815, in DiffusionPipeline.to(self, *args, **kwargs)
    811     logger.warning(
    812         f"The module '{module.__class__.__name__}' has been loaded in 8bit and moving it to {torch_dtype} via `.to()` is not yet supported. Module is still on {module.device}."
    813     )
    814 else:
--> 815     module.to(device, dtype)
    817 if (
    818     module.dtype == torch.float16
    819     and str(device) in ["cpu"]
    820     and not silence_dtype_warnings
    821     and not is_offloaded
    822 ):
    823     logger.warning(
    824         "Pipelines loaded with `dtype=torch.float16` cannot run with `cpu` device. It"
    825         " is not recommended to move them to `cpu` as running them will fail. Please make"
   (...)
    828         " `torch_dtype=torch.float16` argument, or use another device for inference."
    829     )

File ~/.miniconda3/envs/stable-diffusion-train/lib/python3.10/site-packages/transformers/modeling_utils.py:2181, in PreTrainedModel.to(self, *args, **kwargs)
   2176     raise ValueError(
   2177         "`.to` is not supported for `4-bit` or `8-bit` bitsandbytes models. Please use the model as it is, since the"
   2178         " model has already been set to the correct devices and casted to the correct `dtype`."
   2179     )
   2180 else:
-> 2181     return super().to(*args, **kwargs)

File ~/.miniconda3/envs/stable-diffusion-train/lib/python3.10/site-packages/torch/nn/modules/module.py:1145, in Module.to(self, *args, **kwargs)
   1141         return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None,
   1142                     non_blocking, memory_format=convert_to_format)
   1143     return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
-> 1145 return self._apply(convert)

File ~/.miniconda3/envs/stable-diffusion-train/lib/python3.10/site-packages/torch/nn/modules/module.py:797, in Module._apply(self, fn)
    795 def _apply(self, fn):
    796     for module in self.children():
--> 797         module._apply(fn)
    799     def compute_should_use_set_data(tensor, tensor_applied):
    800         if torch._has_compatible_shallow_copy_type(tensor, tensor_applied):
    801             # If the new tensor has compatible tensor type as the existing tensor,
    802             # the current behavior is to change the tensor in-place using `.data =`,
   (...)
    807             # global flag to let the user control whether they want the future
    808             # behavior of overwriting the existing tensor or not.

File ~/.miniconda3/envs/stable-diffusion-train/lib/python3.10/site-packages/torch/nn/modules/module.py:797, in Module._apply(self, fn)
    795 def _apply(self, fn):
    796     for module in self.children():
--> 797         module._apply(fn)
    799     def compute_should_use_set_data(tensor, tensor_applied):
    800         if torch._has_compatible_shallow_copy_type(tensor, tensor_applied):
    801             # If the new tensor has compatible tensor type as the existing tensor,
    802             # the current behavior is to change the tensor in-place using `.data =`,
   (...)
    807             # global flag to let the user control whether they want the future
    808             # behavior of overwriting the existing tensor or not.

File ~/.miniconda3/envs/stable-diffusion-train/lib/python3.10/site-packages/torch/nn/modules/module.py:797, in Module._apply(self, fn)
    795 def _apply(self, fn):
    796     for module in self.children():
--> 797         module._apply(fn)
    799     def compute_should_use_set_data(tensor, tensor_applied):
    800         if torch._has_compatible_shallow_copy_type(tensor, tensor_applied):
    801             # If the new tensor has compatible tensor type as the existing tensor,
    802             # the current behavior is to change the tensor in-place using `.data =`,
   (...)
    807             # global flag to let the user control whether they want the future
    808             # behavior of overwriting the existing tensor or not.

File ~/.miniconda3/envs/stable-diffusion-train/lib/python3.10/site-packages/torch/nn/modules/module.py:820, in Module._apply(self, fn)
    816 # Tensors stored in modules are graph leaves, and we don't want to
    817 # track autograd history of `param_applied`, so we have to use
    818 # `with torch.no_grad():`
    819 with torch.no_grad():
--> 820     param_applied = fn(param)
    821 should_use_set_data = compute_should_use_set_data(param, param_applied)
    822 if should_use_set_data:

File ~/.miniconda3/envs/stable-diffusion-train/lib/python3.10/site-packages/torch/nn/modules/module.py:1143, in Module.to.<locals>.convert(t)
   1140 if convert_to_format is not None and t.dim() in (4, 5):
   1141     return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None,
   1142                 non_blocking, memory_format=convert_to_format)
-> 1143 return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)

File ~/.miniconda3/envs/stable-diffusion-train/lib/python3.10/site-packages/torch/cuda/__init__.py:264, in _lazy_init()
    261         except Exception as e:
    262             msg = (f"CUDA call failed lazily at initialization with error: {str(e)}\n\n"
    263                    f"CUDA call was originally invoked at:\n\n{orig_traceback}")
--> 264             raise DeferredCudaCallError(msg) from e
    265 finally:
    266     delattr(_tls, 'is_initializing')

DeferredCudaCallError: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. 

CUDA call was originally invoked at:

['  File "/home/brand/.miniconda3/envs/stable-diffusion-train/lib/python3.10/runpy.py", line 196, in _run_module_as_main\n    return _run_code(code, main_globals, None,\n', '  File "/home/brand/.miniconda3/envs/stable-diffusion-train/lib/python3.10/runpy.py", line 86, in _run_code\n    exec(code, run_globals)\n', '  File "/home/brand/.miniconda3/envs/stable-diffusion-train/lib/python3.10/site-packages/ipykernel_launcher.py", line 17, in <module>\n    app.launch_new_instance()\n', '  File "/home/brand/.miniconda3/envs/stable-diffusion-train/lib/python3.10/site-packages/traitlets/config/application.py", line 1046, in launch_instance\n    app.start()\n', '  File "/home/brand/.miniconda3/envs/stable-diffusion-train/lib/python3.10/site-packages/ipykernel/kernelapp.py", line 736, in start\n    self.io_loop.start()\n', '  File "/home/brand/.miniconda3/envs/stable-diffusion-train/lib/python3.10/site-packages/tornado/platform/asyncio.py", line 195, in start\n    self.asyncio_loop.run_forever()\n', '  File "/home/brand/.miniconda3/envs/stable-diffusion-train/lib/python3.10/asyncio/base_events.py", line 603, in run_forever\n    self._run_once()\n', '  File "/home/brand/.miniconda3/envs/stable-diffusion-train/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once\n    handle._run()\n', '  File "/home/brand/.miniconda3/envs/stable-diffusion-train/lib/python3.10/asyncio/events.py", line 80, in _run\n    self._context.run(self._callback, *self._args)\n', '  File "/home/brand/.miniconda3/envs/stable-diffusion-train/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 516, in dispatch_queue\n    await self.process_one()\n', '  File "/home/brand/.miniconda3/envs/stable-diffusion-train/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 505, in process_one\n    await dispatch(*args)\n', '  File "/home/brand/.miniconda3/envs/stable-diffusion-train/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 412, in dispatch_shell\n    await result\n', '  File "/home/brand/.miniconda3/envs/stable-diffusion-train/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 740, in execute_request\n    reply_content = await reply_content\n', '  File "/home/brand/.miniconda3/envs/stable-diffusion-train/lib/python3.10/site-packages/ipykernel/ipkernel.py", line 422, in do_execute\n    res = shell.run_cell(\n', '  File "/home/brand/.miniconda3/envs/stable-diffusion-train/lib/python3.10/site-packages/ipykernel/zmqshell.py", line 546, in run_cell\n    return super().run_cell(*args, **kwargs)\n', '  File "/home/brand/.miniconda3/envs/stable-diffusion-train/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3024, in run_cell\n    result = self._run_cell(\n', '  File "/home/brand/.miniconda3/envs/stable-diffusion-train/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3079, in _run_cell\n    result = runner(coro)\n', '  File "/home/brand/.miniconda3/envs/stable-diffusion-train/lib/python3.10/site-packages/IPython/core/async_helpers.py", line 129, in _pseudo_sync_runner\n    coro.send(None)\n', '  File "/home/brand/.miniconda3/envs/stable-diffusion-train/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3284, in run_cell_async\n    has_raised = await self.run_ast_nodes(code_ast.body, cell_name,\n', '  File "/home/brand/.miniconda3/envs/stable-diffusion-train/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3466, in run_ast_nodes\n    if await self.run_code(code, result, async_=asy):\n', '  File "/home/brand/.miniconda3/envs/stable-diffusion-train/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3526, in run_code\n    exec(code_obj, self.user_global_ns, self.user_ns)\n', '  File "/tmp/ipykernel_644660/3011389616.py", line 1, in <module>\n    import torch\n', '  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load\n', '  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked\n', '  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked\n', '  File "<frozen importlib._bootstrap_external>", line 883, in exec_module\n', '  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed\n', '  File "/home/brand/.miniconda3/envs/stable-diffusion-train/lib/python3.10/site-packages/torch/__init__.py", line 1146, in <module>\n    _C._initExtension(manager_path())\n', '  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load\n', '  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked\n', '  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked\n', '  File "<frozen importlib._bootstrap_external>", line 883, in exec_module\n', '  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed\n', '  File "/home/brand/.miniconda3/envs/stable-diffusion-train/lib/python3.10/site-packages/torch/cuda/__init__.py", line 197, in <module>\n    _lazy_call(_check_capability)\n', '  File "/home/brand/.miniconda3/envs/stable-diffusion-train/lib/python3.10/site-packages/torch/cuda/__init__.py", line 195, in _lazy_call\n    _queued_calls.append((callable, traceback.format_stack()))\n']
```


### System Info

diffusers-0.21.4, ubuntu 22

### Who can help?

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeferredCudaCallError: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. #5443

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

DeferredCudaCallError: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. #5443

Description

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions