Kandinsky 3.0 "CUDA Out of memory" error

### Describe the bug

Kandinsky 3.0 fails with "Out of memory" error when the pipeline starts to work. 

When I try other models, like SDXL, there are no problems with it and code lines like "pipe.to('cuda')" work without problems, but when I try Kandinsky3 there are. 

GPU: 1x T4 GPU (Google colab)

### Reproduction

```python
from diffusers import AutoPipelineForText2Image
import torch

pipe = AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-3", variant="fp16", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()
        
prompt = "Any prompt"

generator = torch.Generator(device="cpu").manual_seed(0)
image = pipe(prompt, num_inference_steps=25, generator=generator).images[0] # < Here is the error. 

image.save('1.png')
```

### Logs

```shell
Loading pipeline components...: 100%
5/5 [00:02<00:00, 1.75it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%
5/5 [00:01<00:00, 3.13it/s]
---------------------------------------------------------------------------
OutOfMemoryError                          Traceback (most recent call last)
<ipython-input-1-7c6f4c265399> in <cell line: 10>()
      8 
      9 generator = torch.Generator(device="cpu").manual_seed(0)
---> 10 image = pipe(prompt, num_inference_steps=25, generator=generator).images[0]

18 frames
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in convert(t)
   1156                 return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None,
   1157                             non_blocking, memory_format=convert_to_format)
-> 1158             return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
   1159 
   1160         return self._apply(convert)

OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacty of 14.75 GiB of which 24.81 MiB is free. Process 79636 has 14.72 GiB memory in use. Of the allocated memory 14.62 GiB is allocated by PyTorch, and 1.64 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
```


### System Info

- `diffusers` version: 0.25.0.dev0
- Platform: Linux-5.15.120+-x86_64-with-glibc2.35 (Google Colab)
- Python version: 3.10.12
- PyTorch version (GPU?): 2.1.0+cu121 (True)
- Huggingface_hub version: 0.19.4
- Transformers version: 4.35.2
- Accelerate version: 0.25.0
- xFormers version: not installed
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: No

### Who can help?

@yiyixuxu @patrickvonplaten

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Kandinsky 3.0 "CUDA Out of memory" error #6028

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Kandinsky 3.0 "CUDA Out of memory" error #6028

Description

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions