[Bug] CUDA Illegal Memory Access (`CONCAT failed` in `ggml_cuda_compute_forward`) When Using LoRAs With CPU Parameter Offloading

### Git commit

N/A

### Operating System & Version

Window 11

### GGML backends

CUDA

### Command-line arguments used

sd-cli --diffusion-model "Z:\gguf_models\Image\Qwen-Image\qwen-image-2512-Q4_K_M.gguf" --vae "Z:\gguf_models\Image\Qwen-Image\qwen_image_vae.safetensors" --llm "Z:\gguf_models\Image\Qwen-Image\Qwen2.5-VL-7B-Instruct-UD-Q4_K_XL.gguf" -v --diffusion-fa -W 720 -H 1024 --seed 42 --steps 8 --cfg-scale 1 --sampling-method euler --backend all=cuda0 --params-backend diffusion=cpu,te=cpu,vae=cpu --mmap --lora-model-dir "Z:\gguf_models\Image\Qwen-Image\LoRA" -p "a pack of pikachus in a lush forest<lora:Qwen-Image-2512-Lightning-8steps-V1.0-bf16:1>"

### Steps to reproduce

1. Load a quantized Qwen‑Image GGUF model (`Q4_K_M`, etc.).  
2. Load a Lightning LoRA via `--lora-model-dir`.  
3. Enable CPU parameter offloading:  
   - `--offload-to-cpu`  
   **or**  
   - `--params-backend diffusion=cpu,te=cpu,vae=cpu`  
4. Start a generation request.

### What you expected to happen

Generate an image

### What actually happened

When applying a **Lightning LoRA** to a **quantized Qwen‑Image GGUF base model**, the system crashes at the first sampling step **only when CPU parameter offloading is enabled** (`--offload-to-cpu` or `--params-backend diffusion=cpu`).

The crash consistently reports:

```
ggml_cuda_compute_forward: CONCAT failed
CUDA error: an illegal memory access was encountered
```

This occurs in both `sd-cli` and `sd-server`.

It seems that the backend expects everything to be on one device but receives a mix of CPU and GPU tensors instead. LoRA is applied on GPU while the diffusion is running on the CPU, resulting in the model ending up with its data split across two different devices.

If I force `--lora-apply-mode immediately`, the system catches the mismatch earlier and reports a type‑combination error instead of crashing: `unsupported type combination (f32 to q6_K)`


### Logs / error messages / stack trace

### **CLI (`sd-cli`)**
```text
[DEBUG] ggml_extend.hpp:1907 - qwen2.5vl compute buffer size: 9.10 MB(VRAM)
[INFO ] ggml_extend.hpp:2147 - qwen2.5vl offload params (5918.09 MB, 338 tensors) to runtime backend (CUDA0), taking 3.25s
[DEBUG] conditioner.hpp:2030 - computing condition graph completed, taking 3463 ms
[INFO ] stable-diffusion.cpp:3788 - get_learned_condition completed, taking 3.46s
[INFO ] stable-diffusion.cpp:4021 - generating image: 1/1 - seed 42
[DEBUG] ggml_extend.hpp:1907 - qwen_image compute buffer size: 444.05 MB(VRAM)
[INFO ] ggml_extend.hpp:2147 - qwen_image offload params (12631.04 MB, 1933 tensors) to runtime backend (CUDA0), taking 6.17s
[ERROR] ggml_extend.hpp:69   - ggml_cuda_compute_forward: CONCAT failed
[ERROR] ggml_extend.hpp:69   - CUDA error: an illegal memory access was encountered
[ERROR] ggml_extend.hpp:69   -   current device: 0, in function ggml_cuda_compute_forward at C:\gitrepo\stable-diffusion.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:3114
[ERROR] ggml_extend.hpp:69   -   err
C:\gitrepo\stable-diffusion.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:102: CUDA error
```

### **Server (`sd-server`)**
```text
[DEBUG] ggml_extend.hpp:1907 - qwen2.5vl compute buffer size: 9.10 MB(VRAM)
[INFO ] ggml_extend.hpp:2147 - qwen2.5vl offload params (5918.09 MB, 338 tensors) to runtime backend (CUDA0), taking 1.55s
[DEBUG] conditioner.hpp:2030 - computing condition graph completed, taking 1767 ms
[INFO ] stable-diffusion.cpp:3788 - get_learned_condition completed, taking 1.77s
[INFO ] stable-diffusion.cpp:4021 - generating image: 1/1 - seed 42
[DEBUG] ggml_extend.hpp:1907 - qwen_image compute buffer size: 444.05 MB(VRAM)
[INFO ] ggml_extend.hpp:2147 - qwen_image offload params (12631.04 MB, 1933 tensors) to runtime backend (CUDA0), taking 4.48s
[ERROR] ggml_extend.hpp:69   - ggml_cuda_compute_forward: CONCAT failed
[ERROR] ggml_extend.hpp:69   - CUDA error: an illegal memory access was encountered
[ERROR] ggml_extend.hpp:69   -   current device: 0, in function ggml_cuda_compute_forward at C:\gitrepo\stable-diffusion.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:3114
[ERROR] ggml_extend.hpp:69   -   err
C:\gitrepo\stable-diffusion.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:102: CUDA error
```

### Additional context / environment details

**System Configuration**
- **GPU:** NVIDIA GeForce RTX 4070 Ti SUPER (16 GB VRAM)  
- **Build:** master‑633‑5b0267e (commit 5b0267e)  
- **Core Models:** Qwen-Image-2512 (Q4_K_M GGUF base + Qwen2.5-VL-7B Text Encoder) 
- **Lora Model:** LightX2V-Qwen-Image-Lightning 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] CUDA Illegal Memory Access (`CONCAT failed` in `ggml_cuda_compute_forward`) When Using LoRAs With CPU Parameter Offloading #1558

Git commit

Operating System & Version

GGML backends

Command-line arguments used

Steps to reproduce

What you expected to happen

What actually happened

Logs / error messages / stack trace

CLI (`sd-cli`)

Server (`sd-server`)

Additional context / environment details

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Bug] CUDA Illegal Memory Access (CONCAT failed in ggml_cuda_compute_forward) When Using LoRAs With CPU Parameter Offloading #1558

Description

Git commit

Operating System & Version

GGML backends

Command-line arguments used

Steps to reproduce

What you expected to happen

What actually happened

Logs / error messages / stack trace

CLI (sd-cli)

Server (sd-server)

Additional context / environment details

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

[Bug] CUDA Illegal Memory Access (`CONCAT failed` in `ggml_cuda_compute_forward`) When Using LoRAs With CPU Parameter Offloading #1558

CLI (`sd-cli`)

Server (`sd-server`)