Crash occurs  when tuning Deepseek R1 on 7 cards


Auto-round commit: https://github.com/intel/auto-round/commit/5329bcda9ae38b4328463b56104b806e8307e6f2
Reproduction command:
```bash
AR_LOG_LEVEL=TRACE auto_round \
    --model /data5/yliu7/HF_HOME/DeepSeek-R1-bf16-layer4 \
    --device_map "0,1,2,3,4,5,6" \
    --scheme "W4A16" \
    --iters 32  \
    --low_gpu_mem_usage
```
Log:
````
[38;20m2025-12-02 04:05:05 INFO __main__.py L508: start to quantize /data5/yliu7/HF_HOME/DeepSeek-R1-bf16-layer4[0m
[33;1m2025-12-02 04:05:05 WARNING model.py L279: trust_remote_code is enabled by default, please ensure its correctness.[0m

Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]
Loading checkpoint shards:  86%|████████▌ | 6/7 [00:00<00:00, 52.12it/s]
Loading checkpoint shards: 100%|██████████| 7/7 [00:00<00:00, 60.57it/s]
[38;20m2025-12-02 04:05:06 INFO base.py L381: using torch.bfloat16 for quantization tuning[0m
[38;20m2025-12-02 04:05:06 INFO base.py L641: 'enable_torch_compile' is set to `False` by default. Enabling it can reduce tuning cost by 20%%, but it might throw an exception.[0m
[38;20m2025-12-02 04:05:06 INFO base.py L1677: start to cache block inputs[0m
[38;20m2025-12-02 04:05:15 INFO base.py L1693: caching done[0m

  0%|          | 0/4 [00:00<?, ?it/s]
Quantizing model.layers.0:   0%|          | 0/4 [00:00<?, ?it/s][34;1m2025-12-02 04:05:16 DEBUG device.py L1061: Card 0 used memory details [Estimated]: 14.5078125 GB[0m
[34;1m2025-12-02 04:05:16 DEBUG device.py L1062:   Block input output cache memory: 0 GB[0m
[34;1m2025-12-02 04:05:16 DEBUG device.py L1063:   Quantized layer outputs memory: 6.75390625 GB[0m
[34;1m2025-12-02 04:05:16 DEBUG device.py L1064:   Additional_memory from other ops: 7.75390625 GB[0m
[34;1m2025-12-02 04:05:16 DEBUG device.py L1090: Auto device map for block: {'self_attn.q_a_proj': 'cuda:0', 'self_attn.q_b_proj': 'cuda:2', 'self_attn.kv_a_proj_with_mqa': 'cuda:1', 'self_attn.kv_b_proj': 'cuda:1', 'self_attn.o_proj': 'cuda:3', 'mlp.gate_proj': 'cuda:4', 'mlp.up_proj': 'cuda:5', 'mlp.down_proj': 'cuda:6'}[0m
[38;20mquantized 8/8 layers in the block, loss iter 0: 0.000002 -> iter 30: 0.000000,'peak_ram': 19.66GB, 'peak_vram': {'0': 74.71GB, '1': 2.67GB, '2': 1.72GB, '3': 4.33GB, '4': 5.02GB, '5': 5.02GB, '6': 5.36GB}[0m

Quantizing model.layers.0:  25%|██▌       | 1/4 [00:45<02:16, 45.54s/it]
Quantizing model.layers.1:  25%|██▌       | 1/4 [00:45<02:16, 45.54s/it][34;1m2025-12-02 04:06:01 DEBUG device.py L1061: Card 0 used memory details [Estimated]: 14.5078125 GB[0m
[34;1m2025-12-02 04:06:01 DEBUG device.py L1062:   Block input output cache memory: 0 GB[0m
[34;1m2025-12-02 04:06:01 DEBUG device.py L1063:   Quantized layer outputs memory: 6.75390625 GB[0m
[34;1m2025-12-02 04:06:01 DEBUG device.py L1064:   Additional_memory from other ops: 7.75390625 GB[0m
[34;1m2025-12-02 04:06:01 DEBUG device.py L1090: Auto device map for block: {'self_attn.q_a_proj': 'cuda:0', 'self_attn.q_b_proj': 'cuda:2', 'self_attn.kv_a_proj_with_mqa': 'cuda:1', 'self_attn.kv_b_proj': 'cuda:1', 'self_attn.o_proj': 'cuda:3', 'mlp.gate_proj': 'cuda:4', 'mlp.up_proj': 'cuda:5', 'mlp.down_proj': 'cuda:6'}[0m
[38;20mquantized 8/8 layers in the block, loss iter 0: 0.000001 -> iter 31: 0.000000,'peak_ram': 20.99GB, 'peak_vram': {'0': 74.94GB, '1': 2.67GB, '2': 1.72GB, '3': 4.33GB, '4': 5.02GB, '5': 5.02GB, '6': 5.36GB}[0m

Quantizing model.layers.1:  50%|█████     | 2/4 [01:32<01:32, 46.40s/it]
Quantizing model.layers.2:  50%|█████     | 2/4 [01:32<01:32, 46.40s/it][34;1m2025-12-02 04:06:48 DEBUG device.py L1061: Card 0 used memory details [Estimated]: 14.5078125 GB[0m
[34;1m2025-12-02 04:06:48 DEBUG device.py L1062:   Block input output cache memory: 0 GB[0m
[34;1m2025-12-02 04:06:48 DEBUG device.py L1063:   Quantized layer outputs memory: 6.75390625 GB[0m
[34;1m2025-12-02 04:06:48 DEBUG device.py L1064:   Additional_memory from other ops: 7.75390625 GB[0m
[34;1m2025-12-02 04:06:48 DEBUG device.py L1090: Auto device map for block: {'self_attn.q_a_proj': 'cuda:0', 'self_attn.q_b_proj': 'cuda:2', 'self_attn.kv_a_proj_with_mqa': 'cuda:1', 'self_attn.kv_b_proj': 'cuda:1', 'self_attn.o_proj': 'cuda:3', 'mlp.gate_proj': 'cuda:4', 'mlp.up_proj': 'cuda:5', 'mlp.down_proj': 'cuda:6'}[0m
Traceback (most recent call last):
  File "/home/yliu7/miniforge3/envs/ao/bin/auto_round", line 8, in <module>
    sys.exit(run())
             ^^^^^
  File "/home/user/auto-round/auto_round/__main__.py", line 869, in run
    tune(args)
  File "/home/user/auto-round/auto_round/__main__.py", line 647, in tune
    model, folders = autoround.quantize_and_save(export_dir, format=args.format)  # pylint: disable=E1101
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/auto-round/auto_round/compressors/base.py", line 1042, in quantize_and_save
    model, _ = self.quantize()
               ^^^^^^^^^^^^^^^
  File "/home/user/auto-round/auto_round/compressors/base.py", line 1717, in quantize
    self._quantize_blocks(
  File "/home/user/auto-round/auto_round/compressors/base.py", line 3017, in _quantize_blocks
    q_input, input_ids = self._quantize_block(
                         ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/auto-round/auto_round/compressors/base.py", line 2874, in _quantize_block
    self._scale_loss_and_backward(scaler, loss)
  File "/home/user/auto-round/auto_round/compressors/base.py", line 3252, in _scale_loss_and_backward
    scale_loss.backward()
  File "/home/yliu7/miniforge3/envs/ao/lib/python3.11/site-packages/torch/_tensor.py", line 625, in backward
    torch.autograd.backward(
  File "/home/yliu7/miniforge3/envs/ao/lib/python3.11/site-packages/torch/autograd/__init__.py", line 354, in backward
    _engine_run_backward(
  File "/home/yliu7/miniforge3/envs/ao/lib/python3.11/site-packages/torch/autograd/graph.py", line 841, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: unspecified launch failure
Search for `cudaErrorLaunchFailure' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


Quantizing model.layers.2:  50%|█████     | 2/4 [01:40<01:40, 50.46s/it]


````

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Crash occurs when tuning Deepseek R1 on 7 cards #1083

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Crash occurs when tuning Deepseek R1 on 7 cards #1083

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions