-
Notifications
You must be signed in to change notification settings - Fork 61
Open
Description
Auto-round commit: 5329bcd
Reproduction command:
AR_LOG_LEVEL=TRACE auto_round \
--model /data5/yliu7/HF_HOME/DeepSeek-R1-bf16-layer4 \
--device_map "0,1,2,3,4,5,6" \
--scheme "W4A16" \
--iters 32 \
--low_gpu_mem_usageLog:
�[38;20m2025-12-02 04:05:05 INFO __main__.py L508: start to quantize /data5/yliu7/HF_HOME/DeepSeek-R1-bf16-layer4�[0m
�[33;1m2025-12-02 04:05:05 WARNING model.py L279: trust_remote_code is enabled by default, please ensure its correctness.�[0m
Loading checkpoint shards: 0%| | 0/7 [00:00<?, ?it/s]
Loading checkpoint shards: 86%|████████▌ | 6/7 [00:00<00:00, 52.12it/s]
Loading checkpoint shards: 100%|██████████| 7/7 [00:00<00:00, 60.57it/s]
�[38;20m2025-12-02 04:05:06 INFO base.py L381: using torch.bfloat16 for quantization tuning�[0m
�[38;20m2025-12-02 04:05:06 INFO base.py L641: 'enable_torch_compile' is set to `False` by default. Enabling it can reduce tuning cost by 20%%, but it might throw an exception.�[0m
�[38;20m2025-12-02 04:05:06 INFO base.py L1677: start to cache block inputs�[0m
�[38;20m2025-12-02 04:05:15 INFO base.py L1693: caching done�[0m
0%| | 0/4 [00:00<?, ?it/s]
Quantizing model.layers.0: 0%| | 0/4 [00:00<?, ?it/s]�[34;1m2025-12-02 04:05:16 DEBUG device.py L1061: Card 0 used memory details [Estimated]: 14.5078125 GB�[0m
�[34;1m2025-12-02 04:05:16 DEBUG device.py L1062: Block input output cache memory: 0 GB�[0m
�[34;1m2025-12-02 04:05:16 DEBUG device.py L1063: Quantized layer outputs memory: 6.75390625 GB�[0m
�[34;1m2025-12-02 04:05:16 DEBUG device.py L1064: Additional_memory from other ops: 7.75390625 GB�[0m
�[34;1m2025-12-02 04:05:16 DEBUG device.py L1090: Auto device map for block: {'self_attn.q_a_proj': 'cuda:0', 'self_attn.q_b_proj': 'cuda:2', 'self_attn.kv_a_proj_with_mqa': 'cuda:1', 'self_attn.kv_b_proj': 'cuda:1', 'self_attn.o_proj': 'cuda:3', 'mlp.gate_proj': 'cuda:4', 'mlp.up_proj': 'cuda:5', 'mlp.down_proj': 'cuda:6'}�[0m
�[38;20mquantized 8/8 layers in the block, loss iter 0: 0.000002 -> iter 30: 0.000000,'peak_ram': 19.66GB, 'peak_vram': {'0': 74.71GB, '1': 2.67GB, '2': 1.72GB, '3': 4.33GB, '4': 5.02GB, '5': 5.02GB, '6': 5.36GB}�[0m
Quantizing model.layers.0: 25%|██▌ | 1/4 [00:45<02:16, 45.54s/it]
Quantizing model.layers.1: 25%|██▌ | 1/4 [00:45<02:16, 45.54s/it]�[34;1m2025-12-02 04:06:01 DEBUG device.py L1061: Card 0 used memory details [Estimated]: 14.5078125 GB�[0m
�[34;1m2025-12-02 04:06:01 DEBUG device.py L1062: Block input output cache memory: 0 GB�[0m
�[34;1m2025-12-02 04:06:01 DEBUG device.py L1063: Quantized layer outputs memory: 6.75390625 GB�[0m
�[34;1m2025-12-02 04:06:01 DEBUG device.py L1064: Additional_memory from other ops: 7.75390625 GB�[0m
�[34;1m2025-12-02 04:06:01 DEBUG device.py L1090: Auto device map for block: {'self_attn.q_a_proj': 'cuda:0', 'self_attn.q_b_proj': 'cuda:2', 'self_attn.kv_a_proj_with_mqa': 'cuda:1', 'self_attn.kv_b_proj': 'cuda:1', 'self_attn.o_proj': 'cuda:3', 'mlp.gate_proj': 'cuda:4', 'mlp.up_proj': 'cuda:5', 'mlp.down_proj': 'cuda:6'}�[0m
�[38;20mquantized 8/8 layers in the block, loss iter 0: 0.000001 -> iter 31: 0.000000,'peak_ram': 20.99GB, 'peak_vram': {'0': 74.94GB, '1': 2.67GB, '2': 1.72GB, '3': 4.33GB, '4': 5.02GB, '5': 5.02GB, '6': 5.36GB}�[0m
Quantizing model.layers.1: 50%|█████ | 2/4 [01:32<01:32, 46.40s/it]
Quantizing model.layers.2: 50%|█████ | 2/4 [01:32<01:32, 46.40s/it]�[34;1m2025-12-02 04:06:48 DEBUG device.py L1061: Card 0 used memory details [Estimated]: 14.5078125 GB�[0m
�[34;1m2025-12-02 04:06:48 DEBUG device.py L1062: Block input output cache memory: 0 GB�[0m
�[34;1m2025-12-02 04:06:48 DEBUG device.py L1063: Quantized layer outputs memory: 6.75390625 GB�[0m
�[34;1m2025-12-02 04:06:48 DEBUG device.py L1064: Additional_memory from other ops: 7.75390625 GB�[0m
�[34;1m2025-12-02 04:06:48 DEBUG device.py L1090: Auto device map for block: {'self_attn.q_a_proj': 'cuda:0', 'self_attn.q_b_proj': 'cuda:2', 'self_attn.kv_a_proj_with_mqa': 'cuda:1', 'self_attn.kv_b_proj': 'cuda:1', 'self_attn.o_proj': 'cuda:3', 'mlp.gate_proj': 'cuda:4', 'mlp.up_proj': 'cuda:5', 'mlp.down_proj': 'cuda:6'}�[0m
Traceback (most recent call last):
File "/home/yliu7/miniforge3/envs/ao/bin/auto_round", line 8, in <module>
sys.exit(run())
^^^^^
File "/home/user/auto-round/auto_round/__main__.py", line 869, in run
tune(args)
File "/home/user/auto-round/auto_round/__main__.py", line 647, in tune
model, folders = autoround.quantize_and_save(export_dir, format=args.format) # pylint: disable=E1101
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/auto-round/auto_round/compressors/base.py", line 1042, in quantize_and_save
model, _ = self.quantize()
^^^^^^^^^^^^^^^
File "/home/user/auto-round/auto_round/compressors/base.py", line 1717, in quantize
self._quantize_blocks(
File "/home/user/auto-round/auto_round/compressors/base.py", line 3017, in _quantize_blocks
q_input, input_ids = self._quantize_block(
^^^^^^^^^^^^^^^^^^^^^
File "/home/user/auto-round/auto_round/compressors/base.py", line 2874, in _quantize_block
self._scale_loss_and_backward(scaler, loss)
File "/home/user/auto-round/auto_round/compressors/base.py", line 3252, in _scale_loss_and_backward
scale_loss.backward()
File "/home/yliu7/miniforge3/envs/ao/lib/python3.11/site-packages/torch/_tensor.py", line 625, in backward
torch.autograd.backward(
File "/home/yliu7/miniforge3/envs/ao/lib/python3.11/site-packages/torch/autograd/__init__.py", line 354, in backward
_engine_run_backward(
File "/home/yliu7/miniforge3/envs/ao/lib/python3.11/site-packages/torch/autograd/graph.py", line 841, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: unspecified launch failure
Search for `cudaErrorLaunchFailure' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Quantizing model.layers.2: 50%|█████ | 2/4 [01:40<01:40, 50.46s/it]