Skip to content

multiple device issues #1004

@wenhuach21

Description

@wenhuach21
python3 -m auto_round --model /models/Qwen3-8B  --scheme "W4A16"  --format "auto_round" --device_map 2,6

(autoround) wenhuach@mlp-dgx-01:~/auto-round$ python3 -m auto_round --model /models/Qwen3-8B --scheme "W4A16" --format "auto_round" --device_map 2,6
2025-11-06 22:31:03 INFO main.py L485: start to quantize /models/Qwen3-8B
2025-11-06 22:31:04 WARNING modeling_utils.py L4670: torch_dtype is deprecated! Use dtype instead!
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 80.30it/s]
2025-11-06 22:31:07 INFO base.py L596: 'enable_torch_compile' is set to False by default. Enabling it can reduce tuning cost by 20%%, but it might throw an exception.
2025-11-06 22:31:07 INFO base.py L354: using torch.bfloat16 for quantization tuning
2025-11-06 22:31:07 INFO base.py L1583: start to cache block inputs
2025-11-06 22:31:17 INFO base.py L1598: caching done
Quantizing model.layers.0: 0%| lib/python3.13/site-packages/torch/autograd/graph.py:841: UserWarning: Flash Attention defaults to a non-deterministic algorithm. To explicitly enable determinism call torch.use_deterministic_algorithms(True, warn_only=False). (Triggered internally at /pytorch/aten/src/ATen/native/transformers/cuda/attention_backward.cu:114.)
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
2025-11-06 22:31:56 INFO base.py L2651: quantized 7/7 layers in the block, loss iter 0: 0.000127 -> iter 132: 0.000056
Quantizing model.layers.1: 3%|██████▏ | 1/36 [00:42<24:36, 42.18s/it]2025-11-06 22:32:27 INFO base.py L2651: quantized 7/7 layers in the block, loss iter 0: 0.000299 -> iter 184: 0.000103
Quantizing model.layers.2: 6%|████████████▍ | 2/36 [01:12<19:51, 35.04s/it]Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/home/wenhuach/auto-round/auto_round/main.py", line 854, in
run()
~~~^^
File "/home/wenhuach/auto-round/auto_round/main.py", line 835, in run
tune(args)
~~~~^^^^^^
File "/home/wenhuach/auto-round/auto_round/main.py", line 621, in tune
model, folders = autoround.quantize_and_save(export_dir, format=args.format) # pylint: disable=E1101
~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/wenhuach/auto-round/auto_round/compressors/base.py", line 989, in quantize_and_save
model, _ = self.quantize()
~~~~~~~~~~~~~^^
File "/home/wenhuach/auto-round/auto_round/compressors/base.py", line 1622, in quantize
self._quantize_blocks(
~~~~~~~~~~~~~~~~~~~~~^
self.model,
^^^^^^^^^^^
...<5 lines>...
pbar=pbar,
^^^^^^^^^^
)
^
File "/home/wenhuach/auto-round/auto_round/compressors/base.py", line 2789, in _quantize_blocks
q_input, input_ids = quantize_block(
~~~~~~~~~~~~~~^
m,
^^
...<3 lines>...
device=device,
^^^^^^^^^^^^^^
)
^
File "/home/wenhuach/auto-round/auto_round/compressors/base.py", line 2508, in _quantize_block
clear_memory(input_ids)
~~~~~~~~~~~~^^^^^^^^^^^
File "/home/wenhuach/miniforge3/envs/autoround/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
return fn(*args, **kwargs)
File "/home/wenhuach/auto-round/auto_round/utils/device.py", line 429, in clear_memory
_clear_memory_for_cpu_and_cuda(tensor)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^
File "/home/wenhuach/auto-round/auto_round/utils/device.py", line 416, in _clear_memory_for_cpu_and_cuda
torch.cuda.empty_cache()
~~~~~~~~~~~~~~~~~~~~~~^^
File "/home/wenhuach/miniforge3/envs/autoround/lib/python3.13/site-packages/torch/cuda/memory.py", line 224, in empty_cache
torch._C._cuda_emptyCache()
~~~~~~~~~~~~~~~~~~~~~~~~~^^
torch.AcceleratorError: CUDA error: unspecified launch failure
Search for cudaErrorLaunchFailure' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA` to enable device-side assertions.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions