multiple device issues

~~~bash
python3 -m auto_round --model /models/Qwen3-8B  --scheme "W4A16"  --format "auto_round" --device_map 2,6
~~~


(autoround) wenhuach@mlp-dgx-01:~/auto-round$ python3 -m auto_round --model /models/Qwen3-8B  --scheme "W4A16"  --format "auto_round" --device_map 2,6
2025-11-06 22:31:03 INFO __main__.py L485: start to quantize /models/Qwen3-8B
2025-11-06 22:31:04 WARNING modeling_utils.py L4670: `torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 80.30it/s]
2025-11-06 22:31:07 INFO base.py L596: 'enable_torch_compile' is set to `False` by default. Enabling it can reduce tuning cost by 20%%, but it might throw an exception.
2025-11-06 22:31:07 INFO base.py L354: using torch.bfloat16 for quantization tuning
2025-11-06 22:31:07 INFO base.py L1583: start to cache block inputs
2025-11-06 22:31:17 INFO base.py L1598: caching done
Quantizing model.layers.0:   0%|                                                                                                                                                                                                                                       lib/python3.13/site-packages/torch/autograd/graph.py:841: UserWarning: Flash Attention defaults to a non-deterministic algorithm. To explicitly enable determinism call torch.use_deterministic_algorithms(True, warn_only=False). (Triggered internally at /pytorch/aten/src/ATen/native/transformers/cuda/attention_backward.cu:114.)
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
2025-11-06 22:31:56 INFO base.py L2651: quantized 7/7 layers in the block, loss iter 0: 0.000127 -> iter 132: 0.000056
Quantizing model.layers.1:   3%|██████▏                                                                                                                                                                                                                        | 1/36 [00:42<24:36, 42.18s/it]2025-11-06 22:32:27 INFO base.py L2651: quantized 7/7 layers in the block, loss iter 0: 0.000299 -> iter 184: 0.000103
Quantizing model.layers.2:   6%|████████████▍                                                                                                                                                                                                                  | 2/36 [01:12<19:51, 35.04s/it]Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/wenhuach/auto-round/auto_round/__main__.py", line 854, in <module>
    run()
    ~~~^^
  File "/home/wenhuach/auto-round/auto_round/__main__.py", line 835, in run
    tune(args)
    ~~~~^^^^^^
  File "/home/wenhuach/auto-round/auto_round/__main__.py", line 621, in tune
    model, folders = autoround.quantize_and_save(export_dir, format=args.format)  # pylint: disable=E1101
                     ~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/wenhuach/auto-round/auto_round/compressors/base.py", line 989, in quantize_and_save
    model, _ = self.quantize()
               ~~~~~~~~~~~~~^^
  File "/home/wenhuach/auto-round/auto_round/compressors/base.py", line 1622, in quantize
    self._quantize_blocks(
    ~~~~~~~~~~~~~~~~~~~~~^
        self.model,
        ^^^^^^^^^^^
    ...<5 lines>...
        pbar=pbar,
        ^^^^^^^^^^
    )
    ^
  File "/home/wenhuach/auto-round/auto_round/compressors/base.py", line 2789, in _quantize_blocks
    q_input, input_ids = quantize_block(
                         ~~~~~~~~~~~~~~^
        m,
        ^^
    ...<3 lines>...
        device=device,
        ^^^^^^^^^^^^^^
    )
    ^
  File "/home/wenhuach/auto-round/auto_round/compressors/base.py", line 2508, in _quantize_block
    clear_memory(input_ids)
    ~~~~~~~~~~~~^^^^^^^^^^^
  File "/home/wenhuach/miniforge3/envs/autoround/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
    return fn(*args, **kwargs)
  File "/home/wenhuach/auto-round/auto_round/utils/device.py", line 429, in clear_memory
    _clear_memory_for_cpu_and_cuda(tensor)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^
  File "/home/wenhuach/auto-round/auto_round/utils/device.py", line 416, in _clear_memory_for_cpu_and_cuda
    torch.cuda.empty_cache()
    ~~~~~~~~~~~~~~~~~~~~~~^^
  File "/home/wenhuach/miniforge3/envs/autoround/lib/python3.13/site-packages/torch/cuda/memory.py", line 224, in empty_cache
    torch._C._cuda_emptyCache()
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^
torch.AcceleratorError: CUDA error: unspecified launch failure
Search for `cudaErrorLaunchFailure' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

multiple device issues #1004

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

multiple device issues #1004

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions