[Bug]: Marlin backend provides unexpected behavior

### Problem Description

For W4A16, lamabada_openai accuracy is 0 but piqa has value
For AutoScheme mix-precision, we got an error `AttributeError: 'Autotuner' object has no attribute '_cache_lock' `

### Reproduction Steps

To produce:
For W4A16:
```python
    def test_auto_scheme_export(self):
        model_name = get_model_path("facebook/opt-125m")
        ar = AutoRound(model=model_name, scheme="W4A16", iters=0, disable_opt_rtn=True)
        ar.quantize_and_save(self.save_dir)
        model_args = f"pretrained={self.save_dir}"
        task_name = "lambada_openai"
        # task_name = "piqa"
        result = simple_evaluate(model="hf", model_args=model_args, tasks=task_name, batch_size="auto")
        print(result["results"][task_name]["acc,none"])
        assert result["results"][task_name]["acc,none"] > 0.25
        shutil.rmtree(self.save_dir, ignore_errors=True)
 ```

For AutoScheme, requires fix: https://github.com/intel/auto-round/pull/1403/changes/55a279760992cb8c0189b8bf50a68c7687bdee04
```python
    def test_auto_scheme_export(self):
        model_name = get_model_path("facebook/opt-125m")
        scheme = AutoScheme(avg_bits=3, options=("W2A16", "W4A16", "W8A16", "BF16"))
        ar = AutoRound(model=model_name, scheme=scheme, iters=0, disable_opt_rtn=True)
        ar.quantize_and_save(self.save_dir)
        model_args = f"pretrained={self.save_dir}"
        result = simple_evaluate(model="hf", model_args=model_args, tasks="lambada_openai", batch_size="auto")
        print(result["results"]["lambada_openai"]["acc,none"])
        assert result["results"]["lambada_openai"]["acc,none"] > 0.25
        shutil.rmtree(self.save_dir, ignore_errors=True)
```

### Environment Information

_No response_

### Error Logs

```shell
AutoTuner error log for AutoScheme

../auto_round_extension/triton/qlinear_tritonv2_zp.py:182: in forward                        02:05:05 [60/24492]
    out = quant_linear_fn.apply(                                                                                
/home/xinhe/.local/lib/python3.12/site-packages/torch/autograd/function.py:581: in apply                        
    return super().apply(*args, **kwargs)  # type: ignore[misc]                                                 
../auto_round_extension/triton/triton_utils_zp/dequant.py:172: in forward                                       
    output = quant_matmul_248(input, qweight, scales, qzeros, g_idx, bits, maxq)                                
../auto_round_extension/triton/triton_utils_zp/dequant.py:161: in quant_matmul_248                              
    W = dequant248(qweight, scales, qzeros, g_idx, bits, maxq=maxq, input_dtype=input_dtype)                    
../auto_round_extension/triton/triton_utils_zp/dequant.py:154: in dequant248                                    
    return dequant248_core(qweight, scales, qzeros, g_idx, bits, maxq=maxq, input_dtype=input_dtype)            
../auto_round_extension/triton/triton_utils_zp/dequant.py:132: in dequant248_core                               
    dequant_kernel_248[grid](                                                                                   
/home/xinhe/.local/lib/python3.12/site-packages/triton/runtime/jit.py:419: in <lambda>                          
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)                           
/home/xinhe/.local/lib/python3.12/site-packages/gptqmodel/utils/nogil_patcher.py:224: in patched_run            
    config, used_cached_result, bench_time = _get_config_for_key(self, key, args, kwargs)                       
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
                                                                                                                
self = <triton.runtime.autotuner.Autotuner object at 0x7a1258890b30>                                            
key = (589824, 'torch.int32', 'torch.float16', 'torch.int32', 'torch.int32', 'torch.float16')                   
args = (tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,                         
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...., nan, nan, nan],                                                      
        [nan, nan, nan,  ..., nan, nan, nan]], device='cuda:0',                                                 
       dtype=torch.float16), 589824)                                                                            
kwargs = {'bits': 2, 'grid': <function dequant248_core.<locals>.<lambda> at 0x7a11b03a5ee0>, 'maxq': 3, 'num_gro
ups': 6, ...}                                                                                                   
                                                                                                                
    def _get_config_for_key(self, key, args, kwargs):                                                           
>       with self._cache_lock:                                                                                  
E       AttributeError: 'Autotuner' object has no attribute '_cache_lock'                                       
                                                                                                                
/home/xinhe/.local/lib/python3.12/site-packages/gptqmodel/utils/nogil_patcher.py:149: AttributeError
```

### Additional Context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Marlin backend provides unexpected behavior #1405

Problem Description

Reproduction Steps

Environment Information

Error Logs

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: Marlin backend provides unexpected behavior #1405

Description

Problem Description

Reproduction Steps

Environment Information

Error Logs

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions