Skip to content

Commit

Permalink
Remove redundant parameters for WOQ saving config and fix GPTQ issue (#…
Browse files Browse the repository at this point in the history
…1410)

* remove redundant parameters for WOQ

Signed-off-by: changwangss <chang1.wang@intel.com>
  • Loading branch information
changwangss committed Mar 25, 2024
1 parent 4b50461 commit ef0882f
Show file tree
Hide file tree
Showing 7 changed files with 106 additions and 47 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ pip install -r requirements.txt
```

# Run
We provide compression technologies such as `MixedPrecision`, `SmoothQuant` and `WeightOnlyQuant` with `RTN/AWQ/TEQ` algorithms and `BitsandBytes`, `load_in_4bit` and `load_in_8bit` work on CPU device, the followings are command to show how to use it.
We provide compression technologies such as `MixedPrecision`, `SmoothQuant` and `WeightOnlyQuant` with `Rtn/Awq/Teq/GPTQ/AutoRound` algorithms and `BitsandBytes`, `load_in_4bit` and `load_in_8bit` work on CPU device, the followings are command to show how to use it.
>**Note**:
> Model type "llama" will default use [ipex.optimize_transformers](https://github.com/intel/intel-extension-for-pytorch/blob/339bd251841e153ad9c34e1033ab8b2d936a1781/docs/tutorials/llm/llm_optimize_transformers.md) to accelerate the inference, but "llama" requests transformers version lower than 4.36.0, "falcon" requests transformers version lower than 4.33.3.
Expand Down Expand Up @@ -61,13 +61,13 @@ OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python ru
# load_in_4bit
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python run_generation.py \
--model bigcode/starcoder \
--load_in_4bit True \
--load_in_4bit \
--benchmark \
--batch_size 1
# load_in_8bit
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python run_generation.py \
--model bigcode/starcoder \
--load_in_8bit True \
--load_in_8bit \
--benchmark \
--batch_size 1
```
Expand Down Expand Up @@ -124,7 +124,7 @@ python run_generation.py \
# load_in_4bit
python run_generation.py \
--model bigcode/starcoder \
--load_in_4bit True \
--load_in_4bit \
--accuracy \
--batch_size 20 \
--n_samples 20 \
Expand All @@ -135,7 +135,7 @@ python run_generation.py \
# load_in_8bit
python run_generation.py \
--model bigcode/starcoder \
--load_in_8bit True \
--load_in_8bit \
--accuracy \
--batch_size 20 \
--n_samples 20 \
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,10 @@
We provide the inference benchmarking script `run_generation.py` for large language models, The following are the models we validated, more models are working in progress.

# Quantization for CPU device

>**Note**:
> 1. default search algorithm is beam search with num_beams = 4.
> 2. Model type "gptj", "opt", "llama" and "falcon" will default use [ipex.optimize_transformers](https://github.com/intel/intel-extension-for-pytorch/blob/339bd251841e153ad9c34e1033ab8b2d936a1781/docs/tutorials/llm/llm_optimize_transformers.md) to accelerate the inference, but "llama" requests transformers version lower than 4.36.0, "falcon" requests transformers version lower than 4.33.3 with use_neural_speed=False.
> 2. Model type "gptj", "opt", "llama" and "falcon" will default use [ipex.optimize_transformers](https://github.com/intel/intel-extension-for-pytorch/blob/339bd251841e153ad9c34e1033ab8b2d936a1781/docs/tutorials/llm/llm_optimize_transformers.md) to accelerate the inference, but "llama" requests transformers version lower than 4.36.0, "falcon" requests transformers version lower than 4.33.3 for SmoothQuant.
## Prerequisite​
### Create Environment​
Pytorch and Intel-extension-for-pytorch version 2.1 are required, python version requests equal or higher than 3.9 due to [text evaluation library](https://github.com/EleutherAI/lm-evaluation-harness/tree/master) limitation, the dependent packages are listed in requirements, we recommend create environment as the following steps.
Expand All @@ -21,7 +22,7 @@ pip install -r requirements.txt

## Run
We provide compression technologies such as `MixedPrecision`, `SmoothQuant` and `WeightOnlyQuant` with `RTN/AWQ/TEQ` algorithms and `BitsandBytes`, `load_in_4bit` and `load_in_8bit` work on CPU device, and also support `PEFT` optimized model compression, the followings are command to show how to use it.
We provide compression technologies such as `MixedPrecision`, `SmoothQuant` and `WeightOnlyQuant` with `Rtn/Awq/Teq/GPTQ/AutoRound` algorithms and `BitsandBytes`, `load_in_4bit` and `load_in_8bit` work on CPU device, and also support `PEFT` optimized model compression, the followings are command to show how to use it.

### 1. Performance
``` bash
Expand Down Expand Up @@ -108,13 +109,13 @@ python run_generation.py \
# load_in_4bit
python run_generation.py \
--model EleutherAI/gpt-j-6b \
--load_in_4bit True \
--load_in_4bit \
--accuracy \
--tasks "lambada_openai"
# load_in_8bit
python run_generation.py \
--model EleutherAI/gpt-j-6b \
--load_in_8bit True \
--load_in_8bit \
--accuracy \
--tasks "lambada_openai"
# restore the model optimized with smoothquant
Expand All @@ -128,7 +129,7 @@ python run_generation.py \

```

# # Weight Only Quantization for GPU device
# Weight Only Quantization for GPU device
>**Note**:
> 1. default search algorithm is beam search with num_beams = 1.
> 2. [ipex.optimize_transformers](https://github.com/intel/intel-extension-for-pytorch/blob/v2.1.10%2Bxpu/docs/tutorials/llm/llm_optimize_transformers.md) Support for the optimized inference of model types "gptj," "mistral," "qwen," and "llama" to achieve high performance and accuracy. Ensure accurate inference for other model types as well.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -423,6 +423,7 @@
args.model,
trust_remote_code=args.trust_remote_code,
_commit_hash=args._commit_hash,
use_neural_speed=args.use_neural_speed,
)

# save model
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,9 @@
from peft.peft_model import PEFT_TYPE_TO_MODEL_MAPPING, PeftType
from peft.tuners.lora import LoraLayer, LoraModel
from peft.utils.other import transpose
from intel_extension_for_transformers.transformers.llm.quantization.autograd import matmul_kbit
from intel_extension_for_transformers.transformers.llm.quantization.autograd import (
matmul_kbit,
)
import intel_extension_for_transformers.qbits as qbits # pylint: disable=E0611, E0401


Expand Down Expand Up @@ -180,15 +182,13 @@ def set_weights_bias(
group_dict[group_idx] = 0
else:
group_dict[group_idx] = group_dict[group_idx] + 1
target_idx = group_idx * \
group_size + group_dict[group_idx]
target_idx = group_idx * group_size + group_dict[group_idx]
int_weight2[target_idx] = int_weight[i]
int_weight = int_weight2
else:
g_idx = torch.empty(0, dtype=torch.int32)
else:
g_idx = torch.empty(0, dtype=torch.int32)

if q_config.bits == 4:
int_weight = (int_weight - 8) * 16
gptq_scales = gptq_scales / 16
Expand Down Expand Up @@ -251,38 +251,65 @@ def quant_weight_w_scale(self, weight, scale, zp, group_size=-1):
leng = weight.shape[1] // group_size
tail_flag = False if weight.shape[1] % group_size == 0 else True
for i in range(leng):
int_weight_tmp = weight[:, i * group_size: (i + 1) * group_size].div_(
int_weight_tmp = weight[:, i * group_size : (i + 1) * group_size].div_(
scale[:, i].unsqueeze(1)
)
if zp is not None:
int_weight_tmp.add_(zp[:, i].unsqueeze(1))
int_weight[:, i * group_size: (i + 1) * group_size].copy_(
int_weight[:, i * group_size : (i + 1) * group_size].copy_(
int_weight_tmp.round_()
)
if tail_flag:
int_weight_tmp = weight[:, leng * group_size:].div_(
int_weight_tmp = weight[:, leng * group_size :].div_(
scale[:, -1].unsqueeze(1)
)
if zp is not None:
int_weight_tmp.add_(zp[:, -1].unsqueeze(1))
int_weight[:, leng * group_size:].copy_(int_weight_tmp.round_())
int_weight[:, leng * group_size :].copy_(int_weight_tmp.round_())
return int_weight

def recover_qparms(self):
def recover_idx(ret_idx, k, blocksize):
g_idx = torch.zeros(k, dtype=int)
value_range = (k + blocksize - 1) // blocksize
for i in range(value_range):
for j in range(blocksize):
g_idx[ret_idx[i * blocksize + j]] = i
return g_idx

def recover_int_weight(g_idx, int_weight):
group_dict = {}
ret_idx = torch.zeros(g_idx.shape, dtype=torch.int32)
for i in range(len(g_idx)):
group_idx = g_idx[i].item()
if group_idx not in group_dict:
target_idx = group_idx * group_size
group_dict[group_idx] = 0
else:
group_dict[group_idx] = group_dict[group_idx] + 1
target_idx = group_idx * group_size + group_dict[group_idx]
ret_idx[i] = target_idx

int_weight2 = int_weight.clone().zero_()
for i in range(len(ret_idx)):
int_weight2[i] = int_weight[ret_idx[i]]
int_weight = int_weight2
return int_weight

group_size = qbits.acquire_packed_weight_info(self.weight, 1)[0]
in_features = qbits.acquire_packed_weight_info(self.weight, 2)[0]
out_features = qbits.acquire_packed_weight_info(self.weight, 3)[0]
desc_act = qbits.acquire_packed_weight_info(self.weight, 4)[0] != 0
if desc_act:
g_idx = qbits.acquire_packed_weight_info(self.weight, 5)
g_idx = recover_idx(g_idx, in_features, group_size)
else:
g_idx = None
weight_dtype_ascii = qbits.acquire_packed_weight_info(self.weight, 6)
weight_dtype = "".join(
chr(ascii_code) for ascii_code in weight_dtype_ascii.tolist()
)
bits = 4 if weight_dtype in [
"nf4", "int4_clip", "fp4", "int4_fullrange"] else 8
bits = 4 if weight_dtype in ["nf4", "int4_clip", "fp4", "int4_fullrange"] else 8
compute_dtype_ascii = qbits.acquire_packed_weight_info(self.weight, 7)
compute_dtype = "".join(
chr(ascii_code) for ascii_code in compute_dtype_ascii.tolist()
Expand Down Expand Up @@ -319,6 +346,10 @@ def recover_qparms(self):
group_size=group_size,
)

if g_idx is not None:
int_weight = recover_int_weight(g_idx, int_weight.t())
int_weight = int_weight.t()

scales_dtype = torch.float32 if scales_dtype in ["fp32"] else None
return (
group_size,
Expand Down Expand Up @@ -361,15 +392,13 @@ def __init__(
scheme=kwargs.get("scheme", "sym"),
device=kwargs.get("device", None),
)
LoraLayer.__init__(self, in_features=in_features,
out_features=out_features)
LoraLayer.__init__(self, in_features=in_features, out_features=out_features)

# Freezing the pre-trained weight matrix
self.weight.requires_grad = False

init_lora_weights = kwargs.pop("init_lora_weights", True)
self.update_layer(adapter_name, r, lora_alpha,
lora_dropout, init_lora_weights)
self.update_layer(adapter_name, r, lora_alpha, lora_dropout, init_lora_weights)
qbits_customop_available = True
try:
qbits.dropout_fwd
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,10 @@
from neural_compressor.adaptor.torch_utils.model_wrapper import WeightOnlyLinear
from neural_compressor.utils.utility import LazyImport
from neural_compressor.config import PostTrainingQuantConfig
from intel_extension_for_transformers.tools.utils import is_ipex_available, is_autoround_available
from intel_extension_for_transformers.tools.utils import (
is_ipex_available,
is_autoround_available,
)
from transformers import AutoTokenizer

if is_ipex_available():
Expand Down Expand Up @@ -71,7 +74,7 @@ def unpack_weight(qweight, scales, qzeros, q_config):
except:
# zeros and scales have different iteam numbers.
# remove 1 (due to 0 + 1 in line 68)
zeros = zeros[zeros !=1]
zeros = zeros[zeros != 1]
zeros = zeros.reshape(scales.shape)

# due to INC asym return torch.uint8 but backend request int8,
Expand All @@ -92,7 +95,7 @@ def unpack_weight(qweight, scales, qzeros, q_config):
# due to INC asym return torch.uint8 but backend request int8,
# change it to int8 with offset 128
if not sym:
weight = (weight.to(torch.int32) - 128). to(torch.int8)
weight = (weight.to(torch.int32) - 128).to(torch.int8)
return weight, scales, zeros


Expand Down Expand Up @@ -258,8 +261,13 @@ def _replace_linear(
# Force requires grad to False to avoid unexpected errors
model._modules[name].requires_grad_(False)
if device == "cpu" or device == torch.device("cpu") or device == "auto":
if quantization_config.weight_dtype in \
["fp8_e5m2", "fp8_e4m3", "nf4", "fp4", "int4_fullrange"]:
if quantization_config.weight_dtype in [
"fp8_e5m2",
"fp8_e4m3",
"nf4",
"fp4",
"int4_fullrange",
]:
model._modules[name].set_fp_weights_bias(
module.weight.data,
None if module.bias is None else module.bias.data,
Expand Down Expand Up @@ -324,13 +332,17 @@ def convert_to_quantized_model(model, config, device="cpu"):
calib_dataloader = config.calib_dataloader
calib_func = config.calib_func
calib_iters = config.calib_iters
calib_dataset = config.dataset
model_device = next(model.parameters()).device

if calib_dataloader is None and config.quant_method.value not in ["rtn"]:
if (
calib_dataloader is None
and config.quant_method.value not in ["rtn"]
and calib_dataset is not None
):
from datasets import load_dataset
from torch.utils.data import DataLoader

calib_dataset = config.calib_dataset
if isinstance(calib_dataset, (str, bytes, os.PathLike)):
calib_dataset = load_dataset(calib_dataset, split="train")
calib_dataset = calib_dataset.shuffle(seed=42)
Expand Down Expand Up @@ -442,7 +454,7 @@ def default_calib_func(model):
True if "fullrange" in config.weight_dtype else False
),
"enable_mse_search": config.mse_range,
}
},
}
algorithm = "RTN"
elif config.quant_method.value == "awq":
Expand Down Expand Up @@ -470,7 +482,7 @@ def default_calib_func(model):
"use_max_length": True if config.max_input_length else False,
"pad_max_length": config.max_input_length,
"static_groups": config.static_groups,
}
},
}
algorithm = "GPTQ"
elif config.quant_method.value == "autoround":
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,8 @@ def recover_export_model(model, current_key_name=None):
model._modules[name].pack(
int_weight, scales, zeros, module.bias, g_idx=g_idx
)
if g_idx is not None:
model._modules[name].g_idx = g_idx

if len(list(module.children())) > 0: # pylint: disable=E1101
_ = recover_export_model(module, current_key_name)
Expand Down Expand Up @@ -179,8 +181,13 @@ def convert_model_to_public(model):
module.qweight.data = module.qweight.t_().contiguous()
module.scales.data = module.scales.t_().contiguous()
module.weight_transposed = False
elif model.quantization_config.weight_dtype not in \
["fp8_e5m2", "fp8_e4m3", "nf4", "fp4", "int4_fullrange"]:
elif model.quantization_config.weight_dtype not in [
"fp8_e5m2",
"fp8_e4m3",
"nf4",
"fp4",
"int4_fullrange",
]:
model = recover_export_model(model)


Expand Down Expand Up @@ -368,8 +375,10 @@ def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
exit(0)

if config.model_type in cls.model_type_list and not use_xpu:
if isinstance(quantization_config,
GPTQConfig) and config.model_type not in cls.model_type_list_for_gptq:
if (
isinstance(quantization_config, GPTQConfig)
and config.model_type not in cls.model_type_list_for_gptq
):
use_neural_speed = False
else:
use_neural_speed = True
Expand Down Expand Up @@ -609,7 +618,7 @@ def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
model = convert_to_quantized_model(
model, quantization_config, device=device_map
)
quantization_config.tokenizer = None
quantization_config.remove_redundant_parameters()
model.config.quantization_config = quantization_config

# add quantization_config and save_low_bit to pretrained model dynamically
Expand Down

0 comments on commit ef0882f

Please sign in to comment.