support autoTP with weight only quantization in DS inference path #4750

ftian1 · 2023-11-29T05:54:31Z

This PR is used to make weight only quantization work with autoTP.

The sample code is like below:

    model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16).to(device)

    ds_model = deepspeed.init_inference(model,
                                        mp_size=world_size,
                                        dtype=torch.float16,
                                        replace_with_kernel_inject=False) 

    model = ds_model.module
    from deepspeed.inference.quantization.quantization import _init_group_wise_weight_quantization
    ds_config = {
        "weight_quantization": {
            "post_init_quant": {
                '*': {
                    'num_bits': 4,
                    'group_size': 32,
                    'group_dim': 1,
                    'symmetric': False
                },
            }
        }
    }
    model = _init_group_wise_weight_quantization(model, ds_config)

by this way, user can enable WOQ on multiple cards.

Signed-off-by: Feng Tian <feng.tian@intel.com>

delock · 2023-11-30T04:45:10Z

@ftian1 if accelerator other than CUDA want to support AutoTP WOQ, which set of OpBuilder/kernels needs to be implemented? Can you provide a link to kernel usage in the code?

ftian1 · 2023-12-01T08:33:57Z

@ftian1 if accelerator other than CUDA want to support AutoTP WOQ, which set of OpBuilder/kernels needs to be implemented? Can you provide a link to kernel usage in the code?

Here is the link https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/inference/quantization/utils.py#L115-L124,
in which it dispatch to deepspeed.ops.op_builder.QuantizerBuilder() if device is cuda and group_size is able to be divided by 8. such implementation could be enhanced to support different h/w's opbuilder

delock · 2023-12-02T01:09:18Z

It should be better to detect custom kernel existance by check attribute of the loaded ops, and call custom kernel accordingly, so any accelerator implement these kernels would be plugged.

@ftian1 if accelerator other than CUDA want to support AutoTP WOQ, which set of OpBuilder/kernels needs to be implemented? Can you provide a link to kernel usage in the code?

Here is the link https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/inference/quantization/utils.py#L115-L124, in which it dispatch to deepspeed.ops.op_builder.QuantizerBuilder() if device is cuda and group_size is able to be divided by 8. such implementation could be enhanced to support different h/w's opbuilder

baodii · 2023-12-12T05:23:01Z

tests/unit/inference/test_inference.py

+        ds_output = pipe(query, **inf_kwargs)
+
+        #print(local_rank, "baseline", bs_output)
+        print(local_rank, "deepspeed", ds_output)


Hi, @ftian1 I have run this test. But the result I got is 'deepspeed [{'generated_text': 'DeepSpeed is the greatest,,,,,,,,,,,,,,,'}]'. This result is not right. Can you figure out what's wrong with this test? BTW, I can pass all tests in test_intX_quantization.py.

@baodii may I know which device you are running on? cuda or cpu?

delock · 2023-12-19T01:37:25Z

@ftian1 Is usage of WoQ with AutoTP similiar to with kernel injection? Can you post a sample code show WoQ in DeepSpeed looks like withy kernel injection?

ftian1 requested review from RezaYazdaniAminabadi, jeffra, mrwyattii, awan-10, cmikeh2, arashb and tjruwase as code owners November 29, 2023 05:54

ftian1 added 3 commits November 29, 2023 13:50

Merge branch 'microsoft:master' into master

e9ad3cc

support the wildcard * in the weight_only ds_config

a3ce24d

Signed-off-by: Feng Tian <feng.tian@intel.com>

support autoTP with weight quantization in DS inference path

6d3174a

Signed-off-by: Feng Tian <feng.tian@intel.com>

fix typo in LmHeadLinearAllreduce initialization

7fc6730

delock mentioned this pull request Dec 11, 2023

(Do not merge) (CPU) aggregation of few recent fixes/optimizations #3920

Draft

25 tasks

baodii reviewed Dec 12, 2023

View reviewed changes

letonghan mentioned this pull request Jan 26, 2024

[NeuralChat] Add Multi-Socket LLM Inference Example intel/intel-extension-for-transformers#1073

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support autoTP with weight only quantization in DS inference path #4750

support autoTP with weight only quantization in DS inference path #4750

ftian1 commented Nov 29, 2023 •

edited

delock commented Nov 30, 2023

ftian1 commented Dec 1, 2023

delock commented Dec 2, 2023

baodii Dec 12, 2023

ftian1 Dec 12, 2023

delock commented Dec 19, 2023

support autoTP with weight only quantization in DS inference path #4750

Are you sure you want to change the base?

support autoTP with weight only quantization in DS inference path #4750

Conversation

ftian1 commented Nov 29, 2023 • edited

delock commented Nov 30, 2023

ftian1 commented Dec 1, 2023

delock commented Dec 2, 2023

baodii Dec 12, 2023

Choose a reason for hiding this comment

ftian1 Dec 12, 2023

Choose a reason for hiding this comment

delock commented Dec 19, 2023

ftian1 commented Nov 29, 2023 •

edited