Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support autoTP with weight only quantization in DS inference path #4750

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

ftian1
Copy link
Contributor

@ftian1 ftian1 commented Nov 29, 2023

This PR is used to make weight only quantization work with autoTP.

The sample code is like below:

    model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16).to(device)

    ds_model = deepspeed.init_inference(model,
                                        mp_size=world_size,
                                        dtype=torch.float16,
                                        replace_with_kernel_inject=False) 

    model = ds_model.module
    from deepspeed.inference.quantization.quantization import _init_group_wise_weight_quantization
    ds_config = {
        "weight_quantization": {
            "post_init_quant": {
                '*': {
                    'num_bits': 4,
                    'group_size': 32,
                    'group_dim': 1,
                    'symmetric': False
                },
            }
        }
    }
    model = _init_group_wise_weight_quantization(model, ds_config)

by this way, user can enable WOQ on multiple cards.

@delock
Copy link
Contributor

delock commented Nov 30, 2023

@ftian1 if accelerator other than CUDA want to support AutoTP WOQ, which set of OpBuilder/kernels needs to be implemented? Can you provide a link to kernel usage in the code?

@ftian1
Copy link
Contributor Author

ftian1 commented Dec 1, 2023

@ftian1 if accelerator other than CUDA want to support AutoTP WOQ, which set of OpBuilder/kernels needs to be implemented? Can you provide a link to kernel usage in the code?

Here is the link https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/inference/quantization/utils.py#L115-L124,
in which it dispatch to deepspeed.ops.op_builder.QuantizerBuilder() if device is cuda and group_size is able to be divided by 8. such implementation could be enhanced to support different h/w's opbuilder

@delock
Copy link
Contributor

delock commented Dec 2, 2023

It should be better to detect custom kernel existance by check attribute of the loaded ops, and call custom kernel accordingly, so any accelerator implement these kernels would be plugged.

@ftian1 if accelerator other than CUDA want to support AutoTP WOQ, which set of OpBuilder/kernels needs to be implemented? Can you provide a link to kernel usage in the code?

Here is the link https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/inference/quantization/utils.py#L115-L124, in which it dispatch to deepspeed.ops.op_builder.QuantizerBuilder() if device is cuda and group_size is able to be divided by 8. such implementation could be enhanced to support different h/w's opbuilder

ds_output = pipe(query, **inf_kwargs)

#print(local_rank, "baseline", bs_output)
print(local_rank, "deepspeed", ds_output)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @ftian1 I have run this test. But the result I got is 'deepspeed [{'generated_text': 'DeepSpeed is the greatest,,,,,,,,,,,,,,,'}]'. This result is not right. Can you figure out what's wrong with this test? BTW, I can pass all tests in test_intX_quantization.py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@baodii may I know which device you are running on? cuda or cpu?

@delock
Copy link
Contributor

delock commented Dec 19, 2023

@ftian1 Is usage of WoQ with AutoTP similiar to with kernel injection? Can you post a sample code show WoQ in DeepSpeed looks like withy kernel injection?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants