ZeroQuant quantization kernels and LKD #2207

sdpmas · 2022-08-10T15:31:46Z

Hi,

I was trying out the compression library for ZeroQuant quantization (for GPT-J model). While I was able to compress the model, I didn't see any throughput/latency gain from the quantization during inference. I have a few questions regarding this:

Do you guys have any guide to running inference on compressed models(especially ZeroQuant)? InferenceEngine only seems to support Mixture-of-Quantization but not ZeroQuant. I also tried int8 quantization without using compression module as shown in the code snippet below but end up getting CUDA error: an illegal memory access error
Have you guys released the fused kernels for GeLU+Quantize and GeMM+dequantize proposed in the ZeroQuant paper yet?
Any tentative release date for Layer-by-layer Knowledge Distillation?
What's the motivation for multiplying quantized input by scale here? Wouldn't that dequantize inputs?

injection_policy={gptj_transformer: 
                          module_inject.replace_policy.HFGPTJLayerPolicy}

model = deepspeed.init_inference(
    model,
    mp_size=world_size,
    dtype=torch.int8,
    quantization_setting=2,
    replace_with_kernel_inject=True,
    injection_policy=injection_policy,
)

Any help would be appreciated.

The text was updated successfully, but these errors were encountered:

gsujankumar · 2022-08-11T16:42:23Z

Looks like the inference kernels for zeroquant is not released.

sdpmas · 2022-08-11T17:01:04Z

@gsujankumar have you by any chance been able to quantize gpt-x models like gpt-2 or gpt-j?

yaozhewei · 2022-08-11T18:38:05Z

Hi,

The engine of ZeroQuant inference is not released yet. The code example in DeepSpeed-Example is only to help verify the accuracy of ZeroQuant.

The kernel/engine released is on our calendar and we are actively working on it to make it compatible for various models. Please stay tuned.

For LKD, we will also release it soon.

For the last question, the code for training or accuracy testing is different than the final inference engine. Here, everything is simulated, so we can do quantization aware training or other things

sdpmas · 2022-08-11T18:42:32Z

thanks for replying back @yaozhewei. Do you think you could provide any estimation on when the ZeroQuant inference will be released? any rough estimation would help!

xk503775229 · 2022-09-07T06:38:23Z

i have the same questions, is there any guide to running inference on compressed models(especially ZeroQuant)?
Any help would be appreciated.

xk503775229 · 2022-09-15T13:35:49Z

hi ，when the ZeroQuant inference will be released?

david-macleod · 2022-10-20T18:54:06Z

@yaozhewei any news on this?

yaozhewei · 2022-11-02T01:48:33Z

@david-macleod LKD example is just released (not merged yet): microsoft/DeepSpeedExamples#214

For kernel, please stay tuned

david-macleod · 2022-11-02T05:50:12Z

Thanks @yaozhewei! Do you know whether there is a rough timeline for this? e.g. 1 month, 6 months, 1 year? It would be very useful to know as we'd like to decide where to wait or explore other options. Thanks again!

HarleysZhang · 2023-04-21T12:15:35Z

I have the same problem, after zero-quant with DeepSpeed-Example reposity's code, but didn't see any throughput/latency gain from the quantization during inference， it only have model size decrease.
the inference kernels for zeroquant have released at now?

aakejiang · 2023-06-01T07:49:33Z

@yaozhewei any update on this? Is the engine of ZeroQuant inference released?

Moran232 · 2023-06-22T05:04:52Z

@yaozhewei the newest deepspeed>=0.9.0 can't run any model int INT8， many issue opened not solved yet. Can you tell us which version of deepspeed can run int8 model? I just want to reproduce the result in your paper ZeroQuant

xk503775229 mentioned this issue Sep 15, 2022

hi ，when the ZeroQuant inference will be released? #2326

Closed

yaozhewei self-assigned this Nov 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZeroQuant quantization kernels and LKD #2207

ZeroQuant quantization kernels and LKD #2207

sdpmas commented Aug 10, 2022

gsujankumar commented Aug 11, 2022 •

edited

sdpmas commented Aug 11, 2022

yaozhewei commented Aug 11, 2022

sdpmas commented Aug 11, 2022

xk503775229 commented Sep 7, 2022

xk503775229 commented Sep 15, 2022

david-macleod commented Oct 20, 2022

yaozhewei commented Nov 2, 2022

david-macleod commented Nov 2, 2022

HarleysZhang commented Apr 21, 2023

aakejiang commented Jun 1, 2023 •

edited

Moran232 commented Jun 22, 2023 •

edited

ZeroQuant quantization kernels and LKD #2207

ZeroQuant quantization kernels and LKD #2207

Comments

sdpmas commented Aug 10, 2022

gsujankumar commented Aug 11, 2022 • edited

sdpmas commented Aug 11, 2022

yaozhewei commented Aug 11, 2022

sdpmas commented Aug 11, 2022

xk503775229 commented Sep 7, 2022

xk503775229 commented Sep 15, 2022

david-macleod commented Oct 20, 2022

yaozhewei commented Nov 2, 2022

david-macleod commented Nov 2, 2022

HarleysZhang commented Apr 21, 2023

aakejiang commented Jun 1, 2023 • edited

Moran232 commented Jun 22, 2023 • edited

gsujankumar commented Aug 11, 2022 •

edited

aakejiang commented Jun 1, 2023 •

edited

Moran232 commented Jun 22, 2023 •

edited