Add 4-bit quantized inference to run BLOOM-176B on 2 A100 GPUs #2526

RezaYazdaniAminabadi · 2022-11-18T19:29:33Z

This PR adds the support for 4-bit quantization at DeepSpeed-Inference to be able to run large-scale model such as BLOOM-176B using 2x/4x lower number of GPUs compared to INT8/FP16 inference pipeline.

As the first evaluation of the accuracy on 2 A100 GPUs, we see some good quality text generated using the below prompt:

in=DeepSpeed is a machine learning framework 
out=DeepSpeed is a machine learning framework in C. The aim of the project is to create an accurate and efficient algorithm, 
which is easy to expand and improve. It is very useful in the classification of an input. The key to the success of the project is the
use of the features of the C programming language. The project has a lot of code, but the most important part of their is in the 
C language,

How to run inference

You can find the running scripts here. For running the model using 4-bits, you can start with a FP16 or INT8 checkpoint, and the DeepSpeed-Inference pipeline can generate the 4-bit checkpoint on-the-fly and run the inference on as low as 2 A100-80G GPUs. Here is the command used to generate the above text:

deepspeed --num_gpus 2 bloom-ds-inference.py --name microsoft/bloom-deepspeed-inference-int8 --dtype int4 --batch_size 1 --benchmark

Here is the performance stats for running on 2 A100 GPUs:

*** Performance stats:
Throughput per token including tokenize: 283.44 msecs
Start to ready to generate: 120.887 secs
Tokenize and generate 438 (bs=1) tokens: 27.788 secs
Start to finish: 148.675 sec

Compared to 4 GPU int8-performance, the latency increases from 160.5 ms to 283.4 ms for batch-1 while reducing the number of GPUs by 2x, which reduces the inference cost by about 13%.

We will add more performance results to check the throughput improvement.

More to come

Even though this is only supported for BLOOM, we are working on generalizing this feature for the rest of models in order to reduce the inference cost and democratize the inference of such huge models.
We will add some more tests to measure the accuracy metrics of the Int4 inference pipeline.
We will add faster inference solution using the int-4 quantized inference through DeepSpeed-MII.

cc: @jeffra @yaozhewei @cmikeh2

…nto quantize-inference

mrwyattii · 2022-11-22T21:49:09Z

deepspeed/model_implementations/transformers/ds_transformer.py

@@ -128,7 +128,7 @@ def forward(
            input = input[0]
        input_type = input.dtype

-        if (self.config.fp16 or self.config.q_int8) \
+        if (self.config.fp16 or self.config.qunatize) \


Suggested change

if (self.config.fp16 or self.config.qunatize) \

if (self.config.fp16 or self.config.quantize) \

…-inference

LifeIsStrange · 2023-03-31T15:30:19Z

unrelated but you might be interested in borrowing ideas from smoothquant which seems to enable more performant quantisation, especiallly faster inference

SmoothQuant migrates part of the quantization difficulties from activation to weights, which smooths out the systematic outliers in activation, making both weights and activations easy to quantize.

While the concept was applied to INT8, I don't see why it couldn't be applied to INT4

groenenboomj · 2023-04-07T21:59:44Z

@RezaYazdaniAminabadi Is there an updated version of this PR? I'm having some issues running out of the box on CUDA using the following related branch in transformers:
https://github.com/RezaYazdaniAminabadi/transformers-bloom-inference/tree/int4-bloom

Reza Yazdani Aminabadi added 5 commits November 17, 2022 13:43

add configurable quantization for enabling 4-bit inference

4114bea

fix a few bugs

2ce22d7

add q-int4 on top of q-int8

d2997bf

fix a few issues to run inference on 2 A100-80G (4 A100-40G)

5341811

Add more config params for quantization

e2f6fe9

RezaYazdaniAminabadi marked this pull request as ready for review November 18, 2022 19:29

RezaYazdaniAminabadi requested review from jeffra, mrwyattii, awan-10, cmikeh2 and arashb as code owners November 18, 2022 19:29

RezaYazdaniAminabadi mentioned this pull request Nov 18, 2022

Add configs to run int4 inference huggingface/transformers-bloom-inference#37

Open

RezaYazdaniAminabadi and others added 3 commits November 18, 2022 12:32

Merge branch 'master' into quantize-inference

410faf7

fix formating

dd03ae6

Merge branch 'quantize-inference' of github.com:microsoft/DeepSpeed i…

44184ca

…nto quantize-inference

stas00 mentioned this pull request Nov 19, 2022

Feature request: INT4 format support pytorch/pytorch#74627

Open

Reza Yazdani Aminabadi and others added 5 commits November 18, 2022 16:14

fix config

efab0aa

fix quantization parameters

bc1d63e

Merge branch 'master' into quantize-inference

741e80e

rename the vars in quant config.

92f7aab

format fixes.

803bf2a

mrwyattii reviewed Nov 22, 2022

View reviewed changes

awan-10 mentioned this pull request Nov 28, 2022

Remove all unused quantize settings and flags. #2555

Open

Merge branch 'master' of github.com:microsoft/DeepSpeed into quantize…

5917d5a

…-inference

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add 4-bit quantized inference to run BLOOM-176B on 2 A100 GPUs #2526

Add 4-bit quantized inference to run BLOOM-176B on 2 A100 GPUs #2526

RezaYazdaniAminabadi commented Nov 18, 2022 •

edited

mrwyattii Nov 22, 2022

LifeIsStrange commented Mar 31, 2023 •

edited

groenenboomj commented Apr 7, 2023

	if (self.config.fp16 or self.config.qunatize) \
	if (self.config.fp16 or self.config.quantize) \

Add 4-bit quantized inference to run BLOOM-176B on 2 A100 GPUs #2526

Are you sure you want to change the base?

Add 4-bit quantized inference to run BLOOM-176B on 2 A100 GPUs #2526

Conversation

RezaYazdaniAminabadi commented Nov 18, 2022 • edited

How to run inference

More to come

mrwyattii Nov 22, 2022

Choose a reason for hiding this comment

LifeIsStrange commented Mar 31, 2023 • edited

groenenboomj commented Apr 7, 2023

RezaYazdaniAminabadi commented Nov 18, 2022 •

edited

LifeIsStrange commented Mar 31, 2023 •

edited