Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 4-bit quantized inference to run BLOOM-176B on 2 A100 GPUs #2526

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

RezaYazdaniAminabadi
Copy link
Contributor

@RezaYazdaniAminabadi RezaYazdaniAminabadi commented Nov 18, 2022

This PR adds the support for 4-bit quantization at DeepSpeed-Inference to be able to run large-scale model such as BLOOM-176B using 2x/4x lower number of GPUs compared to INT8/FP16 inference pipeline.

As the first evaluation of the accuracy on 2 A100 GPUs, we see some good quality text generated using the below prompt:

in=DeepSpeed is a machine learning framework 
out=DeepSpeed is a machine learning framework in C. The aim of the project is to create an accurate and efficient algorithm, 
which is easy to expand and improve. It is very useful in the classification of an input. The key to the success of the project is the
use of the features of the C programming language. The project has a lot of code, but the most important part of their is in the 
C language,

How to run inference

You can find the running scripts here. For running the model using 4-bits, you can start with a FP16 or INT8 checkpoint, and the DeepSpeed-Inference pipeline can generate the 4-bit checkpoint on-the-fly and run the inference on as low as 2 A100-80G GPUs. Here is the command used to generate the above text:

deepspeed --num_gpus 2 bloom-ds-inference.py --name microsoft/bloom-deepspeed-inference-int8 --dtype int4 --batch_size 1 --benchmark

Here is the performance stats for running on 2 A100 GPUs:

*** Performance stats:
Throughput per token including tokenize: 283.44 msecs
Start to ready to generate: 120.887 secs
Tokenize and generate 438 (bs=1) tokens: 27.788 secs
Start to finish: 148.675 sec

Compared to 4 GPU int8-performance, the latency increases from 160.5 ms to 283.4 ms for batch-1 while reducing the number of GPUs by 2x, which reduces the inference cost by about 13%.

We will add more performance results to check the throughput improvement.

More to come

  1. Even though this is only supported for BLOOM, we are working on generalizing this feature for the rest of models in order to reduce the inference cost and democratize the inference of such huge models.
  2. We will add some more tests to measure the accuracy metrics of the Int4 inference pipeline.
  3. We will add faster inference solution using the int-4 quantized inference through DeepSpeed-MII.

cc: @jeffra @yaozhewei @cmikeh2

@@ -128,7 +128,7 @@ def forward(
input = input[0]
input_type = input.dtype

if (self.config.fp16 or self.config.q_int8) \
if (self.config.fp16 or self.config.qunatize) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (self.config.fp16 or self.config.qunatize) \
if (self.config.fp16 or self.config.quantize) \

@LifeIsStrange
Copy link

LifeIsStrange commented Mar 31, 2023

unrelated but you might be interested in borrowing ideas from smoothquant which seems to enable more performant quantisation, especiallly faster inference

SmoothQuant migrates part of the quantization difficulties from activation to weights, which smooths out the systematic outliers in activation, making both weights and activations easy to quantize.

While the concept was applied to INT8, I don't see why it couldn't be applied to INT4

@groenenboomj
Copy link

@RezaYazdaniAminabadi Is there an updated version of this PR? I'm having some issues running out of the box on CUDA using the following related branch in transformers:
https://github.com/RezaYazdaniAminabadi/transformers-bloom-inference/tree/int4-bloom

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants