Add Inference support for running the BigScience-BLOOM Architecture #2083

RezaYazdaniAminabadi · 2022-07-10T00:20:01Z

This PR adds a few changes to the Inference-API to run the BLOOM-type models!

For testing with the BLOOM-176B model, we use a combination of techniques: Zero-Inference, DS-Inference, and model-parallelism. Zero-Inference is used to initialize the large model through HuggingFace model-zoo, by using AutoModelForCausalLM.from_config from HF library. Second, we use the DS-inference engine to load the checkpoint into model, parallelize across the GPUs, and inject the kernels. There is also the option to run the model without the kernels and just using the model-parallelism. Here is the script used for running the test on 16 NVIDIA A100-4GB with one input query:

tokenizer = AutoTokenizer.from_pretrained(args.name)
config = AutoConfig.from_pretrained(args.name)
model_hidden_size = config.hidden_size
train_batch_size = 1 * world_size
ds_config = {
    "fp16": {
        "enabled": True
    },
    "bf16": {
        "enabled": False
    },
    "zero_optimization": {
        "stage": 3,
        "offload_param": {
            "device": "cpu",
            "pin_memory": True
        },
        "overlap_comm": True,
        "contiguous_gradients": True,
        "reduce_bucket_size": model_hidden_size * model_hidden_size,
        "stage3_prefetch_bucket_size": 0.9 * model_hidden_size * model_hidden_size,
        "stage3_param_persistence_threshold": 0
    },
    "steps_per_print": 2000,
    "train_batch_size": train_batch_size,
    "train_micro_batch_size_per_gpu": 1,
    "wall_clock_breakdown": False
}
dschf = HfDeepSpeedConfig(ds_config)
model = AutoModelForCausalLM.from_config(config).eval()
ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[0]
ds_engine.module.eval()
model = ds_engine.module

model = deepspeed.init_inference(model, 
                                 mp_size=1,
                                 dtype=torch.half,
                                 checkpoint='bloom-176B.json',
                                 #injection_policy={BloomBlock: ('self_attention.dense', 'mlp.dense_4h_to_h')}
                                 replace_with_kernel_inject=True
                                 )
model = model.module

tokens = tokenizer(
        "DeepSpeed is",
        return_tensors="pt",
    )
for t in tokens:
    if torch.is_tensor(tokens[t]):
        tokens[t] = tokens[t].to(torch.cuda.current_device())
gen_tokens = model.generate(
            **tokens,
            min_length=50,
            max_length=50,
            do_sample=False,
        )
print(tokenizer.batch_decode(gen_tokens)[0])

The checkpoint config passed at init_inference looks like this:

{
	"type": "BLOOM-176B",
	"checkpoints": list_of_checkpoint_files,
	"version": 1.0
}

For running the model with kernel-injection, you need to pass replace_with_kernel_inject at True, otherwise, you need to set the injection_policy for adding the parallelism to the model. This policy shows the two linear layers which are partitioned row-wise, which are self_attention.dense, mlp.dense_4h_to_h.

Regarding the accuracy, the text generated in both cases, with and without kernel, has good quality:

With kernel:
DeepSpeed is a new method for the detection of protein-protein interactions. It is based on the use of a single, high-affinity, small molecule that binds to a protein of interest and a fluorescently labeled, high-affinity,

Without kernel:
DeepSpeed is a new, high-performance, high-throughput, and high-resolution mass spectrometer that is capable of acquiring data at a rate of up to 1.5 million mass spectra per second. The instrument is designed to be used

Regarding the performance, the latency-per-token is about 300 milliseconds on 16 GPUs without kernels and 120ms with kernels. So, by injecting the kernels, we see about 2.5x improvement over the PyTorch native implementation. However, the attention part of the BLOOM architecture is not completely running with DS-inference kernels, due to some accuracy issues with the model-parallelism support. After this part is fixed, the inference throughput can be further improved.

cc: @stas00, @jeffra, @tjruwase, @samyam

csrc/transformer/inference/csrc/normalize.cu

csrc/transformer/inference/csrc/gelu.cu

…pSpeed into ds-inference/bloom-support

…queries

…pSpeed into ds-inference/bloom-support

Add Inference support for running the BigScience-BLOOM Architecture

7cf7c89

RezaYazdaniAminabadi requested review from jeffra, samyam, tjruwase, ShadenSmith, conglongli, awan-10, cli99, eltonzheng, minjiaz, duli2012, mrwyattii, yaozhewei, arashb, xiaoxiawu-microsoft and samadejacobs as code owners July 10, 2022 00:20

tjruwase reviewed Jul 10, 2022

View reviewed changes

csrc/transformer/inference/csrc/normalize.cu Outdated Show resolved Hide resolved

stas00 mentioned this pull request Jul 10, 2022

BLOOM Inference via DeepSpeed-Inference, Accelerate and DeepSpeed-ZeRO bigscience-workshop/Megatron-DeepSpeed#308

Merged

tjruwase reviewed Jul 10, 2022

View reviewed changes

csrc/transformer/inference/csrc/gelu.cu Outdated Show resolved Hide resolved

jeffra and others added 11 commits July 11, 2022 08:29

Merge branch 'master' into ds-inference/bloom-support

6d33b82

formatting

6cc2340

add the checkpoint loading at the same of kernel-injection

2a69756

Merge branch 'ds-inference/bloom-support' of github.com:microsoft/Dee…

e449ac7

…pSpeed into ds-inference/bloom-support

releasing checkpoint CPU-memory after loading it

6e304aa

some fixes plus formatting

16dd1dc

fix layer_past; this caused issues when running inference on several …

e0dd488

…queries

Add support for multi-batch inference

02f9945

fix the padding issue for large bach inference

1cef202

fixing some bug in softmax kernel for batch_size>1

644fea4

align alibi&mask addition with HF new changes

adb0b97

Reza Yazdani and others added 12 commits July 15, 2022 23:28

revert back some changes and support for very large batch size

bd3c0a0

reduce the max_token_length for now

1f92e55

fix mask-adding

5d1f351

fix the large-batch inference for MP > 1

3c12b89

Merge branch 'master' into ds-inference/bloom-support

a5bdd58

Ds inference/bloom support meta (#2104)

aa5e01f

fix the Bert and GPT-J unit tests

b6503ed

fix for OneDevice

a9459d6

Merge branch 'master' into ds-inference/bloom-support

2cd301e

added bloom inference tests

72aba56

fixing the masking stride for the GPT models

332f69d

Merge branch 'ds-inference/bloom-support' of github.com:microsoft/Dee…

0ddf41c

…pSpeed into ds-inference/bloom-support

jeffra mentioned this pull request Jul 18, 2022

add OnDevice and remove zero-inference bigscience-workshop/Megatron-DeepSpeed#316

Merged

Reza Yazdani added 3 commits July 19, 2022 02:34

revert back some changes on replace_module

d38464e

fix fp32 softmax

ac2d092

allocate tensors initially on cpu at inference-api

1e3ea74

jeffra approved these changes Jul 18, 2022

View reviewed changes

jeffra merged commit aa88137 into master Jul 18, 2022

oborchers mentioned this pull request Aug 16, 2022

[BUG] DS Inference Bloom OOM / get_sd_loader_json() missing 1 argument #2222

Closed

lekurile mentioned this pull request Mar 8, 2023

[BUG] can not initialize DeepSpeed-Inference engine with deepspeed.init_inference() #2149

Closed

satpalsr mentioned this pull request Mar 16, 2023

fix check params #3036

Closed

mrwyattii deleted the ds-inference/bloom-support branch July 7, 2023 02:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Inference support for running the BigScience-BLOOM Architecture #2083

Add Inference support for running the BigScience-BLOOM Architecture #2083

RezaYazdaniAminabadi commented Jul 10, 2022 •

edited

Loading

Add Inference support for running the BigScience-BLOOM Architecture #2083

Add Inference support for running the BigScience-BLOOM Architecture #2083

Conversation

RezaYazdaniAminabadi commented Jul 10, 2022 • edited Loading

RezaYazdaniAminabadi commented Jul 10, 2022 •

edited

Loading