Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Inference support for running the BigScience-BLOOM Architecture #2083

Merged
merged 27 commits into from
Jul 18, 2022

Conversation

RezaYazdaniAminabadi
Copy link
Contributor

@RezaYazdaniAminabadi RezaYazdaniAminabadi commented Jul 10, 2022

This PR adds a few changes to the Inference-API to run the BLOOM-type models!

For testing with the BLOOM-176B model, we use a combination of techniques: Zero-Inference, DS-Inference, and model-parallelism. Zero-Inference is used to initialize the large model through HuggingFace model-zoo, by using AutoModelForCausalLM.from_config from HF library. Second, we use the DS-inference engine to load the checkpoint into model, parallelize across the GPUs, and inject the kernels. There is also the option to run the model without the kernels and just using the model-parallelism. Here is the script used for running the test on 16 NVIDIA A100-4GB with one input query:

tokenizer = AutoTokenizer.from_pretrained(args.name)
config = AutoConfig.from_pretrained(args.name)
model_hidden_size = config.hidden_size
train_batch_size = 1 * world_size
ds_config = {
    "fp16": {
        "enabled": True
    },
    "bf16": {
        "enabled": False
    },
    "zero_optimization": {
        "stage": 3,
        "offload_param": {
            "device": "cpu",
            "pin_memory": True
        },
        "overlap_comm": True,
        "contiguous_gradients": True,
        "reduce_bucket_size": model_hidden_size * model_hidden_size,
        "stage3_prefetch_bucket_size": 0.9 * model_hidden_size * model_hidden_size,
        "stage3_param_persistence_threshold": 0
    },
    "steps_per_print": 2000,
    "train_batch_size": train_batch_size,
    "train_micro_batch_size_per_gpu": 1,
    "wall_clock_breakdown": False
}
dschf = HfDeepSpeedConfig(ds_config)
model = AutoModelForCausalLM.from_config(config).eval()
ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[0]
ds_engine.module.eval()
model = ds_engine.module

model = deepspeed.init_inference(model, 
                                 mp_size=1,
                                 dtype=torch.half,
                                 checkpoint='bloom-176B.json',
                                 #injection_policy={BloomBlock: ('self_attention.dense', 'mlp.dense_4h_to_h')}
                                 replace_with_kernel_inject=True
                                 )
model = model.module

tokens = tokenizer(
        "DeepSpeed is",
        return_tensors="pt",
    )
for t in tokens:
    if torch.is_tensor(tokens[t]):
        tokens[t] = tokens[t].to(torch.cuda.current_device())
gen_tokens = model.generate(
            **tokens,
            min_length=50,
            max_length=50,
            do_sample=False,
        )
print(tokenizer.batch_decode(gen_tokens)[0])

The checkpoint config passed at init_inference looks like this:

{
	"type": "BLOOM-176B",
	"checkpoints": list_of_checkpoint_files,
	"version": 1.0
}

For running the model with kernel-injection, you need to pass replace_with_kernel_inject at True, otherwise, you need to set the injection_policy for adding the parallelism to the model. This policy shows the two linear layers which are partitioned row-wise, which are self_attention.dense, mlp.dense_4h_to_h.

Regarding the accuracy, the text generated in both cases, with and without kernel, has good quality:

With kernel:
DeepSpeed is a new method for the detection of protein-protein interactions. It is based on the use of a single, high-affinity, small molecule that binds to a protein of interest and a fluorescently labeled, high-affinity,

Without kernel:
DeepSpeed is a new, high-performance, high-throughput, and high-resolution mass spectrometer that is capable of acquiring data at a rate of up to 1.5 million mass spectra per second. The instrument is designed to be used

Regarding the performance, the latency-per-token is about 300 milliseconds on 16 GPUs without kernels and 120ms with kernels. So, by injecting the kernels, we see about 2.5x improvement over the PyTorch native implementation. However, the attention part of the BLOOM architecture is not completely running with DS-inference kernels, due to some accuracy issues with the model-parallelism support. After this part is fixed, the inference throughput can be further improved.

cc: @stas00, @jeffra, @tjruwase, @samyam

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants