-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Inference support for running the BigScience-BLOOM Architecture #2083
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
RezaYazdaniAminabadi
requested review from
jeffra,
samyam,
tjruwase,
ShadenSmith,
conglongli,
awan-10,
cli99,
eltonzheng,
minjiaz,
duli2012,
mrwyattii,
yaozhewei,
arashb,
xiaoxiawu-microsoft and
samadejacobs
as code owners
July 10, 2022 00:20
tjruwase
reviewed
Jul 10, 2022
tjruwase
reviewed
Jul 10, 2022
…pSpeed into ds-inference/bloom-support
…pSpeed into ds-inference/bloom-support
jeffra
approved these changes
Jul 18, 2022
Closed
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds a few changes to the Inference-API to run the BLOOM-type models!
For testing with the BLOOM-176B model, we use a combination of techniques: Zero-Inference, DS-Inference, and model-parallelism. Zero-Inference is used to initialize the large model through HuggingFace model-zoo, by using
AutoModelForCausalLM.from_config
from HF library. Second, we use the DS-inference engine to load the checkpoint into model, parallelize across the GPUs, and inject the kernels. There is also the option to run the model without the kernels and just using the model-parallelism. Here is the script used for running the test on 16 NVIDIA A100-4GB with one input query:The checkpoint config passed at
init_inference
looks like this:For running the model with kernel-injection, you need to pass
replace_with_kernel_inject
at True, otherwise, you need to set theinjection_policy
for adding the parallelism to the model. This policy shows the two linear layers which are partitioned row-wise, which areself_attention.dense
,mlp.dense_4h_to_h
.Regarding the accuracy, the text generated in both cases, with and without kernel, has good quality:
With kernel:
DeepSpeed is a new method for the detection of protein-protein interactions. It is based on the use of a single, high-affinity, small molecule that binds to a protein of interest and a fluorescently labeled, high-affinity,
Without kernel:
DeepSpeed is a new, high-performance, high-throughput, and high-resolution mass spectrometer that is capable of acquiring data at a rate of up to 1.5 million mass spectra per second. The instrument is designed to be used
Regarding the performance, the latency-per-token is about 300 milliseconds on 16 GPUs without kernels and 120ms with kernels. So, by injecting the kernels, we see about 2.5x improvement over the PyTorch native implementation. However, the attention part of the BLOOM architecture is not completely running with DS-inference kernels, due to some accuracy issues with the model-parallelism support. After this part is fixed, the inference throughput can be further improved.
cc: @stas00, @jeffra, @tjruwase, @samyam