Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for bigscience/bloomz #25

Merged
merged 15 commits into from Nov 7, 2022
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
4 changes: 2 additions & 2 deletions bloom-inference-scripts/bloom-accelerate-inference.py
Expand Up @@ -120,7 +120,7 @@ def generate():
return zip(inputs, outputs, total_new_tokens)


print_rank0(f"*** Running generate")
print_rank0("*** Running generate")
t_generate_start = time.time()
generated = generate()
t_generate_span = time.time() - t_generate_start
Expand All @@ -135,7 +135,7 @@ def generate():
torch.cuda.empty_cache()
gc.collect()

print_rank0(f"*** Running benchmark")
print_rank0("*** Running benchmark")
# warm up
for i in range(1):
_ = generate()
Expand Down
6 changes: 3 additions & 3 deletions bloom-inference-scripts/bloom-ds-inference.py
Expand Up @@ -256,10 +256,10 @@ def generate():

# warmup is a must if measuring speed as it's when all the optimizations are performed
# e.g. on 8x80 a100 the first pass of 100 tokens takes 23sec, and the next one is 4secs
print_rank0(f"*** Running generate warmup")
print_rank0("*** Running generate warmup")
_ = generate()

print_rank0(f"*** Running generate")
print_rank0("*** Running generate")
t_generate_start = time.time()
generated = generate()
t_generate_span = time.time() - t_generate_start
Expand All @@ -275,7 +275,7 @@ def generate():

# benchmark it!
if args.benchmark:
print_rank0(f"*** Running benchmark")
print_rank0("*** Running benchmark")

# warm up
for i in range(1):
Expand Down
4 changes: 2 additions & 2 deletions bloom-inference-scripts/bloom-ds-zero-inference.py
Expand Up @@ -178,7 +178,7 @@ def generate():

# XXX: this is currently doing world_size streams on world_size gpus, so we can feed it different inputs on each! and hence the time can be divided by world_size

print_rank0(f"*** Running generate")
print_rank0("*** Running generate")
t_generate_start = time.time()
pairs = generate()
t_generate_span = time.time() - t_generate_start
Expand All @@ -194,7 +194,7 @@ def generate():
gc.collect()
deepspeed.runtime.utils.see_memory_usage("end-of-generate", force=True)

print_rank0(f"*** Running benchmark")
print_rank0("*** Running benchmark")

# warm up
for i in range(1):
Expand Down
63 changes: 0 additions & 63 deletions bloom-inference-server/cli.py

This file was deleted.

50 changes: 0 additions & 50 deletions bloom-inference-server/utils/constants.py

This file was deleted.

134 changes: 0 additions & 134 deletions bloom-inference-server/utils/requests.py

This file was deleted.

13 changes: 13 additions & 0 deletions inference_server/Makefile
@@ -0,0 +1,13 @@
gen-proto:
pip install grpcio-tools==1.50.0

mkdir -p model_handler/grpc_utils/pb

python -m grpc_tools.protoc -Imodel_handler/grpc_utils/proto --python_out=model_handler/grpc_utils/pb --grpc_python_out=model_handler/grpc_utils/pb model_handler/grpc_utils/proto/generation.proto

find model_handler/grpc_utils/pb/ -type f -name "*.py" -print0 -exec sed -i -e 's/^\(import.*pb2\)/from . \1/g' {} \;

touch model_handler/grpc_utils/__init__.py
touch model_handler/grpc_utils/pb/__init__.py

rm -rf model_handler/grpc_utils/pb/*.py-e
30 changes: 13 additions & 17 deletions bloom-inference-server/README.md → inference_server/README.md
Expand Up @@ -5,25 +5,21 @@ We support HuggingFace accelerate and DeepSpeed Inference for generation.
Install required packages:

```shell
pip install flask flask_api gunicorn pydantic accelerate huggingface_hub>=0.9.0 deepspeed>=0.7.3
pip install flask flask_api gunicorn pydantic accelerate huggingface_hub>=0.9.0 deepspeed>=0.7.3 deepspeed-mii==0.0.2
```
To install [DeepSpeed-MII](https://github.com/microsoft/DeepSpeed-MII):

alternatively you can also install deepspeed from source:
```shell
git clone https://github.com/microsoft/DeepSpeed-MII
cd DeepSpeed-MII
pip install .
git clone https://github.com/microsoft/DeepSpeed
cd DeepSpeed
CFLAGS="-I$CONDA_PREFIX/include/" LDFLAGS="-L$CONDA_PREFIX/lib/" TORCH_CUDA_ARCH_LIST="7.0" DS_BUILD_CPU_ADAM=1 DS_BUILD_AIO=1 DS_BUILD_UTILS=1 pip install -e . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check
```

All the provided scripts are tested on 8 A100 80GB GPUs for BLOOM 176B (fp16/bf16) and 4 A100 80GB GPUs for BLOOM 176B (int8). These scripts might not work for other models or a different number of GPUs.

DS inference is deployed using the DeepSpeed MII library which requires the resharded checkpoints for 8 x Tensor Parallel.

Note: sometimes GPU memory is not freed when DS inference deployment is shutdown. You can free this memory by running:
```python
import mii
mii.terminate("ds_inference_grpc_server")
```
or alternatively, just doing a `killall python` in terminal.
Note: Sometimes GPU memory is not freed when DS inference deployment crashes. You can free this memory by running `killall python` in terminal.

For using BLOOM quantized, use dtype = int8. Also, change the model_name to microsoft/bloom-deepspeed-inference-int8 for DeepSpeed-Inference. For HF accelerate, no change is needed for model_name.

Expand All @@ -39,12 +35,12 @@ Example: generate_kwargs =

1. using HF accelerate
```shell
python cli.py --model_name bigscience/bloom --dtype bf16 --deployment_framework hf_accelerate --generate_kwargs '{"min_length": 100, "max_new_tokens": 100, "do_sample": false}'
python -m inference_server.cli --model_name bigscience/bloom --dtype bf16 --deployment_framework hf_accelerate --generate_kwargs '{"min_length": 100, "max_new_tokens": 100, "do_sample": false}' --num_gpus 8
```

2. using DS inference
```shell
python cli.py --model_name microsoft/bloom-deepspeed-inference-fp16 --dtype fp16 --deployment_framework ds_inference --generate_kwargs '{"min_length": 100, "max_new_tokens": 100, "do_sample": false}'
python -m inference_server.cli --model_name microsoft/bloom-deepspeed-inference-fp16 --dtype fp16 --deployment_framework ds_inference --generate_kwargs '{"min_length": 100, "max_new_tokens": 100, "do_sample": false}' --num_gpus 8
```

#### BLOOM server deployment
Expand All @@ -55,21 +51,21 @@ python cli.py --model_name microsoft/bloom-deepspeed-inference-fp16 --dtype fp16

1. using HF accelerate
```shell
python benchmark.py --model_name bigscience/bloom --dtype bf16 --deployment_framework hf_accelerate --benchmark_cycles 5
python -m inference_server.benchmark --model_name bigscience/bloom --dtype bf16 --deployment_framework hf_accelerate --benchmark_cycles 5 --num_gpus 8
```

2. using DS inference
```shell
deepspeed --num_gpus 8 benchmark.py --model_name bigscience/bloom --dtype fp16 --deployment_framework ds_inference --benchmark_cycles 5
deepspeed --num_gpus 8 --module inference_server.benchmark --model_name bigscience/bloom --dtype fp16 --deployment_framework ds_inference --benchmark_cycles 5 --num_gpus 8
```
alternatively, to load model faster:
```shell
deepspeed --num_gpus 8 benchmark.py --model_name microsoft/bloom-deepspeed-inference-fp16 --dtype fp16 --deployment_framework ds_inference --benchmark_cycles 5
deepspeed --num_gpus 8 --module inference_server.benchmark --model_name microsoft/bloom-deepspeed-inference-fp16 --dtype fp16 --deployment_framework ds_inference --benchmark_cycles 5 --num_gpus 8
```

3. using DS ZeRO
```shell
deepspeed --num_gpus 8 benchmark.py --model_name bigscience/bloom --dtype bf16 --deployment_framework ds_zero --benchmark_cycles 5
deepspeed --num_gpus 8 --module inference_server.benchmark --model_name bigscience/bloom --dtype bf16 --deployment_framework ds_zero --benchmark_cycles 5 --num_gpus 8
```

## Support
Expand Down