Support gpt-j-6b model for Habana (#1170)

* Support gpt-j-6b for Habana graph mode Signed-off-by: Lv, Liang1 <liang1.lv@intel.com> * add deepspeed script Signed-off-by: Lv, Liang1 <liang1.lv@intel.com> * add other graph mode models for habana Signed-off-by: Lv, Liang1 <liang1.lv@intel.com> * mv models and habana code for both inference and finetuning use Signed-off-by: Lv, Liang1 <liang1.lv@intel.com> * update models path Signed-off-by: Lv, Liang1 <liang1.lv@intel.com> * support stream output for SPR and Habana Signed-off-by: Lv, Liang1 <liang1.lv@intel.com> * add load_model and predict_stream for customer Signed-off-by: Lv, Liang1 <liang1.lv@intel.com> * fix issue Signed-off-by: Lv, Liang1 <liang1.lv@intel.com> * use main to test load_model/predict/predict_stream, add StoppingCriteria (#1194) Signed-off-by: Wang, Yi <yi.a.wang@intel.com> --------- Signed-off-by: Lv, Liang1 <liang1.lv@intel.com> Signed-off-by: Wang, Yi <yi.a.wang@intel.com> Co-authored-by: Wang, Yi <yi.a.wang@intel.com>
intel · Jul 17, 2023 · 9ef6ad8 · 9ef6ad8
1 parent a84eabe
commit 9ef6ad8
Show file tree

Hide file tree

Showing 6 changed files with 786 additions and 187 deletions.
diff --git a/workflows/chatbot/habana/gaudi_spawn.py b/workflows/chatbot/habana/gaudi_spawn.py
@@ -0,0 +1,110 @@
+# coding=utf-8
+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+A simple launcher script for distributed training on HPUs.
+
+Single node:
+::
+    >>> python gaudi_spawn.py --world_size=NUM_CARDS_YOU_HAVE --use_mpi
+               YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3 and all other
+               arguments of your training script)
+
+Multi node:
+::
+    >>> python gaudi_spawn.py --hostfile=PATH_TO_HOSTFILE --use_deepspeed
+               YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3 and all other
+               arguments of your training script)
+"""
+
+
+import sys
+from argparse import REMAINDER, ArgumentParser
+
+from optimum.habana.distributed import DistributedRunner
+from optimum.utils import logging
+
+
+logger = logging.get_logger(__name__)
+
+
+def parse_args():
+    """
+    Helper function parsing the command line options.
+    @retval ArgumentParser
+    """
+    parser = ArgumentParser(
+        description=(
+            "Habana Gaudi distributed training launch helper utility that will spawn up multiple distributed"
+            " processes."
+        )
+    )
+
+    # Optional arguments for the launch helper
+    parser.add_argument("--world_size", type=int, default=1, help="Number of HPUs to use (1 or 8)")
+    parser.add_argument("--hostfile", type=str, default=None, help="Path to the file where hosts are specified.")
+    parser.add_argument("--use_mpi", action="store_true", help="Use MPI for distributed training")
+    parser.add_argument("--use_deepspeed", action="store_true", help="Use DeepSpeed for distributed training")
+
+    # positional
+    parser.add_argument(
+        "training_script",
+        type=str,
+        help=(
+            "The full path to the single HPU training "
+            "program/script to be launched in parallel, "
+            "followed by all the arguments for the "
+            "training script."
+        ),
+    )
+
+    # rest from the training program
+    parser.add_argument("training_script_args", nargs=REMAINDER)
+
+    return parser.parse_args()
+
+
+def main():
+    args = parse_args()
+
+    if args.use_deepspeed:
+        from transformers.deepspeed import is_deepspeed_available
+
+        if not is_deepspeed_available():
+            raise ImportError(
+                "--use_deepspeed requires deepspeed: `pip install"
+                " git+https://github.com/HabanaAI/DeepSpeed.git@1.10.0`."
+            )
+
+    # Patch sys.argv
+    sys.argv = [args.training_script] + args.training_script_args
+    # Handle the case where arguments contain whitespaces
+    argv = ['"{}"'.format(arg) if " " in arg and arg[0] != '"' and arg[-1] != '"' else arg for arg in sys.argv]
+    command_list = [" ".join(argv)]
+
+    distributed_runner = DistributedRunner(
+        command_list=command_list,
+        world_size=args.world_size,
+        hostfile=args.hostfile,
+        use_mpi=args.use_mpi,
+        use_deepspeed=args.use_deepspeed,
+    )
+
+    ret_code = distributed_runner.run()
+    sys.exit(ret_code)
+
+
+if __name__ == "__main__":
+    main()
+
diff --git a/workflows/chatbot/inference/README.md b/workflows/chatbot/inference/README.md
@@ -186,15 +186,31 @@ You can use the [generate.py](./generate.py) script for performing direct infere
 python generate.py --base_model_path "./mpt-7b-chat" \
              --habana \
              --tokenizer_name "EleutherAI/gpt-neox-20b" \
+             --use_hpu_graphs \
+             --use_kv_cache \
              --instructions "Transform the following sentence into one that shows contrast. The tree is rotten."
 ```
 
-And you can use `deepspeed` to speedup the inference.
+And you can use `deepspeed` to speedup the inference. currently, TP is not supported for mpt
 
 ```bash
-python ../gaudi_spawn.py --use_deepspeed --world_size 8 generate.py \
+python ../habana/gaudi_spawn.py --use_deepspeed --world_size 8 generate.py \
         --base_model_path "./mpt-7b-chat" \
         --habana \
         --tokenizer_name "EleutherAI/gpt-neox-20b" \
+        --use_hpu_graphs \
+        --use_kv_cache \
         --instructions "Transform the following sentence into one that shows contrast. The tree is rotten."
 ```
+
+Habana supports HPU graph mode for inference speedup, which is available for bloom, gpt2, opt, gptj, gpt_neox. However, mpt and llama model have not supported this mode yet. You can use the parameter `use_hpu_graphs` to speed up the inference.
+
+```bash
+python generate.py --base_model_path "EleutherAI/gpt-j-6b" \
+             --habana \
+             --use_kv_cache \
+             --use_hpu_graphs \
+             --tokenizer_name "EleutherAI/gpt-j-6b" \
+             --instructions "Transform the following sentence into one that shows contrast. The tree is rotten."
+```
+
diff --git a/workflows/chatbot/inference/__init__.py b/workflows/chatbot/inference/__init__.py
diff --git a/workflows/chatbot/inference/docker/Dockerfile b/workflows/chatbot/inference/docker/Dockerfile
@@ -93,7 +93,7 @@ RUN git clone https://github.com/huggingface/optimum-habana.git && \
     apt-get install git-lfs && \
     git-lfs install
 
-RUN pip install optimum[habana] && \
+RUN pip install git+https://github.com/huggingface/optimum-habana.git && \
     pip install peft && \
     pip install einops && \
     pip install datasets && \

diff --git a/workflows/chatbot/inference/docker/README.md b/workflows/chatbot/inference/docker/README.md
@@ -53,5 +53,7 @@ python generate.py \
         --base_model_path "./mpt-7b-chat" \
         --tokenizer_name "EleutherAI/gpt-neox-20b" \
         --habana \
+        --use_hpu_graphs \
+        --use_kv_cache \
         --instructions "Transform the following sentence into one that shows contrast. The tree is rotten."
 ```