-
Notifications
You must be signed in to change notification settings - Fork 192
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[LLM Runtime]add neural speed example (#1232)
* add neural speed example Signed-off-by: intellinjun <jun.lin@intel.com> * Update runtime_example.py Signed-off-by: intellinjun <105184542+intellinjun@users.noreply.github.com> * Update runtime_example.py Signed-off-by: intellinjun <105184542+intellinjun@users.noreply.github.com> * add requirement and readme Signed-off-by: Wenxin Zhang <wenxin.zhang@intel.com> * add more about autoround Signed-off-by: Wenxin Zhang <wenxin.zhang@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * modify readme Signed-off-by: Wenxin Zhang <wenxin.zhang@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: intellinjun <jun.lin@intel.com> Signed-off-by: intellinjun <105184542+intellinjun@users.noreply.github.com> Signed-off-by: Wenxin Zhang <wenxin.zhang@intel.com> Co-authored-by: Wenxin Zhang <wenxin.zhang@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
- Loading branch information
1 parent
834e883
commit 6a97d15
Showing
4 changed files
with
131 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
# Step-by-Step | ||
|
||
To get better performance of popular large language models (LLM), we recommend using [Neural Speed](https://github.com/intel/neural-speed.git), an innovated library designed to provide the most efficient inference of LLMs. Here, we provide the scripts `run_example.py` for inference, and `runtime_acc.py` for accuracy evaluation. | ||
|
||
|
||
# Prerequisite | ||
|
||
We recommend install [Neural Speed](https://github.com/intel/neural-speed.git) from source code to fully leverage the latest features. | ||
|
||
> Note: To build neural-speed from source code, GCC higher than 10 is required. If you can't upgrade system GCC, here is a solution using conda install. | ||
> ```bash | ||
> compiler_version==13.1 | ||
> conda install --update-deps -c conda-forge gxx==${compiler_version} gcc==${compiler_version} gxx_linux-64==${compiler_version} libstdcxx-ng sysroot_linux-64 -y | ||
> ``` | ||
For other third-party dependencies, Pytorch and Intel-extension-for-pytorch >= 2.1 are required. Please make sure Pytorch and Intel-extension-for-pytorch are matched with each other. | ||
|
||
|
||
To running accuracy evaluation, python >=3.9, <= 3.11 is required due to [text evaluation library](https://github.com/EleutherAI/lm-evaluation-harness/tree/master) limitation. | ||
|
||
|
||
Other third-party dependencies are listed in requirements, please follow the steps below: | ||
|
||
|
||
```bash | ||
# build neural-speed from source code | ||
git clone https://github.com/intel/neural-speed.git | ||
cd neural-speed | ||
pip install -r requirements.txt | ||
python setup.py install | ||
# come back to current working directory | ||
cd .. | ||
pip install -r requirements.txt | ||
``` | ||
|
||
# Run | ||
|
||
|
||
> Note: Please prepare LLMs and save locally before running inference. | ||
|
||
## 1. Performance | ||
|
||
``` bash | ||
# int4 with group-size=32 | ||
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python runtime_example.py \ | ||
--model_path ./Llama2 \ | ||
--prompt "Once upon a time, there existed a little girl," \ | ||
--max_new_tokens 32 \ | ||
--group_size 128 | ||
``` | ||
|
||
## 2. Accuracy | ||
|
||
```bash | ||
# int4 with group-size=32 | ||
python runtime_acc.py \ | ||
--model_name ./Llama2 \ | ||
--tasks "lambada_openai" | ||
``` | ||
|
||
|
||
> Note: If you are trying models generated by [autoround](../pytorch/text-generation/quantization/), need to disable `model_format` and `use_gptq` these two arguments. | ||
> ```bash | ||
> # parser.add_argument('--model_format', type=str, default="runtime") | ||
> # parser.add_argument('--use_gptq', action='store_true') | ||
> results = evaluate( | ||
> model="hf-causal", | ||
> model_args=f'pretrained="{args.model_name}", dtype=float32', | ||
> tasks=[f"{args.tasks}"] | ||
> ) | ||
> ``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
|
||
intel_extension_for_transformers | ||
neural-speed | ||
git+https://github.com/EleutherAI/lm-evaluation-harness.git@cc9778fbe4fa1a709be2abed9deb6180fd40e7e2 | ||
sentencepiece | ||
gguf | ||
--extra-index-url https://download.pytorch.org/whl/cpu | ||
torch==2.1.0+cpu | ||
transformers | ||
intel_extension_for_pytorch==2.1.0+cpu |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
# Copyright (c) 2024 Intel Corporation | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
import argparse | ||
from pathlib import Path | ||
from typing import List, Optional | ||
from transformers import AutoTokenizer,TextStreamer | ||
from intel_extension_for_transformers.transformers import AutoModelForCausalLM, WeightOnlyQuantConfig | ||
def main(args_in: Optional[List[str]] = None) -> None: | ||
parser = argparse.ArgumentParser(description="Convert a PyTorch model to a NE compatible file") | ||
parser.add_argument("--model_path",type=Path, | ||
help="model path for local or from hf", default="meta-llama/Llama-2-7b-hf") | ||
parser.add_argument("--prompt",type=str,help="model path for local or from hf",default="Once upon a time, there existed a little girl,") | ||
parser.add_argument("--weight_dtype",type=str, | ||
help="output weight type, default: int4, we support int4, int8, nf4 and others ", default="int4") | ||
parser.add_argument("--compute_dtype", type=str, help="compute type", default="int8") | ||
parser.add_argument("--group_size", type=int, help="group size", default=128) | ||
parser.add_argument("--n_ctx", type=int, help="n_ctx", default=512) | ||
parser.add_argument("--max_new_tokens", type=int, help="max_new_tokens", default=300) | ||
args = parser.parse_args(args_in) | ||
model_name = args.model_path | ||
woq_config = WeightOnlyQuantConfig(load_in_4bit=True, | ||
weight_dtype= args.weight_dtype, compute_dtype=args.compute_dtype, group_size= args.group_size) | ||
prompt = args.prompt | ||
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) | ||
streamer = TextStreamer(tokenizer) | ||
inputs = tokenizer(prompt, return_tensors="pt").input_ids | ||
|
||
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=woq_config) | ||
|
||
outputs = model.generate(inputs, streamer=streamer, ctx_size=args.n_ctx, max_new_tokens=args.max_new_tokens) | ||
|
||
|
||
|
||
if __name__ == "__main__": | ||
main() | ||
|