From 5c4e55d8942f351a994457f7bb17cfe706e2e6a9 Mon Sep 17 00:00:00 2001 From: hshen14 Date: Tue, 26 Sep 2023 20:04:50 +0800 Subject: [PATCH 1/3] Refine LLM runtime readme Signed-off-by: hshen14 --- .../llm/runtime/graph/README.md | 31 ++++++++++++------- 1 file changed, 20 insertions(+), 11 deletions(-) diff --git a/intel_extension_for_transformers/llm/runtime/graph/README.md b/intel_extension_for_transformers/llm/runtime/graph/README.md index ce0a5a0def2..fe861ce4dff 100644 --- a/intel_extension_for_transformers/llm/runtime/graph/README.md +++ b/intel_extension_for_transformers/llm/runtime/graph/README.md @@ -36,17 +36,26 @@ We support the following models: ## How to use -### 1. Build LLM Runtime -Linux +### 1. Install LLM Runtime +Install from binary ```shell +pip install intel-extension-for-transformers +``` + +Build from Source +```shell +# Linux git submodule update --init --recursive mkdir build cd build cmake .. -G Ninja ninja ``` -Windows: install VisualStudio 2022(a validated veresion), search 'Developer PowerShell for VS 2022' and open it, then run the following cmds. + ```powershell +# Windows +# Install VisualStudio 2022 and open 'Developer PowerShell for VS 2022' + mkdir build cd build cmake .. @@ -58,7 +67,7 @@ cmake --build . -j You can use the python api to simplely run HF model. ```python from intel_extension_for_transformers.transformers import AutoModel, WeightOnlyQuantConfig -model_name = "EleutherAI/gpt-j-6b" # support model id of HF or local PATH to model +model_name = "Intel/neural-chat-7b-v1-1" # Hugging Face model_id or local model woq_config = WeightOnlyQuantConfig(compute_dtype="int8", weight_dtype="int4") model = AutoModel.from_pretrained(model_name, quantization_config=woq_config) prompt = "Once upon a time, a little girl" @@ -71,8 +80,8 @@ You can use the following script to run, including convertion, quantization and python scripts/run.py model-path --weight_dtype int4 -p "She opened the door and see" ``` -LLM one-click running script args explanations: -| arg | explanation | +Augument description of run.py: +| Augument | Description | | -------------- | ----------------------------------------------------------------------- | | model | directory containing model file or model id | | --weight_dtype | data type of quantized weight (default: int4) | @@ -92,7 +101,7 @@ LLM one-click running script args explanations: | --keep | number of tokens to keep from the initial prompt (default: 0, -1 = all) | -## Advanced use +## Advanced Usage ### 1. Convert and Quantize LLM model LLM Runtime assumes the same model format as [llama.cpp](https://github.com/ggerganov/llama.cpp) and [ggml](https://github.com/ggerganov/ggml). You can also convert the model by following the below steps: @@ -117,8 +126,8 @@ python scripts/quantize.py --model_name llama2 --model_file ne-f32.bin --out_fil python scripts/quantize.py --model_name llama2 --model_file ne-f32.bin --out_file ne-q4_j.bin --weight_dtype int4 --group_size 32 --compute_dtype int8 ``` -quantization args explanations: -| arg | explanation | +Augument description of quantize.py: +| Augument | Description | | -------------- | ----------------------------------------------------------- | | --model_file | path to the fp32 model | | --out_file | path to the quantized model | @@ -148,8 +157,8 @@ OMP_NUM_THREADS=56 numactl -m 0 -C 0-55 python scripts/inference.py --model_name OMP_NUM_THREADS=56 numactl -m 0 -C 0-55 python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t 56 --color -p "She opened the door and see" --repeat_penalty 1.2 ``` -LLM running script args explanations: -| arg | explanation | +Augument description of inference.py: +| Augument | Description | | -------------- | ----------------------------------------------------------------------- | | --model_name | model name | | -m / --model | path to the executed model | From 03fab02508e1d6cd8e1546149aa635760006a679 Mon Sep 17 00:00:00 2001 From: hshen14 Date: Tue, 26 Sep 2023 20:07:13 +0800 Subject: [PATCH 2/3] Fix typo Signed-off-by: hshen14 --- intel_extension_for_transformers/llm/runtime/graph/README.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/intel_extension_for_transformers/llm/runtime/graph/README.md b/intel_extension_for_transformers/llm/runtime/graph/README.md index fe861ce4dff..4586561d9d0 100644 --- a/intel_extension_for_transformers/llm/runtime/graph/README.md +++ b/intel_extension_for_transformers/llm/runtime/graph/README.md @@ -42,7 +42,7 @@ Install from binary pip install intel-extension-for-transformers ``` -Build from Source +Build from source ```shell # Linux git submodule update --init --recursive @@ -55,7 +55,6 @@ ninja ```powershell # Windows # Install VisualStudio 2022 and open 'Developer PowerShell for VS 2022' - mkdir build cd build cmake .. From d90da0dc0844fb0858d32e837b08639b60c89aeb Mon Sep 17 00:00:00 2001 From: hshen14 Date: Tue, 26 Sep 2023 20:22:49 +0800 Subject: [PATCH 3/3] Update the readme Signed-off-by: hshen14 --- .../llm/runtime/graph/README.md | 25 ++++++++++--------- 1 file changed, 13 insertions(+), 12 deletions(-) diff --git a/intel_extension_for_transformers/llm/runtime/graph/README.md b/intel_extension_for_transformers/llm/runtime/graph/README.md index 4586561d9d0..d086a904168 100644 --- a/intel_extension_for_transformers/llm/runtime/graph/README.md +++ b/intel_extension_for_transformers/llm/runtime/graph/README.md @@ -12,8 +12,8 @@ LLM Runtime is designed to provide the efficient inference of large language mod ## Supported Models -We support the following models: -### Text generation models +LLM Runtime supports the following models: +### Text Generation | model name | INT8 | INT4| |---|:---:|:---:| |[LLaMA2-7B](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf), [LLaMA2-13B](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf), [LLaMA2-70B](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf)| ✅ | ✅ | @@ -27,14 +27,14 @@ We support the following models: |[OPT-125m](https://huggingface.co/facebook/opt-125m), [OPT-350m](https://huggingface.co/facebook/opt-350m), [OPT-1.3B](https://huggingface.co/facebook/opt-1.3b), [OPT-13B](https://huggingface.co/facebook/opt-13b)| ✅ | ✅ | |[ChatGLM-6B](https://huggingface.co/THUDM/chatglm-6b), [ChatGLM2-6B](https://huggingface.co/THUDM/chatglm2-6b)| ✅ | ✅ | -### Code generation models +### Code Generation | model name | INT8 | INT4| |---|:---:|:---:| |[Code-LLaMA-7B](https://huggingface.co/codellama/CodeLlama-7b-hf), [Code-LLaMA-13B](https://huggingface.co/codellama/CodeLlama-13b-hf)| ✅ | ✅ | |[StarCoder-1B](https://huggingface.co/bigcode/starcoderbase-1b), [StarCoder-3B](https://huggingface.co/bigcode/starcoderbase-3b), [StarCoder-15.5B](https://huggingface.co/bigcode/starcoder)| ✅ | ✅ | -## How to use +## How to Use ### 1. Install LLM Runtime Install from binary @@ -63,7 +63,7 @@ cmake --build . -j ### 2. Run LLM with Python API -You can use the python api to simplely run HF model. +You can use Python API to run Hugging Face model simply. Here is the sample code: ```python from intel_extension_for_transformers.transformers import AutoModel, WeightOnlyQuantConfig model_name = "Intel/neural-chat-7b-v1-1" # Hugging Face model_id or local model @@ -73,14 +73,14 @@ prompt = "Once upon a time, a little girl" output = model.generate(prompt, max_new_tokens=30) ``` -### 3. Run LLM with Script -You can use the following script to run, including convertion, quantization and inference. +### 3. Run LLM with Python Script +You can run LLM with one-click python script including convertion, quantization and inference. ``` python scripts/run.py model-path --weight_dtype int4 -p "She opened the door and see" ``` Augument description of run.py: -| Augument | Description | +| Augument | Description | | -------------- | ----------------------------------------------------------------------- | | model | directory containing model file or model id | | --weight_dtype | data type of quantized weight (default: int4) | @@ -101,9 +101,10 @@ Augument description of run.py: ## Advanced Usage +Besides the one-click script, LLM Runtime also offers the detailed script: 1) convert and quantize, and 2) inference. -### 1. Convert and Quantize LLM model -LLM Runtime assumes the same model format as [llama.cpp](https://github.com/ggerganov/llama.cpp) and [ggml](https://github.com/ggerganov/ggml). You can also convert the model by following the below steps: +### 1. Convert and Quantize LLM +LLM Runtime assumes the compatible model format as [llama.cpp](https://github.com/ggerganov/llama.cpp) and [ggml](https://github.com/ggerganov/ggml). You can also convert the model by following the below steps: ```bash @@ -140,9 +141,9 @@ Augument description of quantize.py: | --use_ggml | enable ggml for quantization and inference | -### 2. Inference model with C++ script API +### 2. Inference LLM -We supply LLM running script to run supported models with c++ api conveniently. +We provide LLM inference script to run the quantized model. Please reach [us](mailto:itrex.maintainers@intel.com) if you want to run using C++ API directly. ```bash # recommed to use numactl to bind cores in Intel cpus for better performance # if you use different core numbers, please also change -t arg value