From 5c4e55d8942f351a994457f7bb17cfe706e2e6a9 Mon Sep 17 00:00:00 2001
From: hshen14 <haihao.shen@intel.com>
Date: Tue, 26 Sep 2023 20:04:50 +0800
Subject: [PATCH 1/3] Refine LLM runtime readme

Signed-off-by: hshen14 <haihao.shen@intel.com>
---
 .../llm/runtime/graph/README.md               | 31 ++++++++++++-------
 1 file changed, 20 insertions(+), 11 deletions(-)

diff --git a/intel_extension_for_transformers/llm/runtime/graph/README.md b/intel_extension_for_transformers/llm/runtime/graph/README.md
index ce0a5a0def2..fe861ce4dff 100644
--- a/intel_extension_for_transformers/llm/runtime/graph/README.md
+++ b/intel_extension_for_transformers/llm/runtime/graph/README.md
@@ -36,17 +36,26 @@ We support the following models:
 
 ## How to use
 
-### 1. Build LLM Runtime
-Linux
+### 1. Install LLM Runtime
+Install from binary
 ```shell
+pip install intel-extension-for-transformers
+```
+
+Build from Source
+```shell
+# Linux
 git submodule update --init --recursive
 mkdir build
 cd build
 cmake .. -G Ninja
 ninja
 ```
-Windows: install VisualStudio 2022(a validated veresion), search 'Developer PowerShell for VS 2022' and open it, then run the following cmds.
+
 ```powershell
+# Windows
+# Install VisualStudio 2022 and open 'Developer PowerShell for VS 2022'
+
 mkdir build
 cd build
 cmake ..
@@ -58,7 +67,7 @@ cmake --build . -j
 You can use the python api to simplely run HF model.
 ```python
 from intel_extension_for_transformers.transformers import AutoModel, WeightOnlyQuantConfig
-model_name = "EleutherAI/gpt-j-6b"     # support model id of HF or local PATH to model
+model_name = "Intel/neural-chat-7b-v1-1"     # Hugging Face model_id or local model
 woq_config = WeightOnlyQuantConfig(compute_dtype="int8", weight_dtype="int4")
 model = AutoModel.from_pretrained(model_name, quantization_config=woq_config)
 prompt = "Once upon a time, a little girl"
@@ -71,8 +80,8 @@ You can use the following script to run, including convertion, quantization and
 python scripts/run.py model-path --weight_dtype int4 -p "She opened the door and see"
 ```
 
-LLM one-click running script args explanations:
-| arg               | explanation                                                             |
+Augument description of run.py:
+| Augument         | Description                                                             |
 | --------------    | ----------------------------------------------------------------------- |
 | model           | directory containing model file or model id                               |
 | --weight_dtype  | data type of quantized weight (default: int4)                             |
@@ -92,7 +101,7 @@ LLM one-click running script args explanations:
 | --keep            | number of tokens to keep from the initial prompt (default: 0, -1 = all) |
 
 
-## Advanced use
+## Advanced Usage
 
 ### 1. Convert and Quantize LLM model
 LLM Runtime assumes the same model format as [llama.cpp](https://github.com/ggerganov/llama.cpp) and [ggml](https://github.com/ggerganov/ggml). You can also convert the model by following the below steps:
@@ -117,8 +126,8 @@ python scripts/quantize.py --model_name llama2 --model_file ne-f32.bin --out_fil
 python scripts/quantize.py --model_name llama2 --model_file ne-f32.bin --out_file ne-q4_j.bin --weight_dtype int4 --group_size 32 --compute_dtype int8
 
 ```
-quantization args explanations:
-| arg             | explanation                                                 |
+Augument description of quantize.py:
+| Augument        | Description                                                 |
 | --------------  | ----------------------------------------------------------- |
 | --model_file    | path to the fp32 model                                      |
 | --out_file      | path to the quantized model                                 |
@@ -148,8 +157,8 @@ OMP_NUM_THREADS=56 numactl -m 0 -C 0-55 python scripts/inference.py --model_name
 OMP_NUM_THREADS=56 numactl -m 0 -C 0-55 python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t 56 --color -p "She opened the door and see" --repeat_penalty 1.2
 ```
 
-LLM running script args explanations:
-| arg               | explanation                                                             |
+Augument description of inference.py:
+| Augument          | Description                                                             |
 | --------------    | ----------------------------------------------------------------------- |
 | --model_name      | model name                                                              |
 | -m / --model      | path to the executed model                                              |

From 03fab02508e1d6cd8e1546149aa635760006a679 Mon Sep 17 00:00:00 2001
From: hshen14 <haihao.shen@intel.com>
Date: Tue, 26 Sep 2023 20:07:13 +0800
Subject: [PATCH 2/3] Fix typo

Signed-off-by: hshen14 <haihao.shen@intel.com>
---
 intel_extension_for_transformers/llm/runtime/graph/README.md | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/intel_extension_for_transformers/llm/runtime/graph/README.md b/intel_extension_for_transformers/llm/runtime/graph/README.md
index fe861ce4dff..4586561d9d0 100644
--- a/intel_extension_for_transformers/llm/runtime/graph/README.md
+++ b/intel_extension_for_transformers/llm/runtime/graph/README.md
@@ -42,7 +42,7 @@ Install from binary
 pip install intel-extension-for-transformers
 ```
 
-Build from Source
+Build from source
 ```shell
 # Linux
 git submodule update --init --recursive
@@ -55,7 +55,6 @@ ninja
 ```powershell
 # Windows
 # Install VisualStudio 2022 and open 'Developer PowerShell for VS 2022'
-
 mkdir build
 cd build
 cmake ..

From d90da0dc0844fb0858d32e837b08639b60c89aeb Mon Sep 17 00:00:00 2001
From: hshen14 <haihao.shen@intel.com>
Date: Tue, 26 Sep 2023 20:22:49 +0800
Subject: [PATCH 3/3] Update the readme

Signed-off-by: hshen14 <haihao.shen@intel.com>
---
 .../llm/runtime/graph/README.md               | 25 ++++++++++---------
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/intel_extension_for_transformers/llm/runtime/graph/README.md b/intel_extension_for_transformers/llm/runtime/graph/README.md
index 4586561d9d0..d086a904168 100644
--- a/intel_extension_for_transformers/llm/runtime/graph/README.md
+++ b/intel_extension_for_transformers/llm/runtime/graph/README.md
@@ -12,8 +12,8 @@ LLM Runtime is designed to provide the efficient inference of large language mod
 
 ## Supported Models
 
-We support the following models:
-### Text generation models
+LLM Runtime supports the following models:
+### Text Generation
 | model name | INT8 | INT4|
 |---|:---:|:---:|
 |[LLaMA2-7B](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf), [LLaMA2-13B](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf), [LLaMA2-70B](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf)| ✅ | ✅ | 
@@ -27,14 +27,14 @@ We support the following models:
 |[OPT-125m](https://huggingface.co/facebook/opt-125m), [OPT-350m](https://huggingface.co/facebook/opt-350m), [OPT-1.3B](https://huggingface.co/facebook/opt-1.3b), [OPT-13B](https://huggingface.co/facebook/opt-13b)| ✅ | ✅ |  
 |[ChatGLM-6B](https://huggingface.co/THUDM/chatglm-6b), [ChatGLM2-6B](https://huggingface.co/THUDM/chatglm2-6b)| ✅ | ✅ |
 
-### Code generation models
+### Code Generation
 | model name | INT8 | INT4|
 |---|:---:|:---:|
 |[Code-LLaMA-7B](https://huggingface.co/codellama/CodeLlama-7b-hf), [Code-LLaMA-13B](https://huggingface.co/codellama/CodeLlama-13b-hf)| ✅ | ✅ | 
 |[StarCoder-1B](https://huggingface.co/bigcode/starcoderbase-1b), [StarCoder-3B](https://huggingface.co/bigcode/starcoderbase-3b), [StarCoder-15.5B](https://huggingface.co/bigcode/starcoder)| ✅ | ✅ | 
 
 
-## How to use
+## How to Use
 
 ### 1. Install LLM Runtime
 Install from binary
@@ -63,7 +63,7 @@ cmake --build . -j
 
 ### 2. Run LLM with Python API
 
-You can use the python api to simplely run HF model.
+You can use Python API to run Hugging Face model simply. Here is the sample code:
 ```python
 from intel_extension_for_transformers.transformers import AutoModel, WeightOnlyQuantConfig
 model_name = "Intel/neural-chat-7b-v1-1"     # Hugging Face model_id or local model
@@ -73,14 +73,14 @@ prompt = "Once upon a time, a little girl"
 output = model.generate(prompt, max_new_tokens=30)
 ```
 
-### 3. Run LLM with Script
-You can use the following script to run, including convertion, quantization and inference.
+### 3. Run LLM with Python Script
+You can run LLM with one-click python script including convertion, quantization and inference.
 ```
 python scripts/run.py model-path --weight_dtype int4 -p "She opened the door and see"
 ```
 
 Augument description of run.py:
-| Augument         | Description                                                             |
+| Augument         | Description                                                              |
 | --------------    | ----------------------------------------------------------------------- |
 | model           | directory containing model file or model id                               |
 | --weight_dtype  | data type of quantized weight (default: int4)                             |
@@ -101,9 +101,10 @@ Augument description of run.py:
 
 
 ## Advanced Usage
+Besides the one-click script, LLM Runtime also offers the detailed script: 1) convert and quantize, and 2) inference.
 
-### 1. Convert and Quantize LLM model
-LLM Runtime assumes the same model format as [llama.cpp](https://github.com/ggerganov/llama.cpp) and [ggml](https://github.com/ggerganov/ggml). You can also convert the model by following the below steps:
+### 1. Convert and Quantize LLM
+LLM Runtime assumes the compatible model format as [llama.cpp](https://github.com/ggerganov/llama.cpp) and [ggml](https://github.com/ggerganov/ggml). You can also convert the model by following the below steps:
 
 ```bash
 
@@ -140,9 +141,9 @@ Augument description of quantize.py:
 | --use_ggml      | enable ggml for quantization and inference                  |
 
 
-### 2. Inference model with C++ script API
+### 2. Inference LLM
 
-We supply LLM running script to run supported models with c++ api conveniently.
+We provide LLM inference script to run the quantized model. Please reach [us](mailto:itrex.maintainers@intel.com) if you want to run using C++ API directly.
 ```bash
 # recommed to use numactl to bind cores in Intel cpus for better performance
 # if you use different core numbers, please also  change -t arg value