# Yuan 2.0 inference service deployment based on vLLM

## 1. Configure vLLM environment
Environment requirements: torch2.1.2 cuda12.1

vLLM environment configuration is mainly divided into the following two steps: pull Yuan-2.0 project and install vllm runtime environment

Note: Since the pip version vllm does not currently support Yuan 2.0, it needs to be compiled and installed

### Step 1. Pull Yuan-2.0 project

```bash
# Pull project
git clone https://github.com/IEIT-Yuan/Yuan-2.0.git
```

### Step 2. Install vLLM runtime environment

```bash
# Enter vLLM project
cd Yuan-2.0/3rdparty/vllm

# Install dependencies
pip install -r requirements.txt

# Install setuptools
# vllm has requirements for the setuptools version, refer to https://github.com/vllm-project/vllm/issues/4961
vim pyproject.toml # Modify setuptools == 69.5.1
pip install setuptools == 69.5.1

# Install vllm
pip install -e .
```

## 2. Yuan2.0-2B model reasoning and deployment based on vLLM

The following is an example of how to use the vLLM reasoning framework to reason and deploy the Yuan2.0-2B model

### Step 1. Model download

Use the snapshot_download function in modelscope to download the model. The first parameter is the model name, and the parameter cache_dir is the model download path.

Here you can first enter the autodl platform and initialize the file storage in the corresponding area of ​​the machine. The file storage path is '/root/autodl-fs'.

The files in this storage will not be lost when the machine is shut down, which can avoid the model from being downloaded twice.

![autodl-fs](images/autodl-fs.png)

Then run the following code to execute the model download. The model size is 4.5GB, and it takes about 5 minutes to download.

In [5]:
from modelscope import snapshot_download
model_dir = snapshot_download('YuanLLM/Yuan2-2B-Mars-hf', cache_dir='/root/autodl-fs')

### Step 2. Reasoning Yuan2.0-2B based on vllm

To reason Yuan2.0-2B based on vllm, you first need to load the model and then perform reasoning

#### 1. Load the model

In [1]:
from vllm import LLM, SamplingParams
import time

# Configuration parameters
sampling_params = SamplingParams(max_tokens=300, temperature=1, top_p=0, top_k=1, min_p=0.0, length_penalty=1.0, repetition_penalty=1.0, stop="<eod>", )

# Load the model
llm = LLM(model="/root/autodl-fs/YuanLLM/Yuan2-2B-Mars-hf", trust_remote_code=True)

INFO 06-22 17:21:04 llm_engine.py:73] Initializing an LLM engine with config: model='/root/autodl-fs/YuanLLM/Yuan2-2B-Mars-hf', tokenizer='/root/autodl-fs/YuanLLM/Yuan2-2B-Mars-hf', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, enforce_eager=False, seed=0)


You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


INFO 06-22 17:21:11 llm_engine.py:231] # GPU blocks: 1627, # CPU blocks: 780
INFO 06-22 17:21:13 model_runner.py:412] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 06-22 17:21:13 model_runner.py:416] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode.
INFO 06-22 17:21:20 model_runner.py:458] Graph capturing finished in 7 secs.


#### 2. Reasoning
Reasoning supports single prompt and multiple prompts

##### Option 1. Single prompt reasoning

In [10]:
prompts = ["给我一个python打印helloword的代码<sep>"]

start_time = time.time()
outputs = llm.generate(prompts, sampling_params)
end_time = time.time()

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print("Prompt:", prompt)
    print("Generated text:", generated_text)
    print()

print("inference_time:", (end_time - start_time))

Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  2.13it/s]

Prompt: 给我一个python打印helloword的代码<sep>
Generated text:  以下是一个简单的Python代码，用于打印字符串"hello world"：
```python
print("hello world")
```

inference_time: 0.47487425804138184





##### Option 2. Multiple prompt reasoning

In [9]:
prompts = ["给我一个python打印helloword的代码<sep>", "给我一个c++打印helloword的代码<sep>"]

start_time = time.time()
outputs = llm.generate(prompts, sampling_params)
end_time = time.time()

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print("Prompt:", prompt)
    print("Generated text:", generated_text)
    print()

print("inference_time:", (end_time - start_time))

Processed prompts: 100%|██████████| 2/2 [00:00<00:00,  3.27it/s]

Prompt: 给我一个python打印helloword的代码<sep>
Generated text:  以下是一个简单的Python代码，用于打印字符串"hello world"：
```python
print("hello world")
```

Prompt: 给我一个c++打印helloword的代码<sep>
Generated text:  ```cpp
#include <iostream>

int main() {
    std::cout << "hello world";
    return 
}
```

inference_time: 0.6165323257446289





### Step 3. Deploy Yuan2.0-2B based on vllm.entrypoints.api_server
The steps to deploy Yuan2.0-2B based on api_server include initiating and calling the inference service

#### 1. Service initiation

```bash 
# Please run the following command in the command line, not directly in jupyter!python run
python -m vllm.entrypoints.api_server --model=/root/autodl-fs/YuanLLM/Yuan2-2B-Mars-hf --trust-remote-code
```

#### 2. Service call
There are two ways to call a service: the first is to call it directly through the command line; the second is to call it in batches by running a script.

##### Option 1. Call the service based on the command line

In [13]:
!curl http://localhost:8000/generate -d '{"prompt": "给我一个python打印helloword的代码<sep>", "use_beam_search": false,  "n": 1, "temperature": 1, "top_p": 0, "top_k": 1,  "max_tokens":256, "stop": "<eod>"}'

{"text":["给我一个python打印helloword的代码<sep> 以下是一个简单的Python代码，用于打印字符串\"hello world\"：\n```python\nprint(\"hello world\")\n```"]}

##### Option 2. Calling services based on command scripts

In [14]:
import requests
import json

prompt = "给我一个python打印helloword的代码<sep>"
raw_json_data = {
        "prompt": prompt,
        "logprobs": 1,
        "max_tokens": 256,
        "temperature": 1,
        "use_beam_search": False,
        "top_p": 0,
        "top_k": 1,
        "stop": "<eod>",
        }
json_data = json.dumps(raw_json_data)
headers = {
        "Content-Type": "application/json",
        }
response = requests.post(f'http://localhost:8000/generate',
                     data=json_data,
                     headers=headers)
output = response.text
output = json.loads(output)
print(output)

{'text': ['给我一个python打印helloword的代码<sep> 以下是一个简单的Python代码，用于打印字符串"hello world"：\n```python\nprint("hello world")\n```']}


### Step 4. Deploy Yuan2.0-2B based on vllm.entrypoints.openai.api_server
The steps to deploy Yuan2.0-2B based on openai's api_server are similar to those in step 3. The methods of initiating and calling services are as follows:

#### 1. Service initiation

```bash 
# Please run the following command in the command line, not directly in jupyter!python run
python -m vllm.entrypoints.openai.api_server --model=/root/autodl-fs/YuanLLM/Yuan2-2B-Mars-hf --trust-remote-code
```

#### 2. Service call
There are two ways to call a service: the first is to call it directly through the command line; the second is to call it in batches by running a script.

##### Option 1. Call the service based on the command line

In [16]:
!curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "/root/autodl-fs/YuanLLM/Yuan2-2B-Mars-hf", "prompt": "给我一个python打印helloword的代码<sep>", "max_tokens": 300, "temperature": 1, "top_p": 0, "top_k": 1, "stop": "<eod>"}'

{"id":"cmpl-5f7c39b38f4048928f0c2c6482174a72","object":"text_completion","created":17034486,"model":"/root/autodl-fs/YuanLLM/Yuan2-2B-Mars-hf","choices":[{"index":0,"text":" 以下是一个简单的Python代码，用于打印字符串\"hello world\"：\n```python\nprint(\"hello world\")\n```","logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":11,"total_tokens":39,"completion_tokens":28}}

##### Option 2. Calling services based on command scripts

In [18]:
import requests
import json

prompt = "给我一个python打印helloword的代码<sep>"
raw_json_data = {
        "model": "/root/autodl-fs/YuanLLM/Yuan2-2B-Mars-hf",
        "prompt": prompt,
        "max_tokens": 256,
        "temperature": 1,
        "use_beam_search": False,
        "top_p": 0,
        "top_k": 1,
        "stop": "<eod>",
        }
json_data = json.dumps(raw_json_data, ensure_ascii=True)
headers = {
        "Content-Type": "application/json",
        }
response = requests.post(f'http://localhost:8000/v1/completions',
                     data=json_data,
                     headers=headers)
output = response.text
output = json.loads(output)
print(output)

{'id': 'cmpl-783bba4e302b4fe48e7f45a35c9e4c05', 'object': 'text_completion', 'created': 17034591, 'model': '/root/autodl-fs/YuanLLM/Yuan2-2B-Mars-hf', 'choices': [{'index': 0, 'text': ' 以下是一个简单的Python代码，用于打印字符串"hello world"：\n```python\nprint("hello world")\n```', 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 11, 'total_tokens': 39, 'completion_tokens': 28}}
