# 使用Nexa SDK进行本地模型推理

Nexa SDK 是一个端侧推理框架，支持 ONNX 和 GGML 模型，支持文本生成、图像生成、视觉语言模型（VLM）、音频语言模型、语音转文本（ASR）和文本转语音（TTS）等类型的功能。它支持的设备包括 CPU, GPU (CUDA, Metal, ROCm) 和 iOS。主要具有以下使用范例：

- 本地进行模型推理，支持 ONNX 和 GGML 模型。模型可以从 Nexa On-Device AI Hub 下载，也可以直接从 ModelScope 或者 HuggingFace 下载。
- 进行模型转换，支持将 ModelScope 或 HuggingFace 的模型转换为 GGUF 量化格式。
- 部署本地服务器，支持 API 调用模型。

本教程将从环境安装开始，依次介绍基于 Nexa SDK 的模型推理、模型转换和本地服务器部署。本教程的所有命令推荐在终端下运行。

### 环境安装

环境安装可以参考 [Nexa SDK 文档](https://github.com/NexaAI/nexa-sdk.git)，对于不同的设备，可以下载不同的预编译包进行安装，或者进行本地编译安装。本教程采用本地编译的方式。


In [None]:
!pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu121
!pip install modelscope
!git clone https://github.com/NexaAI/nexa-sdk.git
%cd nexa-sdk
!git submodule update --init --recursive
!pip install -e ".[convert]"

### 模型推理运行

使用 nexa 命令行运行模型，模型源选择从 ModelSope 下载。 由于 `nexa run` 命令为交互式运行，推荐在 terminal 环境中运行。

```shell
nexa run -ms Qwen/Qwen2.5-Coder-7B-Instruct-GGUF
```

将提示 `Qwen/Qwen2.5-Coder-7B-Instruct-GGUF` repo 中有的 GGUF 模型文件，从中选择一个 GGUF 模型文件，例如 `qwen2.5-coder-7b-instruct-fp16.gguf`。


### 模型转换

使用 nexa 模型转换工具将模型转换成 GGUF 量化格式，随后可通过 `nexa run` 命令进行推理。


In [2]:
%%bash
(echo "1"; echo "1"; echo "N"; echo "N") | nexa convert -ms Qwen/Qwen2.5-7B-Instruct


Select model type:
1. NLP (text generation)
2. COMPUTER_VISION (image generation)

Select model type (enter number): 
Available quantization types:
1. q4_0
2. q4_1
3. q5_0
4. q5_1
5. q8_0
6. q2_k
7. q3_k_s
8. q3_k_m
9. q3_k_l
10. q4_k_s
11. q4_k_m
12. q5_k_s
13. q5_k_m
14. q6_k
15. iq2_xxs
16. iq2_xs
17. q2_k_s
18. iq3_xs
19. iq3_xxs
20. iq1_s
21. iq4_nl
22. iq3_s
23. iq3_m
24. iq2_s
25. iq2_m
26. iq4_xs
27. iq1_m
28. f16
29. f32
30. bf16
31. q4_0_4_4
32. q4_0_4_8
33. q4_0_8_8
34. tq1_0
35. tq2_0



2024-11-20 19:46:59.615480: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-11-20 19:47:11.936118: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Select quantization type (enter number): Downloading Model to directory: /mnt/workspace/.cache/modelscope/hub/Qwen/Qwen2.5-7B-Instruct


Downloading [config.json]: 100%|██████████| 663/663 [00:00<00:00, 1.40kB/s]
Downloading [configuration.json]: 100%|██████████| 2.00/2.00 [00:00<00:00, 3.43B/s]
Downloading [generation_config.json]: 100%|██████████| 243/243 [00:00<00:00, 433B/s]
Downloading [LICENSE]: 100%|██████████| 11.1k/11.1k [00:00<00:00, 13.0kB/s]
Downloading [merges.txt]: 100%|██████████| 1.59M/1.59M [00:00<00:00, 2.17MB/s]
Downloading [model-00001-of-00004.safetensors]: 100%|██████████| 3.67G/3.67G [00:13<00:00, 295MB/s] 
Downloading [model-00002-of-00004.safetensors]: 100%|██████████| 3.60G/3.60G [00:15<00:00, 253MB/s] 
Downloading [model-00003-of-00004.safetensors]: 100%|██████████| 3.60G/3.60G [00:21<00:00, 179MB/s] 
Downloading [model-00004-of-00004.safetensors]: 100%|██████████| 3.31G/3.31G [00:12<00:00, 282MB/s] 
Downloading [model.safetensors.index.json]: 100%|██████████| 27.1k/27.1k [00:01<00:00, 26.0kB/s]
Downloading [README.md]: 100%|██████████| 5.85k/5.85k [00:00<00:00, 13.1kB/s]
Downloading [tokenize

Successfully downloaded repository 'Qwen/Qwen2.5-7B-Instruct' to /root/.cache/nexa/hub/modelscope/Qwen/Qwen2.5-7B-Instruct


Writing: 100%|██████████| 15.2G/15.2G [01:56<00:00, 131Mbyte/s] 
2024-11-20 19:51:15,076 - INFO - Model successfully exported to /root/.cache/nexa/tmp_models/Qwen2.5-7B-Instruct-f16.gguf
2024-11-20 19:51:15,089 - INFO - Starting quantization of /root/.cache/nexa/tmp_models/Qwen2.5-7B-Instruct-f16.gguf
2024-11-20 19:51:15,089 - INFO - Output file: /mnt/workspace/nexa-sdk/Qwen2.5-7B-Instruct-q4_0.gguf
llama_model_loader: loaded meta data with 34 key-value pairs and 339 tensors from /root/.cache/nexa/tmp_models/Qwen2.5-7B-Instruct-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen2.5 7B Instruct
llama_model_loader: - kv


Conversion completed successfully. Output file: /mnt/workspace/nexa-sdk/Qwen2.5-7B-Instruct-q4_0.gguf

Would you like to store this model in nexa list so you can run it with `nexa run <model_name>` anywhere and anytime? (y/N): 
Would you like to run the converted model? (y/N): Exiting without running the model.

Converted model stored at /mnt/workspace/nexa-sdk/Qwen2.5-7B-Instruct-q4_0.gguf

You can run the converted model with command: nexa run /mnt/workspace/nexa-sdk/Qwen2.5-7B-Instruct-q4_0.gguf -lp -mt NLP


选择模型类型 (`NLP (text generation)`) 后，从量化类型中选择其中一个，例如：`q4_0`。随后将运行模型量化。随后，运行本地模型，同上，`nexa run` 推荐在 terminal 环境中运行。

```shell
nexa run /mnt/workspace/nexa-sdk/Qwen2.5-7B-Instruct-q4_0.gguf -lp -mt NLP
```

### 本地服务器部署与 API 调用

使用 nexa server 功能将模型进行本地服务器部署，随后可以通过 API 调用进行模型调用。

运行以下命令，从可用的 GGUF 模型转选择一个模型文件下载，例如 `qwen2.5-coder-7b-instruct-fp16.gguf`。

In [6]:
%%bash
echo "10" | nexa server -ms Qwen/Qwen2.5-Coder-7B-Instruct-GGUF --port 8085

INFO:     Started server process [1811]
INFO:     Waiting for application startup.
2024-11-21 10:20:21.137103: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-11-21 10:20:21.175952: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


No model type specified. Running with default model type: NLP
You can specify a different model type using the -mt flag
Available gguf models in the repository:
1. qwen2.5-coder-7b-instruct-fp16-00001-of-00004.gguf
2. qwen2.5-coder-7b-instruct-fp16-00002-of-00004.gguf
3. qwen2.5-coder-7b-instruct-fp16-00003-of-00004.gguf
4. qwen2.5-coder-7b-instruct-fp16-00004-of-00004.gguf
5. qwen2.5-coder-7b-instruct-fp16.gguf
6. qwen2.5-coder-7b-instruct-q2_k.gguf
7. qwen2.5-coder-7b-instruct-q3_k_m.gguf
8. qwen2.5-coder-7b-instruct-q4_0-00001-of-00002.gguf
9. qwen2.5-coder-7b-instruct-q4_0-00002-of-00002.gguf
10. qwen2.5-coder-7b-instruct-q4_0.gguf
11. qwen2.5-coder-7b-instruct-q4_k_m-00001-of-00002.gguf
12. qwen2.5-coder-7b-instruct-q4_k_m-00002-of-00002.gguf
13. qwen2.5-coder-7b-instruct-q4_k_m.gguf
14. qwen2.5-coder-7b-instruct-q5_0-00001-of-00002.gguf
15. qwen2.5-coder-7b-instruct-q5_0-00002-of-00002.gguf
16. qwen2.5-coder-7b-instruct-q5_0.gguf
17. qwen2.5-coder-7b-instruct-q5_k_m-00001-of-0000

INFO:     Application startup complete.
INFO:     Uvicorn running on http://localhost:8085 (Press CTRL+C to quit)
INFO:     Shutting down


Error while terminating subprocess (pid=1808): 


INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [1811]


Please enter the number of the model you want to download and use: You have selected: qwen2.5-coder-7b-instruct-q4_0.gguf
Successfully pulled model Qwen/Qwen2.5-Coder-7B-Instruct-GGUF:qwen2.5-coder-7b-instruct-q4_0.gguf to /root/.cache/nexa/hub/modelscope/Qwen/Qwen2.5-Coder-7B-Instruct-GGUF/qwen2.5-coder-7b-instruct-q4_0.gguf
model_type: NLP


In [8]:
import requests
import json

# 定义请求的 URL
url = "http://localhost:8085/v1/chat/completions"

# 定义请求体
request_body = {
  "messages": [
    {
      "role": "user",
      "content": "Tell me a story"
    }
  ],
  "max_tokens": 128,
  "temperature": 0.1,
  "stream": False,
  "stop_words": []
}
# 将请求体转换为 JSON 格式
json_data = json.dumps(request_body)

# 发送 POST 请求
response = requests.post(url, data=json_data, headers={'Content-Type': 'application/json'})

# 检查响应状态码
if response.status_code == 200:
    # 解析响应内容
    response_data = response.json()
    print("Response:", response_data)
else:
    print(f"Error: {response.status_code} - {response.text}")

Response: {'id': '0e82fa3e-58be-4a0a-b2f2-fe5c81fc4240', 'object': 'chat.completion', 'created': 1732155748.7462492, 'choices': [{'message': {'role': 'assistant', 'content': 'Once upon a time, in a small village nestled among rolling hills, there lived a young girl named Lily. Lily was known for her kind heart and her love for nature. She spent most of her days exploring the nearby forest, collecting flowers, and helping her grandmother with her chores.\n\nOne sunny morning, as Lily was wandering through the forest, she stumbled upon a hidden clearing. In the center of the clearing stood a beautiful, ancient tree. The tree was unlike any other she had ever seen; its bark was smooth and its leaves shimmered in the sunlight.\n\nAs Lily approached the tree, she noticed a small, glowing object lying on'}, 'logprobs': None}]}
