# vLLM 推理服务

vLLM 官方支持一行 bash 命令，启动符合 OpenAI 的接口协议要求的 API 服务：

```bash
vllm serve [YOUR MODEL or MODEL PATH]
```

但是对不同的模型，启动参数也略有不同。比如对于 `deepseek-r1` 这种推理模型，必须添加 `--enable-reasoning` 和 `--reasoning-parser deepseek_r1` 参数。具体每个模型如何使用 vLLM，可以参考 vLLM 官方文档或模型文档。

参考资料：

- [vLLM 官方文档](https://docs.vllm.ai/en/latest/index.html)
- [OpenAI 接口文档](https://platform.openai.com/docs/api-reference/chat/create)

In [1]:
# !pip install openai

In [2]:
from openai import OpenAI

## 1. DeepSeek-R1

**1）启动服务端**

我预先在 `/server` 写了 vLLM 的 deepseek-r1 启动脚本。可以直接到文件夹下执行：

```bash
cd server
bash ds_vllm_bash_server.sh
```

或者直接在命令行中执行：

```bash
conda activate vllm_env && \
    CUDA_VISIBLE_DEVICES=0 vllm serve "./model/DeepSeek-R1-Distill-Qwen-1.5B" \
        --served-model-name deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
        --enable-reasoning \
        --reasoning-parser deepseek_r1 \
        --host 0.0.0.0 \
        --port 8000 \
        --gpu-memory-utilization 0.95 \
        --max-seq-len-to-capture 8192 \
        --tensor-parallel-size 1 \
        --api-key token-abc123456
```

**2）运行客户端**

In [3]:
# -*- coding: utf-8 -*-
# USAGE: python3 ds_vllm_bash_client.py
# https://docs.vllm.ai/en/latest/features/reasoning_outputs.html
# pip install vllm --upgrade


openai_api_key = "token-abc123456"
openai_api_base = "http://localhost:8000/v1"


def chat_completion(prompt, model="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"):
    client = OpenAI(
        api_key=openai_api_key,
        base_url=openai_api_base,
    )

    chat_response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
            {"role": "user", "content": prompt},
        ],
        temperature=0.6,
        top_p=0.95,
        max_tokens=512,
        extra_body={
            "repetition_penalty": 1.05,
        },
    )

    return chat_response

In [4]:
# response = chat_completion(prompt="什么是大模型？")
# content = response.choices[0].message.content
# print(content)

## 2. Qwen

**1）启动服务端**

使用仓库中的 bash 脚本：

```bash
cd server
bash qwen_vllm_bash_server.sh
```

或者直接在命令行中执行：

```bash
conda activate vllm_env && \
    vllm serve "./model/Qwen2.5-1.5B-Instruct/" \
        --served-model-name Qwen/Qwen2.5-7B-Instruct \
        --host 0.0.0.0 \
        --port 8000 \
        --gpu-memory-utilization 0.98 \
        --tensor-parallel-size 1 \
        --api-key token-kcgyrk
```

**2）运行客户端**

In [5]:
# -*- coding: utf-8 -*-
# USAGE: python3 qwen_vllm_bash_client.py


openai_api_key = "token-kcgyrk"
openai_api_base = "http://localhost:8000/v1"


def chat_completion(prompt, model="Qwen/Qwen2.5-7B-Instruct"):
    client = OpenAI(
        api_key=openai_api_key,
        base_url=openai_api_base,
    )

    chat_response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
            {"role": "user", "content": prompt},
        ],
        temperature=0.8,
        top_p=0.9,
        max_tokens=512,
        extra_body={
            "repetition_penalty": 1.05,
        },
    )

    return chat_response

In [6]:
# response = chat_completion(prompt="什么是大模型？")
# content = response.choices[0].message.content
# print(content)

## 3. 符合 OpenAI 接口协议的 API 服务

我们可以不依赖 vLLM 的命令行，而是按照 OpenAI 的接口文档，自行开发一个符合 OpenAI 接口规范的 API。这样可以获得更多自定义的权力，比如可以按自己方式实现预处理和后处理步骤。

以 `deepseek-r1` 模型服务为例，我用 FastAPI 实现了一个支持 `openai` 库调用的 API Server，但只实现两个核心路由：

- `/v1/models`: 查看当前服务支持的模型列表
- `/v1/chat/completions`: 传入用户问题，调用模型获取回答

以下代码同本仓库的 `/server/ds_vllm_server.py` 代码文件。

In [7]:
# -*- coding: utf-8 -*-
# DESC: vLLM openai server
# REFS:
#   - https://platform.openai.com/docs/api-reference/chat/create
#   - https://github.com/openai/openai-quickstart-python
#   - https://fastapi.tiangolo.com/advanced/events/
# USAGE: 
#   conda activate vllm_env
#   python3 ds_vllm_server.py

import os
import re
import vllm
import time
import uuid
import uvicorn

from typing import List, Optional
from pydantic import BaseModel
from fastapi import FastAPI, HTTPException, Depends
from fastapi.security import HTTPBearer
from fastapi.responses import StreamingResponse
from contextlib import asynccontextmanager


# 配置 API 密钥
API_KEY = "token-abc123456"
MODEL_NAME = "DeepSeek-R1-Distill-Qwen-1.5B"
MODEL_PATH = "../model/DeepSeek-R1-Distill-Qwen-1.5B"


# 指定使用哪一块显卡
os.environ["CUDA_VISIBLE_DEVICES"] = "0"


llm_model = {}


def load_model():
    llm = vllm.LLM(
        model=MODEL_PATH,
        gpu_memory_utilization=0.95,
        max_model_len=4096,
        tensor_parallel_size=1,
        enable_prefix_caching=True,
        max_num_batched_tokens=51200
    )

    return llm


@asynccontextmanager
async def lifespan(app: FastAPI):
    llm = load_model()
    llm_model[MODEL_NAME] = llm
    yield
    llm_model.clear()


app = FastAPI(lifespan=lifespan)
security = HTTPBearer()


class ChatMessage(BaseModel):
    role: str  # "user", "assistant", "system"
    content: str


class ChatCompletionRequest(BaseModel):
    model: str
    messages: List[ChatMessage]
    max_tokens: Optional[int] = 8192
    temperature: Optional[float] = 0.6
    top_p: Optional[float] = 0.95
    n: Optional[int] = 1
    stream: Optional[bool] = False
    stop: Optional[List[str]] = None
    presence_penalty: Optional[float] = 0.0
    frequency_penalty: Optional[float] = 0.0


def verify_token(credentials: HTTPBearer = Depends(security)):
    if credentials.credentials != API_KEY:
        raise HTTPException(401, "Invalid API Key")


def split_text(text):
    """文本分割函数"""
    pattern = re.compile(r'<think>(.*?)</think>(.*)', re.DOTALL)
    match = pattern.search(text) # 匹配思考过程

    if match: # 如果匹配到思考过程
        think_content = match.group(1) if match.group(1) is not None else ""
        think_content = think_content.strip()
        answer_content = match.group(2).strip()
    else:
        think_content = ""
        answer_content = text.strip()

    return think_content, answer_content


def model_infr(message: str,
               model,
               temperature=0.6,
               max_tokens=8192,
               top_p=0.95,
               stop_token_ids=[151329, 151336, 151338]):

    # 定义采样参数
    sampling_params = vllm.SamplingParams(temperature=temperature,
                                          top_p=top_p,
                                          max_tokens=max_tokens,
                                          stop_token_ids=stop_token_ids)
    # stop_token_ids or [model.llm_engine.tokenizer.eos_token_id]

    # 应用对话模型
    output = model.generate(message, sampling_params)
    response = output[0].outputs[0].text
    response = f'<think>\n{response}'
    think_content, answer_content = split_text(response)

    return think_content, answer_content


def format_prompt(messages) -> str:
    """仅保留最后一轮 user 的对话"""
    # 倒序遍历找到最后一个用户消息
    for message in reversed(messages):
        if message.role == "user":
            return message.content  # 直接返回字符串内容
    return ""  # 没有用户消息时返回空字符串


@app.get("/v1/models", dependencies=[Depends(verify_token)])
async def list_models():
    return {
        "object": "list",
        "data": [
            {
                "id": MODEL_NAME,
                "object": "model",
                "created": int(time.time()),
                "owned_by": "user",
                "permissions": []
            }
        ]
    }


@app.post("/v1/chat/completions", dependencies=[Depends(verify_token)])
async def create_chat_completion(request: ChatCompletionRequest):
    # 校验
    model_name = request.model
    models = [MODEL_NAME]
    if model_name not in models:
        raise HTTPException(400, f"{model_name} not in {models}")

    if request.n > 1 and not request.stream:
        raise HTTPException(400, "Only n=1 supported in non-streaming mode")
    if request.temperature is not None and request.temperature < 0:
        raise HTTPException(400, "Temperature must be ≥ 0")
    if request.top_p is not None and (request.top_p < 0 or request.top_p > 1):
        raise HTTPException(400, "Top_p must be between 0 and 1")

    # 模型推理
    prompt = format_prompt(request.messages)
    think_content, answer_content = model_infr(
        message=prompt,
        model=llm_model[model_name],
        temperature=request.temperature,
        max_tokens=request.max_tokens,
        top_p=request.top_p
    )

    # 输出排版
    lst = [
        "<think>",
        think_content,
        "</think>",
        answer_content
    ]
    content = "\n".join(lst)

    return {
        "id": f"chatcmpl-{str(uuid.uuid4())}",
        "object": "chat.completion",
        "created": int(time.time()),
        "model": model_name,
        "choices": [{
            "index": 0,
            "message": {
                "role": "assistant",
                "content": content.strip()
            },
            "finish_reason": "stop"
        }]
    }


# uvicorn.run(app, host="0.0.0.0", port=9494, log_level="debug")

INFO 03-21 17:06:47 [__init__.py:256] Automatically detected platform cuda.


## 4. 强制清理显存缓存

如果发现显存不够，可以挣扎一下：

In [8]:
import utils

utils.torch_gc()