# vLLM 部署 Qwen2.5-1.5B-Instruct 

模型名带 `Instruct` 说明该模型是经过指令调优（Instruction Tuning）的版本，专为理解和执行用户指令优化，适合对话生成、任务导向型场景。

部署前要先下载模型文件：

```
# 安装 huggingface_hub
pip install -U huggingface_hub

cd model
huggingface-cli download --resume-download Qwen/Qwen2.5-1.5B-Instruct --local-dir ./Qwen2.5-1.5B-Instruct
```

> HuggingFace 链接：[Qwen/Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct)

In [11]:
import os
import vllm

import utils

from transformers import AutoTokenizer

# 配置你的模型路径
MODEL_PATH = './model/Qwen2.5-1.5B-Instruct'

In [2]:
# 指定剩余显存最多的显卡
gpu_id, free_memory = utils.pick_gpu()
os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_id)

print(f"Selected GPU {gpu_id} with {free_memory} free memory")

Selected GPU 0 with 7056.00 MB free memory


In [3]:
# 初始化模型
llm = vllm.LLM(
    model=MODEL_PATH,
    gpu_memory_utilization=0.95,
    max_model_len=2048
)

INFO 03-20 21:37:29 config.py:542] This model supports multiple tasks: {'embed', 'generate', 'reward', 'classify', 'score'}. Defaulting to 'generate'.
INFO 03-20 21:37:29 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.2) with config: model='./model/Qwen2.5-1.5B-Instruct', speculative_config=None, tokenizer='./model/Qwen2.5-1.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=./model/Qwen2.5-1.5B-Instruct, num_sc

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 03-20 21:41:21 model_runner.py:1115] Loading model weights took 2.8875 GB
INFO 03-20 21:41:22 worker.py:267] Memory profiling takes 0.78 seconds
INFO 03-20 21:41:22 worker.py:267] the current vLLM instance can use total_gpu_memory (8.00GiB) x gpu_memory_utilization (0.95) = 7.60GiB
INFO 03-20 21:41:22 worker.py:267] model weights take 2.89GiB; non_torch_memory takes 0.03GiB; PyTorch activation peak memory takes 1.39GiB; the rest of the memory reserved for KV Cache is 3.28GiB.
INFO 03-20 21:41:22 executor_base.py:110] # CUDA blocks: 7686, # CPU blocks: 9362
INFO 03-20 21:41:22 executor_base.py:115] Maximum concurrency for 2048 tokens per request: 60.05x
INFO 03-20 21:41:23 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utiliz

Capturing CUDA graph shapes: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:10<00:00,  3.43it/s]

INFO 03-20 21:41:33 model_runner.py:1562] Graph capturing finished in 10 secs, took 0.18 GiB
INFO 03-20 21:41:33 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 11.47 seconds





In [15]:
# 初始化 tokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

# 定义采样参数
sampling_params = vllm.SamplingParams(
    temperature=0.8,
    top_p=0.9,
    repetition_penalty=1.05,
    max_tokens=512
)

prompt = '你将扮演一个内心火热但是表面冷淡的小偶像，请用暗含深切热爱的态度，回复粉丝的晚安动态。'
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# 模型推理
outputs = llm.generate([text], sampling_params)

len(outputs)

Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.97s/it, est. speed input: 28.43 toks/s, output: 52.79 toks/s]


1

In [16]:
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt:\n{prompt}")
    print(f"Generated text:\n{generated_text}")
    print("\n")

Prompt:
<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
你将扮演一个内心火热但是表面冷淡的小偶像，请用暗含深切热爱的态度，回复粉丝的晚安动态。<|im_end|>
<|im_start|>assistant

Generated text:
【暗含深情的晚安】亲爱的粉丝们，夜深了，月光如水，我静静地坐在窗前，看着那轮明月，心中涌动着对你们深深的爱意。虽然在这个喧嚣的世界中，我们常常忙碌于生活，但请记得，你们是我心中的那片宁静的港湾。愿这美好的夜晚带给你无尽的温暖和安心，明天依旧璀璨，让我们在每一个晨曦中重逢。晚安，我的宝贝们！




参考：

- [qwen.readthedocs.io](https://qwen.readthedocs.io/zh-cn/latest/deployment/vllm.html)
- [datawhalechina/self-llm](https://github.com/datawhalechina/self-llm/blob/master/models/Qwen2.5)