## vLLM推理框架

vLLM推理框架是一种高效的推理工具，旨在加速和优化大规模语言模型的推理过程。它通过以下几个关键技术实现了这一目标：

1. **PagedAttention**: PagedAttention通过将KVCache存储在块状物理显存中，并使用逻辑显存对物理显存进行复用的技术，该技术可以增加显存寻址的连续性，并降低重复显存的数量。
2. **异步执行**: vLLM框架支持异步执行，这意味着可以在等待某些计算结果的同时，继续进行其他计算任务，从而提高整体推理效率。
3. **Continuous Batching**: 在多batch推理中，一般伴随着短sequence生成完成后等待长sequence完成的padding问题，这些padding不仅占用了额外内存，且占用了生成时间，而通过将新的sequence填充到短sequence后面会让生成时间大大缩短。

vLLM框架支持了大部分的纯文本LLM，部分多模态LLM，以及部分GPTQ和AWQ量化模型。

In [1]:
# 安装vLLM只需要执行下面的命令
!pip install vllm

Looking in indexes: https://mirrors.aliyun.com/pypi/simple
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [3]:
# 使用下面的环境变量使用ModelScope社区来进行下载提速
!export VLLM_USE_MODELSCOPE=1

In [None]:
需要注意的是，vLLM会预占用大量显存来存储KVCache，显存占用越大速度提升越高。如果需要控制显存占用的量，请使用下面的参数：

- gpu_memory_utilization: 从0-1的float小数，默认为0.9，代表了额外显存的占用量

除此之外，还有下面的参数经常被用到：

- tensor_parallel_size tensor并行数量，如果你有多个显卡可以用这个参数来拆分模型
- pipeline_parallel_size pipeline并行数量
- max_num_seqs 并行处理的最大sequence数量

更多分布式推理的参数请查看vLLM的官方文档：https://docs.vllm.ai/en/latest/serving/distributed_serving.html

In [1]:
# 这个例子来自于vLLM官方
from vllm import LLM, SamplingParams
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="Qwen/Qwen2.5-1.5B-Instruct")
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

  from .autonotebook import tqdm as notebook_tqdm
2024-12-23 21:41:01,356	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


Downloading Model to directory: /mnt/workspace/.cache/modelscope/hub/Qwen/Qwen2.5-1.5B-Instruct


2024-12-23 21:41:02,605 - modelscope - INFO - Target directory already exists, skipping creation.


Downloading Model to directory: /mnt/workspace/.cache/modelscope/hub/Qwen/Qwen2.5-1.5B-Instruct


2024-12-23 21:41:03,056 - modelscope - INFO - Target directory already exists, skipping creation.


INFO 12-23 21:41:07 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post1) with config: model='Qwen/Qwen2.5-1.5B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-1.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen/Qwen2.5-1.5B-Instruct, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, 

2024-12-23 21:41:07,591 - modelscope - INFO - Target directory already exists, skipping creation.


Downloading Model to directory: /mnt/workspace/.cache/modelscope/hub/Qwen/Qwen2.5-1.5B-Instruct


2024-12-23 21:41:18,664 - modelscope - INFO - Target directory already exists, skipping creation.


Downloading Model to directory: /mnt/workspace/.cache/modelscope/hub/Qwen/Qwen2.5-1.5B-Instruct


2024-12-23 21:41:19,175 - modelscope - INFO - Target directory already exists, skipping creation.


INFO 12-23 21:41:19 model_runner.py:1056] Starting to load model Qwen/Qwen2.5-1.5B-Instruct...
Downloading Model to directory: /mnt/workspace/.cache/modelscope/hub/Qwen/Qwen2.5-1.5B-Instruct


2024-12-23 21:41:20,154 - modelscope - INFO - Target directory already exists, skipping creation.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.11s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.11s/it]



INFO 12-23 21:41:21 model_runner.py:1067] Loading model weights took 2.8875 GB
INFO 12-23 21:41:22 gpu_executor.py:122] # GPU blocks: 154829, # CPU blocks: 9362
INFO 12-23 21:41:22 gpu_executor.py:126] Maximum concurrency for 32768 tokens per request: 75.60x
INFO 12-23 21:41:24 model_runner.py:1395] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 12-23 21:41:24 model_runner.py:1399] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 12-23 21:41:38 model_runner.py:1523] Graph capturing finished in 14 secs.


Processed prompts: 100%|██████████| 4/4 [00:00<00:00, 29.34it/s, est. speed input: 161.46 toks/s, output: 469.66 toks/s]

Prompt: 'Hello, my name is', Generated text: ' Kofi. I am a very passionate and driven individual who has a strong desire'
Prompt: 'The president of the United States is', Generated text: ' the head of state and the head of government, and he or she is responsible'
Prompt: 'The capital of France is', Generated text: ' a city called Paris, and it is the capital of the Republic of France.'
Prompt: 'The future of AI is', Generated text: ' bright and growing in power, but more importantly, it’s growing in ethical considerations'





vLLM也支持直接部署，以便用户使用OpenAI格式进行访问：

In [None]:
# 在terminal中运行，否则导致下面的client代码等待
!vllm serve Qwen/Qwen2.5-1.5B-Instruct

下面我们对这个server进行调用，下面列举了三个例子：
1. 查看模型列表
2. curl方式的调用
3. openai包方式调用

In [8]:
# 查看vLLM模型
!curl http://localhost:8000/v1/models

{"object":"list","data":[{"id":"Qwen/Qwen2.5-1.5B-Instruct","object":"model","created":1734960925,"owned_by":"vllm","root":"Qwen/Qwen2.5-1.5B-Instruct","parent":null,"max_model_len":32768,"permission":[{"id":"modelperm-196cd2277ec042d0b5879a06c38a21c5","object":"model_permission","created":1734960925,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}

In [9]:
# curl类型 client代码
!curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{ \
    "model": "Qwen/Qwen2.5-1.5B-Instruct", \
    "messages": [ \
    {"role": "system", "content": "You are a helpful assistant."}, \
    {"role": "user", "content": "Who won the world series in 2020?"} \
    ] \
    }'

{"id":"chat-b4888f4f2c9c4ca0aab48a4bd53859ec","object":"chat.completion","created":1734961008,"model":"Qwen/Qwen2.5-1.5B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The New York Yankees won the World Series in 2020.","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":31,"total_tokens":47,"completion_tokens":16},"prompt_logprobs":null}

In [12]:
# python调用
!pip install openai -U
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    model="Qwen/Qwen2.5-1.5B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me a joke."},
    ]
)
print("Chat response:", chat_response)

Looking in indexes: https://mirrors.aliyun.com/pypi/simple
Collecting openai
  Downloading https://mirrors.aliyun.com/pypi/packages/8e/5a/d22cd07f1a99b9e8b3c92ee0c1959188db4318828a3d88c9daac120bdd69/openai-1.58.1-py3-none-any.whl (454 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m454.3/454.3 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Installing collected packages: openai
  Attempting uninstall: openai
    Found existing installation: openai 1.57.0
    Uninstalling openai-1.57.0:
      Successfully uninstalled openai-1.57.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
lmdeploy 0.6.2 requires peft<=0.11.1, but you have peft 0.12.0 which is incompatible.[0m[31m
[0mSuccessfully installed openai-1.58.1
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m