## 模型推理效率低？vLLM+量化技术加速模型推理

在大模型应用的生产部署中，推理效率往往成为制约系统性能的关键瓶颈。

本讲通过对比主流推理框架的技术特点，使用 vLLM 结合量化技术的实战案例，为你提供一套完整的性能优化解决方案。

## 1. 推理效率问题分析

### 1.1 大模型推理的性能瓶颈

大模型推理的性能瓶颈主要体现在以下几个维度：

- 内存瓶颈
- 计算瓶颈
- I/O瓶颈

### 1.2 传统方案的局限性

HuggingFace Transformers：
- 设计初衷为研究和原型开发，生产优化不足
- 静态批处理机制，无法动态调整批次大小
- Python解释器开销，执行效率相对较低
  
原生PyTorch推理：
- 缺乏专门的推理优化，内存管理粗放
- 不支持高级优化技术如KV Cache复用
- 并发处理能力有限，难以满足高吞吐需求

### 1.3 业务场景的性能需求

不同业务场景对推理性能的要求差异显著：

| 场景类型   | 延迟要求  | 吞吐量要求 | 并发数     | 主要挑战   |
|------------|-----------|-------------|-------------|------------|
| 在线对话   | <200ms    | 中等        | 100-1000    | 低延迟     |
| 批量处理   | 秒级      | 高          | 大批量      | 高吞吐     |
| 实时推荐   | <50ms     | 高          | 10000+      | 超低延迟   |
| 内容生成   | 1-5s      | 低          | 10-100      | 长文本     |

## 2. 推理框架对比分析

### 2.1 架构设计对比

**vLLM架构特点**

vLLM采用革命性的PagedAttention技术，核心创新包括：

**PagedAttention机制**：
- 将KV Cache分割为固定大小的块（默认16 tokens）
- 通过块表维护逻辑地址到物理地址的映射
- 支持非连续内存分配，显著提升内存利用率
  
**连续批处理**：
- 动态调整批次大小，新请求可随时加入
- 支持不同长度序列的并行处理
- 实现请求级别的调度和优先级管理
  
**分布式推理支持**：
- 张量并行：模型参数分片到多GPU
- 流水线并行：计算流程分阶段执行
- 数据并行：多实例并行处理请求
  
**HuggingFace Transformers架构**

**设计理念**：
- 面向研究和快速原型开发
- 统一的模型接口和丰富的预训练模型
- Python原生实现，易于理解和修改
  
**技术特点**：
- 静态批处理，批次大小固定
- 内存预分配策略，存在较多碎片
- 支持多种硬件后端（CPU/GPU/TPU）
  
**Ollama架构**

**设计目标**：
- 简化本地部署流程，一键安装使用
- 跨平台支持，降低硬件门槛
- 模型管理便捷，支持热切换
  
**技术实现**：
- 基于llama.cpp引擎，C++实现
- GGUF模型格式，支持高效量化
- CPU/GPU混合推理，适配低端硬件

### 2.2 性能对比数据

基于LLaMA-13B模型在A100-80G GPU上的测试结果：

| 框架           | 吞吐量 (tokens/s) | 平均延迟 (ms) | 显存占用 (GB) | 并发支持  |
|----------------|-------------------|---------------|----------------|-----------|
| vLLM           | 4150              | 95            | 19.4           | 100+      |
| HuggingFace    | 170               | 2350          | 45.2           | 10-20     |
| Ollama         | 50                | 800           | 12.8           | 1-5       |

关键指标分析：

1. 吞吐量提升：vLLM相比HuggingFace提升24倍，相比Ollama提升83倍
2. 延迟优化：vLLM延迟降低95%，满足实时交互需求
3. 内存效率：vLLM显存利用率达96%，是HuggingFace的2.3倍
4. 并发能力：vLLM支持100+并发，适合生产环境高负载

###2.3 适用场景与选型建议

**vLLM适用场景**
- 生产环境高并发服务：在线对话、API服务
- 大规模批量处理：内容生成、数据分析
- 资源受限环境：需要最大化硬件利用率
  
优势：性能最优、内存效率高、并发能力强
劣势：部署复杂度相对较高、学习成本

**HuggingFace Transformers适用场景**
- 研究和原型开发：模型实验、算法验证
- 小规模推理任务：个人项目、概念验证
- 模型微调和训练：结合训练流程的推理
  
优势：生态丰富、易于使用、文档完善
劣势：生产性能不足、内存效率低

**Ollama适用场景**
- 个人和小团队使用：本地AI助手、学习工具
- 边缘设备部署：树莓派、嵌入式设备
- 快速体验和测试：模型试用、功能验证
  
优势：部署简单、硬件门槛低、跨平台支持
劣势：性能有限、并发能力弱

## 3. vLLM核心技术解析

### 3.1 PagedAttention算法原理

PagedAttention是vLLM的核心创新，借鉴了操作系统虚拟内存管理的思想：

**PagedAttention的解决方案**

```python
# Block management for PagedAttention
class PagedAttention:
    def __init__(self, block_size=16, num_blocks=1024):
        self.block_size = block_size
        # Physical block pool, allocated on demand
        self.physical_blocks = torch.zeros(
            num_blocks, num_heads, block_size, head_dim * 2
        )
        self.free_blocks = list(range(num_blocks))
        # Block table per sequence
        self.block_tables = {}
        
    def allocate_sequence(self, seq_id):
        """Allocate a block table for a new sequence"""
        self.block_tables[seq_id] = []
        
    def append_tokens(self, seq_id, new_kv, num_tokens):
        """Append KV data for newly generated tokens"""
        block_table = self.block_tables[seq_id]
        
        for i in range(num_tokens):
            # Check whether the current block is full
            if len(block_table) == 0 or self._is_block_full(block_table[-1]):
                # Allocate a new physical block
                new_block_id = self.free_blocks.pop()
                block_table.append(new_block_id)
            
            # Store KV data into the selected block
            block_id = block_table[-1]
            offset = self._get_block_offset(block_id)
            self.physical_blocks[block_id, :, offset, :] = new_kv[i]
    
    def attention_compute(self, seq_id, query):
        """Compute attention using the sequence's block table"""
        block_table = self.block_tables[seq_id]
        
        # Gather all KV blocks for this sequence
        kv_blocks = []
        for block_id in block_table:
            kv_blocks.append(self.physical_blocks[block_id])
        
        # Concatenate into a full KV sequence
        full_kv = torch.cat(kv_blocks, dim=1)
        return attention_kernel(query, full_kv)
```

**Copy-on-Write机制**

```python
class CopyOnWriteManager:
    def __init__(self):
        self.block_ref_counts = {}  # Reference count per block
        
    def fork_sequence(self, parent_seq_id, child_seq_id):
        """Fork a child sequence from a parent sequence, sharing the KV cache"""
        parent_blocks = self.block_tables[parent_seq_id]
        child_blocks = []
        
        for block_id in parent_blocks:
            # Increase reference count
            self.block_ref_counts[block_id] = \
                self.block_ref_counts.get(block_id, 1) + 1
            child_blocks.append(block_id)
        
        self.block_tables[child_seq_id] = child_blocks
    
    def copy_on_write(self, seq_id, block_idx):
        """Copy-on-write: only copy a block when it is about to be modified"""
        block_table = self.block_tables[seq_id]
        old_block_id = block_table[block_idx]
        
        # Check whether a copy is required
        if self.block_ref_counts.get(old_block_id, 1) > 1:
            # Allocate a new block and copy data
            new_block_id = self.free_blocks.pop()
            self.physical_blocks[new_block_id] = \
                self.physical_blocks[old_block_id].clone()
            
            # Update reference counts
            self.block_ref_counts[old_block_id] -= 1
            self.block_ref_counts[new_block_id] = 1
            
            # Update the block table
            block_table[block_idx] = new_block_id
            return new_block_id
        
        return old_block_id
```



### 3.2 内存管理优化机制

**动态内存分配**

vLLM的内存管理策略包括：

1. 按需分配：只在需要时分配新的内存块
2. 即时回收：序列结束后立即释放内存块
3. 碎片整理：定期整理内存碎片，提高利用率



## 4. 量化技术深度解析

### 4.1 量化算法原理对比

**AWQ (Activation-aware Weight Quantization)**

AWQ通过分析激活分布来确定重要权重，是目前最先进的量化技术之一：

核心原理：
1. 激活感知：基于激活值分布而非权重分布确定重要通道
2. 通道保护：保护0.1%-1%的显著权重通道不进行量化
3. 逐通道缩放：对每个通道应用不同的缩放因子

```python
class AWQQuantizer:
    def __init__(self, w_bit=4, group_size=128):
        self.w_bit = w_bit
        self.group_size = group_size
        
    def quantize_model(self, model, calibration_data):
        """Main AWQ quantization workflow"""
        # 1) Collect activation statistics
        activation_stats = self._collect_activation_stats(model, calibration_data)
        
        # 2) Compute per-layer importance scores
        importance_scores = self._calculate_importance_scores(activation_stats)
        
        # 3) Select weight channels to protect
        protected_channels = self._select_protected_channels(importance_scores)
        
        # 4) Apply quantization
        quantized_model = self._apply_quantization(model, protected_channels)
        
        return quantized_model
    
    def _collect_activation_stats(self, model, calibration_data):
        """Collect activation statistics"""
        stats = {}
        
        def hook_fn(name):
            def hook(module, input, output):
                if name not in stats:
                    stats[name] = []
                # Record activation statistics
                stats[name].append({
                    "mean": output.mean(dim=0),
                    "std": output.std(dim=0),
                    "max": output.max(dim=0)[0],
                })
            return hook
        
        # Register forward hooks
        hooks = []
        for name, module in model.named_modules():
            if isinstance(module, (torch.nn.Linear, torch.nn.Conv2d)):
                hook = module.register_forward_hook(hook_fn(name))
                hooks.append(hook)
        
        # Run calibration data
        model.eval()
        with torch.no_grad():
            for batch in calibration_data:
                model(batch)
        
        # Remove hooks
        for hook in hooks:
            hook.remove()
        
        return stats
    
    def _calculate_importance_scores(self, activation_stats):
        """Compute importance scores for weight channels"""
        importance_scores = {}
        
        for layer_name, stats_list in activation_stats.items():
            # Aggregate stats across multiple batches
            mean_acts = torch.stack([s["mean"] for s in stats_list]).mean(0)
            max_acts = torch.stack([s["max"] for s in stats_list]).max(0)[0]
            
            # Importance score (combining mean and max)
            importance = mean_acts * 0.7 + max_acts * 0.3
            importance_scores[layer_name] = importance
        
        return importance_scores
    
    def _select_protected_channels(self, importance_scores, protect_ratio=0.01):
        """Select weight channels to protect"""
        protected_channels = {}
        
        for layer_name, scores in importance_scores.items():
            # Protect the top 1% channels
            num_protect = max(1, int(len(scores) * protect_ratio))
            _, top_indices = torch.topk(scores, num_protect)
            protected_channels[layer_name] = top_indices
        
        return protected_channels
```


**GPTQ (Gradient Post-training Quantization)**

GPTQ基于梯度信息进行后训练量化：

```python
class GPTQQuantizer:
    def __init__(self, w_bit=4, group_size=128, damp_percent=0.01):
        self.w_bit = w_bit
        self.group_size = group_size
        self.damp_percent = damp_percent
        
    def quantize_layer(self, layer, input_data):
        """Quantize a single linear layer"""
        W = layer.weight.data.clone()
        H = self._compute_hessian(layer, input_data)
        
        # Add a damping term for numerical stability
        damp = self.damp_percent * torch.mean(torch.diag(H))
        diag = torch.arange(H.shape[0], device=H.device)
        H[diag, diag] += damp
        
        # Cholesky-based factorization to obtain an upper-triangular factor
        H = torch.linalg.cholesky(H)
        H = torch.cholesky_inverse(H)
        H = torch.linalg.cholesky(H, upper=True)
        Hinv = H
        
        # Quantize column-by-column
        for i in range(W.shape[1]):
            # Quantize the current column
            w_col = W[:, i]
            q_col = self._quantize_column(w_col)
            
            # Quantization error
            error = w_col - q_col
            
            # Update subsequent weights to compensate for the error
            W[:, i:] -= (error.unsqueeze(1) * Hinv[i, i:].unsqueeze(0))
            
            # Write back the quantized column
            W[:, i] = q_col
        
        layer.weight.data = W
        return layer
    
    def _compute_hessian(self, layer, input_data):
        """Compute the Hessian matrix (approximation)"""
        # Use input data to accumulate second-order information
        H = torch.zeros(
            (layer.in_features, layer.in_features),
            device=layer.weight.device
        )
        
        for batch in input_data:
            # Outer product of inputs
            inp = batch.view(-1, layer.in_features)
            H += inp.t() @ inp
        
        return H / len(input_data)
    
    def _quantize_column(self, w_col):
        """Quantize a single weight column"""
        # Compute the quantization range
        w_min = w_col.min()
        w_max = w_col.max()
        
        # Symmetric quantization
        scale = max(abs(w_min), abs(w_max)) / (2 ** (self.w_bit - 1) - 1)
        
        # Quantize and dequantize
        q_col = torch.round(w_col / scale).clamp(
            -(2 ** (self.w_bit - 1)), 2 ** (self.w_bit - 1) - 1
        )
        q_col = q_col * scale
        
        return q_col
```

**FP8量化技术**

FP8使用8位浮点数表示，相比INT8能更好地保持数值精度：

```python
class FP8Quantizer:
    def __init__(self, format="E4M3"):  # E4M3 or E5M2
        self.format = format
        if format == "E4M3":
            self.exp_bits = 4
            self.mantissa_bits = 3
        else:  # E5M2
            self.exp_bits = 5
            self.mantissa_bits = 2
            
    def quantize_tensor(self, tensor):
        """Quantize an FP32 tensor to FP8"""
        # Compute the quantization range
        if self.format == "E4M3":
            max_val = 448.0  # Maximum value for the E4M3 format
        else:
            max_val = 57344.0  # Maximum value for the E5M2 format
        
        # Scale into the FP8 range
        scale = max_val / tensor.abs().max()
        scaled_tensor = tensor * scale
        
        # Simulate the FP8 quantization process
        quantized = self._simulate_fp8_quantization(scaled_tensor)
        
        return quantized / scale, scale
    
    def _simulate_fp8_quantization(self, tensor):
        """Simulate FP8 quantization"""
        # Extract the sign bit
        sign = torch.sign(tensor)
        abs_tensor = torch.abs(tensor)
        
        # Convert into an FP8-like representation
        if self.format == "E4M3":
            # E4M3: 1 sign bit + 4 exponent bits + 3 mantissa bits
            quantized = self._quantize_e4m3(abs_tensor)
        else:
            # E5M2: 1 sign bit + 5 exponent bits + 2 mantissa bits
            quantized = self._quantize_e5m2(abs_tensor)
        
        return sign * quantized
    
    def _quantize_e4m3(self, tensor):
        """Quantize using the E4M3 format"""
        # Compute exponent and mantissa
        log2_tensor = torch.log2(tensor + 1e-8)
        exponent = torch.floor(log2_tensor).clamp(-6, 8)  # 4-bit exponent range
        
        # Compute mantissa
        mantissa_scale = 2 ** exponent
        mantissa = tensor / mantissa_scale - 1.0
        
        # Quantize mantissa to 3-bit precision
        mantissa_quantized = torch.round(mantissa * 8) / 8
        
        # Reconstruct the quantized value
        quantized = (1.0 + mantissa_quantized) * mantissa_scale
        
        return quantized
```

### 4.2 量化技术性能对比

基于Qwen3-8B模型的量化效果对比：


| 量化方法     | 模型大小 | 内存占用 | 推理速度 | MMLU分数 | 精度损失 |
|--------------|----------|----------|----------|----------|----------|
| FP32         | 32GB     | 32GB     | 基准     | 74.7     | 0%       |
| FP16         | 16GB     | 16GB     | 1.8x     | 74.5     | 0.3%     |
| AWQ-4bit     | 4.5GB    | 6GB      | 3.2x     | 72.1     | 3.5%     |
| GPTQ-4bit    | 4.5GB    | 6GB      | 2.8x     | 71.3     | 4.6%     |
| FP8          | 8GB      | 8GB      | 2.1x     | 74.2     | 0.7%     |



选择建议：
- 追求极致性能：AWQ-4bit，在可接受的精度损失下获得最大加速
- 平衡性能和精度：FP8，较小的精度损失，适中的性能提升
- 兼容性优先：GPTQ-4bit，生态支持最好，部署最稳定


  
## 5. Qwen3-8B实战部署

### 5.1 环境搭建和依赖配置

系统要求

- 操作系统：Ubuntu 20.04+ / CentOS 8+
- Python版本：3.9-3.11
- CUDA版本：12.1+
- GPU要求：RTX 3060 12GB以上（推荐RTX 4090/A100）
- 内存要求：32GB以上系统内存
  
环境安装脚本

In [None]:
#!/bin/bash
# qwen3_vllm_setup.sh - Setup script for Qwen3-8B + vLLM deployment environment

set -e

echo "=== Setting up Qwen3-8B vLLM Deployment Environment ==="

# 1. Check system prerequisites
echo "Checking system prerequisites..."
if ! command -v nvidia-smi &> /dev/null; then
    echo "ERROR: NVIDIA GPU driver not detected"
    exit 1
fi

if ! command -v nvcc &> /dev/null; then
    echo "ERROR: CUDA toolkit (nvcc) not detected"
    exit 1
fi

# Check CUDA version
CUDA_VERSION=$(nvcc --version | grep "release" | awk '{print $6}' | cut -c2-)
echo "Detected CUDA version: $CUDA_VERSION"

# 2. Create Python environment
echo "Creating Python virtual environment..."
if ! command -v conda &> /dev/null; then
    echo "Installing Miniconda..."
    wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
    bash Miniconda3-latest-Linux-x86_64.sh -b -p $HOME/miniconda3
    source $HOME/miniconda3/bin/activate
    conda init bash
fi

conda create -n qwen3-vllm python=3.11 -y
conda activate qwen3-vllm

# 3. Install PyTorch
echo "Installing PyTorch..."
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# 4. Install vLLM
echo "Installing vLLM..."
pip install vllm

# 5. Install additional dependencies
echo "Installing additional dependencies..."
pip install transformers accelerate bitsandbytes
pip install fastapi uvicorn
pip install numpy pandas matplotlib seaborn
pip install psutil GPUtil
pip install vllm

# 6. Verify installation
echo "Verifying installation..."
python -c "
import torch
import vllm
import transformers

print(f'PyTorch version: {torch.__version__}')
print(f'CUDA available: {torch.cuda.is_available()}')
print(f'Number of GPUs: {torch.cuda.device_count()}')
print(f'vLLM version: {vllm.__version__}')
print(f'Transformers version: {transformers.__version__}')

if torch.cuda.is_available():
    print(f'GPU model: {torch.cuda.get_device_name(0)}')
    print(f'GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB')
"
echo "Setup complete!"
echo "Activate the environment with: conda activate qwen3-vllm"


**模型部署和运行**

VLLM_USE_MODELSCOPE=True vllm serve Qwen/Qwen3-8B --dtype auto  --max-model-len 16384   --api-key token-123

VLLM_USE_MODELSCOPE=true vllm serve Qwen/Qwen3-8B --enable-reasoning --reasoning-parser deepseek_r1


如果尚未下载模型，系统会自动开始下载。

若出现网络超时或下载失败，建议从魔搭（ModelScope）社区获取模型。

默认情况下，vLLM 从 Hugging Face 下载模型，如需切换至魔搭，请设置环境变量：

**export VLLM_USE_MODELSCOPE=True**

你可以像调用 OpenAI API 一样调用 vLLM


In [None]:
from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-123",
)
 
completion = client.chat.completions.create(
  model="Qwen/Qwen3-8B",
  messages=[
    {"role": "user", "content": "你是谁？"}
  ]
)
 
print(completion.choices[0].message)

## 5.2 原始模型部署和性能基准

**基础部署代码**

In [None]:
# qwen3_basic_deployment.py - Basic deployment for Qwen3-8B
import time
import os
import json
from typing import List, Dict, Any

import torch
import psutil
import GPUtil
from vllm import LLM, SamplingParams


class Qwen3BasicDeployment:
    def __init__(self, model_path: str, gpu_memory_utilization: float = 0.8):
        self.model_path = model_path
        self.gpu_memory_utilization = gpu_memory_utilization
        self.llm = None
        self.performance_metrics: Dict[str, Any] = {}

    def initialize_model(self):
        """Initialize the Qwen3-8B model"""
        print("Initializing Qwen3-8B model...")

        start_time = time.time()

        self.llm = LLM(
            model=self.model_path,
            tensor_parallel_size=1,  # Single-GPU deployment
            gpu_memory_utilization=self.gpu_memory_utilization,
            max_model_len=32768,  # Qwen3 supports 32K context
            trust_remote_code=True,
            enforce_eager=False,  # Use CUDA Graph optimization
            swap_space=4,  # 4GB swap space
            enable_thinking=False,  # Setting enable_thinking=False disables thinking mode
        )

        init_time = time.time() - start_time
        print(f"Model initialization complete. Time taken: {init_time:.2f}s")

        # Record initialization metrics
        self.performance_metrics["init_time"] = init_time
        self.performance_metrics["model_size_gb"] = self._get_model_size_gb()
        self.performance_metrics["gpu_memory"] = self._get_gpu_memory_usage()

    def _get_model_size_gb(self) -> float:
        """Get model size (GB)"""
        total_size = 0
        for root, _, files in os.walk(self.model_path):
            for file in files:
                file_path = os.path.join(root, file)
                total_size += os.path.getsize(file_path)
        return total_size / (1024**3)

    def _get_gpu_memory_usage(self) -> Dict[str, float]:
        """Get GPU memory usage"""
        if torch.cuda.is_available() and GPUtil.getGPUs():
            gpu = GPUtil.getGPUs()[0]
            return {
                "used_gb": gpu.memoryUsed / 1024,
                "total_gb": gpu.memoryTotal / 1024,
                "utilization": gpu.memoryUsed / gpu.memoryTotal if gpu.memoryTotal else 0.0,
            }
        return {}

    def run_performance_benchmark(self, test_cases: List[str], num_runs: int = 3):
        """Run performance benchmark"""
        if self.llm is None:
            raise RuntimeError("Model is not initialized. Call initialize_model() first.")

        print(f"Running performance benchmark ({num_runs} runs)...")

        sampling_params = SamplingParams(
            temperature=0.7,
            top_p=0.9,
            max_tokens=512,
            stop=["<|im_end|>"],
        )

        all_results = []

        tokenizer = self.llm.get_tokenizer()

        for run in range(num_runs):
            print(f"Run {run + 1}...")

            # Warm-up
            if run == 0:
                print("Warming up the model...")
                _ = self.llm.generate(test_cases[:2], sampling_params)
                if torch.cuda.is_available():
                    torch.cuda.synchronize()

            # Record start state
            start_memory = self._get_gpu_memory_usage()
            start_time = time.time()

            # Run inference
            outputs = self.llm.generate(test_cases, sampling_params)

            # Record end state
            if torch.cuda.is_available():
                torch.cuda.synchronize()
            end_time = time.time()
            end_memory = self._get_gpu_memory_usage()

            # Compute metrics
            total_time = end_time - start_time
            total_input_tokens = sum(len(tokenizer.encode(prompt)) for prompt in test_cases)
            total_output_tokens = sum(len(out.outputs[0].token_ids) for out in outputs)

            run_results = {
                "run": run + 1,
                "total_time": total_time,
                "avg_latency": total_time / len(test_cases),
                "throughput_tokens_per_sec": total_output_tokens / total_time if total_time > 0 else 0.0,
                "throughput_requests_per_sec": len(test_cases) / total_time if total_time > 0 else 0.0,
                "total_input_tokens": total_input_tokens,
                "total_output_tokens": total_output_tokens,
                "memory_usage": end_memory,
                "memory_increase_gb": (
                    end_memory.get("used_gb", 0.0) - start_memory.get("used_gb", 0.0)
                ),
            }

            all_results.append(run_results)
            print(f"  Latency:    {run_results['avg_latency']:.3f}s")
            print(f"  Throughput: {run_results['throughput_tokens_per_sec']:.1f} tokens/s")

        # Aggregate metrics
        avg_results = self._calculate_average_metrics(all_results)
        self.performance_metrics["benchmark"] = avg_results
        self.performance_metrics["benchmark_runs"] = all_results

        return avg_results

    def _calculate_average_metrics(self, results: List[Dict[str, Any]]) -> Dict[str, float]:
        """Compute average performance metrics"""
        metrics = [
            "total_time",
            "avg_latency",
            "throughput_tokens_per_sec",
            "throughput_requests_per_sec",
        ]

        avg_results: Dict[str, float] = {}
        for metric in metrics:
            values = [r[metric] for r in results]
            mean = sum(values) / len(values) if values else 0.0
            var = sum((x - mean) ** 2 for x in values) / len(values) if values else 0.0
            avg_results[f"avg_{metric}"] = mean
            avg_results[f"std_{metric}"] = var ** 0.5

        return avg_results

    def generate_text(self, prompt: str, max_tokens: int = 512) -> str:
        """Generate text"""
        if self.llm is None:
            raise RuntimeError("Model is not initialized. Call initialize_model() first.")

        sampling_params = SamplingParams(
            temperature=0.7,
            top_p=0.9,
            max_tokens=max_tokens,
            stop=["<|im_end|>"],
        )

        outputs = self.llm.generate([prompt], sampling_params)
        return outputs[0].outputs[0].text

    def save_metrics(self, filepath: str):
        """Save performance metrics to a file"""
        with open(filepath, "w", encoding="utf-8") as f:
            json.dump(self.performance_metrics, f, indent=2, ensure_ascii=False)
        print(f"Performance metrics saved to: {filepath}")


# Test cases
TEST_CASES = [
    "Please explain the principle of attention mechanisms in deep learning in detail.",
    "How would you design a distributed system architecture for high concurrency?",
    "Analyze current trends and challenges in AI technology.",
    "Write a Python function to implement the quicksort algorithm.",
    "Explain core concepts and application scenarios of blockchain technology.",
    "How can you optimize database query performance? Provide specific methods.",
    "Analyze the differences and connections between cloud computing and edge computing.",
    "Introduce the overfitting problem in machine learning and how to address it.",
]


def main():
    """Main entry point"""
    model_path = "./models/qwen3-8b-original"

    # Check model path
    if not os.path.exists(model_path):
        print(f"ERROR: Model path does not exist: {model_path}")
        print("Please run download_qwen3.py to download the model first.")
        return

    # Create deployment instance
    deployment = Qwen3BasicDeployment(model_path)

    # Initialize model
    deployment.initialize_model()

    # Run benchmark
    benchmark_results = deployment.run_performance_benchmark(TEST_CASES)

    # Print results
    print("\n=== Performance Benchmark Results ===")
    print(f"Average latency:    {benchmark_results['avg_avg_latency']:.3f}s")
    print(f"Average throughput: {benchmark_results['avg_throughput_tokens_per_sec']:.1f} tokens/s")

    gpu_mem = deployment.performance_metrics.get("gpu_memory", {})
    if gpu_mem:
        print(f"GPU memory used:    {gpu_mem.get('used_gb', 0.0):.1f} GB")
        print(f"Memory utilization: {gpu_mem.get('utilization', 0.0):.1%}")

    # Save metrics
    deployment.save_metrics("qwen3_original_metrics.json")

    # Interactive loop
    print("\n=== Interactive Test ===")
    while True:
        user_input = input("\nEnter a question (type 'quit' to exit): ")
        if user_input.lower() == "quit":
            break

        start_time = time.time()
        response = deployment.generate_text(user_input)
        end_time = time.time()

        print(f"\nResponse: {response}")
        print(f"Response time: {end_time - start_time:.2f}s")


if __name__ == "__main__":
    main()

### 5.3 量化模型部署和性能对比

**量化模型部署代码**

模型加载时自动转为 FP8 格式进行计算,FP8：几乎无损，速度快，显存节省明显

```python
# 在初始化时加入：
self.llm = LLM(
    model=self.model_path,
    tensor_parallel_size=1,
    gpu_memory_utilization=0.95,
    max_model_len=32768,
    trust_remote_code=True,
    enforce_eager=False,
    swap_space=4,
    
    quantization="fp8",           # ← 启用 FP8
    dtype=torch.float16,         # ← 必须指定
)
```

**使用 AWQ 4-bit 量化**

如果你希望进一步节省显存，建议使用 AWQ 量化版模型，例如从 Huggingface 下载 Qwen3-8B-AWQ

然后修改代码中 model_path 和添加 quantization="awq"

```python
self.llm = LLM(
    model="./models/qwen3-8b-awq",           # 指向 AWQ 模型
    tensor_parallel_size=1,
    gpu_memory_utilization=0.95,            # 可适当提高
    max_model_len=16384,                    # 可支持更长上下文
    trust_remote_code=True,
    enforce_eager=False,
    swap_space=2,
    
    # 启用 AWQ 量化
    quantization="awq",
    dtype="auto",
)
```

**如果你想尝试 GPTQ（类似）**
```python
self.llm = LLM(
    model="./models/qwen3-8b-gptq",      # 必须是 GPTQ 量化后的模型
    quantization="gptq",
    dtype="auto",
    ...
)
```

###  常见问题与建议

| 问题 | 原因分析 | 推荐解决方案 |
|------|----------|---------------|
| `ValueError: FP8 is not supported on this device` | 当前 GPU 不支持 FP8 计算（需 Ampere 架构及以上） | - 使用 A10、A100、L4、H100 等支持 Tensor Core 的显卡<br>- 改用 AWQ/GPTQ 量化或降低 `max_model_len` |
| `CUDA out of memory` | KV Cache 显存不足，尤其是长上下文场景 | - 启用 `FP8` 或 `AWQ` 量化<br>- 降低 `max_model_len`（如设为 `8192` 或 `16384`）<br>- 提高 `gpu_memory_utilization` 至 `0.95` |
| 量化未生效 / 没有性能提升 | 未正确配置 `quantization` 参数或模型不匹配 | - 确保 `quantization="fp8"` 且 `dtype=torch.float16`<br>- AWQ/GPTQ 需使用**预先量化好的模型文件**，不能直接对原模型启用 |
| 启动报错：`Unknown quantization method` | vLLM 版本过旧，不支持该量化方式 | 升级 vLLM 到最新版本：<br>`pip install -U vllm` |
| 推理速度慢 | 未启用 CUDA Graph 或硬件利用率低 | - 设置 `enforce_eager=False`（启用 CUDA Graph）<br>- 使用更高吞吐的量化格式（如 FP8 可提速 2x） |
| 多轮对话崩溃 | 上下文过长导致显存溢出 | - 限制输入长度<br>- 使用滑动窗口或摘要机制管理历史记录 |

### 温馨提示
- **FP8**：适合支持设备，几乎无损，推荐优先尝试。
- **AWQ/GPTQ**：极致省显存，适合 16GB 以下显卡部署大模型。
- **不要混合使用量化方式**：只能选择一种 `quantization` 参数。
- 查看日志确认是否成功加载量化：搜索 `Using FP8` 或 `Quantization: awq` 等关键字。




## 总结

性能提升：
- 吞吐量提升24倍（相比HuggingFace Transformers）
- 延迟降低95%，满足实时交互需求
- 显存利用率提升至96%，支持更大并发
  
成本优化：
- AWQ量化可节省硬件成本60%以上
- 运营成本降低，投资回报率超过50%
- 支持在消费级GPU上部署企业级服务

技术选型建议：
- 小型团队：RTX 4090 + AWQ量化，成本效益最优
- 中型企业：A100 + GPTQ量化，平衡性能和成本
- 大规模部署：A100集群 + FP8量化，追求极致性能