## 成本太高？基于请求热度的冷热模型分层部署策略

传统的AI部署要么全量加载所有模型（成本爆炸），要么按需加载（响应太慢），无法平衡性能与成本。


- **全量加载**：10个模型同时驻留内存，显存占用500GB，大部分资源闲置
- **按需加载**：每次请求都重新加载模型，冷启动需要30-60秒，用户无法接受
- **静态分配**：无法预测真实负载，要么过度投入要么性能不足

## 1. 多模型部署的资源困境

### 1.1 显存占用的真实成本

- 硬件成本高
- 资源浪费现象

全量部署方案中，超过60%的显存资源处于闲置状态，造成巨大的成本浪费。


### 1.2 按需加载的时延陷阱

为了控制成本，很多团队选择按需加载，但这种方案面临严重的时延问题：

模型加载时间分析：

- 70B参数模型：从存储加载到GPU内存需要45-90秒
- 网络传输：如果从远程存储加载，额外增加10-30秒
- 模型初始化：推理引擎预热需要5-15秒
- 总计时延：完整的冷加载过程需要60-135秒

在这1-2分钟的等待期间，用户请求全部超时，业务完全中断。这种方案在生产环境中完全不可接受。


传统方案的根本问题在于缺乏**差异化的资源调度策略**。实际上：

- **请求热度分布不均**：遵循80/20法则，少数模型承担大部分请求
- **时间维度的变化**：不同时段的模型使用模式存在明显差异
- **用户行为可预测**：VIP用户、核心业务的模型偏好相对稳定

因此，我们需要一个智能的模型切换机制，能够：
1. 热模型常驻：高频使用的模型保持在GPU内存中
2. 冷模型按需：低频模型动态加载，可容忍适度延迟
3. 智能预测：基于历史数据预测模型使用趋势


## 2. 基于请求热度的模型切换策略

### 2.1 请求热度的量化评估

我们需要建立一个实用的热度评估体系：

In [None]:
import time
from collections import defaultdict, deque

class ModelHeatTracker:
    def __init__(self, window_minutes=15):
        # Store request timestamps per model
        self.request_history = defaultdict(deque)

        # Store aggregate statistics per model
        self.model_stats = defaultdict(lambda: {
            'total_requests': 0,
            'heat_score': 0.0
        })

        self.window_seconds = window_minutes * 60

    def record_request(self, model_id: str):
        """
        Record a request for a model and update its heat score.
        """
        current_time = time.time()
        self.request_history[model_id].append(current_time)

        # Update total request count
        stats = self.model_stats[model_id]
        stats['total_requests'] += 1

        # Remove requests older than the time window
        cutoff = current_time - self.window_seconds
        while (
            self.request_history[model_id]
            and self.request_history[model_id][0] < cutoff
        ):
            self.request_history[model_id].popleft()

        # Compute heat score as the number of requests in the time window
        stats['heat_score'] = len(self.request_history[model_id])

    def get_hot_models(self, top_n=2) -> list:
        """
        Return the top-N hottest models based on recent request volume.
        """
        models = [
            (model_id, stats['heat_score'])
            for model_id, stats in self.model_stats.items()
        ]

        models.sort(key=lambda x: x[1], reverse=True)
        return [model_id for model_id, _ in models[:top_n]]


### 2.2 Ollama代理调度器

基于热度评估，我们实现一个智能的模型切换器，与Ollama深度集成

In [None]:
import requests
import time
import threading
from queue import Queue

class OllamaProxy:
    def __init__(self, ollama_url="http://localhost:11434", max_hot_models=3):
        self.ollama_url = ollama_url
        self.heat_tracker = ModelHeatTracker()
        self.loaded_models = set()
        self.model_queue = Queue()
        self.max_hot_models = max_hot_models

        self._start_loader_thread()

        # Start periodic optimization task
        threading.Thread(target=self._optimization_loop, daemon=True).start()

    def _start_loader_thread(self):
        """Background thread that manages model loading."""
        def load_models():
            while True:
                model_id = self.model_queue.get()
                if model_id not in self.loaded_models:
                    self._load_model(model_id)
                self.model_queue.task_done()

        thread = threading.Thread(target=load_models, daemon=True)
        thread.start()

    def _load_model(self, model_id: str):
        """Load a model (non-blocking from caller perspective)."""
        print(f"Loading model: {model_id}")
        start = time.time()

        try:
            # Call Ollama API to warm up / load the model
            response = requests.post(
                f"{self.ollama_url}/api/generate",
                json={"model": model_id, "prompt": "Hello", "stream": False},
                timeout=60
            )
            if response.status_code == 200:
                self.loaded_models.add(model_id)
                print(f"Model {model_id} loaded in {time.time() - start:.1f}s")
            else:
                print(f"Failed to load {model_id}: HTTP {response.status_code}")
        except Exception as e:
            print(f"Failed to load {model_id}: {str(e)}")

    def generate(self, model_id: str, prompt: str, max_tokens=100):
        """Generate text (automatically handles model loading)."""
        self.heat_tracker.record_request(model_id)

        # If model is not loaded, enqueue a load request
        if model_id not in self.loaded_models:
            self.model_queue.put(model_id)

        # Simple wait for the model to load (demo only; production should optimize)
        start_wait = time.time()
        while model_id not in self.loaded_models and (time.time() - start_wait) < 30:
            time.sleep(0.5)

        # Call Ollama API
        try:
            response = requests.post(
                f"{self.ollama_url}/api/generate",
                json={
                    "model": model_id,
                    "prompt": prompt,
                    "stream": False,
                    "options": {"num_predict": max_tokens}
                },
                timeout=30
            )
            return response.json().get("response", "")
        except Exception as e:
            return f"Error: {str(e)}"

    def get_status(self):
        """Return system status."""
        return {
            "loaded_models": list(self.loaded_models),
            "hot_models": self.heat_tracker.get_hot_models(3),
            "total_requests": sum(
                s["total_requests"] for s in self.heat_tracker.model_stats.values()
            ),
        }

    def _optimization_loop(self):
        """Periodically optimize model placement (load/unload decisions)."""
        while True:
            time.sleep(300)  # check every 5 minutes

            # Get hot models
            hot_models = self.heat_tracker.get_hot_models(self.max_hot_models)

            # Unload cold models
            for model in list(self.loaded_models):
                if model not in hot_models:
                    self._unload_model(model)

    def _unload_model(self, model_id: str):
        """Unload a model (Ollama typically requires service restart to truly unload)."""
        print(f"Unloading cold model: {model_id}")

        # Ollama does not provide a direct unload API; simplified handling here
        if model_id in self.loaded_models:
            self.loaded_models.remove(model_id)

            # In production you might need:
            # 1) Restart the Ollama service, or
            # 2) Run multiple Ollama instances and route traffic accordingly


1. 当前演示实现：展示了"热"模型保持和"温"模型按需加载
2. 真实生产环境：可通过以下方式实现完整冷热分层
  - 方案A：使用多个Ollama实例（1个热实例+1个温实例）
  - 方案B：结合容器技术，为冷模型使用独立容器
  - 方案C：在请求量低谷期自动重启Ollama服务

### 2.3 部署示例

In [None]:
# demo.py - 5-minute quick test
import time
from ollama_proxy import OllamaProxy  # Assume the proxy code is saved as ollama_proxy.py

def run_demo():
    # Initialize the proxy
    proxy = OllamaProxy()

    # Available models (make sure they are downloaded first:
    # ollama pull llama2, ollama pull mistral, etc.)
    models = ["llama2", "mistral", "codellama", "phi"]

    print("=== Cold Start Demo ===")
    print("The first request will trigger model loading...")
    start = time.time()
    response = proxy.generate("llama2", "Hello, please introduce yourself.")
    print(f"Response: {response[:100]}...")
    print(f"First request latency: {time.time() - start:.1f} seconds\n")

    print("=== Warm Request Demo ===")
    print("Repeated requests to a hot model...")
    start = time.time()
    response = proxy.generate("llama2", "Describe the future of AI in one sentence.")
    print(f"Response: {response[:100]}...")
    print(f"Warm request latency: {time.time() - start:.2f} seconds\n")

    print("=== Model Switch Demo ===")
    print("Switching to a less frequently used model...")
    start = time.time()
    response = proxy.generate("mistral", "Please explain quantum computing.")
    print(f"Response: {response[:100]}...")
    print(f"Model switch latency: {time.time() - start:.1f} seconds\n")

    print("=== System Status ===")
    print(proxy.get_status())

if __name__ == "__main__":
    run_demo()


**实测效果（本地Mac M1 Pro）**
| 模型      | 首次加载时间 | 热情求时间 | 内存占用 |
|-----------|--------------|------------|----------|
| llama2    | 12.3秒       | 0.2秒      | 5.2GB    |
| mistral   | 9.8秒        | 0.2秒      | 4.7GB    |
| codellama | 14.1秒       | 0.3秒      | 6.1GB    |

资源节约效果：
- 无需同时加载所有模型
- 内存占用从16GB降至6GB（约62%节约）
- 热门模型保持秒级响应

为什么这个方案有效？
1. 简单实用：无需复杂配置，直接基于Ollama REST API
2. 自动管理：后台线程处理模型加载，不阻塞主请求
3. 智能预热：根据热度自动保持热门模型常驻
4. 资源友好：仅加载实际需要的模型，避免内存浪费

**快速部署指南**

1. 安装Ollama

```bash
# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh
```

2. 下载演示模型

```bash
ollama pull llama2
ollama pull mistral
ollama pull codellama
```

3. 运行演示

```bash
# 保存上面的代码为 demo.py
python demo.py
```

**测试技巧**

1. 先测试全量加载问题：

```shell
# 同时加载多个模型会很慢
time ollama run llama2 "你好"
time ollama run mistral "你好"
```

2. 再测试分层策略优势：
  - 首次请求稍慢
  - 后续热门模型请求极快
  - 冷门模型请求可接受

3. 监控资源使用
```shell
# 观察内存变化
top -pid $(pgrep -f "ollama serve")
```

## 结论
这个方案适配大部分由 vLLM SGLang Ollma 作为推理工具的模型：
- 无需修改源码，纯代理层实现
- 实际节省30-60%的内存资源
- 保持热门模型的快速响应