# 本地化大模型部署与验证

本 Notebook 包含以下步骤：
1. **环境检查**: 确认 GPU 和 CUDA 环境是否满足要求。
2. **模型下载**: 从 ModelScope 下载 Qwen2.5-7B-Instruct-GPTQ-Int4 量化模型。
3. **模型加载**: 加载 Tokenizer 和模型到显存。
4. **推理测试**: 定义对话函数并进行测试。
5. **多模态模型**: 下载并测试 Qwen3-VL-8B-Instruct-4bit-GPTQ (视觉语言模型)。

**参考文章**：[解决 AutoGPTQ 推理慢的问题](https://jishuzhan.net/article/1868560413910634498)

**关键问题解决记录**：
- **问题**：初始部署推理延迟极高（>40秒）
- **根本原因**：AutoGPTQ 0.7.1 需要 PyTorch 2.2.1+，而环境中是 PyTorch 2.1.2，导致 CUDA 扩展未正确加载，回退到 CPU 执行。
- **解决方案**：降级到 AutoGPTQ 0.6.0（适配 PyTorch 2.1.2）。
- **结果**：性能提升 12 倍（推理时间从 43.7s 降至 3.5s）。

## 1. conda环境配置与环境检查

In [1]:
! conda env create -f environment.yml

^C


In [2]:
import torch
import sys

print(f"Python Version: {sys.version}")
print(f"PyTorch Version: {torch.__version__}")

if torch.cuda.is_available():
    print(f"CUDA is available!")
    print(f"GPU Device: {torch.cuda.get_device_name(0)}")
    print(f"CUDA Version: {torch.version.cuda}")
    # Check VRAM
    total_memory = torch.cuda.get_device_properties(0).total_memory / 1024**3
    print(f"Total VRAM: {total_memory:.2f} GB")
else:
    print("WARNING: CUDA is NOT available. Inference will be extremely slow on CPU.")

Python Version: 3.10.19 | packaged by conda-forge | (main, Oct 22 2025, 22:23:22) [MSC v.1944 64 bit (AMD64)]
PyTorch Version: 2.1.2+cu121
CUDA is available!
GPU Device: NVIDIA GeForce RTX 4060 Laptop GPU
CUDA Version: 12.1
Total VRAM: 8.00 GB


## 2. 模型下载

使用 **Qwen2.5-7B-Instruct-GPTQ-Int4** 量化模型：
- 模型大小：约 4.3GB（Int4 量化）

- 显存占用：约 5-6GB- 结果：性能提升 12 倍

- 推理速度：约 3-4 秒/次（修复后）- 解决：AutoGPTQ 0.7.1 → 0.6.0，匹配 PyTorch 2.1.2

- 适配硬件：RTX 4060 8GB- 原因：AutoGPTQ CUDA 扩展未加载（版本不匹配）

- 问题：初始部署推理延迟 >40 秒
**性能优化记录**：

In [4]:
from modelscope import snapshot_download
import os

# 确保下载目录存在
os.makedirs('./models', exist_ok=True)

# 下载 Qwen2.5-7B GPTQ Int4 模型
model_id = 'Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4'

print(f"开始下载 {model_id} 模型...")
try:
    model_dir = snapshot_download(
        model_id, 
        cache_dir='./models',
        revision='master'
    )
    print(f"模型下载成功！存储路径: {model_dir}")
except Exception as e:
    print(f"模型下载失败: {e}")

开始下载 Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 模型...
Downloading Model from https://www.modelscope.cn to directory: ./models\Qwen\Qwen2.5-7B-Instruct-GPTQ-Int4
Downloading Model from https://www.modelscope.cn to directory: ./models\Qwen\Qwen2.5-7B-Instruct-GPTQ-Int4


2025-12-14 20:35:50,628 - modelscope - INFO - Creating symbolic link [./models\Qwen\Qwen2.5-7B-Instruct-GPTQ-Int4].


模型下载成功！存储路径: ./models\Qwen\Qwen2___5-7B-Instruct-GPTQ-Int4


## 3. 模型加载
加载下载的模型。

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
from modelscope import snapshot_download
import torch

model_id = 'Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4'

# 获取模型路径
model_dir = snapshot_download(model_id, cache_dir='./models')

print(f"正在加载 Tokenizer 和 模型 ({model_id})...")
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)

# 加载 GPTQ 量化模型 (AutoGPTQ 0.6.0 会自动处理)
model = AutoModelForCausalLM.from_pretrained(
    model_dir,
    device_map="auto",
    trust_remote_code=True
)
print("模型加载完成！")

Downloading Model from https://www.modelscope.cn to directory: ./models\Qwen\Qwen2.5-7B-Instruct-GPTQ-Int4


2025-12-14 20:35:54,300 - modelscope - INFO - Creating symbolic link [./models\Qwen\Qwen2.5-7B-Instruct-GPTQ-Int4].


正在加载 Tokenizer 和 模型 (Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4)...


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

模型加载完成！


## 4. 定义推理函数
定义一个 `chat` 函数，封装 Prompt 构建和生成过程。

In [None]:
def chat(query):
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": query}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

    generated_ids = model.generate(
        model_inputs.input_ids,
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.8,
        top_k=20,
        do_sample=True
    )
    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]

    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    return response

## 5. 测试对话

In [7]:
test_query = "你好，请介绍一下你自己。"
print(f"\nUser: {test_query}")
# chat 函数已经返回了字符串，直接打印即可，不需要再用 print 包裹
response = chat(test_query)
print(f"Assistant: {response}")


User: 你好，请介绍一下你自己。


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)


Assistant: 你好！我是一个来自阿里云的语言模型，我叫通义千问。我可以帮助用户生成各种类型的文本，例如文章、故事、诗歌、故事等，并能够回答各种问题。尽管我非常强大，但我也不是完美无缺的，我还需要不断学习和进步。如果你有任何问题或需要帮助，都可以随时向我提问。


In [None]:
# 使用经典问题来测试下
test_query_2 = "鲁迅和周树人是什么关系？"
print(f"\nUser: {test_query_2}")
response_2 = chat(test_query_2)
print(f"Assistant: {response_2}")


User: 鲁迅和周树人是什么关系？
Assistant: 鲁迅是周树人的笔名。鲁迅是中国近现代著名的文学家、思想家、革命家，原名周树人，浙江绍兴人。1918年5月，首次用“鲁迅”作为自己的笔名，发表中国现代文学史上第一篇白话文小说《狂人日记》。
Assistant: 鲁迅是周树人的笔名。鲁迅是中国近现代著名的文学家、思想家、革命家，原名周树人，浙江绍兴人。1918年5月，首次用“鲁迅”作为自己的笔名，发表中国现代文学史上第一篇白话文小说《狂人日记》。


## 6. 下载 Qwen3-VL-8B 多模态模型（4-bit 量化版）

尝试下载 Qwen3-VL-8B-Instruct-4bit-GPTQ 量化模型（支持图像理解，显存占用约 6-8GB）

In [6]:
from modelscope import snapshot_download
import os

# 使用相对路径存储在 notebook 所在目录
cache_dir = './models'
os.makedirs(cache_dir, exist_ok=True)

# 下载 Qwen3-VL-8B 4-bit GPTQ 量化模型（适合 8GB 显存）
vl_model_id = 'DavidWen2025/Qwen3-VL-8B-Instruct-4bit-GPTQ'

print(f"开始下载 {vl_model_id} 多模态模型...")
print(f"下载路径: {os.path.abspath(cache_dir)}")

try:
    vl_model_dir = snapshot_download(
        vl_model_id,
        cache_dir=cache_dir,
        revision='master'
    )
    print(f"模型下载成功！存储路径: {vl_model_dir}")
except Exception as e:
    print(f"模型下载失败: {e}")

开始下载 DavidWen2025/Qwen3-VL-8B-Instruct-4bit-GPTQ 多模态模型...
下载路径: d:\llm_deploy\LLM\LLM_DEPLOY\models
Downloading Model from https://www.modelscope.cn to directory: ./models\DavidWen2025\Qwen3-VL-8B-Instruct-4bit-GPTQ
Downloading Model from https://www.modelscope.cn to directory: ./models\DavidWen2025\Qwen3-VL-8B-Instruct-4bit-GPTQ


2025-12-14 21:59:37,018 - modelscope - INFO - Got 2 files, start to download ...


Processing 2 items:   0%|          | 0.00/2.00 [00:00<?, ?it/s]

Downloading [model-00001-of-00002.safetensors]:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

Downloading [model-00002-of-00002.safetensors]:   0%|          | 0.00/2.76G [00:00<?, ?B/s]

2025-12-14 22:33:48,197 - modelscope - INFO - Download model 'DavidWen2025/Qwen3-VL-8B-Instruct-4bit-GPTQ' successfully.


模型下载成功！存储路径: ./models\DavidWen2025\Qwen3-VL-8B-Instruct-4bit-GPTQ


### 6.1 加载并测试 Qwen3-VL-8B 4-bit 量化模型

加载 4-bit 量化的视觉语言模型，测试文本和图像理解能力。

In [15]:
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor, AutoConfig
from qwen_vl_utils import process_vision_info
from modelscope import snapshot_download
import torch
import json
import os

# 获取模型路径
vl_model_id = 'DavidWen2025/Qwen3-VL-8B-Instruct-4bit-GPTQ'
vl_model_dir = snapshot_download(vl_model_id, cache_dir='./models')

print(f"正在加载 Qwen3-VL-8B 4-bit GPTQ 模型...")
print(f"模型路径: {vl_model_dir}")

# 修复配置文件中的兼容性问题 (Qwen3-VL -> Qwen2-VL)
config_path = os.path.join(vl_model_dir, 'config.json')
with open(config_path, 'r', encoding='utf-8') as f:
    config_dict = json.load(f)

modified = False

# 1. 修正主 model_type
if config_dict.get('model_type') == 'qwen3_vl':
    print("Fixing model_type: qwen3_vl -> qwen2_vl")
    config_dict['model_type'] = 'qwen2_vl'
    modified = True

# 2. 修正 architectures
if 'Qwen3VLForConditionalGeneration' in config_dict.get('architectures', []):
    print("Fixing architectures: Qwen3VL -> Qwen2VL")
    config_dict['architectures'] = ['Qwen2VLForConditionalGeneration']
    modified = True

# 3. 修正 text_config
if 'text_config' in config_dict and isinstance(config_dict['text_config'], dict):
    if config_dict['text_config'].get('model_type') == 'qwen3_vl_text':
        print("Fixing text_config model_type: qwen3_vl_text -> qwen2")
        config_dict['text_config']['model_type'] = 'qwen2'
        modified = True
    
    if '_name_or_path' not in config_dict['text_config']:
        config_dict['text_config']['_name_or_path'] = ''
        modified = True

# 4. 修正 vision_config
if 'vision_config' in config_dict and isinstance(config_dict['vision_config'], dict):
    if config_dict['vision_config'].get('model_type') == 'qwen3_vl':
        print("Fixing vision_config model_type: qwen3_vl -> qwen2_vl")
        config_dict['vision_config']['model_type'] = 'qwen2_vl'
        modified = True

# 保存修复后的配置
if modified:
    print("Saving patched config.json...")
    with open(config_path, 'w', encoding='utf-8') as f:
        json.dump(config_dict, f, indent=2, ensure_ascii=False)
else:
    print("Config already patched.")

# 加载模型和处理器（使用 4-bit 量化）
try:
    # 显式加载 Config 对象
    config = AutoConfig.from_pretrained(vl_model_dir, trust_remote_code=True)
    
    # 手动将 text_config 和 vision_config 转换为 Config 对象
    # 解决 AttributeError: 'dict' object has no attribute 'to_dict'
    if hasattr(config, 'text_config') and isinstance(config.text_config, dict):
        print("Converting text_config dict to Config object...")
        config.text_config = AutoConfig.from_dict(config.text_config)
        
    if hasattr(config, 'vision_config') and isinstance(config.vision_config, dict):
        print("Converting vision_config dict to Config object...")
        config.vision_config = AutoConfig.from_dict(config.vision_config)

    vl_model = Qwen2VLForConditionalGeneration.from_pretrained(
        vl_model_dir,
        config=config, # 传入修复后的 config 对象
        device_map={"": 0} if torch.cuda.is_available() else {"": "cpu"},
        trust_remote_code=True
    )
    
    vl_processor = AutoProcessor.from_pretrained(vl_model_dir, trust_remote_code=True)
    
    print("✅ Qwen3-VL-8B 4-bit 模型加载完成！")
except Exception as e:
    print(f"❌ 模型加载失败: {e}")
    import traceback
    traceback.print_exc()


Downloading Model from https://www.modelscope.cn to directory: ./models\DavidWen2025\Qwen3-VL-8B-Instruct-4bit-GPTQ
正在加载 Qwen3-VL-8B 4-bit GPTQ 模型...
模型路径: ./models\DavidWen2025\Qwen3-VL-8B-Instruct-4bit-GPTQ
Config already patched.
Converting text_config dict to Config object...
❌ 模型加载失败: type object 'AutoConfig' has no attribute 'from_dict'
正在加载 Qwen3-VL-8B 4-bit GPTQ 模型...
模型路径: ./models\DavidWen2025\Qwen3-VL-8B-Instruct-4bit-GPTQ
Config already patched.
Converting text_config dict to Config object...
❌ 模型加载失败: type object 'AutoConfig' has no attribute 'from_dict'


Traceback (most recent call last):
  File "C:\Users\Iron\AppData\Local\Temp\ipykernel_12780\3922773060.py", line 69, in <module>
    config.text_config = AutoConfig.from_dict(config.text_config)
AttributeError: type object 'AutoConfig' has no attribute 'from_dict'


### 6.2 测试纯文本对话

先测试视觉模型的文本理解能力（不使用图像）。

In [16]:
def vl_chat_text_only(query):
    """纯文本对话（不使用图像）"""
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": query}
            ]
        }
    ]
    
    # 应用聊天模板
    text = vl_processor.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
    )
    
    # 处理输入
    image_inputs, video_inputs = process_vision_info(messages)
    inputs = vl_processor(
        text=[text],
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt"
    )
    inputs = inputs.to(vl_model.device)
    
    # 生成回答
    generated_ids = vl_model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.8,
        do_sample=True
    )
    
    generated_ids_trimmed = [
        out_ids[len(in_ids):] 
        for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    
    response = vl_processor.batch_decode(
        generated_ids_trimmed,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=False
    )[0]
    
    return response

# 测试纯文本对话
test_query = "请用一句话介绍一下你自己的能力"
print(f"User: {test_query}")
response = vl_chat_text_only(test_query)
print(f"Assistant: {response}")

User: 请用一句话介绍一下你自己的能力


NameError: name 'vl_processor' is not defined

In [None]:
### 6.3 测试图像理解（使用网络图片）

测试模型的视觉理解能力，使用在线图片URL。

In [None]:
def vl_chat_with_image(query, image_url):
    """带图像的对话"""
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": image_url},
                {"type": "text", "text": query}
            ]
        }
    ]
    
    # 应用聊天模板
    text = vl_processor.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
    )
    
    # 处理输入（包含图像）
    image_inputs, video_inputs = process_vision_info(messages)
    inputs = vl_processor(
        text=[text],
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt"
    )
    inputs = inputs.to(vl_model.device)
    
    # 生成回答
    generated_ids = vl_model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.8,
        do_sample=True
    )
    
    generated_ids_trimmed = [
        out_ids[len(in_ids):] 
        for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    
    response = vl_processor.batch_decode(
        generated_ids_trimmed,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=False
    )[0]
    
    return response

# 测试图像理解
image_url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
query = "请详细描述这张图片的内容"

print(f"图片URL: {image_url}")
print(f"User: {query}")
print("\\n处理中...")

response = vl_chat_with_image(query, image_url)
print(f"\\nAssistant: {response}")

### 6.4 多轮对话测试

测试带图像的多轮对话能力。

In [None]:
# 多轮对话示例
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
            {"type": "text", "text": "图片中有什么动物？"}
        ]
    }
]

# 第一轮
text = vl_processor.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(conversation)
inputs = vl_processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt")
inputs = inputs.to(vl_model.device)

generated_ids = vl_model.generate(**inputs, max_new_tokens=256, temperature=0.7, do_sample=True)
generated_ids_trimmed = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
response_1 = vl_processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

print("第一轮对话：")
print(f"User: 图片中有什么动物？")
print(f"Assistant: {response_1}")

# 添加第一轮回答到对话历史
conversation.append({"role": "assistant", "content": [{"type": "text", "text": response_1}]})

# 第二轮（追问）
conversation.append({
    "role": "user",
    "content": [{"type": "text", "text": "它在做什么？"}]
})

text = vl_processor.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(conversation)
inputs = vl_processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt")
inputs = inputs.to(vl_model.device)

generated_ids = vl_model.generate(**inputs, max_new_tokens=256, temperature=0.7, do_sample=True)
generated_ids_trimmed = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
response_2 = vl_processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

print(f"\\n第二轮对话：")
print(f"User: 它在做什么？")
print(f"Assistant: {response_2}")

### 6.5 测试本地图片与总结

如果有本地图片，也可以这样使用：

```python
# 使用本地图片路径
local_image_path = "./test_image.jpg"
query = "描述这张图片"
response = vl_chat_with_image(query, f"file://{os.path.abspath(local_image_path)}")
print(response)
```

### 注意事项

1. **显存需求**：Qwen3-VL-8B 4-bit GPTQ 需要约 6-8GB 显存，**适合 RTX 4060 8GB**
2. **量化效果**：4-bit 量化在保持性能的同时大幅降低显存占用
3. **图片格式**：支持 JPEG、PNG 等常见格式
4. **视频支持**：Qwen3-VL 还支持视频输入（使用 `{"type": "video", "video": "path"}`)

## 总结

本 Notebook 演示了：
- ✅ Qwen2.5-7B 文本模型的部署与测试（Int4 量化，3.5秒推理）
- ✅ Qwen3-VL-8B 4-bit GPTQ 多模态模型的加载
- ✅ 纯文本对话功能
- ✅ 图像理解功能
- ✅ 多轮对话功能

**模型对比**：
| 模型 | 大小 | 显存占用 | 适配硬件 | 功能 |
|------|------|---------|---------|------|
| Qwen2.5-7B-GPTQ-Int4 | 4.3GB | 5-6GB | RTX 4060 8GB ✅ | 纯文本 |
| Qwen3-VL-8B-4bit-GPTQ | 5-6GB | 6-8GB | RTX 4060 8GB ✅ | 文本+图像 |

**下一步**：
- 运行 `02_rag_implementation.ipynb` 构建 RAG 知识库
- 启动 `rag_server.py` 部署 RAG API 服务