# Python与Hugging Face模型互动教程 —— 玩转AI模型

## 引言：为什么要学习Hugging Face？

在AI时代，Hugging Face已经成为了AI模型的"GitHub"。它提供了：
- 数万个预训练模型
- 简单易用的API
- 强大的模型库（Transformers）
- 活跃的社区

今天，我们将学习如何使用Python与Hugging Face上的模型进行互动，让你能够快速使用各种AI模型。

## 一、Hugging Face是什么？

### 1.1 Hugging Face简介

Hugging Face是一个AI社区和平台，它提供：

1. **模型仓库（Model Hub）**
   - 超过10万个预训练模型
   - 涵盖NLP、计算机视觉、音频等领域
   - 支持多种框架（PyTorch、TensorFlow等）

2. **Transformers库**
   - 统一的API接口
   - 支持各种任务（文本生成、分类、翻译等）
   - 简化模型使用流程

3. **数据集仓库（Dataset Hub）**
   - 数千个数据集
   - 统一的数据加载接口

4. **Spaces**
   - 托管AI应用
   - 展示模型Demo

### 1.2 为什么选择Hugging Face？

```python
# 传统方式：复杂的模型加载和使用
import torch
model = torch.load('model.pt')
# 需要了解模型结构、预处理方式等...

# Hugging Face方式：简单直接
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
result = classifier("I love this movie!")
# 就这么简单！
```

## 二、环境准备

### 2.1 安装必要的库

```bash
# 创建虚拟环境
python -m venv hf_env
source hf_env/bin/activate  # Windows: hf_env\Scripts\activate

# 安装基础库
pip install transformers
pip install torch  # 或 tensorflow
pip install datasets
pip install huggingface-hub

# 安装额外功能（可选）
pip install accelerate  # 加速训练
pip install sentencepiece  # 某些模型需要
pip install protobuf  # 某些模型需要
```

### 2.2 配置Hugging Face账户（可选）

如果你想使用私有模型或上传模型，需要配置token：

```python
# 方法1：使用命令行
# huggingface-cli login

# 方法2：在代码中设置
from huggingface_hub import login
login(token="your_token_here")

# 方法3：使用环境变量
# export HUGGING_FACE_HUB_TOKEN="your_token_here"
```

## 三、快速开始：Pipeline API

### 3.1 什么是Pipeline？

Pipeline是Hugging Face提供的高级API，让你能够快速使用模型，无需了解复杂的细节。

```python
from transformers import pipeline

# 创建一个pipeline就像创建一个函数
# 这个函数可以处理特定的任务
sentiment_analyzer = pipeline("sentiment-analysis")

# 使用就像调用函数一样简单
result = sentiment_analyzer("I am very happy today!")
print(result)
# 输出: [{'label': 'POSITIVE', 'score': 0.9998}]
```

### 3.2 常用Pipeline示例

```python
from transformers import pipeline

# 1. 情感分析
print("=== 情感分析 ===")
sentiment_pipeline = pipeline("sentiment-analysis")
texts = [
    "I love this product!",
    "This is terrible.",
    "It's okay, not great but not bad either."
]
for text in texts:
    result = sentiment_pipeline(text)
    print(f"文本: {text}")
    print(f"结果: {result[0]['label']}, 置信度: {result[0]['score']:.4f}\n")

# 2. 文本生成
print("=== 文本生成 ===")
generator = pipeline("text-generation", model="gpt2")
prompt = "Once upon a time"
result = generator(prompt, max_length=50, num_return_sequences=2)
for i, text in enumerate(result):
    print(f"生成 {i+1}: {text['generated_text']}\n")

# 3. 问答系统
print("=== 问答系统 ===")
qa_pipeline = pipeline("question-answering")
context = """
Hugging Face is a company that develops tools for building applications using machine learning.
It is most notable for its Transformers library built for NLP applications.
The company was founded in 2016 by French entrepreneurs.
"""
question = "When was Hugging Face founded?"
answer = qa_pipeline(question=question, context=context)
print(f"问题: {question}")
print(f"答案: {answer['answer']}, 置信度: {answer['score']:.4f}\n")

# 4. 命名实体识别
print("=== 命名实体识别 ===")
ner_pipeline = pipeline("ner", aggregation_strategy="simple")
text = "Apple Inc. was founded by Steve Jobs in Cupertino, California."
entities = ner_pipeline(text)
for entity in entities:
    print(f"实体: {entity['word']}, 类型: {entity['entity_group']}, 分数: {entity['score']:.4f}")

# 5. 文本摘要
print("\n=== 文本摘要 ===")
summarizer = pipeline("summarization")
article = """
The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, 
and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. 
During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest 
man-made structure in the world, a title it held for 41 years until the Chrysler Building in 
New York City was finished in 1930.
"""
summary = summarizer(article, max_length=50, min_length=25, do_sample=False)
print(f"原文长度: {len(article.split())} 词")
print(f"摘要: {summary[0]['summary_text']}")

# 6. 翻译
print("\n=== 翻译 ===")
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-zh")
text = "Hello, how are you today?"
translation = translator(text)
print(f"英文: {text}")
print(f"中文: {translation[0]['translation_text']}")

# 7. 零样本分类
print("\n=== 零样本分类 ===")
classifier = pipeline("zero-shot-classification")
text = "This is a tutorial about using Hugging Face with Python"
candidate_labels = ["education", "politics", "technology", "sports"]
result = classifier(text, candidate_labels)
print(f"文本: {text}")
print("分类结果:")
for label, score in zip(result['labels'], result['scores']):
    print(f"  {label}: {score:.4f}")
```

### 3.3 指定特定模型

```python
# 使用特定模型而不是默认模型
from transformers import pipeline

# 1. 使用中文情感分析模型
chinese_sentiment = pipeline(
    "sentiment-analysis", 
    model="uer/roberta-base-finetuned-dianping-chinese"
)
result = chinese_sentiment("这个产品真的很棒！")
print(f"中文情感分析: {result}")

# 2. 使用特定的文本生成模型
gpt2_medium = pipeline(
    "text-generation",
    model="gpt2-medium"
)

# 3. 使用本地模型
# 首先下载模型到本地
local_model_path = "./my_local_model"
local_pipeline = pipeline("text-classification", model=local_model_path)
```

## 四、深入使用：AutoModel和AutoTokenizer

### 4.1 理解Tokenizer和Model

当你需要更多控制时，可以直接使用AutoModel和AutoTokenizer：

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# 1. 加载tokenizer和model
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# 2. 准备输入
text = "I love using Hugging Face!"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

# 3. 推理
with torch.no_grad():
    outputs = model(**inputs)
    
# 4. 处理输出
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(f"预测结果: {predictions}")
```

### 4.2 批量处理

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# 批量处理多个文本
texts = [
    "I love this movie!",
    "This is terrible.",
    "Not bad, quite enjoyable.",
    "Absolutely fantastic!"
]

# Tokenize所有文本
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)

# 批量推理
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

# 解析结果
for i, text in enumerate(texts):
    neg_score = predictions[i][0].item()
    pos_score = predictions[i][1].item()
    sentiment = "POSITIVE" if pos_score > neg_score else "NEGATIVE"
    confidence = max(pos_score, neg_score)
    print(f"文本: {text}")
    print(f"情感: {sentiment}, 置信度: {confidence:.4f}\n")
```

### 4.3 使用不同类型的模型

```python
from transformers import (
    AutoModelForSequenceClassification,  # 分类
    AutoModelForTokenClassification,     # 词元分类（如NER）
    AutoModelForQuestionAnswering,       # 问答
    AutoModelForCausalLM,               # 文本生成（GPT类）
    AutoModelForSeq2SeqLM,              # 序列到序列（T5类）
    AutoModelForMaskedLM,               # 掩码语言模型（BERT类）
    AutoTokenizer
)

# 1. 文本生成（GPT-2）
print("=== 文本生成示例 ===")
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# 设置pad_token
tokenizer.pad_token = tokenizer.eos_token

input_text = "The future of AI is"
inputs = tokenizer(input_text, return_tensors="pt")

# 生成文本
outputs = model.generate(
    **inputs,
    max_length=50,
    num_return_sequences=1,
    temperature=0.8,
    do_sample=True,
    top_p=0.9
)

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"输入: {input_text}")
print(f"生成: {generated_text}\n")

# 2. 掩码语言模型（BERT）
print("=== 掩码预测示例 ===")
from transformers import pipeline

unmasker = pipeline('fill-mask', model='bert-base-uncased')
text = "The capital of France is [MASK]."
predictions = unmasker(text)

print(f"输入: {text}")
print("预测结果:")
for pred in predictions[:3]:
    print(f"  {pred['token_str']}: {pred['score']:.4f}")
```

## 五、处理中文模型

### 5.1 中文文本分类

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# 使用中文BERT模型
model_name = "bert-base-chinese"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# 中文文本示例
texts = [
    "这部电影真的太精彩了！",
    "服务态度很差，不推荐。",
    "还可以，一般般吧。"
]

for text in texts:
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    
    print(f"文本: {text}")
    print(f"预测分数: {predictions}\n")
```

### 5.2 中文问答系统

```python
from transformers import pipeline

# 使用中文问答模型
qa_pipeline = pipeline(
    "question-answering",
    model="uer/roberta-base-chinese-extractive-qa"
)

context = """
杭州是浙江省的省会城市，位于中国东南沿海。
杭州以其美丽的西湖风景而闻名，被誉为"人间天堂"。
这座城市有着悠久的历史，可以追溯到2200多年前。
杭州也是中国重要的经济中心，阿里巴巴总部就位于此地。
"""

questions = [
    "杭州是哪个省的省会？",
    "杭州有什么著名景点？",
    "哪家知名公司的总部在杭州？"
]

for question in questions:
    answer = qa_pipeline(question=question, context=context)
    print(f"问题: {question}")
    print(f"答案: {answer['answer']}, 置信度: {answer['score']:.4f}\n")
```

### 5.3 中文文本生成

```python
from transformers import pipeline

# 使用中文GPT模型
generator = pipeline(
    "text-generation",
    model="uer/gpt2-chinese-cluecorpussmall",
    device=0 if torch.cuda.is_available() else -1
)

prompts = [
    "今天天气",
    "人工智能的发展",
    "我最喜欢的"
]

for prompt in prompts:
    result = generator(
        prompt,
        max_length=50,
        num_return_sequences=1,
        temperature=0.7
    )
    print(f"提示: {prompt}")
    print(f"生成: {result[0]['generated_text']}\n")
```

## 六、图像处理模型

### 6.1 图像分类

```python
from transformers import pipeline
from PIL import Image
import requests

# 创建图像分类pipeline
classifier = pipeline("image-classification", model="google/vit-base-patch16-224")

# 方法1：从URL加载图像
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png"
image = Image.open(requests.get(url, stream=True).raw)

# 分类
results = classifier(image)
print("图像分类结果:")
for result in results[:3]:
    print(f"  {result['label']}: {result['score']:.4f}")

# 方法2：从本地文件加载
# image = Image.open("path/to/your/image.jpg")
# results = classifier(image)
```

### 6.2 目标检测

```python
from transformers import pipeline
from PIL import Image, ImageDraw
import requests

# 创建目标检测pipeline
detector = pipeline("object-detection", model="facebook/detr-resnet-50")

# 加载图像
url = "https://images.unsplash.com/photo-1518991669955-9c7e78ec80ca"
image = Image.open(requests.get(url, stream=True).raw)

# 检测对象
results = detector(image)

# 可视化结果
draw = ImageDraw.Draw(image)
for result in results:
    box = result['box']
    label = result['label']
    score = result['score']
    
    if score > 0.9:  # 只显示高置信度的检测
        # 画边界框
        draw.rectangle(
            [box['xmin'], box['ymin'], box['xmax'], box['ymax']],
            outline="red",
            width=3
        )
        # 添加标签
        draw.text(
            (box['xmin'], box['ymin']),
            f"{label}: {score:.2f}",
            fill="red"
        )

# 保存结果
image.save("detection_result.jpg")
print("检测结果已保存到 detection_result.jpg")
```

### 6.3 图像描述生成

```python
from transformers import pipeline
from PIL import Image

# 创建图像到文本pipeline
image_to_text = pipeline(
    "image-to-text",
    model="Salesforce/blip-image-captioning-base"
)

# 加载图像
image = Image.open("your_image.jpg")

# 生成描述
result = image_to_text(image)
print(f"图像描述: {result[0]['generated_text']}")
```

## 七、音频处理模型

### 7.1 语音识别

```python
from transformers import pipeline
import librosa

# 创建语音识别pipeline
transcriber = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-small"
)

# 加载音频文件
# 注意：需要安装 librosa: pip install librosa
audio_file = "path/to/audio.wav"

# 转录
result = transcriber(audio_file)
print(f"转录结果: {result['text']}")

# 处理长音频
result = transcriber(
    audio_file,
    chunk_length_s=30,  # 每30秒一个块
    stride_length_s=5   # 5秒重叠
)
print(f"长音频转录: {result['text']}")
```

### 7.2 音频分类

```python
from transformers import pipeline

# 创建音频分类pipeline
audio_classifier = pipeline(
    "audio-classification",
    model="superb/hubert-base-superb-er"
)

# 分类音频
audio_file = "path/to/audio.wav"
results = audio_classifier(audio_file)

print("音频分类结果:")
for result in results[:3]:
    print(f"  {result['label']}: {result['score']:.4f}")
```

## 八、多模态模型

### 8.1 视觉问答（VQA）

```python
from transformers import pipeline
from PIL import Image

# 创建VQA pipeline
vqa = pipeline("visual-question-answering")

# 加载图像
image = Image.open("path/to/image.jpg")

# 提问
questions = [
    "What color is the sky?",
    "How many people are in the image?",
    "What is the main object in the picture?"
]

for question in questions:
    result = vqa(image=image, question=question)
    print(f"问题: {question}")
    print(f"答案: {result[0]['answer']}, 置信度: {result[0]['score']:.4f}\n")
```

### 8.2 CLIP模型（图文匹配）

```python
from transformers import pipeline

# 创建零样本图像分类pipeline（使用CLIP）
classifier = pipeline(
    "zero-shot-image-classification",
    model="openai/clip-vit-base-patch32"
)

# 加载图像
from PIL import Image
image = Image.open("path/to/image.jpg")

# 定义候选标签
candidate_labels = ["cat", "dog", "bird", "car", "tree", "building"]

# 分类
results = classifier(
    images=image,
    candidate_labels=candidate_labels
)

print("图像分类结果:")
for result in results:
    print(f"  {result['label']}: {result['score']:.4f}")
```

## 九、模型微调基础

### 9.1 准备数据集

```python
from datasets import Dataset, DatasetDict
import pandas as pd

# 创建示例数据
data = {
    'text': [
        "I love this product!",
        "This is terrible.",
        "Great quality, highly recommend.",
        "Waste of money.",
        "Not bad, decent value."
    ],
    'label': [1, 0, 1, 0, 1]  # 1: positive, 0: negative
}

# 创建Dataset
df = pd.DataFrame(data)
dataset = Dataset.from_pandas(df)

# 分割训练集和测试集
train_test = dataset.train_test_split(test_size=0.2)
dataset_dict = DatasetDict({
    'train': train_test['train'],
    'test': train_test['test']
})

print(f"训练集大小: {len(dataset_dict['train'])}")
print(f"测试集大小: {len(dataset_dict['test'])}")
```

### 9.2 简单的模型微调

```python
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer
)
import numpy as np
from datasets import load_metric

# 加载模型和tokenizer
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# 数据预处理函数
def preprocess_function(examples):
    return tokenizer(examples['text'], truncation=True, padding=True)

# 应用预处理
tokenized_datasets = dataset_dict.map(preprocess_function, batched=True)

# 设置训练参数
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

# 定义评估函数
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    accuracy = (predictions == labels).mean()
    return {"accuracy": accuracy}

# 创建Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    compute_metrics=compute_metrics,
)

# 训练模型
# trainer.train()

# 保存模型
# trainer.save_model("./my-finetuned-model")
```

## 十、性能优化

### 10.1 使用GPU加速

```python
import torch
from transformers import pipeline

# 检查GPU是否可用
device = 0 if torch.cuda.is_available() else -1
print(f"使用设备: {'GPU' if device == 0 else 'CPU'}")

# 在GPU上运行pipeline
classifier = pipeline(
    "sentiment-analysis",
    device=device
)

# 批量处理以提高效率
texts = ["Text 1", "Text 2", "Text 3"] * 100
results = classifier(texts, batch_size=32)
```

### 10.2 模型量化

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# 加载模型
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# 动态量化
quantized_model = torch.quantization.quantize_dynamic(
    model,
    {torch.nn.Linear},
    dtype=torch.qint8
)

# 比较模型大小
import os

def get_model_size(model):
    torch.save(model.state_dict(), "temp.p")
    size = os.path.getsize("temp.p") / 1e6
    os.remove("temp.p")
    return size

print(f"原始模型大小: {get_model_size(model):.2f} MB")
print(f"量化模型大小: {get_model_size(quantized_model):.2f} MB")
```

### 10.3 使用ONNX加速

```python
from transformers import AutoTokenizer, AutoModel
from pathlib import Path
import torch

# 导出为ONNX格式
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# 准备虚拟输入
dummy_input = tokenizer(
    "Hello, world!",
    return_tensors="pt"
)

# 导出ONNX
torch.onnx.export(
    model,
    tuple(dummy_input.values()),
    "model.onnx",
    input_names=['input_ids', 'attention_mask'],
    output_names=['output'],
    dynamic_axes={
        'input_ids': {0: 'batch_size', 1: 'sequence'},
        'attention_mask': {0: 'batch_size', 1: 'sequence'},
        'output': {0: 'batch_size'}
    }
)

print("模型已导出为ONNX格式")
```

## 十一、实战项目：构建智能问答系统

### 11.1 项目结构

```python
# qa_system.py
from transformers import pipeline
from typing import List, Dict
import json

class QASystem:
    def __init__(self, model_name="deepset/roberta-base-squad2"):
        """初始化问答系统"""
        self.qa_pipeline = pipeline("question-answering", model=model_name)
        self.knowledge_base = {}
        
    def add_document(self, doc_id: str, title: str, content: str):
        """添加文档到知识库"""
        self.knowledge_base[doc_id] = {
            "title": title,
            "content": content
        }
        
    def load_knowledge_base(self, file_path: str):
        """从文件加载知识库"""
        with open(file_path, 'r', encoding='utf-8') as f:
            self.knowledge_base = json.load(f)
            
    def search_documents(self, query: str, top_k: int = 3) -> List[str]:
        """简单的文档搜索（基于关键词）"""
        query_words = set(query.lower().split())
        scores = {}
        
        for doc_id, doc in self.knowledge_base.items():
            content_words = set(doc['content'].lower().split())
            score = len(query_words.intersection(content_words))
            scores[doc_id] = score
            
        # 返回得分最高的文档
        sorted_docs = sorted(scores.items(), key=lambda x: x[1], reverse=True)
        return [doc_id for doc_id, _ in sorted_docs[:top_k]]
    
    def answer_question(self, question: str, context: str = None) -> Dict:
        """回答问题"""
        if context is None:
            # 从知识库中搜索相关文档
            relevant_docs = self.search_documents(question)
            if not relevant_docs:
                return {
                    "answer": "抱歉，我在知识库中找不到相关信息。",
                    "confidence": 0.0,
                    "source": None
                }
            
            # 合并相关文档作为上下文
            contexts = []
            for doc_id in relevant_docs:
                doc = self.knowledge_base[doc_id]
                contexts.append(f"{doc['title']}: {doc['content']}")
            context = " ".join(contexts)
            source = relevant_docs[0]
        else:
            source = "provided_context"
        
        # 使用问答模型
        try:
            result = self.qa_pipeline(
                question=question,
                context=context,
                max_answer_len=100
            )
            
            return {
                "answer": result['answer'],
                "confidence": result['score'],
                "source": source,
                "context_used": context[:200] + "..." if len(context) > 200 else context
            }
        except Exception as e:
            return {
                "answer": f"处理问题时出错: {str(e)}",
                "confidence": 0.0,
                "source": None
            }
    
    def interactive_qa(self):
        """交互式问答"""
        print("智能问答系统已启动！输入 'quit' 退出。")
        print("输入 'add' 添加新文档到知识库。")
        print("-" * 50)
        
        while True:
            user_input = input("\n请输入您的问题: ").strip()
            
            if user_input.lower() == 'quit':
                print("感谢使用，再见！")
                break
                
            elif user_input.lower() == 'add':
                title = input("文档标题: ")
                content = input("文档内容: ")
                doc_id = f"doc_{len(self.knowledge_base) + 1}"
                self.add_document(doc_id, title, content)
                print(f"文档已添加，ID: {doc_id}")
                
            else:
                result = self.answer_question(user_input)
                print(f"\n答案: {result['answer']}")
                print(f"置信度: {result['confidence']:.2%}")
                if result['source']:
                    print(f"来源: {result['source']}")

# 使用示例
if __name__ == "__main__":
    # 创建问答系统
    qa = QASystem()
    
    # 添加一些示例文档
    qa.add_document(
        "doc_1",
        "Python简介",
        "Python是一种高级编程语言，由Guido van Rossum在1991年创建。它以简洁易读的语法著称，广泛应用于数据科学、人工智能、Web开发等领域。"
    )
    
    qa.add_document(
        "doc_2",
        "Hugging Face介绍",
        "Hugging Face是一家专注于自然语言处理的公司，提供了Transformers库，使得使用预训练模型变得非常简单。它的模型中心托管了数万个模型。"
    )
    
    qa.add_document(
        "doc_3",
        "机器学习基础",
        "机器学习是人工智能的一个分支，让计算机能够从数据中学习。主要分为监督学习、无监督学习和强化学习三大类。"
    )
    
    # 测试问答
    questions = [
        "Python是什么时候创建的？",
        "Hugging Face提供了什么库？",
        "机器学习有哪些类型？",
        "谁创建了Python？"
    ]
    
    print("=== 测试问答系统 ===\n")
    for q in questions:
        print(f"问题: {q}")
        result = qa.answer_question(q)
        print(f"答案: {result['answer']}")
        print(f"置信度: {result['confidence']:.2%}")
        print("-" * 50)
    
    # 启动交互模式
    # qa.interactive_qa()
```

### 11.2 增强版：结合向量数据库

```python
# enhanced_qa_system.py
from sentence_transformers import SentenceTransformer
import numpy as np
from typing import List, Tuple
import faiss

class EnhancedQASystem:
    def __init__(self):
        # 初始化句子编码器
        self.encoder = SentenceTransformer('paraphrase-MiniLM-L6-v2')
        
        # 初始化问答模型
        from transformers import pipeline
        self.qa_pipeline = pipeline("question-answering")
        
        # 初始化向量索引
        self.dimension = 384  # MiniLM输出维度
        self.index = faiss.IndexFlatL2(self.dimension)
        
        # 文档存储
        self.documents = []
        
    def add_document(self, text: str, metadata: dict = None):
        """添加文档并建立向量索引"""
        # 编码文档
        embedding = self.encoder.encode([text])
        
        # 添加到索引
        self.index.add(embedding)
        
        # 存储文档
        self.documents.append({
            'text': text,
            'metadata': metadata or {},
            'embedding': embedding[0]
        })
        
    def search_similar(self, query: str, k: int = 3) -> List[Tuple[int, float]]:
        """搜索相似文档"""
        # 编码查询
        query_embedding = self.encoder.encode([query])
        
        # 搜索
        distances, indices = self.index.search(query_embedding, k)
        
        # 返回结果
        results = []
        for i, (idx, dist) in enumerate(zip(indices[0], distances[0])):
            if idx < len(self.documents):
                results.append((idx, float(dist)))
        
        return results
    
    def answer_question_with_context(self, question: str) -> dict:
        """基于相似文档回答问题"""
        # 搜索相关文档
        similar_docs = self.search_similar(question, k=3)
        
        if not similar_docs:
            return {
                'answer': '找不到相关信息',
                'confidence': 0.0,
                'sources': []
            }
        
        # 构建上下文
        contexts = []
        sources = []
        for idx, distance in similar_docs:
            doc = self.documents[idx]
            contexts.append(doc['text'])
            sources.append({
                'text': doc['text'][:100] + '...',
                'similarity': 1 / (1 + distance)  # 转换为相似度分数
            })
        
        combined_context = ' '.join(contexts)
        
        # 问答
        result = self.qa_pipeline(
            question=question,
            context=combined_context
        )
        
        return {
            'answer': result['answer'],
            'confidence': result['score'],
            'sources': sources
        }

# 使用示例
if __name__ == "__main__":
    qa = EnhancedQASystem()
    
    # 添加文档
    documents = [
        "深度学习是机器学习的一个子集，使用多层神经网络来学习数据的表示。",
        "卷积神经网络（CNN）特别适合处理图像数据，在计算机视觉任务中表现出色。",
        "循环神经网络（RNN）擅长处理序列数据，如文本和时间序列。",
        "Transformer架构革命性地改变了NLP领域，BERT和GPT都基于这种架构。"
    ]
    
    for doc in documents:
        qa.add_document(doc)
    
    # 测试问答
    question = "什么神经网络适合处理图像？"
    result = qa.answer_question_with_context(question)
    
    print(f"问题: {question}")
    print(f"答案: {result['answer']}")
    print(f"置信度: {result['confidence']:.2%}")
    print("\n相关文档:")
    for i, source in enumerate(result['sources']):
        print(f"{i+1}. {source['text']} (相似度: {source['similarity']:.2%})")
```

## 十二、最佳实践和注意事项

### 12.1 模型选择指南

```python
# model_selection_guide.py

def recommend_model(task: str, language: str = "en", size_constraint: str = "medium"):
    """根据任务推荐合适的模型"""
    
    recommendations = {
        "sentiment-analysis": {
            "en": {
                "small": "distilbert-base-uncased-finetuned-sst-2-english",
                "medium": "roberta-base-sentiment",
                "large": "roberta-large-mnli"
            },
            "zh": {
                "small": "uer/roberta-base-finetuned-dianping-chinese",
                "medium": "bert-base-chinese",
                "large": "hfl/chinese-roberta-wwm-ext-large"
            }
        },
        "text-generation": {
            "en": {
                "small": "gpt2",
                "medium": "gpt2-medium",
                "large": "gpt2-large"
            },
            "zh": {
                "small": "uer/gpt2-chinese-cluecorpussmall",
                "medium": "IDEA-CCNL/Wenzhong-GPT2-110M",
                "large": "IDEA-CCNL/Wenzhong2.0-GPT2-3.5B-chinese"
            }
        },
        "question-answering": {
            "en": {
                "small": "distilbert-base-cased-distilled-squad",
                "medium": "deepset/roberta-base-squad2",
                "large": "deepset/roberta-large-squad2"
            },
            "zh": {
                "small": "uer/roberta-base-chinese-extractive-qa",
                "medium": "luhua/chinese_pretrain_mrc_roberta_wwm_ext_base",
                "large": "luhua/chinese_pretrain_mrc_macbert_large"
            }
        }
    }
    
    if task in recommendations and language in recommendations[task]:
        return recommendations[task][language].get(size_constraint, "未找到合适的模型")
    else:
        return "不支持的任务或语言"

# 使用示例
print("情感分析（中文，小模型）:", recommend_model("sentiment-analysis", "zh", "small"))
print("文本生成（英文，中等）:", recommend_model("text-generation", "en", "medium"))
print("问答（中文，大模型）:", recommend_model("question-answering", "zh", "large"))
```

### 12.2 错误处理和日志

```python
import logging
from transformers import pipeline
import torch

# 配置日志
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

class SafeModelWrapper:
    def __init__(self, task: str, model: str = None):
        self.task = task
        self.model_name = model
        self.pipeline = None
        self._initialize()
    
    def _initialize(self):
        """安全初始化模型"""
        try:
            logger.info(f"正在加载模型: {self.model_name or 'default'} for {self.task}")
            self.pipeline = pipeline(self.task, model=self.model_name)
            logger.info("模型加载成功")
        except Exception as e:
            logger.error(f"模型加载失败: {str(e)}")
            raise
    
    def predict(self, *args, **kwargs):
        """安全预测"""
        if self.pipeline is None:
            logger.error("模型未初始化")
            return None
        
        try:
            result = self.pipeline(*args, **kwargs)
            logger.info("预测成功")
            return result
        except torch.cuda.OutOfMemoryError:
            logger.error("GPU内存不足")
            # 尝试清理内存并使用CPU
            torch.cuda.empty_cache()
            self.pipeline.device = -1
            logger.info("切换到CPU模式重试")
            return self.predict(*args, **kwargs)
        except Exception as e:
            logger.error(f"预测失败: {str(e)}")
            return None

# 使用示例
safe_classifier = SafeModelWrapper("sentiment-analysis")
result = safe_classifier.predict("This is a great product!")
if result:
    print(f"预测结果: {result}")
```

### 12.3 资源管理

```python
import gc
import torch
from contextlib import contextmanager

@contextmanager
def model_context(pipeline_func, *args, **kwargs):
    """上下文管理器，自动清理模型资源"""
    model = None
    try:
        model = pipeline_func(*args, **kwargs)
        yield model
    finally:
        # 清理资源
        if model is not None:
            del model
        gc.collect()
        if torch.cuda.is_available():
            torch.cuda.empty_cache()

# 使用示例
with model_context(pipeline, "sentiment-analysis") as classifier:
    result = classifier("I love this!")
    print(result)
# 模型会在with块结束后自动清理
```

## 十三、总结

### 13.1 我们学到了什么

1. **Hugging Face基础**
   - Model Hub的使用
   - Transformers库的核心概念
   - Pipeline API的便捷性

2. **各种任务的实现**
   - 文本分类、生成、问答
   - 图像分类、目标检测
   - 语音识别、音频分类
   - 多模态任务

3. **进阶技巧**
   - 模型微调
   - 性能优化
   - 错误处理
   - 资源管理

4. **实战应用**
   - 构建问答系统
   - 整合向量数据库
   - 生产环境考虑

### 13.2 下一步学习建议

1. **深入特定领域**
   - 选择一个感兴趣的方向深入研究
   - 尝试更多专业模型
   - 参与社区讨论

2. **模型训练和微调**
   - 学习如何训练自己的模型
   - 掌握数据集准备技巧
   - 了解分布式训练

3. **部署和优化**
   - 学习模型部署技术
   - 掌握模型压缩和加速
   - 了解边缘计算部署

4. **参与开源**
   - 贡献代码到Transformers库
   - 分享自己训练的模型
   - 编写教程帮助他人

### 13.3 有用的资源

1. **官方资源**
   - Hugging Face官网: https://huggingface.co/
   - Transformers文档: https://huggingface.co/docs/transformers/
   - 模型中心: https://huggingface.co/models

2. **学习资源**
   - Hugging Face课程: https://huggingface.co/course/
   - 论文阅读: https://papers.huggingface.co/
   - 社区论坛: https://discuss.huggingface.co/

3. **相关工具**
   - Gradio: 快速创建模型Demo
   - Streamlit: 构建数据应用
   - FastAPI: 构建API服务

### 13.4 结语

Hugging Face让AI模型的使用变得前所未有的简单。通过本教程，你已经掌握了使用各种AI模型的基本技能。记住：

- **从简单开始**：先用Pipeline API快速实现功能
- **逐步深入**：需要更多控制时再使用底层API
- **注重实践**：多做项目，在实践中学习
- **保持学习**：AI领域发展迅速，持续关注新技术

现在，你已经拥有了强大的AI工具箱，去创造令人惊叹的应用吧！

**Happy Modeling! 🤗**