# 多项式核 SVR (Polynomial Kernel SVR)

## 理论背景

当数据具有非线性关系时，线性 SVR 无法很好地拟合。多项式核 SVR 通过核技巧将数据隐式映射到高维特征空间，在该空间中进行线性回归，从而实现非线性拟合。

### 多项式核函数

多项式核的数学形式为：

$$K(x, x') = (\gamma \cdot x^\top x' + r)^d$$

其中：
- $d$: 多项式阶数 (degree)
- $\gamma$: 缩放参数 (gamma)
- $r$: 独立项系数 (coef0)，控制高阶项与低阶项的权重

### 隐式特征映射

以二阶多项式核为例 ($d=2$, $\gamma=1$, $r=0$)，对于 $x = (x_1, x_2)$：

$$\phi(x) = (x_1^2, x_2^2, \sqrt{2}x_1 x_2)$$

核技巧的关键在于：我们无需显式计算 $\phi(x)$，只需通过核函数计算内积 $K(x, x') = \langle\phi(x), \phi(x')\rangle$。

## 1. 环境配置与数据准备

In [None]:
# =============================================================================
# 导入必要的库
# =============================================================================
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV, learning_curve
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import warnings
warnings.filterwarnings('ignore')

# 设置随机种子
np.random.seed(42)

# matplotlib 配置
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS', 'SimHei', 'DejaVu Sans']
plt.rcParams['axes.unicode_minus'] = False
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['figure.dpi'] = 100

In [None]:
# =============================================================================
# 生成具有多项式关系的数据
# =============================================================================

def generate_polynomial_data(n_samples=150, noise_level=0.5, polynomial_degree=2):
    """
    生成具有多项式关系的回归数据
    
    真实函数: y = 0.5*x^2 - 2*x + 1 + noise
    
    Parameters:
    -----------
    n_samples : int
        样本数量
    noise_level : float
        噪声标准差
    polynomial_degree : int
        数据生成的多项式阶数
    
    Returns:
    --------
    X : ndarray, shape (n_samples, 1)
        特征矩阵
    y : ndarray, shape (n_samples,)
        目标值
    y_true : ndarray, shape (n_samples,)
        无噪声的真实值
    """
    X = np.sort(np.random.uniform(-3, 3, n_samples))
    
    # 生成多项式关系: y = 0.5*x^2 - 2*x + 1
    y_true = 0.5 * X**2 - 2 * X + 1
    y = y_true + np.random.normal(0, noise_level, n_samples)
    
    return X.reshape(-1, 1), y, y_true

# 生成数据
X, y, y_true = generate_polynomial_data(n_samples=150, noise_level=0.5)

# 划分数据集
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"训练集大小: {X_train.shape[0]}")
print(f"测试集大小: {X_test.shape[0]}")
print(f"数据范围: X ∈ [{X.min():.2f}, {X.max():.2f}]")
print(f"目标范围: y ∈ [{y.min():.2f}, {y.max():.2f}]")

## 2. 数据可视化与分析

In [None]:
# =============================================================================
# 可视化原始数据和真实多项式函数
# =============================================================================

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# 左图: 散点图与真实曲线
ax = axes[0]
ax.scatter(X_train, y_train, c='steelblue', alpha=0.6, s=40, 
           edgecolors='white', label='训练数据')
ax.scatter(X_test, y_test, c='coral', alpha=0.6, s=40, 
           edgecolors='white', label='测试数据')

# 绘制真实函数
X_line = np.linspace(-3.5, 3.5, 200).reshape(-1, 1)
y_true_line = 0.5 * X_line**2 - 2 * X_line + 1
ax.plot(X_line, y_true_line, 'g--', linewidth=2, label='真实函数: $0.5x^2 - 2x + 1$')

ax.set_xlabel('X', fontsize=12)
ax.set_ylabel('y', fontsize=12)
ax.set_title('多项式回归数据', fontsize=14)
ax.legend(loc='upper right')
ax.grid(True, alpha=0.3)

# 右图: 残差分布
ax = axes[1]
# 对于原始数据，残差就是噪声
X_flat = X.flatten()
residuals = y - (0.5 * X_flat**2 - 2 * X_flat + 1)
ax.hist(residuals, bins=25, color='steelblue', alpha=0.7, edgecolor='white')
ax.axvline(x=0, color='red', linestyle='--', linewidth=2)
ax.set_xlabel('残差 (噪声)', fontsize=12)
ax.set_ylabel('频数', fontsize=12)
ax.set_title(f'噪声分布 (std={residuals.std():.3f})', fontsize=14)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 3. 多项式核 SVR 训练

### 核心参数说明

| 参数 | 作用 | 典型值范围 |
|------|------|------------|
| `kernel` | 核函数类型 | 'poly' |
| `degree` | 多项式阶数 | 2, 3, 4, 5 |
| `C` | 正则化参数 | 0.1 ~ 1000 |
| `epsilon` | ε-tube 宽度 | 0.01 ~ 1.0 |
| `gamma` | 核系数 | 'scale', 'auto', 或数值 |
| `coef0` | 独立项系数 | 0 ~ 10 |

In [None]:
# =============================================================================
# 训练二阶多项式核 SVR
# =============================================================================

# 构建 Pipeline
poly_svr = Pipeline([
    ('scaler', StandardScaler()),
    ('svr', SVR(
        kernel='poly',
        degree=2,           # 二阶多项式
        C=100,              # 正则化参数
        epsilon=0.1,        # ε-tube 宽度
        gamma='scale',      # 自动缩放
        coef0=1             # 独立项系数
    ))
])

# 训练模型
poly_svr.fit(X_train, y_train)

# 预测
y_pred_train = poly_svr.predict(X_train)
y_pred_test = poly_svr.predict(X_test)

# 评估
print("=" * 50)
print("二阶多项式核 SVR 性能评估")
print("=" * 50)
print(f"\n训练集:")
print(f"  R² Score: {r2_score(y_train, y_pred_train):.4f}")
print(f"  MSE: {mean_squared_error(y_train, y_pred_train):.4f}")
print(f"  MAE: {mean_absolute_error(y_train, y_pred_train):.4f}")

print(f"\n测试集:")
print(f"  R² Score: {r2_score(y_test, y_pred_test):.4f}")
print(f"  MSE: {mean_squared_error(y_test, y_pred_test):.4f}")
print(f"  MAE: {mean_absolute_error(y_test, y_pred_test):.4f}")

# 获取支持向量信息
n_sv = poly_svr.named_steps['svr'].n_support_
print(f"\n支持向量数量: {n_sv[0]}")
print(f"支持向量占比: {n_sv[0] / len(X_train) * 100:.1f}%")

In [None]:
# =============================================================================
# 可视化拟合结果
# =============================================================================

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# 左图: 拟合曲线
ax = axes[0]
ax.scatter(X_train, y_train, c='steelblue', alpha=0.5, s=30, label='训练数据')
ax.scatter(X_test, y_test, c='coral', alpha=0.5, s=30, label='测试数据')

# 预测曲线
X_plot = np.linspace(X.min() - 0.5, X.max() + 0.5, 300).reshape(-1, 1)
y_plot = poly_svr.predict(X_plot)

ax.plot(X_plot, y_plot, 'r-', linewidth=2, label='Poly SVR 预测')
ax.plot(X_line, y_true_line, 'g--', linewidth=2, alpha=0.7, label='真实函数')

# 标记支持向量
svr_model = poly_svr.named_steps['svr']
scaler = poly_svr.named_steps['scaler']
sv_indices = svr_model.support_
ax.scatter(X_train[sv_indices], y_train[sv_indices], 
           s=100, facecolors='none', edgecolors='purple', 
           linewidths=2, label='支持向量')

ax.set_xlabel('X', fontsize=12)
ax.set_ylabel('y', fontsize=12)
ax.set_title('二阶多项式核 SVR 拟合结果', fontsize=14)
ax.legend(loc='upper right')
ax.grid(True, alpha=0.3)

# 右图: 预测值 vs 真实值
ax = axes[1]
ax.scatter(y_train, y_pred_train, c='steelblue', alpha=0.6, label='训练集')
ax.scatter(y_test, y_pred_test, c='coral', alpha=0.6, label='测试集')

# 理想线
lims = [min(y.min(), y_pred_test.min()) - 0.5, max(y.max(), y_pred_test.max()) + 0.5]
ax.plot(lims, lims, 'k--', alpha=0.7, label='理想线 (y=x)')

ax.set_xlabel('真实值', fontsize=12)
ax.set_ylabel('预测值', fontsize=12)
ax.set_title(f'预测值 vs 真实值 (R² = {r2_score(y_test, y_pred_test):.3f})', fontsize=14)
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 4. 多项式阶数对比分析

In [None]:
# =============================================================================
# 比较不同多项式阶数的效果
# =============================================================================

degrees = [1, 2, 3, 4, 5]
results = []

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

for idx, degree in enumerate(degrees):
    # 训练模型
    model = Pipeline([
        ('scaler', StandardScaler()),
        ('svr', SVR(kernel='poly', degree=degree, C=100, epsilon=0.1, coef0=1))
    ])
    model.fit(X_train, y_train)
    
    # 预测和评估
    y_pred = model.predict(X_test)
    r2 = r2_score(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    n_sv = model.named_steps['svr'].n_support_[0]
    
    results.append({
        'degree': degree,
        'r2': r2,
        'mse': mse,
        'n_sv': n_sv
    })
    
    # 可视化
    ax = axes[idx]
    ax.scatter(X_train, y_train, c='steelblue', alpha=0.4, s=20)
    y_plot = model.predict(X_plot)
    ax.plot(X_plot, y_plot, 'r-', linewidth=2)
    ax.plot(X_line, y_true_line, 'g--', linewidth=1.5, alpha=0.7)
    ax.set_title(f'Degree = {degree}\nR² = {r2:.3f}, SV = {n_sv}', fontsize=11)
    ax.set_xlabel('X')
    ax.set_ylabel('y')
    ax.grid(True, alpha=0.3)

# 最后一个子图: 性能对比
ax = axes[-1]
degrees_list = [r['degree'] for r in results]
r2_list = [r['r2'] for r in results]
ax.bar(degrees_list, r2_list, color='steelblue', alpha=0.7, edgecolor='white')
ax.set_xlabel('多项式阶数')
ax.set_ylabel('R² Score')
ax.set_title('不同阶数的测试集 R²', fontsize=11)
ax.set_xticks(degrees_list)
ax.grid(True, alpha=0.3, axis='y')

plt.suptitle('多项式核 SVR: 不同阶数对比', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

# 打印结果表格
print("\n" + "="*60)
print("不同多项式阶数的性能对比")
print("="*60)
print(f"{'阶数':<8} {'R²':<12} {'MSE':<12} {'支持向量数':<12}")
print("-"*60)
for r in results:
    print(f"{r['degree']:<8} {r['r2']:<12.4f} {r['mse']:<12.4f} {r['n_sv']:<12}")

## 5. 超参数调优

In [None]:
# =============================================================================
# 网格搜索调优
# =============================================================================

# 定义参数网格
param_grid = {
    'svr__degree': [2, 3],
    'svr__C': [10, 100],
    'svr__epsilon': [0.05, 0.1],
    'svr__coef0': [0, 1]
}

# 创建基础模型
base_model = Pipeline([
    ('scaler', StandardScaler()),
    ('svr', SVR(kernel='poly', gamma='scale'))
])

# 网格搜索
grid_search = GridSearchCV(
    base_model,
    param_grid,
    cv=5,
    scoring='neg_mean_squared_error',
    return_train_score=True,
    n_jobs=-1
)

grid_search.fit(X_train, y_train)

print("最佳参数:")
for param, value in grid_search.best_params_.items():
    print(f"  {param}: {value}")

print(f"\n最佳交叉验证 MSE: {-grid_search.best_score_:.4f}")

# 使用最佳模型评估
best_model = grid_search.best_estimator_
y_pred_best = best_model.predict(X_test)

print(f"\n测试集性能:")
print(f"  R² Score: {r2_score(y_test, y_pred_best):.4f}")
print(f"  MSE: {mean_squared_error(y_test, y_pred_best):.4f}")
print(f"  MAE: {mean_absolute_error(y_test, y_pred_best):.4f}")

## 6. 学习曲线分析

In [None]:
# =============================================================================
# 绘制学习曲线
# =============================================================================

train_sizes, train_scores, test_scores = learning_curve(
    best_model, X_train, y_train,
    train_sizes=np.linspace(0.1, 1.0, 10),
    cv=5,
    scoring='neg_mean_squared_error',
    n_jobs=-1
)

# 转换为正值 MSE
train_scores_mean = -train_scores.mean(axis=1)
train_scores_std = train_scores.std(axis=1)
test_scores_mean = -test_scores.mean(axis=1)
test_scores_std = test_scores.std(axis=1)

fig, ax = plt.subplots(figsize=(10, 6))

# 绘制学习曲线
ax.fill_between(train_sizes, 
                train_scores_mean - train_scores_std,
                train_scores_mean + train_scores_std, 
                alpha=0.2, color='steelblue')
ax.fill_between(train_sizes, 
                test_scores_mean - test_scores_std,
                test_scores_mean + test_scores_std, 
                alpha=0.2, color='coral')

ax.plot(train_sizes, train_scores_mean, 'o-', color='steelblue', 
        linewidth=2, label='训练集 MSE')
ax.plot(train_sizes, test_scores_mean, 'o-', color='coral', 
        linewidth=2, label='验证集 MSE')

ax.set_xlabel('训练样本数', fontsize=12)
ax.set_ylabel('MSE', fontsize=12)
ax.set_title('学习曲线 - 多项式核 SVR', fontsize=14)
ax.legend(loc='upper right')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n学习曲线分析:")
print("- 如果训练误差和验证误差都高且接近: 模型欠拟合")
print("- 如果训练误差低但验证误差高: 模型过拟合")
print("- 如果两者都低且接近: 模型拟合良好")

## 7. 单元测试

In [None]:
# =============================================================================
# 单元测试
# =============================================================================

def run_tests():
    """运行所有单元测试"""
    test_results = []
    
    # 测试 1: 数据生成函数
    try:
        X_t, y_t, y_true_t = generate_polynomial_data(n_samples=50)
        assert X_t.shape == (50, 1), "特征维度错误"
        assert y_t.shape == (50,), "目标维度错误"
        assert y_true_t.shape == (50,), "真实值维度错误"
        test_results.append(("数据生成函数", True, ""))
    except Exception as e:
        test_results.append(("数据生成函数", False, str(e)))
    
    # 测试 2: 多项式核 SVR 可训练
    try:
        test_model = Pipeline([
            ('scaler', StandardScaler()),
            ('svr', SVR(kernel='poly', degree=2, C=10))
        ])
        test_model.fit(X_train, y_train)
        assert hasattr(test_model.named_steps['svr'], 'support_'), "模型未正确训练"
        test_results.append(("模型训练", True, ""))
    except Exception as e:
        test_results.append(("模型训练", False, str(e)))
    
    # 测试 3: 预测输出
    try:
        predictions = test_model.predict(X_test)
        assert predictions.shape == y_test.shape, "预测维度不匹配"
        assert not np.any(np.isnan(predictions)), "预测包含 NaN"
        test_results.append(("预测输出", True, ""))
    except Exception as e:
        test_results.append(("预测输出", False, str(e)))
    
    # 测试 4: 不同阶数的核函数
    try:
        for degree in [1, 2, 3, 4]:
            m = SVR(kernel='poly', degree=degree)
            m.fit(StandardScaler().fit_transform(X_train), y_train)
        test_results.append(("不同阶数核函数", True, ""))
    except Exception as e:
        test_results.append(("不同阶数核函数", False, str(e)))
    
    # 测试 5: 支持向量数量合理
    try:
        n_sv = test_model.named_steps['svr'].n_support_[0]
        assert 0 < n_sv <= len(X_train), "支持向量数量异常"
        test_results.append(("支持向量数量", True, f"SV={n_sv}"))
    except Exception as e:
        test_results.append(("支持向量数量", False, str(e)))
    
    # 测试 6: R² 值在合理范围
    try:
        r2 = r2_score(y_test, predictions)
        assert r2 > 0, f"R² 值过低: {r2}"
        test_results.append(("R² 值合理性", True, f"R²={r2:.4f}"))
    except Exception as e:
        test_results.append(("R² 值合理性", False, str(e)))
    
    # 输出结果
    print("="*60)
    print("单元测试结果")
    print("="*60)
    
    passed = 0
    for name, success, msg in test_results:
        status = "✓ 通过" if success else "✗ 失败"
        passed += int(success)
        print(f"{status} | {name}")
        if msg:
            print(f"       {msg}")
    
    print("="*60)
    print(f"总计: {passed}/{len(test_results)} 测试通过")
    print("="*60)
    
    return passed == len(test_results)

# 运行测试
all_passed = run_tests()

## 8. 知识总结

### 多项式核 SVR 要点

1. **适用场景**
   - 数据具有明显的多项式关系
   - 特征维度较低时效果更好
   - 需要可解释的非线性模型

2. **参数选择指南**
   - `degree`: 从低阶开始，逐步增加；高阶容易过拟合
   - `C`: 数据噪声大时使用较小的 C
   - `coef0`: 影响高阶项和低阶项的权重平衡

3. **与 RBF 核对比**
   - 多项式核: 更适合有明确多项式结构的数据
   - RBF 核: 更通用，适合未知非线性关系
   - 多项式核计算可能更高效（特别是低阶时）

4. **常见问题**
   - 高阶多项式可能导致数值不稳定
   - 特征缩放对多项式核很重要
   - 过高的 `degree` 会导致严重过拟合

### 参考文献

- Vapnik, V. N. (1995). The Nature of Statistical Learning Theory
- Schölkopf, B., & Smola, A. J. (2002). Learning with Kernels