# 线性支持向量机回归 (Linear SVR)

## 理论基础

支持向量机回归 (SVR) 将 SVM 的"大间隔"思想应用于回归问题。与分类不同，回归目标是找到一个函数 $f(x) = w^\top x + b$，使得预测值与真实值的偏差在某个阈值 $\epsilon$ 内。

### ε-不敏感损失函数

SVR 使用 ε-不敏感损失函数：

$$
L_\epsilon(y, f(x)) = \begin{cases}
0, & |y - f(x)| \le \epsilon \\
|y - f(x)| - \epsilon, & \text{otherwise}
\end{cases}
$$

这意味着：
- 预测值在真实值 $\pm\epsilon$ 范围内时，损失为 0
- 超出该范围的部分才被线性惩罚

### 优化目标

$$
\min_{w,b,\xi,\xi^*} \frac{1}{2}\|w\|^2 + C\sum_{i=1}^{n}(\xi_i + \xi_i^*)
$$

约束条件：
- $y_i - (w^\top x_i + b) \le \epsilon + \xi_i$
- $(w^\top x_i + b) - y_i \le \epsilon + \xi_i^*$
- $\xi_i, \xi_i^* \ge 0$

### 关键参数

- **C**: 正则化参数，控制模型复杂度与拟合程度的权衡
- **ε (epsilon)**: 不敏感区域的宽度，定义"管道"的半径

## 1. 环境配置与数据准备

In [None]:
# =============================================================================
# 导入必要的库
# =============================================================================
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import LinearSVR
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# 设置随机种子以确保结果可复现
np.random.seed(42)

# 配置 matplotlib
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS', 'SimHei', 'DejaVu Sans']
plt.rcParams['axes.unicode_minus'] = False
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['figure.dpi'] = 100

In [None]:
# =============================================================================
# 生成带噪声的非线性回归数据
# =============================================================================

def generate_regression_data(n_samples=100, noise_level=0.3):
    """
    生成用于回归的合成数据集
    
    真实函数: y = 2*sin(x) + 0.5*x
    
    Parameters:
    -----------
    n_samples : int
        样本数量
    noise_level : float
        高斯噪声的标准差
    
    Returns:
    --------
    X : ndarray, shape (n_samples, 1)
        特征矩阵
    y : ndarray, shape (n_samples,)
        目标值
    """
    X = np.sort(np.random.uniform(-3, 3, n_samples))
    y_true = 2 * np.sin(X) + 0.5 * X
    y = y_true + np.random.normal(0, noise_level, n_samples)
    return X.reshape(-1, 1), y

# 生成数据
X, y = generate_regression_data(n_samples=100, noise_level=0.3)

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"训练集大小: {X_train.shape[0]}")
print(f"测试集大小: {X_test.shape[0]}")
print(f"特征维度: {X_train.shape[1]}")

## 2. 数据可视化

In [None]:
# =============================================================================
# 可视化原始数据分布
# =============================================================================

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# 左图: 散点图
axes[0].scatter(X_train, y_train, c='steelblue', alpha=0.7, 
                edgecolors='white', s=50, label='训练数据')
axes[0].scatter(X_test, y_test, c='coral', alpha=0.7, 
                edgecolors='white', s=50, label='测试数据')

# 绘制真实函数
X_line = np.linspace(-3, 3, 200).reshape(-1, 1)
y_true_line = 2 * np.sin(X_line) + 0.5 * X_line
axes[0].plot(X_line, y_true_line, 'g--', linewidth=2, label='真实函数')

axes[0].set_xlabel('X', fontsize=12)
axes[0].set_ylabel('y', fontsize=12)
axes[0].set_title('回归数据分布', fontsize=14)
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# 右图: 目标值分布直方图
axes[1].hist(y_train, bins=20, color='steelblue', alpha=0.7, 
             edgecolor='white', label='训练集')
axes[1].hist(y_test, bins=15, color='coral', alpha=0.7, 
             edgecolor='white', label='测试集')
axes[1].set_xlabel('y', fontsize=12)
axes[1].set_ylabel('频数', fontsize=12)
axes[1].set_title('目标值分布', fontsize=14)
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 3. 模型训练与 ε 参数分析

In [None]:
# =============================================================================
# 探索不同 epsilon 值对模型的影响
# =============================================================================

epsilon_values = [0.0, 0.5, 1.0, 1.5, 2.0]
models = {}

for eps in epsilon_values:
    # 构建 Pipeline: 标准化 + LinearSVR
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('svr', LinearSVR(epsilon=eps, C=1.0, max_iter=10000, random_state=42))
    ])
    
    # 训练模型
    pipeline.fit(X_train, y_train)
    models[eps] = pipeline
    
    # 评估
    y_pred_train = pipeline.predict(X_train)
    y_pred_test = pipeline.predict(X_test)
    
    print(f"ε = {eps:.1f}:")
    print(f"  训练集 R² = {r2_score(y_train, y_pred_train):.4f}")
    print(f"  测试集 R² = {r2_score(y_test, y_pred_test):.4f}")
    print(f"  测试集 MSE = {mean_squared_error(y_test, y_pred_test):.4f}")
    print()

In [None]:
# =============================================================================
# 可视化不同 epsilon 值的拟合效果
# =============================================================================

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

X_plot = np.linspace(X.min() - 0.5, X.max() + 0.5, 300).reshape(-1, 1)

for idx, (eps, model) in enumerate(models.items()):
    ax = axes[idx]
    
    # 预测
    y_plot = model.predict(X_plot)
    
    # 绘制数据点
    ax.scatter(X_train, y_train, c='steelblue', alpha=0.6, 
               edgecolors='white', s=40, label='训练数据')
    
    # 绘制预测线和 ε-tube
    ax.plot(X_plot, y_plot, 'r-', linewidth=2, label=f'SVR 预测')
    ax.fill_between(X_plot.ravel(), y_plot - eps, y_plot + eps, 
                    alpha=0.2, color='red', label=f'ε-管道 (±{eps})')
    
    # 获取测试集性能
    r2 = r2_score(y_test, model.predict(X_test))
    
    ax.set_xlabel('X', fontsize=11)
    ax.set_ylabel('y', fontsize=11)
    ax.set_title(f'ε = {eps:.1f} (R² = {r2:.3f})', fontsize=12)
    ax.legend(loc='upper left', fontsize=9)
    ax.grid(True, alpha=0.3)

# 隐藏多余的子图
axes[-1].axis('off')

plt.suptitle('LinearSVR: 不同 ε 值的拟合效果对比', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

## 4. C 参数（正则化强度）分析

In [None]:
# =============================================================================
# 探索不同 C 值对模型的影响
# =============================================================================

C_values = [0.01, 0.1, 1.0, 10.0, 100.0]
epsilon_fixed = 0.5

fig, axes = plt.subplots(1, 5, figsize=(20, 4))

for idx, C in enumerate(C_values):
    # 构建并训练模型
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('svr', LinearSVR(epsilon=epsilon_fixed, C=C, max_iter=10000, random_state=42))
    ])
    pipeline.fit(X_train, y_train)
    
    # 预测
    y_plot = pipeline.predict(X_plot)
    r2 = r2_score(y_test, pipeline.predict(X_test))
    
    # 可视化
    ax = axes[idx]
    ax.scatter(X_train, y_train, c='steelblue', alpha=0.6, s=30)
    ax.plot(X_plot, y_plot, 'r-', linewidth=2)
    ax.fill_between(X_plot.ravel(), y_plot - epsilon_fixed, y_plot + epsilon_fixed, 
                    alpha=0.2, color='red')
    ax.set_title(f'C = {C}\nR² = {r2:.3f}', fontsize=11)
    ax.set_xlabel('X')
    ax.grid(True, alpha=0.3)

axes[0].set_ylabel('y')
plt.suptitle(f'LinearSVR: 不同 C 值的拟合效果 (ε = {epsilon_fixed})', fontsize=13, y=1.05)
plt.tight_layout()
plt.show()

print("\n参数影响总结:")
print("- C 值较小: 强正则化，模型更简单，可能欠拟合")
print("- C 值较大: 弱正则化，模型更复杂，可能过拟合")
print("- 对于线性 SVR，由于模型本身是线性的，C 的影响相对有限")

## 5. 超参数调优

In [None]:
# =============================================================================
# 使用网格搜索进行超参数调优
# =============================================================================

# 定义参数网格
param_grid = {
    'svr__C': [0.1, 1.0, 10.0],
    'svr__epsilon': [0.1, 0.5, 1.0]
}

# 创建 Pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svr', LinearSVR(max_iter=10000, random_state=42))
])

# 网格搜索
grid_search = GridSearchCV(
    pipeline, 
    param_grid, 
    cv=5,
    scoring='neg_mean_squared_error',
    return_train_score=True,
    n_jobs=-1
)

grid_search.fit(X_train, y_train)

print("最佳参数:", grid_search.best_params_)
print(f"最佳交叉验证 MSE: {-grid_search.best_score_:.4f}")

# 使用最佳模型评估
best_model = grid_search.best_estimator_
y_pred_test = best_model.predict(X_test)

print(f"\n测试集性能:")
print(f"  R² Score: {r2_score(y_test, y_pred_test):.4f}")
print(f"  MSE: {mean_squared_error(y_test, y_pred_test):.4f}")
print(f"  MAE: {mean_absolute_error(y_test, y_pred_test):.4f}")

## 6. 模型评估与诊断

In [None]:
# =============================================================================
# 残差分析与模型诊断
# =============================================================================

y_pred_train = best_model.predict(X_train)
y_pred_test = best_model.predict(X_test)

residuals_train = y_train - y_pred_train
residuals_test = y_test - y_pred_test

fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# 1. 预测值 vs 真实值
ax = axes[0, 0]
ax.scatter(y_train, y_pred_train, c='steelblue', alpha=0.6, label='训练集')
ax.scatter(y_test, y_pred_test, c='coral', alpha=0.6, label='测试集')
lims = [min(y.min(), y_pred_test.min()) - 0.5, max(y.max(), y_pred_test.max()) + 0.5]
ax.plot(lims, lims, 'k--', alpha=0.7, label='理想线 (y=x)')
ax.set_xlabel('真实值', fontsize=11)
ax.set_ylabel('预测值', fontsize=11)
ax.set_title('预测值 vs 真实值', fontsize=12)
ax.legend()
ax.grid(True, alpha=0.3)

# 2. 残差分布
ax = axes[0, 1]
ax.hist(residuals_train, bins=20, alpha=0.6, color='steelblue', 
        label=f'训练集 (std={residuals_train.std():.3f})')
ax.hist(residuals_test, bins=15, alpha=0.6, color='coral', 
        label=f'测试集 (std={residuals_test.std():.3f})')
ax.axvline(x=0, color='black', linestyle='--', alpha=0.7)
ax.set_xlabel('残差', fontsize=11)
ax.set_ylabel('频数', fontsize=11)
ax.set_title('残差分布', fontsize=12)
ax.legend()
ax.grid(True, alpha=0.3)

# 3. 残差 vs 预测值
ax = axes[1, 0]
ax.scatter(y_pred_train, residuals_train, c='steelblue', alpha=0.6, label='训练集')
ax.scatter(y_pred_test, residuals_test, c='coral', alpha=0.6, label='测试集')
ax.axhline(y=0, color='black', linestyle='--', alpha=0.7)
ax.set_xlabel('预测值', fontsize=11)
ax.set_ylabel('残差', fontsize=11)
ax.set_title('残差 vs 预测值', fontsize=12)
ax.legend()
ax.grid(True, alpha=0.3)

# 4. 最终拟合结果
ax = axes[1, 1]
ax.scatter(X_train, y_train, c='steelblue', alpha=0.6, s=40, label='训练数据')
ax.scatter(X_test, y_test, c='coral', alpha=0.6, s=40, label='测试数据')
y_plot_best = best_model.predict(X_plot)
best_eps = grid_search.best_params_['svr__epsilon']
ax.plot(X_plot, y_plot_best, 'g-', linewidth=2, label='SVR 预测')
ax.fill_between(X_plot.ravel(), y_plot_best - best_eps, y_plot_best + best_eps, 
                alpha=0.2, color='green', label=f'ε-管道 (±{best_eps})')
ax.set_xlabel('X', fontsize=11)
ax.set_ylabel('y', fontsize=11)
ax.set_title('最佳模型拟合结果', fontsize=12)
ax.legend(loc='upper left')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 7. 单元测试

In [None]:
# =============================================================================
# 单元测试：验证模型和数据处理的正确性
# =============================================================================

def run_tests():
    """运行所有单元测试"""
    test_results = []
    
    # 测试 1: 数据生成函数
    try:
        X_test_gen, y_test_gen = generate_regression_data(n_samples=50)
        assert X_test_gen.shape == (50, 1), "特征维度错误"
        assert y_test_gen.shape == (50,), "目标维度错误"
        test_results.append(("数据生成函数", True, ""))
    except Exception as e:
        test_results.append(("数据生成函数", False, str(e)))
    
    # 测试 2: 模型可训练性
    try:
        test_pipeline = Pipeline([
            ('scaler', StandardScaler()),
            ('svr', LinearSVR(epsilon=0.5, max_iter=1000, random_state=42))
        ])
        test_pipeline.fit(X_train, y_train)
        assert hasattr(test_pipeline.named_steps['svr'], 'coef_'), "模型未正确训练"
        test_results.append(("模型训练", True, ""))
    except Exception as e:
        test_results.append(("模型训练", False, str(e)))
    
    # 测试 3: 预测输出维度
    try:
        predictions = test_pipeline.predict(X_test)
        assert predictions.shape == y_test.shape, "预测输出维度不匹配"
        test_results.append(("预测输出维度", True, ""))
    except Exception as e:
        test_results.append(("预测输出维度", False, str(e)))
    
    # 测试 4: R² 值在合理范围
    try:
        r2 = r2_score(y_test, predictions)
        assert -1 <= r2 <= 1, f"R² 值异常: {r2}"
        test_results.append(("R² 值范围", True, f"R² = {r2:.4f}"))
    except Exception as e:
        test_results.append(("R² 值范围", False, str(e)))
    
    # 测试 5: 模型参数可访问
    try:
        coef = test_pipeline.named_steps['svr'].coef_
        intercept = test_pipeline.named_steps['svr'].intercept_
        assert coef is not None and intercept is not None
        test_results.append(("模型参数访问", True, f"系数维度: {coef.shape}"))
    except Exception as e:
        test_results.append(("模型参数访问", False, str(e)))
    
    # 测试 6: Pipeline 结构完整性
    try:
        assert len(test_pipeline.steps) == 2
        assert 'scaler' in test_pipeline.named_steps
        assert 'svr' in test_pipeline.named_steps
        test_results.append(("Pipeline 结构", True, ""))
    except Exception as e:
        test_results.append(("Pipeline 结构", False, str(e)))
    
    # 输出测试结果
    print("="*60)
    print("单元测试结果")
    print("="*60)
    
    passed = 0
    for name, success, msg in test_results:
        status = "✓ 通过" if success else "✗ 失败"
        passed += int(success)
        print(f"{status} | {name}")
        if msg:
            print(f"       {msg}")
    
    print("="*60)
    print(f"总计: {passed}/{len(test_results)} 测试通过")
    print("="*60)
    
    return passed == len(test_results)

# 运行测试
all_passed = run_tests()

## 8. 知识总结

### LinearSVR 的特点

1. **算法原理**
   - 基于 ε-不敏感损失函数
   - 在预测值周围建立"管道"，管道内的点不计入损失
   - 通过最小化管道外的偏差和权重范数来训练

2. **关键参数**
   - `epsilon`: 定义不敏感区域的宽度，较大的 ε 导致更多的支持向量
   - `C`: 正则化参数，控制模型复杂度
   - `loss`: 损失函数类型 ('epsilon_insensitive' 或 'squared_epsilon_insensitive')

3. **使用建议**
   - 必须对特征进行标准化（SVM 对尺度敏感）
   - 适合中等规模数据集（大规模数据考虑 SGDRegressor）
   - 对于非线性关系，考虑使用 SVR 配合核函数

4. **与其他回归方法对比**
   - 相比最小二乘法，SVR 对异常值更鲁棒
   - 相比 Ridge/Lasso，SVR 有稀疏解（部分样本成为支持向量）

### 参考文献

- Drucker, H., et al. (1997). Support Vector Regression Machines
- Smola, A. J., & Schölkopf, B. (2004). A Tutorial on Support Vector Regression