# Day 26 回归模型评价：熵权法 + TOPSIS

之前对机器学习模型进行评估时，常常面临多个指标拉扯的问题，很难只凭经验判断谁更优秀。

本节我们继续沿用多目标优化的思路，引入更系统的评价问题方法论，用熵权法结合 TOPSIS 来帮助我们为模型打分。

数据集为：加州房价数据集

- **指标的冲突性**：MSE、MAE 越低越好，但 R² 越高越好，还有训练时间这种成本维度，不可能同时最好。
- **主观性风险**：如果仅凭直觉设定权重，结论会随人而异。
- **缺乏统一标准**：传统方式只盯某一个指标（如 R²），无法兼顾整体表现。

**熵权法（Entropy Weight Method）**：依据指标数据的离散程度自动赋权，差异越大说明区分度越好，权重越高。
**TOPSIS (Technique for Order Preference by Similarity to Ideal Solution)**：在权重基础上，衡量每个模型与“理想解”和“负理想解”的距离，得到最终排序。整个流程可拆为“客观赋权 + 综合评价”两大阶段。

## 1. 数据预处理


In [1]:
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.datasets import fetch_california_housing

plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
sns.set(style='whitegrid', font='SimHei')

housing = fetch_california_housing(as_frame=True)
df = housing.frame.copy()
df.columns = [
    'MedInc (中位收入)', 'HouseAge (房龄)', 'AveRooms (平均房间数)',
    'AveBedrms (平均卧室数)', 'Population (人口)', 'AveOccup (平均居住人数)',
    'Latitude (纬度)', 'Longitude (经度)', 'MedHouseVal (房价中位数)']
print(f"样本量：{df.shape[0]}，特征数：{df.shape[1]-1}")
df.head()

样本量：20640，特征数：8


Unnamed: 0,MedInc (中位收入),HouseAge (房龄),AveRooms (平均房间数),AveBedrms (平均卧室数),Population (人口),AveOccup (平均居住人数),Latitude (纬度),Longitude (经度),MedHouseVal (房价中位数)
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [2]:
X = housing.data
y = housing.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"训练集：{X_train.shape}, 测试集：{X_test.shape}")

训练集：(16512, 8), 测试集：(4128, 8)


## 2. 构建待比较的回归模型

挑选四个常见的基准模型（线性回归、决策树、随机森林、梯度提升）来模拟真实建模场景。

In [3]:
regressors = {
    'Linear Regression (线性回归)': LinearRegression(),
    'Decision Tree (决策树)': DecisionTreeRegressor(random_state=42),
    'Random Forest (随机森林)': RandomForestRegressor(n_estimators=100, random_state=42),
    'Gradient Boosting (梯度提升)': GradientBoostingRegressor(n_estimators=100, random_state=42)
}

## 3. 训练模型并收集多维指标

我们记录每个模型在测试集上的 **MSE / RMSE / MAE / R²** 以及训练耗时，构建原始决策矩阵 `results_df`。

In [4]:
records = []
for name, model in regressors.items():
    start = time.perf_counter()
    model.fit(X_train, y_train)
    duration = time.perf_counter() - start

    preds = model.predict(X_test)
    mse = mean_squared_error(y_test, preds)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_test, preds)
    r2 = r2_score(y_test, preds)

    records.append({
        'Model': name,
        'Mean Squared Error (MSE)': mse,
        'Root Mean Squared Error (RMSE)': rmse,
        'Mean Absolute Error (MAE)': mae,
        'R2 Score': r2,
        'Training Time (s)': duration
    })

results_df = pd.DataFrame(records)
results_df

Unnamed: 0,Model,Mean Squared Error (MSE),Root Mean Squared Error (RMSE),Mean Absolute Error (MAE),R2 Score,Training Time (s)
0,Linear Regression (线性回归),0.555892,0.745581,0.5332,0.575788,0.019426
1,Decision Tree (决策树),0.495235,0.703729,0.454679,0.622076,0.139013
2,Random Forest (随机森林),0.255368,0.50534,0.327543,0.805123,8.383994
3,Gradient Boosting (梯度提升),0.293997,0.542215,0.371643,0.775645,2.616082


我们已经拿到了 `results_df`（原始决策矩阵），接下来仍旧是三部曲：**数据标准化 → 熵权计算 → TOPSIS 排序**。

## 4. 指标方向与数据预处理

机器不知道“误差越小越好、R² 越大越好”，必须手动标明指标属性。训练时间同样视为成本型指标。

In [5]:
benefit_cols = ['R2 Score']
cost_cols = ['Mean Squared Error (MSE)', 'Root Mean Squared Error (RMSE)',
             'Mean Absolute Error (MAE)', 'Training Time (s)']

data_eval = results_df.set_index('Model')[benefit_cols + cost_cols].astype(float)

print('步骤 1 完成：指标方向已明确定义。')
print(f'效益型指标 (+): {benefit_cols}')
print(f'成本型指标 (-): {cost_cols}')

步骤 1 完成：指标方向已明确定义。
效益型指标 (+): ['R2 Score']
成本型指标 (-): ['Mean Squared Error (MSE)', 'Root Mean Squared Error (RMSE)', 'Mean Absolute Error (MAE)', 'Training Time (s)']


## 5. 数据标准化 (Normalization)

为避免不同量纲的影响，效益型指标采用 $(x-\min)/(\max-\min)$，成本型指标采用 $(\max-x)/(\max-\min)$，最后加上极小量 $\epsilon$ 防止 $\ln(0)$。

In [6]:
epsilon = 1e-6

for col in benefit_cols:
    min_val = data_eval[col].min()
    max_val = data_eval[col].max()
    data_eval[col] = 1.0 if max_val == min_val else (data_eval[col] - min_val) / (max_val - min_val)

for col in cost_cols:
    min_val = data_eval[col].min()
    max_val = data_eval[col].max()
    data_eval[col] = 1.0 if max_val == min_val else (max_val - data_eval[col]) / (max_val - min_val)

data_eval = data_eval + epsilon
print('步骤 2 完成：指标已标准化。')
data_eval

步骤 2 完成：指标已标准化。


Unnamed: 0_level_0,R2 Score,Mean Squared Error (MSE),Root Mean Squared Error (RMSE),Mean Absolute Error (MAE),Training Time (s)
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Linear Regression (线性回归),1e-06,1e-06,1e-06,1e-06,1.000001
Decision Tree (决策树),0.201837,0.201837,0.174209,0.381805,0.985704
Random Forest (随机森林),1.000001,1.000001,1.000001,1.000001,1e-06
Gradient Boosting (梯度提升),0.871462,0.871462,0.846509,0.785567,0.689566


## 6. 熵权法计算权重

指标差异越大说明越能区分模型，其权重应更高。信息熵提供了量化的依据。

In [7]:
n, m = data_eval.shape
P = data_eval.div(data_eval.sum(axis=0), axis=1)
k = 1 / np.log(n)
E = -k * (P * np.log(P)).sum(axis=0)
d = 1 - E
weights = d / d.sum()

weights_df = pd.DataFrame(weights, columns=['Weight']).sort_values('Weight', ascending=False)
weights_df

Unnamed: 0,Weight
Root Mean Squared Error (RMSE),0.230469
R2 Score,0.221074
Mean Squared Error (MSE),0.221074
Mean Absolute Error (MAE),0.177287
Training Time (s),0.150096


## 7. TOPSIS 综合评价

- **加权**：把每列指标乘以对应权重。
- **找标杆**：每列最大值是理想解，最小值是负理想解。
- **算距离**：求模型到理想解/负理想解的欧氏距离。
- **打分**：计算相对接近度 $C_i$，越接近 1 表示越优秀。

In [8]:
V = data_eval * weights
V_plus = V.max()
V_minus = V.min()

D_plus = np.sqrt(((V - V_plus) ** 2).sum(axis=1))
D_minus = np.sqrt(((V - V_minus) ** 2).sum(axis=1))

scores = D_minus / (D_plus + D_minus)

final_results = results_df.copy()
final_results['TOPSIS Score'] = final_results['Model'].map(scores)
final_results['Rank'] = final_results['TOPSIS Score'].rank(ascending=False).astype(int)
final_results = final_results.sort_values('Rank')

columns_to_show = ['Model', 'R2 Score', 'Mean Squared Error (MSE)',
                    'Mean Absolute Error (MAE)', 'Training Time (s)',
                    'TOPSIS Score', 'Rank']
final_results[columns_to_show]

Unnamed: 0,Model,R2 Score,Mean Squared Error (MSE),Mean Absolute Error (MAE),Training Time (s),TOPSIS Score,Rank
3,Gradient Boosting (梯度提升),0.775645,0.293997,0.371643,2.616082,0.824156,1
2,Random Forest (随机森林),0.805123,0.255368,0.327543,8.383994,0.739894,2
1,Decision Tree (决策树),0.622076,0.495235,0.454679,0.139013,0.350084,3
0,Linear Regression (线性回归),0.575788,0.555892,0.5332,0.019426,0.260106,4


从结果可以看出，综合考虑误差、拟合优度与训练成本后，树模型（随机森林、梯度提升）往往能取得更高的 TOPSIS 得分。该流程可平滑迁移到任何回归任务，只需替换数据和指标即可复用。