1、导入了多个常用的数据处理、可视化、特征工程、模型选择和评估的库。warnings.filterwarnings('ignore') 用于忽略警告信息。

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import norm, skew
from sklearn.preprocessing import LabelEncoder, RobustScaler, PowerTransformer
from sklearn.model_selection import KFold, cross_val_score, GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import StackingRegressor, GradientBoostingRegressor
from sklearn.linear_model import Lasso, ElasticNet, Ridge
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
import warnings
warnings.filterwarnings('ignore')

2、从指定路径读取训练集和测试集数据，并分别存储在 train 和 test 中。提取测试集的 Id 列作为 test_id，然后删除训练集和测试集的 Id 列，因为 Id 通常不用于模型训练。

In [2]:
train = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv')
test = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/test.csv')

In [3]:
test_id = test['Id']
train.drop("Id", axis=1, inplace=True)
test.drop("Id", axis=1, inplace=True)

3、对训练集的 SalePrice 列取对数（使用 np.log1p 即 log(1 + x)），以处理数据的右偏态分布。然后将 SalePrice 列的值存储在 y_train 中作为目标变量。

In [4]:
train["SalePrice"] = np.log1p(train["SalePrice"])
y_train = train["SalePrice"].values

4、记录训练集的行数 ntrain，然后将训练集（不包括 SalePrice 列）和测试集合并为 all_data，并重置索引。

In [5]:
ntrain = train.shape[0]
all_data = pd.concat([train.drop("SalePrice", axis=1), test]).reset_index(drop=True)

5、找出 all_data 中存在缺失值的列，存储在 missing_cols 中。定义一个填充策略字典 fill_strategy，然后根据字典中的策略或列的众数填充缺失值。

In [6]:
missing_cols = all_data.columns[all_data.isnull().any()].tolist()
fill_strategy = {
    'PoolQC': 'None', 'MiscFeature': 'None', 'Alley': 'None',
    'Fence': 'None', 'FireplaceQu': 'None', 'LotFrontage': all_data['LotFrontage'].median(),
    'GarageType': 'None', 'GarageFinish': 'None', 'GarageQual': 'None',
    'GarageCond': 'None', 'BsmtQual': 'None', 'BsmtCond': 'None',
    'BsmtExposure': 'None', 'BsmtFinType1': 'None', 'BsmtFinType2': 'None',
    'MasVnrType': 'None', 'MasVnrArea': 0, 'MSZoning': 'RL',
    'Functional': 'Typ', 'Electrical': 'SBrkr', 'KitchenQual': 'TA',
    'Exterior1st': 'VinylSd', 'Exterior2nd': 'VinylSd', 'SaleType': 'WD',
    'Utilities': 'AllPub', 'MSSubClass': 'None'
}

for col in missing_cols:
    all_data[col].fillna(fill_strategy.get(col, all_data[col].mode()[0]), inplace=True)

6、将 MSSubClass 和 YrSold 列转换为字符串类型。
创建了几个新的特征，如
总平方英尺 TotalSF、总浴室数 TotalBath、总门廊面积 TotalPorch、房屋年龄 Age 和翻新年龄 RemodAge。

In [7]:
all_data['MSSubClass'] = all_data['MSSubClass'].apply(str)
all_data['YrSold'] = all_data['YrSold'].astype(str)
all_data['MoSold'] = all_data['MoSold'].astype(str)

# 组合特征
all_data['TotalSF'] = all_data['TotalBsmtSF'] + all_data['1stFlrSF'] + all_data['2ndFlrSF']
all_data['TotalBath'] = (all_data['FullBath'] + 0.5*all_data['HalfBath'] + 
                         all_data['BsmtFullBath'] + 0.5*all_data['BsmtHalfBath'])
all_data['TotalPorch'] = (all_data['OpenPorchSF'] + all_data['EnclosedPorch'] + 
                          all_data['3SsnPorch'] + all_data['ScreenPorch'])
all_data['Age'] = all_data['YrSold'].astype(int) - all_data['YearBuilt']
all_data['RemodAge'] = all_data['YrSold'].astype(int) - all_data['YearRemodAdd']

7、找出数据集中的分类变量列，使用 LabelEncoder 对分类变量进行编码。

In [8]:
cat_cols = all_data.select_dtypes(include=['object']).columns
le = LabelEncoder()
for col in cat_cols:
    all_data[col] = le.fit_transform(all_data[col].astype(str))


8、找出数值型特征，计算每个数值型特征的偏度，筛选出偏度绝对值大于 0.75 的特征。使用 PowerTransformer（Yeo-Johnson 变换）对这些偏态特征进行变换，以改善数据的分布。

In [9]:
numeric_feats = all_data.dtypes[all_data.dtypes != 'object'].index
skewed_feats = all_data[numeric_feats].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
skewed_feats = skewed_feats[abs(skewed_feats) > 0.75]

pt = PowerTransformer(method='yeo-johnson', standardize=True)
all_data[skewed_feats.index] = pt.fit_transform(all_data[skewed_feats.index])


9、创建一个 RobustScaler 对象，用于对数据进行鲁棒缩放（即对数据中的异常值不敏感的缩放方法）。将 all_data 进行缩放，并转换回 DataFrame 格式，保持原来的列名。
根据之前记录的训练集行数 ntrain，将 all_data 分割为训练集 train 和测试集 test。

In [10]:
scaler = RobustScaler()
all_data = pd.DataFrame(scaler.fit_transform(all_data), columns=all_data.columns)

train = all_data[:ntrain]
test = all_data[ntrain:]

10、
（1）使用 make_pipeline 创建一个管道，先对数据进行 RobustScaler 缩放，然后应用 Lasso 回归模型，设置 alpha 参数为 0.00045，随机种子为 42，最大迭代次数为 10000。


（2）创建一个 Ridge 回归模型，设置 alpha 参数为 12.0，随机种子为 42。


（3）创建一个 XGBRegressor（XGBoost 回归模型），设置学习率、估计器数量、最大深度、最小子树权重等参数，并启用 GPU 加速，指定使用第 0 块 GPU。


（4）创建一个 LGBMRegressor（LightGBM 回归模型），设置目标函数、叶子节点数量、学习率、估计器数量等参数，并启用 GPU 加速，指定使用第 0 块 GPU。


（5）创建一个 StackingRegressor（堆叠回归模型），将前面定义的 lasso、ridge、xgb_model 和 lgb_model 作为基础估计器，使用 GradientBoostingRegressor 作为最终估计器，并设置相应的参数。

In [11]:
lasso = make_pipeline(
    RobustScaler(),
    Lasso(alpha=0.00045, random_state=42, max_iter=10000)
)

ridge = Ridge(alpha=12.0, random_state=42)

xgb_model = XGBRegressor(
    learning_rate=0.012, n_estimators=4500,
    max_depth=4, min_child_weight=1.5,
    gamma=0.005, subsample=0.8,
    colsample_bytree=0.25,
    reg_alpha=0.95, reg_lambda=0.75,
    random_state=42,
    tree_method='gpu_hist',  # 启用 GPU 加速
    predictor='gpu_predictor',  
    gpu_id=0  # 使用第 0 块 GPU
)


lgb_model = LGBMRegressor(
    objective='regression', num_leaves=6,
    learning_rate=0.008, n_estimators=4800,
    max_bin=200, bagging_fraction=0.75,
    bagging_freq=5, feature_fraction=0.25,
    feature_fraction_seed=9, bagging_seed=9,
    min_data_in_leaf=6, min_sum_hessian_in_leaf=11,
    device='gpu',  # 启用 GPU 加速
    gpu_platform_id=0,  # 使用第 0 块 GPU
    gpu_device_id=0
)


In [12]:
stacked_model = StackingRegressor(
    estimators=[
        ('lasso', lasso),
        ('ridge', ridge),
        ('xgb', xgb_model),
        ('lgb', lgb_model)
    ],
    final_estimator=GradientBoostingRegressor(
        n_estimators=1200, learning_rate=0.008,
        max_depth=3, max_features='sqrt',
        min_samples_leaf=15, random_state=42
    )
)


11、创建一个 KFold 对象，设置折数为 5，启用数据打乱，随机种子为 42。定义一个函数 rmse_cv，用于计算模型的均方根误差（RMSE），使用 cross_val_score 进行 5 折交叉验证，计算负均方误差，取平方根并返回平均 RMSE

In [13]:
kf = KFold(n_splits=5, shuffle=True, random_state=42)

def rmse_cv(model):
    rmse = np.sqrt(-cross_val_score(
        model, train.values, y_train, 
        scoring="neg_mean_squared_error", cv=kf)
    )
    return rmse.mean()

print("Stacked Model RMSE: {:.4f}".format(rmse_cv(stacked_model)))

[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 3381
[LightGBM] [Info] Number of data points in the train set: 1168, number of used features: 82
[LightGBM] [Info] Using requested OpenCL platform 0 device 0
[LightGBM] [Info] Using GPU Device: Tesla P100-PCIE-16GB, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 44 dense feature groups (0.05 MB) transferred to GPU in 0.000828 secs. 1 sparse feature groups
[LightGBM] [Info] Start training from score 12.030658
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 44 dense feature groups (0.04 MB) transferred to GPU in 0.000573 secs. 1 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 44 dense feature groups (0.04 MB) transferred to GPU in 0.000563 secs. 1 sparse feature groups
[LightGBM] [Info] Size of histogram bi

In [14]:
stacked_model.fit(train.values, y_train)
final_pred = np.expm1(stacked_model.predict(test.values))




[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 3524
[LightGBM] [Info] Number of data points in the train set: 1460, number of used features: 83
[LightGBM] [Info] Using requested OpenCL platform 0 device 0
[LightGBM] [Info] Using GPU Device: Tesla P100-PCIE-16GB, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 44 dense feature groups (0.06 MB) transferred to GPU in 0.000828 secs. 1 sparse feature groups
[LightGBM] [Info] Start training from score 12.024057
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 44 dense feature groups (0.05 MB) transferred to GPU in 0.000802 secs. 1 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 44 dense feature groups (0.04 MB) transferred to GPU in 0.000723 secs. 1 sparse feature groups
[LightGBM] [Info] Size of histogram bi

In [15]:

submission = pd.DataFrame({'Id': test_id, 'SalePrice': final_pred})
submission.to_csv("submission.csv", index=False)