### 采用 scikit-learn 中的 RandomForestRegressor 对加利福尼亚房价数据集进行预测。

#### 建立模型

In [1]:
from sklearn.datasets import fetch_california_housing
import time
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import validation_curve
from sklearn.model_selection import GridSearchCV
import numpy as np
# X_california, y_california = fetch_california_housing(return_X_y=True)

# 导入数据集
data = fetch_california_housing()
X_california = data.data
y_california = data.target
feature_names = data.feature_names
target_name = "House Price"

# 输出数据集信息
print("数据集大小:", X_california.shape)
print("特征数量:", len(feature_names))
print("特征名称:", feature_names)
print("标签名称:", target_name)
print("标签分布情况:\n", data.target)
# 后续采用的是交叉验证，所以不需要划分训练集和测试集

数据集大小: (20640, 8)
特征数量: 8
特征名称: ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']
标签名称: House Price
标签分布情况:
 [4.526 3.585 3.521 ... 0.923 0.847 0.894]


#### 建立模型

In [2]:
# 分别使用 DecisionTreeRegressor 和 RandomForestRegressor 建立分类模型（参数默认）
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_validate

# 建立DecisionTreeRegressor模型
decision_tree_reg = DecisionTreeRegressor()
# 以根均方误差 RMSE 为评估指标： ‘neg_root_mean_squared_error’
dt_scores = cross_validate(decision_tree_reg, X_california, y_california,
                            cv=10, scoring='neg_root_mean_squared_error', return_train_score=True,n_jobs=-1)

dt_train_rmse = np.sqrt(-dt_scores['train_score'])
dt_test_rmse = np.sqrt(-dt_scores['test_score'])
print("DecisionTreeRegressor模型的训练集RMSE：", dt_train_rmse)
print("DecisionTreeRegressor模型的测试集RMSE：", dt_test_rmse)
print("DecisionTreeRegressor模型的拟合分别耗时：", dt_scores['fit_time'])

# 建立RandomForestRegressor模型
random_forest_reg = RandomForestRegressor()
# 以根均方误差 RMSE 为评估指标
rf_scores = cross_validate(random_forest_reg, X_california, y_california,
                            cv=10, scoring='neg_root_mean_squared_error', return_train_score=True,n_jobs=-1)

rf_train_rmse = np.sqrt(-rf_scores['train_score'])
rf_test_rmse = np.sqrt(-rf_scores['test_score'])
print("RandomForestRegressor模型的训练集RMSE：", rf_train_rmse)
print("RandomForestRegressor模型的测试集RMSE：", rf_test_rmse)
print("RandomForestRegressor模型的拟合分别耗时：", rf_scores['fit_time'])

DecisionTreeRegressor模型的训练集RMSE： [1.76398858e-08 1.76263308e-08 1.66068000e-08 1.77224745e-08
 1.69919074e-08 1.74371966e-08 1.72230275e-08 1.74512202e-08
 1.57309248e-08 1.78358195e-08]
DecisionTreeRegressor模型的测试集RMSE： [1.04391926 0.93927818 0.94785848 0.84940843 0.94859303 0.95484077
 0.87041794 1.01007558 1.01367974 0.84890938]
DecisionTreeRegressor模型的拟合分别耗时： [0.31525254 0.31858468 0.28863835 0.31858468 0.31582451 0.30868077
 0.29864407 0.28758407 0.29006052 0.30968213]
RandomForestRegressor模型的训练集RMSE： [0.42984211 0.43041083 0.42567804 0.43453237 0.42697992 0.42774629
 0.43506784 0.42649927 0.41941884 0.43070611]
RandomForestRegressor模型的测试集RMSE： [0.92580873 0.76815601 0.80547556 0.69286533 0.79060611 0.76752061
 0.722231   0.86270623 0.89239559 0.69895461]
RandomForestRegressor模型的拟合分别耗时： [18.15581346 18.51949286 18.41110563 18.29483771 18.39650083 18.18302822
 18.22264886 18.40616846 18.07918    18.22904539]
