说明：房价预测也是经典的回归问题

对于房价的预测，最关键的，也是最困难的就是找特征，而这边已经把相关特征列了出来，下面我们来一一解释下：

* SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.
* MSSubClass: The building class
* MSZoning: The general zoning classification
* LotFrontage: Linear feet of street connected to property
* LotArea: Lot size in square feet
* Street: Type of road access
* Alley: Type of alley access
* LotShape: General shape of property
* LandContour: Flatness of the property
* Utilities: Type of utilities available
* LotConfig: Lot configuration
* LandSlope: Slope of property
* Neighborhood: Physical locations within Ames city limits
* Condition1: Proximity to main road or railroad
* Condition2: Proximity to main road or railroad (if a second is present)
* BldgType: Type of dwelling
* HouseStyle: Style of dwelling
* OverallQual: Overall material and finish quality
* OverallCond: Overall condition rating
* YearBuilt: Original construction date
* YearRemodAdd: Remodel date
* RoofStyle: Type of roof
* RoofMatl: Roof material
* Exterior1st: Exterior covering on house
* Exterior2nd: Exterior covering on house (if more than one material)
* MasVnrType: Masonry veneer type
* MasVnrArea: Masonry veneer area in square feet
* ExterQual: Exterior material quality
* ExterCond: Present condition of the material on the exterior
* Foundation: Type of foundation
* BsmtQual: Height of the basement
* BsmtCond: General condition of the basement
* BsmtExposure: Walkout or garden level basement walls
* BsmtFinType1: Quality of basement finished area
* BsmtFinSF1: Type 1 finished square feet
* BsmtFinType2: Quality of second finished area (if present)
* BsmtFinSF2: Type 2 finished square feet
* BsmtUnfSF: Unfinished square feet of basement area
* TotalBsmtSF: Total square feet of basement area
* Heating: Type of heating
* HeatingQC: Heating quality and condition
* CentralAir: Central air conditioning
* Electrical: Electrical system
* 1stFlrSF: First Floor square feet
* 2ndFlrSF: Second floor square feet
* LowQualFinSF: Low quality finished square feet (all floors)
* GrLivArea: Above grade (ground) living area square feet
* BsmtFullBath: Basement full bathrooms
* BsmtHalfBath: Basement half bathrooms
* FullBath: Full bathrooms above grade
* HalfBath: Half baths above grade
* Bedroom: Number of bedrooms above basement level
* Kitchen: Number of kitchens
* KitchenQual: Kitchen quality
* TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
* Functional: Home functionality rating
* Fireplaces: Number of fireplaces
* FireplaceQu: Fireplace quality
* GarageType: Garage location
* GarageYrBlt: Year garage was built
* GarageFinish: Interior finish of the garage
* GarageCars: Size of garage in car capacity
* GarageArea: Size of garage in square feet
* GarageQual: Garage quality
* GarageCond: Garage condition
* PavedDrive: Paved driveway
* WoodDeckSF: Wood deck area in square feet
* OpenPorchSF: Open porch area in square feet
* EnclosedPorch: Enclosed porch area in square feet
* 3SsnPorch: Three season porch area in square feet
* ScreenPorch: Screen porch area in square feet
* PoolArea: Pool area in square feet
* PoolQC: Pool quality
* Fence: Fence quality
* MiscFeature: Miscellaneous feature not covered in other categories
* MiscVal: $Value of miscellaneous feature
* MoSold: Month Sold
* YrSold: Year Sold
* SaleType: Type of sale
* SaleCondition: Condition of sale

先来看下数据结构

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

In [2]:
def loaddata(file, train=True):
    if train:
        X = pd.read_csv(file)
        X_train = X[X.columns[1:-1]]
        y_train = X[X.columns[-1]]
        return X_train, y_train
    else:
        X = pd.read_csv(file)
        X_test = X[X.columns[1:]]
        return X_test

In [3]:
X_train, y_train = loaddata('./数据集/train.csv')
y_train = np.log1p(y_train)
X_test = loaddata('./数据集/test.csv', train=False)

In [4]:
X_train

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,2,2008,WD,Normal
1,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,0,,,,0,5,2007,WD,Normal
2,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,0,,,,0,9,2008,WD,Normal
3,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,0,,,,0,2,2006,WD,Abnorml
4,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,0,,,,0,12,2008,WD,Normal
5,50,RL,85.0,14115,Pave,,IR1,Lvl,AllPub,Inside,...,0,0,,MnPrv,Shed,700,10,2009,WD,Normal
6,20,RL,75.0,10084,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,8,2007,WD,Normal
7,60,RL,,10382,Pave,,IR1,Lvl,AllPub,Corner,...,0,0,,,Shed,350,11,2009,WD,Normal
8,50,RM,51.0,6120,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,4,2008,WD,Abnorml
9,190,RL,50.0,7420,Pave,,Reg,Lvl,AllPub,Corner,...,0,0,,,,0,1,2008,WD,Normal


In [5]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 79 columns):
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-null object
Exterior2nd      1460 non-

In [6]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 79 columns):
MSSubClass       1459 non-null int64
MSZoning         1455 non-null object
LotFrontage      1232 non-null float64
LotArea          1459 non-null int64
Street           1459 non-null object
Alley            107 non-null object
LotShape         1459 non-null object
LandContour      1459 non-null object
Utilities        1457 non-null object
LotConfig        1459 non-null object
LandSlope        1459 non-null object
Neighborhood     1459 non-null object
Condition1       1459 non-null object
Condition2       1459 non-null object
BldgType         1459 non-null object
HouseStyle       1459 non-null object
OverallQual      1459 non-null int64
OverallCond      1459 non-null int64
YearBuilt        1459 non-null int64
YearRemodAdd     1459 non-null int64
RoofStyle        1459 non-null object
RoofMatl         1459 non-null object
Exterior1st      1458 non-null object
Exterior2nd      1458 non

这里有三类数据需要进行预处理

一类就是缺失数据

* 缺失数据的处理，也分好几种，若数据缺失超过1/10，我们就做丢弃处理，小于1/10的，我们可以使用插值法，均值，评估法来补全这部分数据

一类是类别数据，需要进行类别处理，将其标准化为标准的离散数据

一类是数值差别大，我们还需要对其进行归一化。

In [7]:
def drop_nan(X):
    for i in X.columns:
        if len(X[X[i].isnull()]) > 1/5 * len(X[i]):
            X.drop(i, inplace=True, axis=1)
    return X

In [8]:
X_train = drop_nan(X_train)
X_train_copy = X_train.copy()
X_test = drop_nan(X_test)
X_test_copy = X_test.copy()

In [9]:
def less_than_more(X):
    missing_cols = []
    for i in X.columns:
        if len(X[X[i].notnull()]) != len(X):
            missing_cols.append(i)
    return missing_cols

In [10]:
missing_cols_train = less_than_more(X_train)
missing_cols_test = less_than_more(X_test)

In [11]:
len(X_train.columns)

74

In [12]:
len(X_test.columns)

74

In [13]:
miss_columns = []
nonmiss_columns = []
for i in X_train.columns:
    if i in X_test.columns:
        nonmiss_columns.append(i)
    else:
        miss_columns.append(i)

对于其它缺失值不是那么多的数据，这里我们采用预测估计方来补全，那也就是意味着需要创建模型，在此之前，还需要完全类别数据和数值标准化的处理步骤。

先进行类别数据的转化，将类别数据转换为数值型，但对于存在缺失值的数据还需要对类别数据先填充NaN，数值数据填充0。

In [14]:
def fillnan(X):
    for i in X.columns:
        if X[i].dtype == 'O':
            X[i].fillna('NaN', inplace=True)
        else:
            X[i].fillna(0, inplace=True)
    return X

In [15]:
X_train = fillnan(X_train)
X_test = fillnan(X_test)

In [16]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 74 columns):
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1460 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-null object
Exterior2nd      1460 non-null object
MasVnrType       1460 no

In [17]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 74 columns):
MSSubClass       1459 non-null int64
MSZoning         1459 non-null object
LotFrontage      1459 non-null float64
LotArea          1459 non-null int64
Street           1459 non-null object
LotShape         1459 non-null object
LandContour      1459 non-null object
Utilities        1459 non-null object
LotConfig        1459 non-null object
LandSlope        1459 non-null object
Neighborhood     1459 non-null object
Condition1       1459 non-null object
Condition2       1459 non-null object
BldgType         1459 non-null object
HouseStyle       1459 non-null object
OverallQual      1459 non-null int64
OverallCond      1459 non-null int64
YearBuilt        1459 non-null int64
YearRemodAdd     1459 non-null int64
RoofStyle        1459 non-null object
RoofMatl         1459 non-null object
Exterior1st      1459 non-null object
Exterior2nd      1459 non-null object
MasVnrType       1459 no

In [18]:
from sklearn.preprocessing import LabelEncoder, MinMaxScaler

def label_tranform(X):
    for i in X.columns:
        lencoder = LabelEncoder()
        if X[i].dtype == 'O':
            X[i] = lencoder.fit_transform(X[i])
    return X

In [19]:
X_train = label_tranform(X_train)
X_test = label_tranform(X_test)

还有几个数值特征需要类别化一下，等等看下效果

In [20]:
cols = ['MSSubClass', 'YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'YrSold']

def numeric_transform(X, cols=cols):
    for i in cols:
        le = LabelEncoder()
        X[i] = le.fit_transform(X[i])
    return X

In [21]:
X_train = numeric_transform(X_train)
X_test = numeric_transform(X_test)

再来看以下数据的大致分布情况

In [22]:
X_train.describe()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,...,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition
count,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,4.166438,3.028767,57.623288,10516.828082,0.99589,1.942466,2.777397,0.000685,3.019178,0.062329,...,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,1.815753,7.513014,3.770548
std,4.161951,0.632017,34.664304,9981.264932,0.063996,1.409156,0.707666,0.026171,1.622634,0.276232,...,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,1.5521,1.100854
min,0.0,0.0,0.0,1300.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
25%,0.0,3.0,42.0,7553.5,1.0,0.0,3.0,0.0,2.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,5.0,1.0,8.0,4.0
50%,4.0,3.0,63.0,9478.5,1.0,3.0,3.0,0.0,4.0,0.0,...,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2.0,8.0,4.0
75%,6.0,3.0,79.0,11601.5,1.0,3.0,3.0,0.0,4.0,0.0,...,68.0,0.0,0.0,0.0,0.0,0.0,8.0,3.0,8.0,4.0
max,14.0,4.0,313.0,215245.0,1.0,3.0,3.0,1.0,4.0,2.0,...,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,4.0,8.0,5.0


In [23]:
X_train

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,...,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,5,3,65.0,8450,1,3,3,0,4,0,...,61,0,0,0,0,0,2,2,8,4
1,0,3,80.0,9600,1,3,3,0,2,0,...,0,0,0,0,0,0,5,1,8,4
2,5,3,68.0,11250,1,0,3,0,4,0,...,42,0,0,0,0,0,9,2,8,4
3,6,3,60.0,9550,1,0,3,0,0,0,...,35,272,0,0,0,0,2,0,8,0
4,5,3,84.0,14260,1,0,3,0,2,0,...,84,0,0,0,0,0,12,2,8,4
5,4,3,85.0,14115,1,0,3,0,4,0,...,30,0,320,0,0,700,10,3,8,4
6,0,3,75.0,10084,1,3,3,0,4,0,...,57,0,0,0,0,0,8,1,8,4
7,5,3,0.0,10382,1,0,3,0,0,0,...,204,228,0,0,0,350,11,3,8,4
8,4,4,51.0,6120,1,3,3,0,4,0,...,0,205,0,0,0,0,4,2,8,0
9,14,3,50.0,7420,1,3,3,0,0,0,...,4,0,0,0,0,0,1,2,8,4


In [24]:
cols = ['LotFrontage', 'LotArea', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 
        'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'GarageArea', 
        'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal']

def numeric_normalization(X, cols=cols):
    # mmscaler = MinMaxScaler()
    for i in cols:
       # X[i] = mmscaler.fit_transform(np.array(X[i]).reshape(-1, 1))
        X[i] = np.log1p(X[i])
    return X

In [25]:
X_train = numeric_normalization(X_train)
X_test = numeric_normalization(X_test)

下面我们来对缺失数据进行预测补全，这边我们也使用随机森林和GBDT来进行预测。

In [26]:
missing_cols_train

['LotFrontage',
 'MasVnrType',
 'MasVnrArea',
 'BsmtQual',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinType1',
 'BsmtFinType2',
 'Electrical',
 'GarageType',
 'GarageYrBlt',
 'GarageFinish',
 'GarageQual',
 'GarageCond']

In [27]:
missing_cols_test

['MSZoning',
 'LotFrontage',
 'Utilities',
 'Exterior1st',
 'Exterior2nd',
 'MasVnrType',
 'MasVnrArea',
 'BsmtQual',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinType1',
 'BsmtFinSF1',
 'BsmtFinType2',
 'BsmtFinSF2',
 'BsmtUnfSF',
 'TotalBsmtSF',
 'BsmtFullBath',
 'BsmtHalfBath',
 'KitchenQual',
 'Functional',
 'GarageType',
 'GarageYrBlt',
 'GarageFinish',
 'GarageCars',
 'GarageArea',
 'GarageQual',
 'GarageCond',
 'SaleType']

In [28]:
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

In [29]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV

In [30]:
def gscv(x, y, categorical=True):
    if categorical:
        rf = RandomForestClassifier()
    else:
        rf = RandomForestRegressor()
    gscv = GridSearchCV(rf, {'n_estimators': [50, 80, 100, 200],
                             'max_depth': [3, 10, 30, 50],
                             'min_samples_split': [2, 5, 10],
                             'min_samples_leaf': [1, 2, 5],
                             'random_state': range(501)}, cv=10, n_jobs=-1, iid=False)
    gscv.fit(x, y)
    return gscv

In [31]:
def padding(X, X_copy, categorical=True, cols=cols):
    to_be_dummies = []
    for i in X.columns:
        if i not in cols:
            to_be_dummies.append(i)            
    for i in X_copy.columns:
        if len(X_copy[X_copy[i].notnull()]) < len(X_copy):
            x_train = X[X_copy[i].notnull()]
            x_train.drop(i, inplace=True, axis=1)
            y_train = X[X_copy[i].notnull()][i]
            x_test = X[X_copy[i].isnull()]
            x_test.drop(i, inplace=True, axis=1)
            if i in cols:
                cols.remove(i)
            else:
                to_be_dummies.remove(i)
            x_train_dummies = pd.get_dummies(x_train[to_be_dummies])
            x_train_no_dummies = x_train[cols]
            x_train = pd.concat([x_train_dummies, x_train_no_dummies], axis=1)
            x_test_dummies = pd.get_dummies(x_test[to_be_dummies])
            x_test_no_dummies = x_test[cols]
            x_test = pd.concat([x_test_dummies, x_test_no_dummies], axis=1)           
            if X_copy[i].dtype == 'O':
                gscv_rf = gscv(x_train, y_train)
                rf = RandomForestClassifier(n_estimators=gscv_rf.best_params_['n_estimators'],
                                            max_depth=gscv_rf.best_params_['max_depth'],
                                            min_samples_split=gscv_rf.best_params_['min_samples_split'],
                                            min_samples_leaf=gscv_rf.best_params_['min_samples_leaf'],
                                            random_state=gscv_rf.best_params_['random_state'])
            else:
                gscv_rf = gscv(x_train, y_train, categorical=False)
                rf= RandomForestRegressor(n_estimators=gscv_rf.best_params_['n_estimators'],
                                          max_depth=gscv_rf.best_params_['max_depth'],
                                          min_samples_split=gscv_rf.best_params_['min_samples_split'],
                                          min_samples_leaf=gscv_rf.best_params_['min_samples_leaf'],
                                          random_state=gscv_rf.best_params_['random_state'])
            rf.fit(x_train, y_train)
            X.loc[(X_copy[i].isnull()), i] = rf.predict(x_test)
    return X

In [32]:
X_train.describe()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,...,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition
count,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,4.166438,3.028767,3.460779,9.110966,0.99589,1.942466,2.777397,0.000685,3.019178,0.062329,...,2.308541,0.698019,0.085679,0.410671,0.030431,0.233456,6.321918,1.815753,7.513014,3.770548
std,4.161951,0.632017,1.638062,0.517369,0.063996,1.409156,0.707666,0.026171,1.622634,0.276232,...,2.152387,1.727317,0.666876,1.403194,0.438685,1.22603,2.703626,1.328095,1.5521,1.100854
min,0.0,0.0,0.0,7.170888,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
25%,0.0,3.0,3.7612,8.929898,1.0,0.0,3.0,0.0,2.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,5.0,1.0,8.0,4.0
50%,4.0,3.0,4.158883,9.156887,1.0,3.0,3.0,0.0,4.0,0.0,...,3.258097,0.0,0.0,0.0,0.0,0.0,6.0,2.0,8.0,4.0
75%,6.0,3.0,4.382027,9.358976,1.0,3.0,3.0,0.0,4.0,0.0,...,4.234107,0.0,0.0,0.0,0.0,0.0,8.0,3.0,8.0,4.0
max,14.0,4.0,5.749393,12.279537,1.0,3.0,3.0,1.0,4.0,2.0,...,6.306275,6.315358,6.232448,6.175867,6.605298,9.64866,12.0,4.0,8.0,5.0


In [None]:
X_train = padding(X_train, X_train_copy)
X_test = padding(X_test, X_test_copy)

数据都已经补充完整，现在就来尝试预测数据了，依然使用随机森林回归，这边加入一个梯度提升树方法GBDT，来综合评估，同时，对于模型评估，我们使用交叉验证来实现。

In [None]:
def onehotencoder(X, cols=cols):
    to_be_dummies = []
    for i in X.columns:
        if i not in cols:
            to_be_dummies.append(i)
    X_dummies = pd.get_dummies(X[to_be_dummies])
    X_no_dummies = X[cols]
    X = pd.concat([X_dummies, X_no_dummies], axis=1)
    return X

In [None]:
X_train = onehotencoder(X_train)
X_test = onehotencoder(X_test)

In [None]:
gscv_rf = gscv(X_train, y_train, categorical=False)

In [None]:
rf = RandomForestRegressor(n_estimators=gscv_rf.best_params_['n_estimators'],
                           max_depth=gscv_rf.best_params_['max_depth'],
                           min_samples_split=gscv_rf.best_params_['min_samples_split'],
                           min_samples_leaf=gscv_rf.best_params_['min_samples_leaf'],
                           random_state=gscv_rf.best_params_['random_state'])
rf.fit(X_train, y_train)

In [None]:
rf.predict(X_test)

In [None]:
gbdt = GradientBoostingRegressor()
gscv = GridSearchCV(gbdt, {'n_estimators': [50, 80, 100, 200],
                           'max_depth': [3, 5, 10, 30, 50],
                           'learning_rate': [0.05, 0.1, 0.2],
                           'min_samples_split': [2, 5, 10],
                           'min_samples_leaf': [1, 2, 5],
                           'subsample': [0.6, 0.8, 1.0],
                           'random_state': range(501)}, cv=10, n_jobs=-1, iid=False)
gscv.fit(X_train, y_train)
gbdt = GradientBoostingRegressor(loss='exponential',
                                 n_estimators=gscv.best_params_['n_estimators'],
                                 max_depth=gscv.best_params_['max_depth'],
                                 learning_rate=gscv.best_params_['learning_rate'],
                                 min_samples_split=gscv.best_params_['min_samples_split'],
                                 min_samples_leaf=gscv.best_params_['min_samples_leaf'],
                                 subsample=gscv.best_params_['subsample'],
                                 random_state=gscv.best_params_['random_state'])
gbdt.fit(X_train, y_train)

In [None]:
gbdt.predict(X_test)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_squared_log_error

x_train, x_val, ytrain, y_val = train_test_split(X_train, y_train, test_size=0.3)

In [None]:
rf.fit(x_train, ytrain)
y_pred = gbdt.predict(x_val)

In [None]:
np.sqrt(mean_squared_error(y_val, y_pred))

In [None]:
gbdt.fit(x_train, ytrain)
y_pred = gbdt.predict(x_val)

In [None]:
np.sqrt(mean_squared_error(y_val, y_pred))

按照使用均方根误差来看，这个结果显然很大，我来看看该怎么改进这个结果。先来看下特征重要度

In [None]:
rf_features = pd.DataFrame({'百分比1':rf.feature_importances_}, index=X_train.columns).sort_values('百分比1', ascending=False)

In [None]:
gbdt_features = pd.DataFrame({'百分比2':gbdt.feature_importances_}, index=X_train.columns).sort_values('百分比2', ascending=False)

In [None]:
pd.concat([rf_features, gbdt_features], axis=1).sort_values('百分比1', ascending=False)

以上我们来看一下前5个最重要的特征，正因为是最重要的特征，需要确保其数据不存在异常值之类。

In [None]:
cols = ['OverallQual', 'GrLivArea', 'TotalBsmtSF', 'GarageCars', 'GarageArea', '1stFlrSF', 'BsmtFinSF1']

def outlier(X, y=None, cols=cols):
    for i in cols:
        dropindex = list(X[np.abs(X[i] - X[i].mean()) > 3 * np.std(X[i])].index)
        X.drop(dropindex, inplace=True)
        y.drop(dropindex, inplace=True)
    return X, y

In [None]:
X_train, y_train = outlier(X_train, y_train)

In [None]:
x_train, x_val, ytrain, y_val = train_test_split(X_train, y_train, test_size=0.3)

rf = RandomForestRegressor()
gscv = GridSearchCV(rf, {'n_estimators': [50, 80, 100, 200],
                         'max_depth': [3, 10, 30, 50],
                         'min_samples_split': [2, 5, 10],
                         'min_samples_leaf': [1, 2, 5],
                         'random_state': range(501)}, cv=10, n_jobs=-1, iid=False)
gscv.fit(x_train, ytrain)
rf = RandomForestRegressor(n_estimators=gscv.best_params_['n_estimators'],
                           max_depth=gscv.best_params_['max_depth'],
                           min_samples_split=gscv.best_params_['min_samples_split'],
                           min_samples_leaf=gscv.best_params_['min_samples_leaf'],
                           random_state=gscv.best_params_['random_state'])
rf.fit(x_train, ytrain)
y_pred = rf.predict(x_val)
rf_score = np.sqrt(mean_squared_error(y_val, y_pred))

In [None]:
gbdt = GradientBoostingRegressor()
gscv = GridSearchCV(gbdt, {'n_estimators': [50, 80, 100, 200],
                           'max_depth': [3, 5, 10, 30, 50],
                           'learning_rate': [0.05, 0.1, 0.2],
                           'min_samples_split': [2, 5, 10],
                           'min_samples_leaf': [1, 2, 5],
                           'subsample': [0.6, 0.8, 1.0],
                           'random_state': range(501)}, cv=10, n_jobs=-1, iid=False)
gscv.fit(x_train, ytrain)
gbdt = GradientBoostingRegressor(loss='exponential',
                                 n_estimators=gscv.best_params_['n_estimators'],
                                 max_depth=gscv.best_params_['max_depth'],
                                 learning_rate=gscv.best_params_['learning_rate'],
                                 min_samples_split=gscv.best_params_['min_samples_split'],
                                 min_samples_leaf=gscv.best_params_['min_samples_leaf'],
                                 subsample=gscv.best_params_['subsample'],
                                 random_state=gscv.best_params_['random_state'])
gbdt.fit(x_train, ytrain)
y_pred = gbdt.predict(x_val)
gbdt_score = np.sqrt(mean_squared_error(y_val, y_pred))

In [None]:
import xgboost as xgb

In [None]:
xgbr = xgb.XGBRegressor()
gscv = GridSearchCV(xgbr, {'n_estimators': [50, 80, 100, 200],
                           'max_depth': [3, 5, 10, 30, 50],
                           'learning_rate': [0.05, 0.1, 0.2],
                           'min_child_weight': [3, 5, 7, 9],
                           'subsample': [0.6, 0.8, 1.0],
                           'colsample_bytree': [0.6, 0.8, 0.1],
                           'reg_lambda': [0.01, 0.05, 0.1, 0.5, 1.0],
                           'reg_alpha': [0, 0.1, 0.5, 1.0],
                           'random_state': range(501)}, cv=10, n_jobs=-1)
gscv.fit(x_train, ytrain)
xgbr = xgb.XGBRegressor(n_estimators=gscv.best_params_['n_estimators'],
                        max_depth=gscv.best_params_['max_depth'],
                        learning_rate=gscv.best_params_['learning_rate'],
                        min_child_weight=gscv.best_params_['min_child_weight'],
                        subsample=gscv.best_params_['subsample'],
                        colsample_bytree=gscv.best_params_['colsample_bytree'],
                        reg_lambda=gscv.best_params_['reg_lambda'],
                        reg_alpha=gscv.best_params_['reg_alpha'],
                        random_state=gscv.best_params_['random_state'])
xgbr.fit(x_train, ytrain)
y_pred = xgbr.predict(x_val)
xgb_score = np.sqrt(mean_squared_error(y_val, y_pred))

In [None]:
(rf_score + gbdt_score + xgb_score) / 3

In [None]:
rf_score

In [None]:
gbdt_score

In [None]:
xgb_score