Теперь решаем задачу регрессии - предскажем цены на недвижимость.

1. Использовать датасет https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data (train.csv)
2. Данных немного, поэтому необходимо использовать 10-fold кросс-валидацию для оценки качества моделей
3. Построить случайный лес, вывести важность признаков
4. Обучить стекинг как минимум 3х моделей, использовать хотя бы 1 линейную модель и 1 нелинейную
5. Для валидации модели 2-го уровня использовать отдельный hold-out датасет, как на занятии
6. Показать, что использование ансамблей моделей действительно улучшает качество (стекинг vs другие модели сравнивать на hold-out)

В качестве решения: Jupyter notebook с кодом, комментариями и графикам

### 1. Подготовим данные 

In [46]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from jupyterthemes import jtplot
from sklearn.metrics import auc, roc_curve, roc_auc_score
%matplotlib inline
jtplot.style()

In [215]:
df = pd.read_csv('train.csv')

In [343]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1460 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            1460 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non

In [216]:
# отсутствующие данные
df_nan = df.isnull().sum()
df_nan[df.isnull().sum()>1]

LotFrontage      259
Alley           1369
MasVnrType         8
MasVnrArea         8
BsmtQual          37
BsmtCond          37
BsmtExposure      38
BsmtFinType1      37
BsmtFinType2      38
FireplaceQu      690
GarageType        81
GarageYrBlt       81
GarageFinish      81
GarageQual        81
GarageCond        81
PoolQC          1453
Fence           1179
MiscFeature     1406
dtype: int64

In [217]:
# LotFrontage,MasVnrArea заменим на 0, 
# остальные заменим на No, т.к. они обозначают отсутствие признака - например, гаража, бассейна и.т.д

df['LotFrontage'].fillna(0, inplace=True)
df['MasVnrArea'].fillna(0, inplace=True)

df.fillna('No', inplace=True)

In [218]:
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,No,Reg,Lvl,AllPub,...,0,No,No,No,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,No,Reg,Lvl,AllPub,...,0,No,No,No,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,No,IR1,Lvl,AllPub,...,0,No,No,No,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,No,IR1,Lvl,AllPub,...,0,No,No,No,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,No,IR1,Lvl,AllPub,...,0,No,No,No,0,12,2008,WD,Normal,250000


In [219]:
# посмотрим на категориальные признаки, их много 0_о, как и вариантов значений в них
сategorical_columns = [c for c in df.columns if df[c].dtype.name == 'object']
сategorical_columns

['MSZoning',
 'Street',
 'Alley',
 'LotShape',
 'LandContour',
 'Utilities',
 'LotConfig',
 'LandSlope',
 'Neighborhood',
 'Condition1',
 'Condition2',
 'BldgType',
 'HouseStyle',
 'RoofStyle',
 'RoofMatl',
 'Exterior1st',
 'Exterior2nd',
 'MasVnrType',
 'ExterQual',
 'ExterCond',
 'Foundation',
 'BsmtQual',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinType1',
 'BsmtFinType2',
 'Heating',
 'HeatingQC',
 'CentralAir',
 'Electrical',
 'KitchenQual',
 'Functional',
 'FireplaceQu',
 'GarageType',
 'GarageYrBlt',
 'GarageFinish',
 'GarageQual',
 'GarageCond',
 'PavedDrive',
 'PoolQC',
 'Fence',
 'MiscFeature',
 'SaleType',
 'SaleCondition']

In [220]:
df_dummies = pd.get_dummies(df[сategorical_columns], columns=сategorical_columns)

In [221]:
#  фичей немало, нужны ли они все?! как вариант, можно было сделать get_dummies для тех вариантов, которые встречается не меньше N-раз (30 например)
df_dummies.shape

(1460, 365)

In [222]:
# разброс числовых значений огромный, надо нормировать для линейных и неленейных моделей *** Верно? Ведь только лесу все равно на размерность
numeric_columns = [c for c in df.columns if df[c].dtype.name != 'object']
df_numeric = df[numeric_columns]

In [223]:
df_numeric.columns

Index(['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual',
       'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd',
       'Fireplaces', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal',
       'MoSold', 'YrSold', 'SalePrice'],
      dtype='object')

In [112]:
# соединим все вместе и удалим Id
data = pd.concat([df_numeric, df_dummies], axis=1)
data = data.drop('Id',axis=1)

In [113]:
data.shape

(1460, 401)

In [114]:
X = data.drop('SalePrice', axis = 1)
y = data['SalePrice']

### 2. Данных немного, поэтому необходимо использовать 10-fold кросс-валидацию для оценки качества моделей 

In [60]:
# уберем назойливые предупреждения про устаревание
import warnings
warnings.filterwarnings("ignore")

In [64]:
from sklearn.cross_validation import cross_val_score
from sklearn.ensemble import RandomForestRegressor

In [65]:
clf = RandomForestRegressor()

In [71]:
# неплохо, случайный лес объясняет 84,8% дисперсии
scores = cross_val_score(clf, X, y, cv=10, scoring='r2')
scores.mean()

0.844863824541191

### 3. Построить случайный лес, вывести важность признаков

In [100]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

In [101]:
model = RandomForestRegressor()

In [102]:
grid = {'max_depth': np.arange(4, 15),
        'min_samples_leaf': [10, 25, 50, 75, 100],
        'n_estimators': np.arange(6, 25),
        }

gridsearch = GridSearchCV(clf, grid, scoring='r2', cv=3, n_jobs=-1)

In [103]:
gridsearch.fit(X, y)

GridSearchCV(cv=3, error_score='raise',
       estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=8,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=10, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=24, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'max_depth': array([ 4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14]), 'min_samples_leaf': [10, 25, 50, 75, 100], 'n_estimators': array([ 6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
       23, 24])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='r2', verbose=0)

In [95]:
sorted(zip(gridsearch.cv_results_['mean_test_score'], gridsearch.cv_results_['params']), key = lambda x: -x[0])[:5]

[(0.8385370227036624,
  {'max_depth': 8, 'min_samples_leaf': 10, 'n_estimators': 24}),
 (0.8380052027237102,
  {'max_depth': 11, 'min_samples_leaf': 10, 'n_estimators': 24}),
 (0.836419116407217,
  {'max_depth': 14, 'min_samples_leaf': 10, 'n_estimators': 20}),
 (0.8362410912673922,
  {'max_depth': 12, 'min_samples_leaf': 10, 'n_estimators': 21}),
 (0.8361217537555121,
  {'max_depth': 11, 'min_samples_leaf': 10, 'n_estimators': 17})]

In [96]:
model = RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=8,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=10, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=24, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [97]:
clf.fit(X,y)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=8,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=10, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=24, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [99]:
# цена в большей степени зависит от общего состояния недвижимости и площади без basement

feat_imp = pd.DataFrame(X.columns, columns=['feature'])
feat_imp['importance'] = clf.feature_importances_

feat_imp.sort_values(by='importance', ascending=False).head()

Unnamed: 0,feature,importance
3,OverallQual,0.658047
15,GrLivArea,0.124574
11,TotalBsmtSF,0.057731
8,BsmtFinSF1,0.027752
24,GarageCars,0.021233


### 4. Обучить стекинг как минимум 3х моделей, использовать хотя бы 1 линейную модель и 1 нелинейную

In [104]:
from sklearn.model_selection import train_test_split

In [106]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [123]:
numeric_columns = ['MSSubClass','LotFrontage','LotArea','OverallQual','OverallCond','YearBuilt','YearRemodAdd',
 'MasVnrArea','BsmtFinSF1','BsmtFinSF2','BsmtUnfSF','TotalBsmtSF','1stFlrSF','2ndFlrSF','LowQualFinSF','GrLivArea',
 'BsmtFullBath','BsmtHalfBath','FullBath','HalfBath','BedroomAbvGr','KitchenAbvGr','TotRmsAbvGrd','Fireplaces',
 'GarageCars','GarageArea','WoodDeckSF','OpenPorchSF','EnclosedPorch','3SsnPorch','ScreenPorch','PoolArea',
 'MiscVal','MoSold','YrSold']

In [127]:
# для линейной понадобится нормирование
from sklearn.preprocessing import StandardScaler

In [125]:
scaler = StandardScaler()
scaler.fit(X_train[numeric_columns])

StandardScaler(copy=True, with_mean=True, with_std=True)

In [128]:
X_train[numeric_columns] = scaler.transform(X_train[numeric_columns])
X_test[numeric_columns] = scaler.transform(X_test[numeric_columns])

In [138]:
def get_meta_features(clf, X_train, y_train, X_test, stack_cv):
    meta_train = np.zeros_like(y_train, dtype=float)
    meta_test = np.zeros_like(y_test, dtype=float)
    
    for i, (train_ind, test_ind) in enumerate(stack_cv.split(X_train, y_train)):
        
        clf.fit(X_train.iloc[train_ind], y_train.iloc[train_ind])
        meta_train[test_ind] = clf.predict(X_train.iloc[test_ind])
        meta_test += clf.predict(X_test)
    
    return meta_train, meta_test

In [144]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn import svm

In [167]:
clf_lin_reg = LinearRegression()
clf_rand_for_reg = RandomForestRegressor()
clf_svr = svm.SVR(kernel='poly', degree=3)

In [168]:
from sklearn.model_selection import StratifiedKFold

stack_cv = StratifiedKFold(n_splits=10, random_state=555)

meta_train = []
meta_test = []
col_names = []

# модель 1
print('Linear_Regression_features...')
meta_tr, meta_te = get_meta_features(clf_lin_reg, X_train, y_train, X_test, stack_cv)

meta_train.append(meta_tr)
meta_test.append(meta_te)
col_names.append('Linear_Regression_pred')

# модель 2
print('Random_Forest_Reression features...')
meta_tr, meta_te = get_meta_features(clf_rand_for_reg, X_train, y_train, X_test, stack_cv)

meta_train.append(meta_tr)
meta_test.append(meta_te)
col_names.append('Random_Forest_Reression_pred')

# модель 3
print('SVR...')
meta_tr, meta_te = get_meta_features(clf_svr, X_train, y_train, X_test, stack_cv)

meta_train.append(meta_tr)
meta_test.append(meta_te)
col_names.append('SVR_pred')

Linear_Regression_features...
Random_Forest_Reression features...
SVR...


In [169]:
X_meta_train = pd.DataFrame(np.stack(meta_train, axis=1), columns=col_names)
X_meta_test = pd.DataFrame(np.stack(meta_test, axis=1), columns=col_names)

In [170]:
from sklearn.linear_model import LogisticRegression

In [171]:
clf_lr_meta = LinearRegression()
clf_lr_meta.fit(X_meta_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [172]:
y_pred_meta_test = clf_lr_meta.predict(X_meta_test)

In [174]:
clf_lr_meta.coef_.flatten()

array([ 0.48864924,  0.66180246, -0.0085878 ])

### проверим теперь модель на отложенной выборке

In [301]:
df_sub = pd.read_csv('test.csv')

In [302]:
#в submission у некоторых числовых колонок тип object 0_о
num_col_train = ['MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual',
       'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd',
       'Fireplaces', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal',
       'MoSold', 'YrSold']

num_col = set(num_col)

sub_num_col = set(list(df_sub_numeric.columns))
col_obj_to_num = list(num_col - sub_num_col)
col_obj_to_num = ['BsmtUnfSF','GarageCars','BsmtFullBath','BsmtFinSF2','TotalBsmtSF','GarageArea','BsmtFinSF1','BsmtHalfBath']

In [354]:
df_sub[col_obj_to_num].fillna(0, inplace = True)
df_sub[col_obj_to_num] = df_sub[col_obj_to_num].apply(pd.to_numeric)

In [356]:
df_sub[~df_sub.isin([np.nan, np.inf, -np.inf]).any(1)]

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition


In [357]:
df_sub[сategorical_columns].fillna('No', inplace=True)

In [370]:
data.fillna(0,inplace = True)

In [372]:
df_sub['LotFrontage'].fillna(0, inplace=True)
df_sub['MasVnrArea'].fillna(0, inplace=True)
df_sub['BsmtFullBath'].fillna(0, inplace=True)
df_sub['BsmtHalfBath'].fillna(0, inplace=True)

In [373]:
df_dummies = pd.get_dummies(df_sub[сategorical_columns], columns=сategorical_columns)
df_sub_numeric = df_sub[num_col_train]

In [374]:
data = pd.concat([df_sub_numeric, df_dummies], axis=1)

In [387]:
np.where(np.isnan(data))

(array([], dtype=int64), array([], dtype=int64))

In [388]:
data.iloc[660]

MSSubClass                 20.0
LotFrontage                99.0
LotArea                  5940.0
OverallQual                 4.0
OverallCond                 7.0
YearBuilt                1946.0
YearRemodAdd             1950.0
MasVnrArea                  0.0
BsmtFinSF1                  0.0
BsmtFinSF2                  0.0
BsmtUnfSF                   0.0
TotalBsmtSF                 0.0
1stFlrSF                  896.0
2ndFlrSF                    0.0
LowQualFinSF                0.0
GrLivArea                 896.0
BsmtFullBath                0.0
BsmtHalfBath                0.0
FullBath                    1.0
HalfBath                    0.0
BedroomAbvGr                2.0
KitchenAbvGr                1.0
TotRmsAbvGrd                4.0
Fireplaces                  0.0
GarageCars                  1.0
GarageArea                280.0
WoodDeckSF                  0.0
OpenPorchSF                 0.0
EnclosedPorch               0.0
3SsnPorch                   0.0
                          ...  
GarageCo

In [386]:
data.fillna(0, inplace=True)

In [389]:
X_submission = data

In [390]:
X_submission[num_col_train] = scaler.transform(X_submission[num_col_train])

размеры боевых данных не совпадают с обучающими 0_о

In [403]:
X_submission.shape

(1459, 366)

In [404]:
X_test.shape

(438, 400)

In [409]:
adj_col = list(set(X_test.columns) - set(X_submission.columns))
X_submission[adj_col]= 0

KeyError: "['Condition2_RRNn' 'Electrical_No' 'RoofMatl_ClyTile' 'GarageYrBlt_1933.0'\n 'Electrical_Mix' 'Exterior1st_ImStucc' 'Exterior1st_Stone'\n 'GarageYrBlt_1914.0' 'GarageCond_No' 'RoofMatl_Roll' 'GarageType_No'\n 'Heating_OthW' 'Heating_Floor' 'GarageYrBlt_1906.0' 'BsmtFinType1_No'\n 'GarageQual_No' 'GarageQual_Ex' 'Utilities_NoSeWa' 'RoofMatl_Membran'\n 'PoolQC_No' 'GarageYrBlt_1908.0' 'FireplaceQu_No' 'Fence_No'\n 'Condition2_RRAn' 'BsmtQual_No' 'GarageYrBlt_No' 'GarageYrBlt_1929.0'\n 'RoofMatl_Metal' 'HouseStyle_2.5Fin' 'GarageYrBlt_1931.0'\n 'Condition2_RRAe' 'BsmtCond_No' 'MasVnrType_No' 'PoolQC_Fa'\n 'GarageFinish_No' 'Exterior2nd_Other' 'MiscFeature_TenC' 'Alley_No'\n 'BsmtFinType2_No' 'MiscFeature_No'] not in index"

In [414]:
df_adj_col = pd.DataFrame(columns=adj_col)
df_adj_col.shape



(0, 40)

In [393]:
stack_cv = StratifiedKFold(n_splits=10, random_state=555)

meta_train = []
meta_test = []
col_names = []

# модель 1
print('Linear_Regression_features...')
meta_tr, meta_te = get_meta_features(clf_lin_reg, X_train, y_train, X_submission, stack_cv)

meta_train.append(meta_tr)
meta_test.append(meta_te)
col_names.append('Linear_Regression_pred')

# модель 2
print('Random_Forest_Reression features...')
meta_tr, meta_te = get_meta_features(clf_rand_for_reg, X_train, y_train, X_submission, stack_cv)

meta_train.append(meta_tr)
meta_test.append(meta_te)
col_names.append('Random_Forest_Reression_pred')

# модель 3
print('SVR...')
meta_tr, meta_te = get_meta_features(clf_svr, X_train, y_train, X_submission, stack_cv)

meta_train.append(meta_tr)
meta_test.append(meta_te)
col_names.append('SVR_pred')

Linear_Regression_features...


ValueError: shapes (1459,366) and (400,) not aligned: 366 (dim 1) != 400 (dim 0)

['Condition2_RRNn',
 'Electrical_No',
 'RoofMatl_ClyTile',
 'GarageYrBlt_1933.0',
 'Electrical_Mix',
 'Exterior1st_ImStucc',
 'Exterior1st_Stone',
 'GarageYrBlt_1914.0',
 'GarageCond_No',
 'RoofMatl_Roll',
 'GarageType_No',
 'Heating_OthW',
 'Heating_Floor',
 'GarageYrBlt_1906.0',
 'BsmtFinType1_No',
 'GarageQual_No',
 'GarageQual_Ex',
 'Utilities_NoSeWa',
 'RoofMatl_Membran',
 'PoolQC_No',
 'GarageYrBlt_1908.0',
 'FireplaceQu_No',
 'Fence_No',
 'Condition2_RRAn',
 'BsmtQual_No',
 'GarageYrBlt_No',
 'GarageYrBlt_1929.0',
 'RoofMatl_Metal',
 'HouseStyle_2.5Fin',
 'GarageYrBlt_1931.0',
 'Condition2_RRAe',
 'BsmtCond_No',
 'MasVnrType_No',
 'PoolQC_Fa',
 'GarageFinish_No',
 'Exterior2nd_Other',
 'MiscFeature_TenC',
 'Alley_No',
 'BsmtFinType2_No',
 'MiscFeature_No']