## Домашнее задание (Кросс-валидация) UPD 

Для выполнения домашнего задания необходимо взять boston house-prices datase (sklearn.datasets.load_boston) и сделать тоже самое для задачи регрессии (попробовать разные алгоритмы, поподбирать параметры, вывести итоговое качество). 

In [25]:
import pandas as pd
import numpy as np

from sklearn.cross_validation import KFold

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict, GridSearchCV

from sklearn.linear_model import LinearRegression, Ridge, RidgeCV, ElasticNet, LassoCV, LassoLarsCV, Lasso
from sklearn.neural_network import MLPRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor, BaggingRegressor, GradientBoostingRegressor

from sklearn import metrics
from sklearn.metrics import mean_squared_error, explained_variance_score, mean_absolute_error
from sklearn.svm import SVR

import matplotlib.pyplot as plt
%matplotlib inline


from jupyterthemes import jtplot
jtplot.style(figsize=(12.0, 8.0))


import warnings
warnings.simplefilter('ignore')

### Реализация

## Часть I (препроцессинг)

### Загрузим данные

In [2]:
boston = load_boston()

In [3]:
X = boston.data

In [4]:
y = boston.target

In [5]:
#сделаем набор данных 
#df = pd.DataFrame(X, columns = boston.feature_names)

In [6]:
#выполним нормализацию данных 
#data = X
#data_n_all = (data - data.mean()) / (data.std())
#df_scaler = pd.DataFrame(data_n_all, columns = boston.feature_names)

### Выполним нормировку значений

In [7]:
#обработаем df
#scaler = StandardScaler()
#scaler.fit(df)

#X_df_scaled = scaler.transform(df)
#X_df_scaled = pd.DataFrame(X_df_scaled, columns = boston.feature_names)

In [8]:
#X_df_scaled.head()

In [9]:
#обработаем numpy массив
scaler = StandardScaler()
scaler.fit(X)

X_scaled = scaler.transform(X)
X_scaled = np.array(pd.DataFrame(X_scaled))

In [10]:
len(y) == len(X_scaled) == len(X)

True

## Часть II (Кросс-валидация)


попробуем разные алгоритмы, поподбираем параметры, анализируем итоговое качество

part I: (выполним кросс-валидацию на разнообразных алгоритмах, для оценки качества используем метрику RMSE)
алгоритмы:
    - LinearRegression
    - Ridge
    - ElasticNet
    - RandomForestRegressor
    - SVM
    - kNN
    - BaggingRegressor
    - GradientBoostingRegressor
part II: поподбираем параметры для этого будем использовать GridSearchCV и немного в ручную поподбираем

##### Сделаем тренировочный и тестовый набор данных

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.3, random_state=5779)

In [12]:
#X_train, X_test, y_train, y_test = train_test_split(df, y, test_size = 0.3, random_state=5779)

##### Сделаем модели

In [13]:
model_lr = LinearRegression()
model_ridge = Ridge()

model_EN = ElasticNet()
model_l = Lasso()

model_rf = RandomForestRegressor()
model_tree = DecisionTreeRegressor()
model_SVM = SVR()
model_kNN = KNeighborsRegressor()

model_bagging = BaggingRegressor()
model_GBR = GradientBoostingRegressor()

### Part I

In [14]:
k_fold = KFold(len(y_train), n_folds=20, shuffle=True, random_state=0)

In [85]:
#напишем полезную функцию rmse_cv 
def rmse_cv(model, t):
    """
    функция на вход принимает зафиченную модель
    и выполняет кросс-валидацию
    """
    rmse= np.sqrt(-cross_val_score(model, X_train, y_train, scoring="neg_mean_squared_error", cv = t))
    return(rmse)

In [16]:
print('RMSE on train...')
print('LinearRegression features...')
print(rmse_cv(model_lr, k_fold).mean())
print('..............................')
print('RF features...')
print(rmse_cv(model_rf, k_fold).mean())
print('..............................')
print('Ridge features...')
print(rmse_cv(model_ridge, k_fold).mean())
print('..............................')
print('Tree features...')
print(rmse_cv(model_tree, k_fold).mean())
print('..............................')
print('ElasticNet features...')
print(rmse_cv(model_EN, k_fold).mean())
print('..............................')
print('Lasso features...')
print(rmse_cv(model_l, k_fold).mean())
print('..............................')

print('SVM features...')
print(rmse_cv(model_SVM, k_fold).mean())
print('..............................')
print('kNN features...')
print(rmse_cv(model_kNN, k_fold).mean())
print('..............................')
print('BaggingRegressor features...')
print(rmse_cv(model_bagging, k_fold).mean())
print('..............................')
print('GradientBoostingRegressor features...')
print(rmse_cv(model_GBR, k_fold).mean())
print('..............................')

RMSE on train...
LinearRegression features...
4.548732181464086
..............................
RF features...
3.4285373827701378
..............................
Ridge features...
4.5442558258884524
..............................
Tree features...
4.891252474124793
..............................
ElasticNet features...
5.022816107507158
..............................
Lasso features...
4.993995410464253
..............................
SVM features...
5.113180136985321
..............................
kNN features...
4.2248009828795245
..............................
BaggingRegressor features...
3.37084335030953
..............................
GradientBoostingRegressor features...
3.164223159297822
..............................


### Part II
поподбираем параметры для этого будем использовать GridSearchCV (для выполнения этой части домашней работы, чтобы не делать одно и тоже, я выбрал только некоторые алгоритмы из приведенных в часте I)

алгоритмы:
- KNeighborsRegressor
- Ridge
- RandomForestRegressor
- GradientBoostingRegressor
- DecisionTreeRegressor
- BaggingRegressor

In [17]:
from sklearn.model_selection import GridSearchCV

#### - подберем оптимальные параметры для модели KNeighborsRegressor

In [18]:
k_range = list(range(1, 31))

In [19]:
param_grid = dict(n_neighbors=k_range)

In [20]:
grid_kNN = GridSearchCV(model_kNN, param_grid, cv=20)

In [21]:
grid_kNN.fit(X_train, y_train)

GridSearchCV(cv=20, error_score='raise',
       estimator=KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=1, n_neighbors=5, p=2,
          weights='uniform'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [34]:
#for k in grid.cv_results_:
#    print(k, ":", grid.cv_results_[k][0])

In [23]:
test_scores = grid_kNN.cv_results_['mean_test_score']
#print(test_scores)

In [28]:
print(grid_kNN.best_score_)
print(grid_kNN.best_params_)
print(grid_kNN.best_estimator_)

0.7108661276097427
{'n_neighbors': 5}
KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=1, n_neighbors=5, p=2,
          weights='uniform')


#### - подберем оптимальные параметры для модели Ridge

In [29]:
alphas = [0.05, 0.1, 0.3, 1, 3, 5, 10, 15, 30, 45, 60, 75, 100, 125, 250, 500, 750]

In [30]:
param_grid = dict(alpha=alphas)

In [31]:
grid_r = GridSearchCV(model_ridge, param_grid, cv=20)

In [32]:
grid_r.fit(X_train, y_train)

GridSearchCV(cv=20, error_score='raise',
       estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'alpha': [0.05, 0.1, 0.3, 1, 3, 5, 10, 15, 30, 45, 60, 75, 100, 125, 250, 500, 750]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [64]:
#for k in grid.cv_results_:
#    print(k, ":", grid.cv_results_[k][0])

In [33]:
test_scores = grid_r.cv_results_['mean_test_score']
#print(test_scores)

In [34]:
print(grid_r.best_score_)
print(grid_r.best_params_)
print(grid_r.best_estimator_)

0.6352350458121686
{'alpha': 45}
Ridge(alpha=45, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)


- подберем оптимальные параметры для модели RandomForestRegressor

In [None]:
#model_rf = RandomForestRegressor()

In [35]:
estimators = list(range(1, 100, 10))

In [36]:
param_grid = dict(n_estimators=estimators)

In [37]:
grid_rf = GridSearchCV(model_rf, param_grid, cv=20, n_jobs=-1)

In [38]:
grid_rf.fit(X_train, y_train)

GridSearchCV(cv=20, error_score='raise',
       estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'n_estimators': [1, 11, 21, 31, 41, 51, 61, 71, 81, 91]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [39]:
test_scores = grid_rf.cv_results_['mean_test_score']

In [40]:
print(grid_rf.best_score_)
print(grid_rf.best_params_)
print(grid_rf.best_estimator_)

0.822351112493569
{'n_estimators': 41}
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=41, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)


#### - подберем оптимальные параметры для модели GradientBoostingRegressor

In [41]:
lr = [0.05, 0.01, 0.1, 0.3, 0.5, 0.7, 1, 2, 3, 5, 7, 10, 15, 20, 25, 30, 50, 75]

In [42]:
param_grid = dict(learning_rate=lr)

In [43]:
grid_GBR = GridSearchCV(model_GBR, param_grid, cv=20)

In [44]:
grid_GBR.fit(X_train, y_train)

GridSearchCV(cv=20, error_score='raise',
       estimator=GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=3, max_features=None,
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=100, presort='auto', random_state=None,
             subsample=1.0, verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'learning_rate': [0.05, 0.01, 0.1, 0.3, 0.5, 0.7, 1, 2, 3, 5, 7, 10, 15, 20, 25, 30, 50, 75]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [45]:
test_scores = grid_GBR.cv_results_['mean_test_score']

In [46]:
print(grid_GBR.best_score_)
print(grid_GBR.best_params_)
print(grid_GBR.best_estimator_)

0.8317971448543446
{'learning_rate': 0.1}
GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=3, max_features=None,
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=100, presort='auto', random_state=None,
             subsample=1.0, verbose=0, warm_start=False)


- подберем оптимальные параметры для модели DecisionTreeRegressor

In [78]:
dept = list(range(1, 51))

In [79]:
param_grid = dict(max_depth=dept)

In [81]:
grid_tree = GridSearchCV(model_tree, param_grid, cv=20)

In [82]:
grid_tree.fit(X_train, y_train)

GridSearchCV(cv=20, error_score='raise',
       estimator=DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'max_depth': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [83]:
test_scores = grid_tree.cv_results_['mean_test_score']

In [84]:
print(grid_tree.best_score_)
print(grid_tree.best_params_)
print(grid_tree.best_estimator_)

0.73586736123632
{'max_depth': 8}
DecisionTreeRegressor(criterion='mse', max_depth=8, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')


- подберем оптимальные параметры для модели BaggingRegressor

In [53]:
BaggingRegressor()

BaggingRegressor(base_estimator=None, bootstrap=True,
         bootstrap_features=False, max_features=1.0, max_samples=1.0,
         n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
         verbose=0, warm_start=False)

In [55]:
fe = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
est = list(range(1, 10))

In [56]:
param_grid = dict(max_features=fe, n_estimators=est)

In [57]:
grid_BR = GridSearchCV(model_bagging, param_grid, cv=20, n_jobs=-1)

In [58]:
grid_BR.fit(X_train, y_train)

GridSearchCV(cv=20, error_score='raise',
       estimator=BaggingRegressor(base_estimator=None, bootstrap=True,
         bootstrap_features=False, max_features=1.0, max_samples=1.0,
         n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
         verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'max_features': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1], 'n_estimators': [1, 2, 3, 4, 5, 6, 7, 8, 9]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [59]:
test_scores = grid_BR.cv_results_['mean_test_score']

In [60]:
print(grid_BR.best_score_)
print(grid_BR.best_params_)
print(grid_BR.best_estimator_)

0.8163522324696072
{'max_features': 0.8, 'n_estimators': 8}
BaggingRegressor(base_estimator=None, bootstrap=True,
         bootstrap_features=False, max_features=0.8, max_samples=1.0,
         n_estimators=8, n_jobs=1, oob_score=False, random_state=None,
         verbose=0, warm_start=False)


In [61]:
estimators = {
    'KNeighborsRegressor': grid_kNN,
    'Ridge': grid_r,
    'RandomForestRegressor': grid_rf,
    'GradientBoostingRegressor': grid_GBR,
    'DecisionTreeRegressor': grid_tree,
    'BaggingRegressor': grid_BR,
}

In [68]:
for k in estimators:
    v = estimators[k]
    print('..........')
    print(k, "CV Accuracy:", v.best_score_, "Validation Accuracy:", v.best_estimator_.score(X_test, y_test))

..........
KNeighborsRegressor CV Accuracy: 0.7108661276097427 Validation Accuracy: 0.7288837516477775
..........
Ridge CV Accuracy: 0.6352350458121686 Validation Accuracy: 0.705562545058098
..........
RandomForestRegressor CV Accuracy: 0.822351112493569 Validation Accuracy: 0.8883211217586678
..........
GradientBoostingRegressor CV Accuracy: 0.8317971448543446 Validation Accuracy: 0.8823216530351842
..........
DecisionTreeRegressor CV Accuracy: 0.7009118233618215 Validation Accuracy: 0.7883364212433589
..........
BaggingRegressor CV Accuracy: 0.8163522324696072 Validation Accuracy: 0.8486271519934029
