
# Exercise 3 GridsearchCV

The goal of this exercise is to learn to use GridSearchCV to run a grid search, predict on the test set and score on the test set.

Preliminary:

- Import California Housing data set and split it in a train set and a test set (10%). Fit a linear regression on the data set. *The goal is to focus on the gridsearch, that is why the code to fit the Linear Regression is given.*

```python
#imports
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

#data
housing = fetch_california_housing()
X, y = housing['data'], housing['target']
#split data train test 
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.1,
                                                    shuffle=True,
                                                    random_state=43)
#pipeline 
pipeline = [('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler()),
            ('lr', LinearRegression())]
pipe = Pipeline(pipeline)
```

1. Run `GridSearchCV` on all CPUs with 5 folds, MSE as score, Random Forest as model with:

- max_depth between 1 and 20 (at least 3 values)
- n_estimators between 1 and 100 (at least 3 values)

This may take few minutes to run.

*Hint*: The name of the metric to put in the parameter `scoring` is `neg_mean_squared_error`. The smaller the MSE is, the better the model is. At the contrary, The greater the R2 is the better the model is. `GridSearchCV` chooses the best model by selecting the one that maximized the score on the validation sets. And, in mathetmatic, maximzing a function or minimzing its opposite is equivalent. More details:

- https://stackoverflow.com/questions/21443865/scikit-learn-cross-validation-negative-values-with-mean-squared-error

2. Extract the best fitted estimator, print its params, print its score on the validation set and print `cv_results_`.

3. Compute the score the test set.

**WARNING: If the score used in classification is the AUC, there is one rare case where the AUC may return an error or a warning: The fold contains only one class. In that case it can't be computed, by definition.**


In [5]:
#imports
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

#data
housing = fetch_california_housing()
X, y = housing['data'], housing['target']
#split data train test 
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.1,
                                                    shuffle=True,
                                                    random_state=43)
#pipeline 
pipeline = [('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler()),
            ('lr', LinearRegression())]
pipe = Pipeline(pipeline)

# 1.
parameters = {'n_estimators':[10, 50, 75],
            'max_depth':[4, 7, 10]}

rf = RandomForestRegressor()
gridsearch = GridSearchCV(rf,
                        parameters,
                        cv = 5,
                        n_jobs=-1,
                        scoring='neg_mean_squared_error')

gridsearch.fit(X_train, y_train)

# 2.
print(gridsearch.best_score_)
print(gridsearch.best_params_)
print(gridsearch.cv_results_)

# The best score is -0.29028202683007526, that means that the MSE is ~0.29, it doesn't give any information since this metric is arbitrary. This score is the average of `neg_mean_squared_error` on all the validation sets.

# 3.
print('-----------')
print(gridsearch.score(X_test, y_test))
# The MSE score is ~0.27. The score I got on the test set is close to the score I got on the validation sets. It means the models is not over fitted.

-0.29155187671448063
{'max_depth': 10, 'n_estimators': 75}
{'mean_fit_time': array([0.49330783, 2.17938128, 3.37942586, 0.73762355, 3.60652013,
       6.02801886, 1.01369634, 6.42465711, 7.83597479]), 'std_fit_time': array([0.03276028, 0.11588314, 0.13233421, 0.08189721, 0.16334649,
       0.36806859, 0.06414597, 0.2397275 , 0.89071465]), 'mean_score_time': array([0.00608697, 0.01854568, 0.02624226, 0.00551362, 0.02538805,
       0.04687972, 0.00957913, 0.04579592, 0.04288216]), 'std_score_time': array([0.00275626, 0.00130466, 0.00322864, 0.00062858, 0.0032865 ,
       0.00594217, 0.00294153, 0.01430291, 0.00872202]), 'param_max_depth': masked_array(data=[4, 4, 4, 7, 7, 7, 10, 10, 10],
             mask=[False, False, False, False, False, False, False, False,
                   False],
       fill_value='?',
            dtype=object), 'param_n_estimators': masked_array(data=[10, 50, 75, 10, 50, 75, 10, 50, 75],
             mask=[False, False, False, False, False, False, False, False,
