# Developing an ML Model End-to-End: Exercises

Now that you've had some exposure to using Python for training, tuning, evaluating, and selecting an ML model, it's time to try some exercises on your own. Below, you'll find 5 exercises that ask you to continue your work with the California Housing Data Set used in the Developing an ML Model End-to-End lab. Each exercise will begin with some starter code to help guide your development towards a solution to the problem. Remember, there isn't just one solution to the problem and your code may produce the similar results even if it's written differently from what is supplised in the solution notebook.

If you haven't already downloaded the California Housing Data Set, the code block below is from the course lab. You can execute it again to download the data set to Anaconda Notebooks or your local computer.

In [None]:
# Import packages for loading data
import os
import tarfile
from six.moves import urllib
import pandas as pd

# Create a function for pulling U.S. Census data for California housing
DOWNLOAD_ROOT = 'https://raw.githubusercontent.com/jsukup/handson-ml2/master/'
HOUSING_PATH = os.path.join('datasets', 'housing')
HOUSING_URL = DOWNLOAD_ROOT + 'datasets/housing/housing.tgz'

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    tgz_path = os.path.join(housing_path, 'housing.tgz')
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

fetch_housing_data() # Pull data set from GitHub

# Create a function for loading data into a Python object
def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, 'housing.csv')
    return pd.read_csv(csv_path)

## Q1: 
Try training a **Support Vector Machine** regressor (`sklearn.svm.SVR`), with various Hyperparameters such as `kernel='linear'` (with 3 values for the `C` hyperparameter) and `kernel='rbf'` (with 3 values for the `C` and `gamma` Hyperparameters). Don't worry about what these Hyperparameters mean for now. 

How does the best `SVR` predictor perform? Print the best model's RMSE and parameters.  

In [None]:
# Train a Support Vector Machine using GridSearchCV from the Scikit Learn library to find the best parameters
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV

# Create a list of hyperparameter values set as key:value pairs
param_grid = [
        {'kernel': ['linear'], 'C': [10., 3000., 30000.0]},
        {'kernel': ['rbf'], 'C': [1.0, 30., 1000.0],
         'gamma': [0.01, 0.3, 3.0]},
]

# Train Support Vector Machine model


### YOUR CODE HERE ###

### Q2:
Try replacing `GridSearchCV` with `RandomizedSearchCV` and set `n_iter=10`  and `cv=5`. Print the best model's RMSE and parameters.

In [None]:
# Try random search with Support Vector Machine model
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import expon, reciprocal

# see https://docs.scipy.org/doc/scipy/reference/stats.html
# for `expon()` and `reciprocal()` documentation and more probability 
# distribution functions.

# Create a range of hyperparameter values set as key:value pairs
# Note: gamma is ignored when kernel is 'linear'
param_distribs = {
        'kernel': ['linear', 'rbf'],
        'C': reciprocal(20, 200000),
        'gamma': expon(scale=1.0),
}
### YOUR CODE HERE ###

### Q3:
Try creating a transformer class called `TopFeatureSelector` with three methods: `__init__`, `fit`, and `transform` that selects the top features based on their importances and add it to the existing `full_pipeline` created earlier. Both the `BaseEstimator` and `TransformerMixin` classes used earlier will be appropriate for creating the new top features class. When done, the new Pipeline will look something like this:

```
pipeline = Pipeline([
                     ('preparation', full_pipeline),
                     ('feature_selection', TopFeatureSelector(feature_importances, k))
])
```

**NOTE**: This feature selector assumes that you have already computed the feature importances somehow (for example using a `RandomForestRegressor`). Though tempting to compute them directly in the `TopFeatureSelector`'s `fit()` method, this would likely slow down Hyperparameter optimization since the feature importances would have to be computed for every Hyperparameter combination (unless you implement some sort of cache).

In [None]:
# Import estimators, pipeline, and Random Forest
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline

# Use RandomForestRegressor to compute feature importance scores
### YOUR CODE HERE ###

### Q4:
Try creating a single Pipeline that does the full data preprocessing plus the final prediction. This can be accomplished using the Pipeline created in Q3 and adding one final step to the list for the model estimation. Try using the SVR model used earlier where Random Search found the best model parameters (i.e. `rnd_search_SVR.best_params_`) as the final step.

Try the Pipeline on the first 5 observations from the training data by printing out the model's prediction and the actual label.

In [None]:
# Import pipeline and Support Vector Machine packages
from sklearn.pipeline import Pipeline
from sklearn.svm import SVR
import pandas as pd

# Create final pipeline for preprocessing and prediction
final_pipeline = Pipeline([
    ('preparation', full_pipeline),
    ('feature_selection', TopFeatureSelector(feature_importances, k)),
    ('svm_reg', SVR(**rnd_search_SVR.best_params_))
])
### YOUR CODE HERE ###

### Q5:
Automatically explore some preprocessing options using `GridSearchCV` on the `final_pipeline` just created. Print out the parameters of the best model

To access Hyperparameters nested within other Pipelines, double underscores (i.e. "\__") can be used. For example, to tune the value of `k` in the `TopFeatureSelector` class, set the Pipeline 'key:value' pair to `'feature_selector__k'`followed by some method for generating values for `k`. 

Double underscores allow access to nested tasks within the Pipeline. Another example: to access the `SimpleImputer` nested all the way down in the first Pipeline created for numeric features alone (i.e. `num_pipeline`), use the names given to each Pipeline object in sequence:

```
'preparation__num__imputer__strategy'
```
where:
```
preparation__ = [name for `full_pipeline` in `final_pipeline` object]
num__ = [name for `num_pipeline` in `full_pipeline` object]
imputer__ = [name for `SimpleImputer` in `num_pipeline` object]
strategy = [hyperparameter for `SimpleImputer`]
```
Finish by printing the parameters of the best model found.


In [None]:
# Create grid search parameters for pipeline
param_grid = [{
    'preparation__num__imputer__strategy': ['mean', 'median', 'most_frequent'],
    'feature_selection__k': list(range(1, len(feature_importances) + 1))
}]

# Five fold cross-validation: (3*16)*5=240 rounds of training  
### YOUR CODE HERE ###