# Hyperparameter Tuning

## Imports

In [1]:
import pickle
import pandas as pd
from sklearn import svm

from cross_validation import custom_cross_validation, hyperparameter_search

## Load DataFrame

In [2]:
dirname = '../data/processed/'
df = pd.read_csv(dirname + 'housing_data_2_trimmed.csv')

# Drop non-numeric features, except 'postal_code'
df = df.drop(columns=['city', 'state', 'sold_date'])
df.shape

(5643, 51)

Now that we know which models are performing better, it's time to perform cross validation and tune hyperparameters.
- Do a google search for hyperparameter ranges for each type of model.

GridSearch/RandomSearch are a great methods for checking off both of these tasks.
- BUT we have a problem - if we calculated a numerical value to encode city (such as the mean of sale prices in that city) on the training data, we can't cross validate 
- The rows in each validation fold were part of the original calculation of the mean for that city - that means we're leaking information!
- While sklearn's built in functions are extremely useful, sometimes it is necessary to do things ourselves

You need to create two functions to replicate what Gridsearch does under the hood

**`custom_cross_validation()`**
- Should take the training data, and divide it into multiple train/validation splits. 
- Look into `sklearn.model_selection.KFold` to accomplish this - the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) shows how to split a dataframe and loop through the indexes of your split data. 
- Within your function, you should compute the city means on the training folds just like you did in Notebook 1 - you may have to re-join the city column to do this - and then join these values to the validation fold

**`hyperparameter_search()`**
- Should take the validation and training splits from your previous function, along with your dictionary of hyperparameter values
- For each set of hyperparameter values, fit your chosen model on each set of training folds, and take the average of your chosen scoring metric. [itertools.product()](https://docs.python.org/3/library/itertools.html) will be helpful for looping through all combinations of hyperparameter values
- Your function should output the hyperparameter values corresponding the highest average score across all folds. Alternatively, it could also output a model object fit on the full training dataset with these parameters.

In [3]:
train_validate_folds = custom_cross_validation(df, 5)

In [4]:
best_model = hyperparameter_search(
    train_validate_folds[0], 
    train_validate_folds[1],
    param_grid={
        'C': [250_000, 500_000],
        'gamma': [10, 50]
    }
)
best_model

{'C': 500000, 'gamma': 10, 'score': 0.8058787487874819}

Next, we select the best SVM model that we found and pickle the model to an external file.

In [5]:
model = svm.SVR(
    kernel='rbf',
    C=best_model['C'],
    gamma=best_model['gamma'],
    epsilon=1.0
)

dirname = '../models/'
basename = 'best_svm.pkl'
with open(dirname + basename, 'wb') as f:
    pickle.dump(model, f)

Once you've identified which model works the best, implement a prediction pipeline to make sure that you haven't leaked any data, and that the model could be easily deployed if desired.
- Your pipeline should load the data, process it, load your saved tuned model, and output a set of predictions
- Assume that the new data is in the same JSON format as your original data - you can use your original data to check that the pipeline works correctly
- Beware that a pipeline can only handle functions with fit and transform methods.
- Classes can be used to get around this, but now sklearn has a wrapper for user defined functions.
- You can develop your functions or classes in the notebook here, but once they are working, you should import them from `functions_variables.py` 

In [6]:
# Build pipeline here

Pipelines come from sklearn.  When a pipeline is pickled, all of the information in the pipeline is stored with it.  For example, if we were deploying a model, and we had fit a scaler on the training data, we would want the same, already fitted scaling object to transform the new data with.  This is all stored when the pipeline is pickled.
- save your final pipeline in your `models/` folder

In [7]:
# save your pipeline here