# Housing Prices in King County, WA: Support Vector Regression
Goal
- Use Support Vector Regression to create a model that predicts the sale price of homes given various attributes about the house

## Obtain Data

In [2]:
# global imports

# sklearn features
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.svm import SVR

In [3]:
# import dataframes 
%store -r dfs

# assign dataframes to variables
X_train = dfs[0]
X_val = dfs[1]
X_test = dfs[2]
y_train = dfs[3]
y_val = dfs[4]
y_test = dfs[5] 

# check importing data frames worked
df = [X_train, X_val, X_test, y_train, y_val, y_test]
for d in df:
    print(d.shape)

(3181, 18)
(682, 18)
(682, 18)
(3181, 1)
(682, 1)
(682, 1)


## Train Support Vector Regression Model
Model Evaluation
- Use model to make predictions for price given test predictors
- Compute metrics to compare predictions with actual price for test dataset
    - Minimize mean absolute error and mean squared error
        - Mean absolute error: the average difference between the observed price and predicted price
        - Mean squared error: the average squared difference between observed price and predicted price
            - Gives a higher weight than mean absolute error for large errors

Steps to Train the Random Forest Model
1. Train a base model with default parameters
2. Evaluate base model using validation dataset
3. Train a model with preliminary best parameters chosen through random search
    - Test a wide range of parameter values
        - Choose the set of parameters that minimizes mean squared error
    - Random search is appropriate for preliminary estimates
        - Has a faster runtime than grid search
            - This is because it does not try all parameter values
4. Evaluate random search model using validation dataset
5. Train a model with best parameters chosen through grid search
    - Test a narrow range of parameter values
        - Choose the set of parameters that minimizes mean squared error
    - Grid search is appropriate for final estimates
        - Tries all combinations of parameters
6. Evaluate model using validation dataset

In [4]:
# general support vector regressor
svr_model = SVR()

# function that evaluates the model
def evaluate(model, test_pred, test_resp):
    # predictions
    predict = model.predict(test_pred)
    # metrics on test data
    test_mae = mean_absolute_error(test_resp,  predict)
    test_mse = mean_squared_error(test_resp,  predict)
    # print results
    print("Mean Absolute Error: %s" %test_mae)
    print("Mean Squared Error: %s" %test_mse)

In [5]:
# evaluate the base model
base_model = SVR(gamma = 'auto')
base_model.fit(X_train, y_train.values.ravel())
evaluate(base_model, X_val, y_val.values.ravel())

Mean Absolute Error: 210590.9825763046
Mean Squared Error: 101625618270.59404


In [6]:
# random search grid
random_grid = {
    'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
    'degree': [1, 3, 5, 7],
    'gamma': ['auto', 'scale'],
    'C': [0.1, 1, 10, 100],
    'epsilon': [0.01, 0.1, 1, 2]}

# fit random search
svr_random = RandomizedSearchCV(estimator = svr_model, param_distributions = random_grid, n_iter = 
                               100, cv = 3, scoring = 'neg_mean_squared_error')
svr_random.fit(X_train, y_train.values.ravel())

# output best parameters from random search
svr_random.best_params_

{'kernel': 'linear', 'gamma': 'auto', 'epsilon': 0.1, 'degree': 5, 'C': 100}

In [7]:
# evaluate best random search model
best_random = svr_random.best_estimator_
evaluate(best_random, X_val, y_val.values.ravel())

Mean Absolute Error: 193774.88883209048
Mean Squared Error: 91739650596.68297


In [8]:
# grid search parameters
params_grid = {
    'kernel': ['linear'],
    'gamma': ['auto'],
    'epsilon': [0.0075, 0.01, 0.0125],
    'degree': [4, 5, 6],
    'C': [50, 75, 100, 125, 150]
}

# fit grid search
svr_grid = GridSearchCV(estimator = svr_model, param_grid = params_grid, cv = 3, scoring = 
                       'neg_mean_squared_error')
svr_grid.fit(X_train, y_train.values.ravel())

# output best parameters from grid search
svr_grid.best_params_

{'C': 150, 'degree': 4, 'epsilon': 0.0075, 'gamma': 'auto', 'kernel': 'linear'}

In [9]:
# evaluate best grid search model
best_grid = svr_grid.best_estimator_
evaluate(best_grid, X_val, y_val.values.ravel())

Mean Absolute Error: 187840.9805491457
Mean Squared Error: 88153486487.66376


## Final Support Vector Regressor Model
- Train a model with best parameters found from grid search
- Evaluate the model using the test dataset
    - The model is not overfitted
        - The mean absolute error for training and test datasets are similar
        - The mean squared error for training and test datasets are similar

In [11]:
# check model performance on the test dataset
final_model = SVR(C = 150, degree = 4, epsilon = 0.0075, gamma = 'auto', kernel = 'linear')
final_model.fit(X_train, y_train.values.ravel())
evaluate(final_model, X_test, y_test.values.ravel())

Mean Absolute Error: 203642.65827867747
Mean Squared Error: 142434180920.8759
