# Housing Prices in King County, WA: Random Forest
Goal
- Use Random Forests to create a model that predicts the sale price of homes given various attributes about the house

## Obtain Data

In [1]:
# global imports

# sklearn features
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor

In [2]:
# import dataframes 
%store -r dfs

# assign dataframes to variables
X_train = dfs[0]
X_val = dfs[1]
X_test = dfs[2]
y_train = dfs[3]
y_val = dfs[4]
y_test = dfs[5] 

# check importing data frames worked
df = [X_train, X_val, X_test, y_train, y_val, y_test]
for d in df:
    print(d.shape)

(3181, 18)
(682, 18)
(682, 18)
(3181, 1)
(682, 1)
(682, 1)


## Train Random Forest Model
Model Evaluation
- Use model to make predictions for price given test predictors
- Compute metrics to compare predictions with actual price for test dataset
    - Minimize mean absolute error and mean squared error
        - Mean absolute error: the average difference between the observed price and predicted price
        - Mean squared error: the average squared difference between observed price and predicted price
            - Gives a higher weight than mean absolute error for large errors

Steps to Train the Random Forest Model
1. Train a base model with default parameters
2. Evaluate base model using validation dataset
3. Train a model with preliminary best parameters chosen through random search
    - Test a wide range of parameter values
        - Choose the set of parameters that minimizes mean squared error
    - Random search is appropriate for preliminary estimates
        - Has a faster runtime than grid search
            - This is because it does not try all parameter values
4. Evaluate random search model using validation dataset
5. Train a model with best parameters chosen through grid search
    - Test a narrow range of parameter values
        - Choose the set of parameters that minimizes mean squared error
    - Grid search is appropriate for final estimates
        - Tries all combinations of parameters
6. Evaluate model using validation dataset

In [11]:
# general random forest regressor
rf_model = RandomForestRegressor()

# function that evaluates the model
def evaluate(model, test_pred, test_resp):
    # predictions
    predict = model.predict(test_pred)
    # metrics on test data
    test_mae = mean_absolute_error(test_resp,  predict)
    test_mse = mean_squared_error(test_resp,  predict)
    # print results
    print("Mean Absolute Error: %s" %test_mae)
    print("Mean Squared Error: %s" %test_mse)

In [12]:
# evaluate the base model
base_model = RandomForestRegressor(n_estimators = 10, random_state = 123)
base_model.fit(X_train, y_train.values.ravel())
evaluate(base_model, X_val, y_val.values.ravel())

Mean Absolute Error: 124628.17442109779
Mean Squared Error: 43352277985.34949


In [5]:
# random search grid
random_grid = {
    'bootstrap': [True, False],
    'max_depth': [None, 10, 25, 50, 75, 100],
    'max_features': ['auto', 'sqrt'],
    'min_samples_leaf': [1, 2, 4],
    'min_samples_split': [2, 5, 10],
    'n_estimators': [100, 500, 1000, 1500, 2000]}

# fit random search
rf_random = RandomizedSearchCV(estimator = rf_model, param_distributions = random_grid, n_iter = 
                               100, cv = 3, scoring = 'neg_mean_squared_error')
rf_random.fit(X_train, y_train.values.ravel())

# output best parameters from random search
rf_random.best_params_

{'n_estimators': 2000,
 'min_samples_split': 5,
 'min_samples_leaf': 2,
 'max_features': 'sqrt',
 'max_depth': 75,
 'bootstrap': False}

In [13]:
# evaluate best random search model
best_random = rf_random.best_estimator_
evaluate(best_random, X_val, y_val.values.ravel())

Mean Absolute Error: 115083.0792970653
Mean Squared Error: 37131636642.76534


In [7]:
# grid search parameters
params_grid = {
    'bootstrap': [False],
    'max_depth': [65, 75, 85],
    'max_features': ['sqrt'],
    'min_samples_leaf': [2,3],
    'min_samples_split': [3, 5, 7],
    'n_estimators': [1800, 2000, 2200]}

# fit grid search
rf_grid = GridSearchCV(estimator = rf_model, param_grid = params_grid, cv = 3, scoring = 
                       'neg_mean_squared_error')
rf_grid.fit(X_train, y_train.values.ravel())

# output best parameters from grid search
rf_grid.best_params_

{'bootstrap': False,
 'max_depth': 75,
 'max_features': 'sqrt',
 'min_samples_leaf': 2,
 'min_samples_split': 5,
 'n_estimators': 1800}

In [14]:
# evaluate best grid search model
best_grid = rf_grid.best_estimator_
evaluate(best_grid, X_val, y_val.values.ravel())

Mean Absolute Error: 115254.41338666176
Mean Squared Error: 37129802086.589455


## Final Random Forest Model
- Train a model with best parameters found from grid search
- Evaluate the model using the test dataset

In [15]:
# check model performance on the test dataset
final_model = RandomForestRegressor(n_estimators = 1800, min_samples_split = 5, min_samples_leaf = 
                                    2, max_features = 'sqrt', bootstrap = False, max_depth = 75)
final_model.fit(X_train, y_train.values.ravel())
evaluate(final_model, X_test, y_test.values.ravel())

Mean Absolute Error: 118306.92566531352
Mean Squared Error: 46120275428.31522


## Examine Feature Importance
### Important Features
- 'sqftAbove'
    - Data visualizations depict a positive, linear relationship between 'price' and 'sqftAbove'
        - Typically, increases in above ground living space correlate with increases in price
        - See "Scatterplot of Sale Price vs Above Ground Living Space" in the *WashingtonHouseSales-DataVisualization* notebook
- 'bathroom'
    - Data visualizations depicting the relationship between'price' and 'bathroom' show sale prices varies greatly depending on the number of bathrooms in a home
        - Houses with more bathrooms tend to be sold for higher prices
        - See "Boxplot of Sale Price Grouped by Number of Bathrooms" in the *WashingtonHouseSales-DataVisualization* notebook 

### Unimportant Features
- 'location' (Vashon Island, North, East Rural, South Rural)
    - Data visualizations depicting the relationship between 'location' and 'price' show are sold for similar prices in Vashon Island, North, East Rural, and South Rural
        - See "Boxplot of Sale Price Grouped by Location" in the *WashingtonHouseSales-DataVisualization* notebook
- 'waterfront'
    - Data visualizations depicting the relationship between 'waterfront' and 'price' show the two categories for 'waterfront' have a similar price range
        - See "Boxplot of Sale Price Grouped by Waterfront" in the *WashingtonHouseSales-DataVisualization* notebook

In [10]:
# lists
names = X_train.columns
importance = final_model.feature_importances_
importance_list = []
# list of feature and its importance
for ind, col in enumerate(names):
    current = [col, importance[ind]]
    importance_list.append(current)

# order by list by feature importance
sorted(importance_list, key = lambda x: x[1])

[['location_Vashon Island', 0.0003395784243543726],
 ['location_North', 0.0027051279121225636],
 ['location_East Rural', 0.003158919063874182],
 ['location_South Rural', 0.006449035848101613],
 ['waterfront', 0.008554047608073826],
 ['condition', 0.015863262312040827],
 ['location_Seattle', 0.021407641114138457],
 ['floors', 0.02560871550843141],
 ['bedroom', 0.03557651719609657],
 ['yrWorked', 0.04087033108935593],
 ['location_East Urban', 0.046498513627482044],
 ['yrBuilt', 0.05513950621847217],
 ['sqftLot', 0.0635883641299262],
 ['location_South Urban', 0.07778539152397673],
 ['sqftBelow', 0.0857958352041598],
 ['view', 0.09076148910324426],
 ['bathroom', 0.14813550556307842],
 ['sqftAbove', 0.271762218553071]]