Let’s start by importing the libraries.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn

pd.set_option('display.max_columns', 100)
pd.set_option('display.width', None)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

%matplotlib inline

Next, we’ll import a few different algorithms that we will use against
our dataset.

In [2]:
# Import specified linear algorithms
from sklearn.linear_model import ElasticNet, Ridge, Lasso

# Import specified ensemble algorithms 
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

Finally, let’s import the dataset and check its shape.

In [3]:
df = pd.read_csv('../dataset/analytical_base_table.csv')

df.shape

(1863, 42)

Split Dataset
-------------

Data should be thought of as a limited resources. Data can be split for
training and for testing, but the same data can’t be used for both.

**Objectives:**

Let’s split the data set using a function from scikit-learn

In [4]:
# function for splitting data
from sklearn.model_selection import train_test_split

Before completing the data split, we’ll need to separate our target
variable and our input features.

In [5]:
# seperate object for our target variable
y = df.tx_price

# seperate object for input features
X = df.drop('tx_price', axis=1)

Next, we’ll split the dataset with 20% of our observations set aside for
testing.

In [6]:
# Split X and y into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)

# verify length of each set
len(X_train), len(X_test), len(y_train), len(y_test)

(1490, 373, 1490, 373)

Model Pipeline
==============

The majority of algorithms often require that the data is preprocessed.

**Objectives:**

-   Standardize data set
-   Set up pipelines

Since our model will be using cross-validation, we’ll need to set up a
pipelines.

In [7]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

Next, we’ll create a pipelines dictionary with all of our algorithms.

In [8]:
pipelines = {
    'lasso' : make_pipeline(StandardScaler(), Lasso(random_state=123)),
    'ridge' : make_pipeline(StandardScaler(), Ridge(random_state=123)),
    'enet' : make_pipeline(StandardScaler(), ElasticNet(random_state=123)),
    'rf' : make_pipeline(StandardScaler(), RandomForestRegressor(random_state=123)),
    'gb' : make_pipeline(StandardScaler(), GradientBoostingRegressor(random_state=123))
}

Declare Hyperparameters to Tune
-------------------------------

Unlike standard parameters that are learned attributes from the training
data, hyperparameters (sometimes referred to as model parameters) are
manually modified prior to training.

**Objectives:**

-   Create hyperparameter grids as for each algorithm

In [9]:
# lasso hyperparameters
lasso_hyperparameters = {
    'lasso__alpha' : [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10]
}   

# ridge hyperparameters
ridge_hyperparameters = {
    'ridge__alpha' : [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10]
}

# elastic net hyperparameters
enet_hyperparameters = {
    'elasticnet__alpha' : [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10],
    'elasticnet__l1_ratio' : [0.1, 0.3, 0.5, 0.7, 0.9]
}

# random forest hyperparameters
rf_hyperparameters = {
    'randomforestregressor__n_estimators' : [100, 200],
    'randomforestregressor__max_features' : ['auto', 'sqrt', 0.33]
}

# gradient boost hyperparameters
gb_hyperparameters = {
    'gradientboostingregressor__n_estimators': [100, 200],
    'gradientboostingregressor__learning_rate' : [0.05, 0.1, 0.2],
    'gradientboostingregressor__max_depth' : [1, 3, 5]
}


Next, we’ll create a dictionary for all of the hyperparameters.

In [10]:
hyperparameters = {
    'rf' : rf_hyperparameters, 
    'gb' : gb_hyperparameters,
    'lasso' : lasso_hyperparameters,
    'ridge' : ridge_hyperparameters,
    'enet' : enet_hyperparameters
}

Cross Validation
----------------

In [11]:
from sklearn.model_selection import GridSearchCV

In [12]:
fitted_models = {}

for name, pipeline in pipelines.items():
    # create cross-validation object 
    model = GridSearchCV(pipeline, hyperparameters[name], cv=10, n_jobs=-1)

    # fite model on X_train, y_train
    model.fit(X_train, y_train)

    # store model in dictionary
    fitted_models[name] = model

    # print message after model has been fitted
    print(name, 'has been fitted.')

lasso has been fitted.
ridge has been fitted.
enet has been fitted.
rf has been fitted.
gb has been fitted.


Finally, we’ll want to verify that our models have been fitted correctly

In [13]:
from sklearn.exceptions import NotFittedError

for name, model in fitted_models.items():
    try:
        pred = model.predict(X_test)
        print(name, 'has been fitted.')
    except NotFittedError as e:
        print(repr(e))

lasso has been fitted.
ridge has been fitted.
enet has been fitted.
rf has been fitted.
gb has been fitted.


Evaluate Models
===============

An initial way to evaluate our models is by looking at their
cross-validated score on the the training set.

In [14]:
# display the average R^2 score for each model
for name, model in fitted_models.items():
    print(name, model.best_score_)

lasso 0.3074267600282174
ridge 0.31577627312984397
enet 0.34264066452633046
rf 0.48222955687042296
gb 0.4877585575860505


We will want to calculate the R^2 score on the test set, so let’s import
it.

In [15]:
from sklearn.metrics import r2_score

As an alternative to validate our models, we can also assess their
performance based on their mean absolute error (MAE).

In [16]:
from sklearn.metrics import mean_absolute_error

Let’s tests our models against our test data.

In [17]:
for name, model in fitted_models.items():
    pred = model.predict(X_test)
    print(name)
    print('--------')
    print('R^2:', r2_score(y_test, pred))
    print('MAE:', mean_absolute_error(y_test, pred))

lasso
--------
R^2: 0.40264242247282833
MAE: 85485.10224013317
ridge
--------
R^2: 0.40331359441549197
MAE: 85410.0700244062
enet
--------
R^2: 0.40259330411156247
MAE: 86506.60728298842
rf
--------
R^2: 0.5681374308500101
MAE: 68260.51735924934
gb
--------
R^2: 0.5410410083457009
MAE: 70777.06141879127


Let’s assess our models by answering the following questions:

-   Which model had the highest R^2 on the test set? **rf, Random Forest
    Regressor**

-   Which model has the lowest mean absolute error? **rf, Random Forest
    Regressor**