# Regression Deep Learning Case Study: Boston housing price

<p style="text-align: justify">In this project tutorial we will discover how to develop and evaluate neural network models
using Keras for a regression problem. This project covers the following aspects:</p>



<ul>
<li>How to load a CSV dataset and make it available to Keras.</li>
<li>How to create a neural network model with Keras for a regression problem.</li>
<li>How to use scikit-learn with Keras to evaluate models using cross validation.</li>
<li>How to perform data preparation in order to improve skill with Keras models.</li>
<li>How to tune the network topology of models with Keras.</li>
</ul>

<p style="text-align: justify">For this project we will investigate the <a href="https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data">Boston House Price</a> dataset. Each record in the database
describes a Boston suburb or town. The data was drawn from the Boston Standard Metropolitan Statistical Area (SMSA) in 1970. The attributes are defined as follows (taken from the <a href="https://archive.ics.uci.edu/ml/machine-learning-databases/housing/">UCI Machine Learning Repository</a>):</p>

1. CRIM: per capita crime rate by town
2. ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
3. INDUS: proportion of non-retail business acres per town
4. CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
5. NOX: nitric oxides concentration (parts per 10 million)
6. RM: average number of rooms per dwelling
7. AGE: proportion of owner-occupied units built prior to 1940
8. DIS: weighted distances to five Boston employment centers
9. RAD: index of accessibility to radial highways
10. TAX: full-value property-tax rate per &#36;10,000
11. PTRATIO: pupil-teacher ratio by town
12. B: $1000(Bk - 0:63)^2$ where $Bk$ is the proportion of blacks by town
13. LSTAT: % lower status of the population
14. MEDV: Median value of owner-occupied homes in &#36;1000s

We can see that the input attributes have a mixture of units, which may require normalization of the features.

<p style="text-align: justify">Reasonable performance for models evaluated using Mean Squared Error (MSE) are around 20 in squared thousands of dollars (or $4,500 if you take the square root). This is a nice target to aim for with our neural network model.</p>

## Develop a Baseline Neural Network Model

In this section we will create a baseline neural network model for the regression problem. Let's start off by importing all of the functions and objects we will need for this tutorial.

In [1]:
# Load libraries

# numpy
import numpy as np

# pandas
import pandas as pd

# keras library
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor

# scikit functions
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

Using TensorFlow backend.


<p style="text-align: justify">We can now load our dataset from a file in the local directory. The dataset is in fact not in CSV format on the UCI Machine Learning Repository, the attributes are instead separated by whitespace. We can load this easily using the Pandas library. We can then split the input (<em>X</em>) and output (<em>Y</em>) attributes so that they are easier to model with Keras and scikit-learn.</p>

In [2]:
# load dataset
dataframe = pd.read_csv("housing.csv", delim_whitespace=True, header=None)
dataset = dataframe.values
# split into input (X) and output (Y) variables
X = dataset[:,0:13]
Y = dataset[:,13]

<p style="text-align: justify">We can create Keras models and evaluate them with scikit-learn, by using handy wrapper objects provided by the Keras library. This is desirable, because scikit-learn excels at evaluating models, and will allow us to use powerful data preparation and model evaluation schemes, with very few lines of code. The Keras wrapper class requires a function as an argument. This function, that we must define is responsible for creating the neural network model to be evaluated.</p>

<p style="text-align: justify">Below, we define the function to create the baseline model to be evaluated. It is a simple model that has a <b>single fully connected hidden layer</b>, with the same number of neurons as input attributes (13). The network uses good practices such as the rectifier activation function for the hidden layer. No activation function is used for the output layer because it is a regression problem, and we are interested in predicting numerical values directly without transform.</p>

<p style="text-align: justify">The efficient <b>ADAM</b> optimization algorithm is used and a <b>mean squared error</b> loss function is optimized. This will be the same metric that we will use to evaluate the performance of the model. It is a desirable metric because by taking the square root of an error value it gives us a result that we can directly understand, in the context of the problem with the units in thousands of dollars.</p>

In [3]:
# define base mode
def baseline_model():
    # create model
    model = Sequential()
    model.add(Dense(13, input_dim=13, activation='relu', kernel_initializer='normal'))
    model.add(Dense(1, kernel_initializer='normal'))
    # Compile model
    model.compile(loss='mean_squared_error', optimizer='adam')
    return model

<p style="text-align: justify">The Keras wrapper object for use in scikit-learn as a regression estimator is called "KerasRegressor". We create an instance and pass it to both, the name of the function to create the neural network model, as well as some parameters to pass along to the <em>fit()</em> function of the model later, such as the number of epochs and batch size. Both of these are set to sensible defaults. We also initialize the random number generator with a constant random seed, a process we will repeat for each model evaluated in this tutorial. This is to ensure we compare models consistently and that the results are reproducible.</p>

In [4]:
# fix random seed for reproducibility
seed = 7
np.random.seed(seed)

# evaluate model with standardized dataset
estimator = KerasRegressor(build_fn=baseline_model, epochs=50, batch_size=5, verbose=0)

<p style="text-align: justify">The final step is to evaluate this baseline model. We will use 10-fold cross validation to
evaluate the model.</p>

In [5]:
kfold = KFold(n_splits=10, random_state=seed)
results = cross_val_score(estimator, X, Y, cv=kfold)
print("Baseline: %.2f (%.2f) MSE" % (results.mean(), results.std()))

Baseline: -37.47 (21.61) MSE


## Lift Performance By Standardizing The Dataset

<p style="text-align: justify">An important concern with the Boston house price dataset is that the input attributes all vary
in their scales, because they measure different quantities. It is almost always good practice to prepare your data before modeling it using a neural network model. Continuing on from the above baseline model, we can re-evaluate the same model using a standardized version of the input dataset.</p>

<p style="text-align: justify">We can use scikit-learn's Pipeline framework to perform the standardization during the model evaluation process, within each fold of the cross validation. This ensures that there is no data leakage from each testset cross validation fold into the training data. The code below creates a scikit-learn Pipeline that first standardizes the dataset, then creates and evaluates the baseline neural network model.</p>

In [6]:
#evaluate baseline model with standardized dataset
np.random.seed(seed)
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasRegressor(build_fn=baseline_model, epochs=50, batch_size=5, verbose=0)))
pipeline = Pipeline(estimators)
kfold = KFold(n_splits=10, random_state=seed)
results = cross_val_score(pipeline, X, Y, cv=kfold)
print("Standardized: %.2f (%.2f) MSE" % (results.mean(), results.std()))

Standardized: -29.61 (27.29) MSE


<p style="text-align: justify">Running the example provides an improved performance over the baseline model without standardized data, dropping the error by 10 thousand squared dollars.</p>

<p style="text-align: justify">A further extension of this section would be to similarly apply a rescaling to the output variable such as, <b>normalizing</b> it to the range of 0 to 1, and use a Sigmoid or similar <b>activation function</b> on the <b>output layer</b> to narrow output predictions to the same range.</p>

## Tune The Neural Network Topology

<p style="text-align: justify">There are many concerns that can be optimized for a neural network model. Perhaps the point of biggest leverage is the structure of the network itself, including the number of layers and the number of neurons in each layer. In this section we will evaluate two additional network topologies, in an effort to further improve the performance of the model. We will look at both a deeper and a wider network topology.</p>

### Evaluate a Deeper Network Topology

<p style="text-align: justify">One way to improve the performance of a neural network is to add more layers. This might allow the model to extract and recombine higher order features embedded in the data. In this section, we will evaluate the effect of adding one more hidden layer to the model. This is as easy as defining a new function that will create this deeper model, copied from our baseline model above. We can then insert a new line after the first hidden layer. In this case with about half the number of neurons. Our network topology now looks like:</p>

    13 inputs -> [13 -> 6] -> 1 output

<p style="text-align: justify">We can evaluate this network topology in the same way as above, whilst also using the standardization of the dataset that above was shown to improve performance.</p>

In [7]:
# define the model
def larger_model():
    # create model
    model = Sequential()
    model.add(Dense(13, input_dim=13, activation='relu', kernel_initializer='normal'))
    model.add(Dense(6, activation='relu', kernel_initializer='normal'))
    model.add(Dense(1, kernel_initializer='normal'))
    # Compile model
    model.compile(loss='mean_squared_error', optimizer='adam')
    return model

# fix random seed for reproducibility
seed = 7
np.random.seed(seed)

# evaluate model with standardized dataset
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasRegressor(build_fn=larger_model, epochs=50, batch_size=5, verbose=0)))
pipeline = Pipeline(estimators)
kfold = KFold(n_splits=10, random_state=seed)
results = cross_val_score(pipeline, X, Y, cv=kfold)
print("Larger: %.2f (%.2f) MSE" % (results.mean(), results.std()))

Larger: -23.33 (27.01) MSE


<p style="text-align: justify">By adding one hidden layer  in the network, the performances almost got unchanged, from 22.8 to 22.1 in terms of mean MSE. The variance even increased, showing that deeper networks, which are also more computationally expensive are not always better in terms of performance and accuracy.</p>

### Evaluate a Wider Network Topology

<p style="text-align: justify">Another approach to increasing the representational capacity of the model is to create a wider network. In this section we evaluate the effect of keeping a shallow network architecture and nearly doubling the number of neurons in the one hidden layer. Again, all we need to do is define a new function that creates our neural network model. Here, we have increased the number of neurons in the hidden layer compared to the baseline model from 13 to 20. The topology for
our wider network can be summarized as follows:</p>

    13 inputs -> [20] -> 1 output

In [8]:
# define wider model
def wider_model():
    # create model
    model = Sequential()
    model.add(Dense(20, input_dim=13, activation='relu', kernel_initializer='normal'))
    model.add(Dense(1, kernel_initializer='normal'))
    # Compile model
    model.compile(loss='mean_squared_error', optimizer='adam')
    return model

# fix random seed for reproducibility
seed = 7
np.random.seed(seed)

# evaluate model with standardized dataset
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasRegressor(build_fn=wider_model, epochs=50, batch_size=5, verbose=0)))
pipeline = Pipeline(estimators)
kfold = KFold(n_splits=10, random_state=seed)
results = cross_val_score(pipeline, X, Y, cv=kfold)
print("Wider: %.2f (%.2f) MSE" % (results.mean(), results.std()))

Wider: -24.64 (25.65) MSE


<p style="text-align: justify">With the current set of parameters, the deeper network performs better than the wider network. It would have been hard to guess that a deeper network would outperform a wider network on this problem. The results demonstrate the importance of empirical testing when it comes to developing neural network models.</p>

## Extensions

<p style="text-align: justify">This section lists some ideas for extending the tutorial that you may wish to explore to increase the performance of the neural network:</p>

<ul>

<li><b>Data Transforms</b>. In the current notebook, standardization has been considered. Normalization of the features (between 0 and 1) could also be experimented to see whether there is some improvement with the different neural networks.</li>
<li><b>Hyperparameters grid search</b>. Exploring how the number of epochs, <i>k</i>-folds, or batch size for instance would affect the results.</li>
<li><b>Deeper and/or wider networks</b>. See if there is any real improvement in terms of accuracy without sacrificing computational power.</li>
 