# Objective: Model to predict Boston housing prices #

Dataset - https://archive.ics.uci.edu/ml/datasets/Housing

This is a regression problem. The features are as follows:

1. CRIM: per capita crime rate by town 
2. ZN: proportion of residential land zoned for lots over 25,000 sq.ft. 
3. INDUS: proportion of non-retail business acres per town 
4. CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) 
5. NOX: nitric oxides concentration (parts per 10 million) 
6. RM: average number of rooms per dwelling 
7. AGE: proportion of owner-occupied units built prior to 1940 
8. DIS: weighted distances to five Boston employment centres 
9. RAD: index of accessibility to radial highways 
10. TAX: full-value property-tax rate per $10,000 
11. PTRATIO: pupil-teacher ratio by town 
12. B: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town 
13. LSTAT: % lower status of the population 

The target label is:

14. MEDV: Median value of owner-occupied homes in $1000's

## Part 1: NN with extra hidden layers ##

We follow the following approach:

* ** Perform standardization during the model evaluation process, within each fold of cross validation. This ensures that there is no data leakage from each test set cross validation fold into the training data. **
* The NN structure is 13 inputs -> [13 -> 6] -> 1 output
* The extra hidden layer allows the model to extract and recombine higher order features embedded in the data. 

In [1]:
# Regression Example With Boston Dataset: Standardized and Larger
import numpy
import pandas
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Load dataset
dataframe = pandas.read_csv("housing.csv", delim_whitespace=True, header=None)
dataset = dataframe.values

# Split into input (X) and output (Y) variables
X = dataset[:,0:13]
Y = dataset[:,13]

# Define the model
def larger_model():
    # create model
    model = Sequential()
    model.add(Dense(13, 
                    input_dim=13, 
                    init='normal', 
                    activation='relu'))
    # Extra hidden layer with 6 neurons
    model.add(Dense(6, 
                    init='normal', 
                    activation='relu'))
    model.add(Dense(1, 
                    init='normal'))
    
    # Compile model
    model.compile(loss='mean_squared_error', optimizer='adam')
    return model

# Fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)

# Evaluate model with standardized dataset
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', 
                   KerasRegressor(build_fn=larger_model, 
                                  nb_epoch=50, 
                                  batch_size=5, 
                                  verbose=0)))

# Set up pipeline
pipeline = Pipeline(estimators)

# 
kfold = KFold(n_splits=10, 
              random_state=seed)

# Cross validation scores
results = cross_val_score(pipeline, 
                          X, 
                          Y, 
                          cv=kfold)
print("Larger: %.2f (%.2f) MSE" % (results.mean(), results.std()))


Using TensorFlow backend.


Larger: 103.28 (236.30) MSE


## Part 2: NN with a wider network topology ##

* We keep a shallow network architecture(1 hidden layer) but we double the number of neurons in the hidden layer(20).

In [2]:
'''
NN structure: 13 inputs -> [20] -> 1 output
'''
def wider_model():
    # create model
    model = Sequential()
    model.add(Dense(20, 
                    input_dim=13, 
                    init='normal', 
                    activation='relu'))
    model.add(Dense(1, 
                    init='normal'))
    
    # Compile model
    model.compile(loss='mean_squared_error', 
                  optimizer='adam')
    return model

# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)

# evaluate model with standardized dataset
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasRegressor(build_fn = wider_model, 
                                         nb_epoch=100, 
                                         batch_size=5, 
                                         verbose=0)))

# pipeline the estimators
pipeline = Pipeline(estimators)

# k-fold split
kfold = KFold(n_splits=10, random_state=seed)
results = cross_val_score(pipeline, 
                          X, 
                          Y, 
                          cv=kfold)
print("Wider: %.2f (%.2f) MSE" % (results.mean(), results.std()))

Wider: 22.77 (24.89) MSE
