# House price prediction

In this notebook, we are going to take a look at the Boston Housing dataset, that is built-in right into the Keras. Input consists of several different information about the neighborhood with a goal of predicting the property price. Therefore, as we can clearly see, this is a regression task.

In [1]:
from tensorflow.keras.datasets import boston_housing

First of all, we'll import our data. We can see that this dataset is quite small, containing only 404 training items and 102 validation items. Input is a 13 dimensional vector, output is a number in thousands of dolars.

In [2]:
(x_train, y_train), (x_test, y_test) = boston_housing.load_data()

print('Train shape: x-{} y-{} Test shape: x-{}'.format(x_train.shape, y_train.shape, x_test.shape))

Train shape: x-(404, 13) y-(404,) Test shape: x-(102, 13)


Before puttingh our data into the network, it is always a good practice to normalize the data. Here, we use feature-wise normalization. For each input feature we substract the mean of the feature and divide by the standard deviation.

In [3]:
mean = x_train.mean(axis=0)
x_train -= mean
std = x_train.std(axis=0)
x_train /= std

x_test -= mean
x_test /= std

In [4]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

We can now proceed with building our model. We'll use 3 layers of fully-connected dense layer with 64 nodes and relu activation function for the first two layers. Last layer has only one node (since we are predicting only 1 number) and no activation function, as we don't want to affect or limit our final prediction. Since this is a regression task, we are using mean squared error loss function and mean absolute error as our metrics.

In [5]:
def build_model():
    model = Sequential()
    model.add(Dense(64, activation='relu', input_shape=(x_train.shape[1],)))
    model.add(Dense(64, activation='relu'))
    model.add(Dense(1))
    model.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])
    return model

In [6]:
import numpy as np

Since we have a small amount of data, we'll use K-fold validation to reduce dependency on the data pick. We'll run 500 epochs so we are able to tell when does our network begin to overfit.

In [7]:
k = 4
num_val_samples = len(x_train) // k
num_epochs = 500
all_mae_histories = []

for i in range(k):
    print('Processing fold #{}'.format(i))
    
    val_data = x_train[i*num_val_samples : (i+1)*num_val_samples]
    val_targets = y_train[i*num_val_samples : (i+1)*num_val_samples]
    
    partial_x_train = np.concatenate([x_train[: i*num_val_samples], x_train[(i+1)*num_val_samples:]],
                                     axis=0)
    partial_y_train = np.concatenate([y_train[: i*num_val_samples], y_train[(i+1)*num_val_samples:]],
                                     axis=0)
    model = build_model()
    history = model.fit(partial_x_train, partial_y_train, 
              validation_data=(val_data, val_targets),
              epochs=num_epochs, batch_size=1, verbose=0)
    
    mae_history = history.history['val_mean_absolute_error']
    all_mae_histories.append(mae_history)

average_mae_history = [np.mean([x[i] for x in all_mae_histories]) for i in range(num_epochs)]

Processing fold #0
Processing fold #1
Processing fold #2
Processing fold #3


As we can see, our network is improving from the beginning, until about 80 epochs. Then, it begins to overfit on the training data, reducing its ability for correct predictions on new unseen validation data.

In [8]:
import matplotlib.pyplot as plt

def smooth_curve(points, factor=0.9):
    smoothed_points = []
    for point in points:
        if smoothed_points:
            previous = smoothed_points[-1]
            smoothed_points.append(previous * factor + point * (1 - factor))
        else:
            smoothed_points.append(point)
    return smoothed_points

smooth_mae_history = smooth_curve(average_mae_history[10:])

plt.plot(range(1, len(smooth_mae_history) + 1), smooth_mae_history)
plt.xlabel('Epochs')
plt.ylabel('Validation MAE')
plt.show()

<Figure size 640x480 with 1 Axes>

At the end, based on our knowledge, we'll train our final model. We use 75 epochs with batch size increased to 16. As we can see, our MAE is about 2.657 which means that our predictions are off by an average of \$2657.

In [9]:
model = build_model()
model.fit(x_train, y_train, epochs=75, batch_size=16, verbose=0)

test_mse_score, test_mae_score = model.evaluate(x_test, y_test)
print('MSE: {}, MAE: {}'.format(test_mse_score, test_mae_score))

MSE: 20.577824012905943, MAE: 2.8793123937120626
