# Predicting House Prices

## Regression

In [1]:
import keras
from keras_tqdm import TQDMNotebookCallback

import numpy as np
from notebook_importer import *
from helper import *

from keras.datasets import boston_housing
(train_data, train_targets), (test_data, test_targets) = boston_housing.load_data()

from keras import models
from keras import layers

Using TensorFlow backend.


importing notebook from helper.ipynb


## Features

1. Per capita crime rate.
2. Proportion of residential land zoned for lots over 25,000 square feet.
3. Proportion of non-retail business acres per town.
4. Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
5. Nitric oxides concentration (parts per 10 million).
6. Average number of rooms per dwelling.
7. Proportion of owner-occupied units built prior to 1940.
8. Weighted distances to five Boston employment centres.
9. Index of accessibility to radial highways.
10. Full-value property-tax rate per $10,000.
11. Pupil-teacher ratio by town.
12. 1000 * (Bk - 0.63) ** 2 where Bk is the proportion of blacks by town.
13. % lower status of the population.

In [2]:
mean = train_data.mean(axis=0)
train_data -= mean
std = train_data.std(axis=0)
train_data /= std

test_data -= mean
test_data /= std

In [3]:
def build_model():
    model = models.Sequential()
    model.add(layers.Dense(64, activation='relu', input_shape=(train_data.shape[1],)))
    model.add(layers.Dense(1))
    model.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])
    
    return model

In [4]:
k = 4
num_val_samples = len(train_data) // k
all_scores = []
for i in range(k):
    print('processing fold #', i)
    
    # prepare the validation data: data from partition # k
    val_data = train_data[i * num_val_samples: (i + 1) * num_val_samples]
    val_targets = train_targets[i * num_val_samples: (i + 1) * num_val_samples]
    
    # prepare the training data: data from all other partitions
    partial_train_data = np.concatenate(
        [train_data[:i * num_val_samples],
        train_data[(i + 1) * num_val_samples:]],
        axis=0)	
    
    partial_train_targets = np.concatenate(
    [train_targets[:i * num_val_samples],
    train_targets[(i + 1) * num_val_samples:]],
    axis=0)		
    
    # build the Keras model (already compiled)
    model = build_model()
    
    # train the model (in silent mode, verbose=0)
    model.fit(partial_train_data, partial_train_targets,
    epochs=100, batch_size=1, verbose=0, callbacks=[TQDMNotebookCallback()])
    
    # evaluate the model on the validation data
    val_mse, val_mae = model.evaluate(val_data, val_targets, verbose=0)
    all_scores.append(val_mae)	

processing fold # 0



processing fold # 1



processing fold # 2



processing fold # 3





In [5]:
print(all_scores)
print(np.mean(all_scores))

[1.8265851440996226, 2.3391485544714596, 2.6466831653425009, 2.3030123568997523]
2.2788573052


In [6]:
model.fit(train_data, train_targets, epochs=300, batch_size=1, verbose=0, callbacks=[TQDMNotebookCallback()])
test_mse_score, test_mae_score = model.evaluate(test_data, test_targets)

test_mae_score



2.6447017146091834

## Conclusions

* Regression is done using different loss functions from classification; Mean Squared Error (MSE) is a commonly used loss function for regression.


* Similarly, evaluation metrics to be used for regression differ from those used for classification; naturally the concept of "accuracy" does not apply for regression. A common regression metric is Mean Absolute Error (MAE).


* When features in the input data have values in different ranges, each feature should be scaled independently as a preprocessing step.


* When there is little data available, using K-Fold validation is a great way to reliably evaluate a model.


* When little training data is available, it is preferable to use a small network with very few hidden layers (typically only one).
