# A Neural Network for House Price Estimation

Neural networks have excelled in tasks like image classification or processing of audio signals. However, they are just as applicable to **tabular datasets** as other machine learning models. In this example we
- train a simple neural network for regression on a tabular data set with rich attributes
- learn how to plug `keras` and `scikit-learn` together with an adapter, enabling us to integrate `keras` models into a `scikit-learn` workflow

## Preamble

In [1]:
import matplotlib.pyplot as plt

In [2]:
import data_science_learning_paths
data_science_learning_paths.setup_plot_style()

In [3]:
import pandas
import seaborn
from tensorflow import keras
import sklearn

## The Dataset

In [4]:
data = data_science_learning_paths.datasets.read_house_prices(
    encode_ordinal=True,
    encode_categorial=True,
)

In [5]:
data.head()

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtCond,BsmtFinSF1,BsmtFinSF2,BsmtFullBath,BsmtHalfBath,BsmtQual,BsmtUnfSF,...,CentralAir_N,CentralAir_Y,Functional_Maj1,Functional_Maj2,Functional_Min1,Functional_Min2,Functional_Mod,Functional_Sev,Functional_Typ,SalePrice
0,856,854,3,3,706,0,1,0,2,150,...,0,1,0,0,0,0,0,0,1,208500
1,1262,0,3,3,978,0,0,1,2,284,...,0,1,0,0,0,0,0,0,1,181500
2,920,866,3,3,486,0,1,0,2,434,...,0,1,0,0,0,0,0,0,1,223500
3,961,756,3,2,216,0,1,0,3,540,...,0,1,0,0,0,0,0,0,1,140000
4,1145,1053,4,3,655,0,1,0,2,490,...,0,1,0,0,0,0,0,0,1,250000


In [6]:
target_col = "SalePrice"

In [7]:
features, target = data[data.columns.difference([target_col])], data[target_col]

In [8]:
n_features = features.shape[1]
n_features

74

Neural networks are sensitive to the scale of the input variables and usually benefit from scaling the inputs. Here we scale all inputs to unit variance before passing them to the model.

In [9]:
from sklearn.preprocessing import StandardScaler

In [10]:
features = pandas.DataFrame(
    StandardScaler().fit_transform(features),
    columns=features.columns
)

In [11]:
from sklearn.model_selection import train_test_split

In [12]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2)

In [13]:
X_train.shape

(1168, 74)

## Target Function Definition

`keras` does not provide RMSE as a built-in loss. However we can easily supply a custom loss function if we implement it using math operations from `keras.backend`.

In [14]:
def root_mean_squared_error(y_true, y_pred):
        return keras.backend.sqrt(
            keras.backend.mean(
                keras.backend.square(y_pred - y_true)
            )
        ) 


## The Network

The network we use for this purpose is quite simple: Two fully-connected layers, with the last layer being a single neuron.

Since we are going to use it with the `keras`/`sklearn` API, we need to write a function that builds and compiles the network.

In [15]:
def build_regressor_network():
    net = keras.models.Sequential(
        [
            keras.layers.Dense(
                units=64,  
                input_dim=n_features, 
                activation="relu", 
            ),
            keras.layers.Dense(
                units=32,  
                input_dim=n_features, 
                activation="relu", 
            ),
            keras.layers.Dense(
                units=16,  
                input_dim=n_features, 
                activation="relu", 
            ),
            keras.layers.Dense(
                units=8,  
                input_dim=n_features, 
                activation="relu", 
            ),
            keras.layers.Dense(
                units=1, 
                activation="linear"
            ),
        ]
    )
    net.compile(
        optimizer="adam",
        loss=root_mean_squared_error,
    )
    return net

In [16]:
net = build_regressor_network()

This function is now passed to a wrapper class that aims to implement the `sklearn` estimator interface. We can now pass the data in the shape usual for `sklearn` to train the classifier...

In [17]:
model = keras.wrappers.scikit_learn.KerasRegressor(
    build_fn=build_regressor_network,
    epochs=25, 
    batch_size=10,
)

history = model.fit(
    x=X_train,
    y=y_train,
    validation_split=0.2
)

Epoch 1/25


TypeError: in user code:

    /Users/cls/miniforge3/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py:805 train_function  *
        return step_function(self, iterator)
    <ipython-input-14-bf50b1f35558>:3 root_mean_squared_error  *
        keras.backend.mean(
    /Users/cls/miniforge3/lib/python3.8/site-packages/tensorflow/python/ops/math_ops.py:1184 binary_op_wrapper
        raise e
    /Users/cls/miniforge3/lib/python3.8/site-packages/tensorflow/python/ops/math_ops.py:1168 binary_op_wrapper
        return func(x, y, name=name)
    /Users/cls/miniforge3/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py:201 wrapper
        return target(*args, **kwargs)
    /Users/cls/miniforge3/lib/python3.8/site-packages/tensorflow/python/ops/math_ops.py:565 subtract
        return gen_math_ops.sub(x, y, name)
    /Users/cls/miniforge3/lib/python3.8/site-packages/tensorflow/python/ops/gen_math_ops.py:10316 sub
        _, _, _op, _outputs = _op_def_library._apply_op_helper(
    /Users/cls/miniforge3/lib/python3.8/site-packages/tensorflow/python/framework/op_def_library.py:555 _apply_op_helper
        raise TypeError(

    TypeError: Input 'y' of 'Sub' Op has type int64 that does not match type float32 of argument 'x'.


In [None]:
plt.plot(history.history["loss"])
plt.plot(history.history["val_loss"])

... and ask for class predictions via the `predict` method. However, we get the predictions in the nested array shape that is usual for TensorFlow, so we need to flatten the array to get a 1D-vector.

In [None]:
y_pred = model.predict(X_test).flatten()

In [None]:
seaborn.distplot(data[target_col])
seaborn.distplot(y_pred)
plt.title("distribution of target variable and predictions")

In [None]:
seaborn.distplot(y_pred - y_test)
plt.title("distribution of error")

## `sklearn`-style Evaluation

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer

In [None]:
from data_science_learning_paths.mlp import root_mean_squared_error as numpy_rmse

In [None]:
cv_results_net = cross_val_score(
    model,
    features,
    target,
    scoring=make_scorer(numpy_rmse),
    cv=5
)

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
cv_results_rf = cross_val_score(
    RandomForestRegressor(),
    features,
    target,
    scoring=make_scorer(numpy_rmse),
    cv=5,
)

In [None]:
pandas.DataFrame(
    {
        "network": cv_results_net,
        "random forest": cv_results_rf
    }
).plot(kind="box")

## Exercise: Improve Accuracy

## References

- [Keras Tutorial: Deep Learning in Python](https://www.datacamp.com/community/tutorials/deep-learning-python)

---
_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © 2018-2025 [Point 8 GmbH](https://point-8.de)_