# House-Price Prediction with Tensorflow Keras Neural Networks

This example demonstrates how to create a neural network model with Tensorflow and Keras to predict house prices based on the following features:

1. Year of sale of the house
2. The age of the house at the time of sale
3. Distance from city center
4. Number of stores in the locality
5. The latitude
6. The longitude

It explains why the normalization process of the data is important and when to choose which activation function.

Import the necessary packages/modules.

In [193]:
import pandas as pd
import matplotlib.pyplot as plt 
import tensorflow as tf
from tensorflow.keras import backend as K
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping, LambdaCallback

Check that the Apple M1 GPU is available (just a sanity check) - expected output is "Num GPUs Available:  1".

In [194]:
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

Num GPUs Available:  1


Import the input data from the CSV file.

In [195]:
column_names = ['serial','date','age','distance','stores','latitude','longitude','price']
df = pd.read_csv('data.csv', names = column_names)
df.head()

Unnamed: 0,serial,date,age,distance,stores,latitude,longitude,price
0,0,2009,21,9,6,84,121,14264
1,1,2007,4,2,3,86,121,12032
2,2,2016,18,3,7,90,120,13560
3,3,2002,13,2,2,80,128,12029
4,4,2014,25,5,8,81,122,14157


Remove the column "serial" as it is not a valid feature (just an incremental number in the data) and normalize the data by substracting the min value and dividing by the standard deviation. This ensures that there are no negative values in the normalized data set.

In [196]:
df = df.iloc[:, 1:]
# one can also choose to normalize with the mean value instead of min, 
# however this leads to worse results (higher mean squared error)
# df_norm = (df - df.mean())/df.std()
df_norm = (df - df.min())/df.std()
df_norm.head()

Unnamed: 0,date,age,distance,stores,latitude,longitude,price
0,1.649083,1.853562,2.812644,1.909072,1.265026,0.315657,2.939923
1,1.28262,0.353059,0.625032,0.954536,1.897539,0.315657,0.753349
2,2.931704,1.588767,0.937548,2.22725,3.162565,0.0,2.250251
3,0.366463,1.147443,0.625032,0.636357,0.0,2.525259,0.75041
4,2.565241,2.206621,1.56258,2.545429,0.316257,0.631315,2.835101


Create the X dataset by selecting all columns except the price column (which in our case is not a feature, but the value to be predicted in the end).

In [197]:
X = df_norm.loc[:, df_norm.columns != 'price']

In [198]:
X.head()

Unnamed: 0,date,age,distance,stores,latitude,longitude
0,1.649083,1.853562,2.812644,1.909072,1.265026,0.315657
1,1.28262,0.353059,0.625032,0.954536,1.897539,0.315657
2,2.931704,1.588767,0.937548,2.22725,3.162565,0.0
3,0.366463,1.147443,0.625032,0.636357,0.0,2.525259
4,2.565241,2.206621,1.56258,2.545429,0.316257,0.631315


Create the Y dataset by only selecting the price column.

In [199]:
Y = df_norm.loc[:, df_norm.columns == 'price']

In [200]:
Y

Unnamed: 0,price
0,2.939923
1,0.753349
2,2.250251
3,0.750410
4,2.835101
...,...
4995,2.229679
4996,3.422890
4997,2.781220
4998,2.987926


Create the training, validation and test dataset. This code splits the input data into groups of 80%, 10% and 10%.

In [201]:
X_train, X_val_and_test, Y_train, Y_val_and_test = train_test_split(X, Y, test_size=0.2, random_state = 42)
X_val, X_test, Y_val, Y_test = train_test_split(X_val_and_test, Y_val_and_test, test_size=0.5, random_state = 42)

Lets show the shapes of the output variables.

In [202]:
print(X_train.shape, X_val.shape, X_test.shape, Y_train.shape, Y_val.shape, Y_test.shape)

(4000, 6) (500, 6) (500, 6) (4000, 1) (500, 1) (500, 1)


We build a model with 4 dense layers and a relu activation function (because we normalized in a way to avoid negative values). Suppose we would have chosen to normalize with the mean instead of min and thus keeping negative values in the normalized dataset, we would choose the linear activation function here.

In [203]:
model = Sequential()
model.add(Dense(units=10, activation='relu', input_shape=(6,)))
model.add(Dense(units=20, activation='relu'))
model.add(Dense(units=5, activation='relu'))
model.add(Dense(units=1, activation='relu'))

Print the model summary.

In [205]:
model.summary()

Model: "sequential_25"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_100 (Dense)           (None, 10)                70        
                                                                 
 dense_101 (Dense)           (None, 20)                220       
                                                                 
 dense_102 (Dense)           (None, 5)                 105       
                                                                 
 dense_103 (Dense)           (None, 1)                 6         
                                                                 
Total params: 401
Trainable params: 401
Non-trainable params: 0
_________________________________________________________________


Compile the model with mean squared error metric.

In [207]:
model.compile(loss='mean_squared_error', optimizer='RMSprop', metrics=['mean_squared_error'])

Fit and play with epoch size:

In [208]:
model.fit(X_train, Y_train, epochs=50, validation_data=(X_val, Y_val))

Epoch 1/50
  1/125 [..............................] - ETA: 47s - loss: 6.9357 - mean_squared_error: 6.9357

2022-10-29 12:10:09.104956: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


Epoch 2/50
 11/125 [=>............................] - ETA: 0s - loss: 1.0314 - mean_squared_error: 1.0314

2022-10-29 12:10:09.898061: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x2f992b190>

Evaluate the model performance:

In [209]:
model.evaluate(X_test, Y_test)



[0.15392450988292694, 0.15392450988292694]

## Conclusion

### Important Ideas

   * Normalization has to always happen on all the data!
      - substracting min() and dividing with std() avoids negative values in the output
      - substracting mean() and dividing with std() normalizes with keeping negative values in the output
   * Play with the number of epochs
   * Play with the different activation functions (in output and hidden layer!)
      - regression: with only positive numbers - use "relu"
      - regression: with negative and positive numbers - use "linear"
      - classification: use sigmoid