<a href="https://colab.research.google.com/github/morarsebastianroloway/HousePricingPredictor/blob/master/House_Pricing_Predictor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Description

In this simple example, we will train a model to predict housing prices. Our training data consists of 14 variables. 13 variables are predictor variables, with the last being the target variable. Our training data comes from the Boston Housing Price Prediction dataset, which is hosted by Kaggle.

Data description
The Boston data frame has 506 rows and 14 columns.

This data frame contains the following columns:

    crim
    - per capita crime rate by town.

    zn
    - proportion of residential land zoned for lots over 25,000 sq.ft.

    indus
    - proportion of non-retail business acres per town.

    chas
     - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

    nox
    - nitrogen oxides concentration (parts per 10 million).

    rm
    - average number of rooms per dwelling.

    age
    - proportion of owner-occupied units built prior to 1940.

    dis
    - weighted mean of distances to five Boston employment centres.

    rad
    - index of accessibility to radial highways.

    tax
    - full-value property-tax rate per $10,000.
    
    ptratio
    - pupil-teacher ratio by town.

    black
    - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.

    lstat
    - lower status of the population (percent).

    medv
    - median value of owner-occupied homes in $1000s.

Source
Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102.

Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley. 

#Import what we need and download train data
We will need to download train.csv and store it somewhere accessible. Let’s start by importing what we need and reading in our data.

In [0]:
import pandas as pd
import numpy as np
import tensorflow as tf

In [62]:
#read training data
train_df = pd.read_csv('https://firebasestorage.googleapis.com/v0/b/bible-project-2365c.appspot.com/o/train.csv?alt=media&token=9c5d17c2-0589-43ea-b992-e7c2ad02d714', index_col='ID')
train_df.head()

# The above code will print the first five rows of the imported data. We will
# see the column names printed along with the data in a tabular format. Our 
# target variable is called medv, so we store it.

Unnamed: 0_level_0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
2,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
4,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
5,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2
7,0.08829,12.5,7.87,0,0.524,6.012,66.6,5.5605,5,311,15.2,395.6,12.43,22.9


In [0]:
predictors = ['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad',
              'tax', 'ptratio', 'black', 'lstat']
target = 'medv'

# Normalize data for neural networks to perform optimally

In [0]:
from sklearn.preprocessing import MinMaxScaler

In [65]:
# If you take a look at the data, you will see that the different columns have 
# different ranges. This is not good for gradient descent. We need to have the 
# columns range between 0 and 1
scaler = MinMaxScaler(feature_range=(0, 1))
# Scale both the training inputs and outputs
scaled_train = scaler.fit_transform(train_df)

  return self.partial_fit(X, y)


In [66]:
# Print out the adjustment that the scaler applied to the total_earnings column 
# of data
print("Note: median values were scaled by multiplying by {:.10f} and adding {:.6f}".format(scaler.scale_[13], scaler.min_[13]))

Note: median values were scaled by multiplying by 0.0222222222 and adding -0.111111


In [0]:
multiplied_by = scaler.scale_[13]
added = scaler.min_[13]

In [68]:
print(type(scaled_train))

<class 'numpy.ndarray'>


In [0]:
# Scaling produces a Numpy Array. We need to create a DataFrame out of that. 
scaled_train_df = pd.DataFrame(scaled_train, columns=train_df.columns.values)

# Let't build our model


In [0]:
# We are now reading to start building our Neural Network. We will make use of 
# a Sequential model.
model = tf.keras.Sequential()

In [0]:
# We can now add layers to our model. We will be creating fully connected
# layers using model.add(). The first call creates two layers, while subsequent
# calls add one layer each. We need to tell each layer what its output will be,
# which is the number of neurons it will output. We also need to specify the 
# activation of the layers. In this case, we use the relu activation function.
model.add(tf.keras.layers.Dense(50, activation='relu'))
model.add(tf.keras.layers.Dense(100, activation='relu'))
model.add(tf.keras.layers.Dense(50, activation='relu'))
model.add(tf.keras.layers.Dense(1))

# Notice that the final layer outputs one value. That is because we are
# predicting a continuous variable. For the same reason, we do not specify an
# activation

In [0]:
# Next, we need to compile our model. We do this by specifying our loss 
# function and our optimizer
model.compile(loss='mean_squared_error', optimizer='adam')

In [0]:
# We are now ready to train our model. Before we do that, we need to get our
# training dataset ready. We will leave out the first ten rows of our data so
# we can use them for validation. We will separate our predictors into X, and
# our target into Y.
X = scaled_train_df.drop(target, axis=1).values
Y = scaled_train_df[[target]].values

In [74]:
# We will train our model by passing in our training dataset. We also need to
# specify the number of times we would like to go over our training data. This
# is called an epoch.
model.fit(
    X[10:],
    Y[10:],
    epochs=50,
    shuffle=True,
    verbose=2
)

Epoch 1/50
 - 1s - loss: 0.1309
Epoch 2/50
 - 0s - loss: 0.0504
Epoch 3/50
 - 0s - loss: 0.0298
Epoch 4/50
 - 0s - loss: 0.0233
Epoch 5/50
 - 0s - loss: 0.0205
Epoch 6/50
 - 0s - loss: 0.0170
Epoch 7/50
 - 0s - loss: 0.0141
Epoch 8/50
 - 0s - loss: 0.0121
Epoch 9/50
 - 0s - loss: 0.0108
Epoch 10/50
 - 0s - loss: 0.0100
Epoch 11/50
 - 0s - loss: 0.0143
Epoch 12/50
 - 0s - loss: 0.0107
Epoch 13/50
 - 0s - loss: 0.0095
Epoch 14/50
 - 0s - loss: 0.0082
Epoch 15/50
 - 0s - loss: 0.0072
Epoch 16/50
 - 0s - loss: 0.0072
Epoch 17/50
 - 0s - loss: 0.0069
Epoch 18/50
 - 0s - loss: 0.0065
Epoch 19/50
 - 0s - loss: 0.0057
Epoch 20/50
 - 0s - loss: 0.0059
Epoch 21/50
 - 0s - loss: 0.0056
Epoch 22/50
 - 0s - loss: 0.0058
Epoch 23/50
 - 0s - loss: 0.0059
Epoch 24/50
 - 0s - loss: 0.0050
Epoch 25/50
 - 0s - loss: 0.0051
Epoch 26/50
 - 0s - loss: 0.0050
Epoch 27/50
 - 0s - loss: 0.0044
Epoch 28/50
 - 0s - loss: 0.0053
Epoch 29/50
 - 0s - loss: 0.0046
Epoch 30/50
 - 0s - loss: 0.0042
Epoch 31/50
 - 0s -

<tensorflow.python.keras.callbacks.History at 0x7f422026fd68>

In [75]:
test_error_rate = model.evaluate(X[:10], Y[:10], verbose=0)
print("The mean squared error (MSE) for the test data set is: {}".format(test_error_rate))

The mean squared error (MSE) for the test data set is: 0.0032902092207223177


# Make a prediction

In [0]:
# At this point we are ready to make a prediction.
prediction = model.predict(X[:1])

In [77]:
y_0 = prediction[0][0]
print('Prediction with scaling - {}'.format(y_0))
y_0 -= added
y_0 /= multiplied_by
print("Housing Price Prediction  - ${}".format(y_0*1000))

Prediction with scaling - 0.47554340958595276
Housing Price Prediction  - $26399.453431367874
