# Linear regression: Housing Price Prediction

This notebook uses the classic [Boston Housing](http://lib.stat.cmu.edu/datasets/boston) Dataset and builds a model to predict the housing price in the suburban area of Boston.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

import tensorflow as tf

from tensorflow import keras #cleaner and shorter weight
from tensorflow.keras import layers #e.g. hidden/ output...
from tensorflow.keras.datasets import boston_housing #data
import time

## The Boston Housing Dataset

The Boston dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. The dataset is small in size with only 506 cases. A full description of all variables can be found [here](http://lib.stat.cmu.edu/datasets/boston).

### Get the data
First download the dataset. Boston dataset is a build-in dataset in Keras. Notice that this build-in dataset has already been cleaned and preprocessed (normalized) by keras.

In [None]:
from tensorflow.keras.datasets import boston_housing
(train_data, train_targets), (test_data, test_targets) = boston_housing.load_data()

#dataset, which is already separated in train and test data.
#The only parameter (path) is basically where to store the downloaded dataset.

In [None]:
#check training and test size
print('Training data : {}'.format(train_data.shape))
print('Training target : {}'.format(train_targets.shape))
print('Test data : {}'.format(test_data.shape))
print('Test target : {}'.format(test_targets.shape))

#so can know the regression model: y = g(β0 + β1x1 + ......+ β13x13) : 13 inputs  --> 找β > 1 output

## The model

### Build the model

Let's build our model. Here, we'll use a `Sequential` model with an output layer that returns a single, continuous value. The model building steps are wrapped in a function, `build_model`, since we'll create a second model, later on.

In [None]:
# methoud1: directly
# Adam
model = keras.Sequential([layers.Dense(1, input_shape=[train_data.shape[1]])])
#dense: fully connected, CNN就不是dense
#1是因為output只有1個
optimizer = tf.keras.optimizers.Adam(0.001)
#use Adam optimizer here, 0.001 is learning_rate, is default value
#https://keras.io/api/optimizers/adam/
model.compile(loss='mse',
              optimizer=optimizer,
              metrics=['mse'])
#loss function: tell it is regression problem by MSE, metrics is model evaluation performance

In [None]:
#SGD
model_1 = keras.Sequential([layers.Dense(1, input_shape=[train_data.shape[1]])])

optimizer = tf.keras.optimizers.SGD(0.0000004)

model_1.compile(loss='mse',
              optimizer=optimizer,
              metrics=['mse'])

### Build the model

We can also wrap the model in a function, `build_model`, if we want to create a second model, later on.

In [None]:
# methoud2: writing function for future use
# Adam
def build_model():
    model = keras.Sequential([layers.Dense(1, input_shape=[train_data.shape[1]])])
    optimizer = tf.keras.optimizers.Adam(0.001)
    model.compile(loss='mse',
                  optimizer=optimizer,
                  metrics=['mse'])
    return model

### Inspect the model

Use the `.summary` method to print a simple description of the model

In [None]:
model.summary()
#1 output layer
#number of parameters: 13 + 1(bias)

In [None]:
model_1.summary()

### Train the model

Train the model for 2000 epochs, and record the training accuracy in the `history` object.

In [None]:
# Display training progress by printing a single dot for each completed epoch
# Adam
class PrintLoss(keras.callbacks.Callback):
    def on_epoch_begin(self, epoch, logs):
        if epoch == 0: self.time = time.monotonic()
    def on_epoch_end(self, epoch, logs):
        if epoch % 100 == 0:
            # calculate training time
            t = time.monotonic() - self.time
            # if validation set is included, its loss is recorded in logs['val_loss']
            print('| Epoch {:4} | training time {:6.2f}s | train loss {:6.2f} |'
                  .format(epoch, t, logs['loss']))
EPOCHS = 3001 #k
#only with 2900, so we set = 3001
#e.g. sample size = n, total: k*n iterations
#try and see when will converge

history = model.fit(train_data, train_targets,
                    epochs=EPOCHS, verbose=0,
                    callbacks=[PrintLoss()])
#if batch_size=500，就不會converge (train loss) / if = 404 --> full data set
#SGD is very sensitive (when chooseing different stepsize)

In [None]:
# SDK
class PrintLoss(keras.callbacks.Callback):
    def on_epoch_begin(self, epoch, logs):
        if epoch == 0: self.time = time.monotonic()
    def on_epoch_end(self, epoch, logs):
        if epoch % 100 == 0:
            t = time.monotonic() - self.time
            print('| Epoch {:4} | training time {:6.2f}s | train loss {:6.2f} |'
                  .format(epoch, t, logs['loss']))
EPOCHS = 3001 #k

history_1 = model_1.fit(train_data, train_targets,
                    epochs=EPOCHS, verbose=0,
                    callbacks=[PrintLoss()])

#SGD is very sensitive (when chooseing different stepsize)

Visualize the model's training progress using the stats stored in the `history` object.

In [None]:
#turn into dataframe
hist = pd.DataFrame(history.history)
#new add column
hist['epoch'] = history.epoch
display(hist.tail())

#SDK
hist_1 = pd.DataFrame(history_1.history)
#new add column
hist_1['epoch'] = history_1.epoch
display(hist_1.tail())

In [None]:
def plot_history(history):
    hist = pd.DataFrame(history.history)
    hist['epoch'] = history.epoch
    hist.tail()

    plt.figure()
    plt.xlabel('Epoch')
    plt.ylabel('Mean Square Error [MPG]')
    plt.plot(hist['epoch'], hist['mse'],
             label='Train Error')

    plt.show()
#Adam
plot_history(history)

In [None]:
#SDK
plot_history(history_1)

### Make predictions

Calculate testing test mean squared error

In [None]:
_, mse = model.evaluate(test_data, test_targets, verbose=0)

print("Adam: Testing set Mean Squared Error: {:5.2f}".format(mse))

#evaluate 印出的值是loss, accuracy(metric)
#"_"代表最後一次執行的結果，或暫時性/不重要的變數
#verbose = 0 代表不要輸出任何紀錄訊息
#verbose = 1 表不輸出進度條紀錄，是default，
#會出現4/4 [==============================] - 0s 972us/step - loss: 24.4064 - mse: 24.4064

_, mse_1 = model_1.evaluate(test_data, test_targets, verbose=0)

print("SDK: Testing set Mean Squared Error: {:5.2f}".format(mse_1))


Finally, predict housing price using data in the testing set:

In [None]:
test_predictions = model.predict(test_data).flatten()

plt.scatter(test_targets, test_predictions)
plt.xlabel('True Values [price]')
plt.ylabel('Predictions [price]')
plt.axis('equal')
plt.axis('square')
plt.xlim([0,plt.xlim()[1]])
plt.ylim([0,plt.ylim()[1]])
_ = plt.plot([-100, 100], [-100, 100]) #diagonal line

In [None]:
test_predictions_1 = model_1.predict(test_data).flatten()

plt.scatter(test_targets, test_predictions_1)
plt.xlabel('True Values [price]')
plt.ylabel('Predictions [price]')
plt.axis('equal')
plt.axis('square')
plt.xlim([0,plt.xlim()[1]])
plt.ylim([0,plt.ylim()[1]])
_ = plt.plot([-100, 100], [-100, 100]) #diagonal line