# Boston House Price Prediction

https://www.kaggle.com/c/boston-housing

- CRIM - per capita crime rate by town
- ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS - proportion of non-retail business acres per town.
- CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
- NOX - nitric oxides concentration (parts per 10 million)
- RM - average number of rooms per dwelling
- AGE - proportion of owner-occupied units built prior to 1940
- DIS - weighted distances to five Boston employment centres
- RAD - index of accessibility to radial highways
- TAX - full-value property-tax rate per \$10,000
- PTRATIO - pupil-teacher ratio by town
- B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT - percentage lower status of the population
- MEDV - Median value of owner-occupied homes in $1000's

## Load the dataset

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
import pandas as pd
import numpy as np

In [None]:
boston_df = pd.read_csv('https://raw.githubusercontent.com/manaranjanp/GenAI_LLM/main/DLIntro/boston.csv')

In [None]:
boston_df.head(5)

In [None]:
boston_df.info()

### Set X and Y Variables

In [None]:
boston_df.columns

In [None]:
X = np.array(boston_df[['crim', 'zn', 'indus', 'chas', 
                        'nox', 'rm', 'age', 'dis', 'rad',
                        'tax', 'ptratio', 'black', 'lstat']])

In [None]:
Y = np.array(boston_df.medv)

In [None]:
X.shape

In [None]:
Y.shape

## Split dataset into train and test

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
train_X, test_X, train_y, test_y = train_test_split( X, Y, test_size = 0.2)

In [None]:
train_X.shape

In [None]:
test_X.shape

### Normalize data

All variables need to be normalized to bring them onto one scale. To scale we can use standardization technique, which is subtracting mean and dividing by standard deviation.

The train and test data need to be normalized based on mean and std of training dataset, as the NN parameters will be estimated based on the training dataset.

In [None]:
## Calculate meand std from the training dataset
mean = train_X.mean(axis=0)
std = train_X.std(axis=0)

## Standardizing the training dataset
train_X -= mean
train_X /= std

## Standardizing the test dataset
test_X -= mean
test_X /= std

## Build NN Model

Explain:

1. NN Architecture
2. Layers and Neurons
3. Activation Functions 
4. Loss Function
5. Backpropagation 
6. Gradient Descent and variations of Gradient Descent

### Model 1:

In [None]:
import tensorflow as tf
from tensorflow import keras
from keras import models
from keras.layers import Dense, Activation
from keras.optimizers import SGD

In [None]:
tf.__version__

In [None]:
model = models.Sequential()

model.add(Dense(64, input_shape=(train_X.shape[1],)))

model.add(Activation('relu'))

model.add(Dense(64))
model.add(Activation('relu'))

model.add(Dense(1))

In [None]:
model.summary()

In [None]:
model.compile(optimizer="sgd", 
              loss='mse', 
              metrics=['mse'])

**EPOCH** - an ENTIRE dataset is passed forward and backward through the neural network only ONCE.

**BATCH SIZE** - Total number of training examples present in a single batch. The backpropagation algorithms updates the weights after each batch size operation.

Usually the validation metrics are measured at the end of each epoch to measure progress of the learning in the neural network. (If it is underfitting or overfitting)

In [None]:
EPOCHS = 30
## BATCH_SIZE

Explain how data would be taken in batches and run multiple epochs.

In [None]:
%%time
history = model.fit(
    train_X, 
    train_y,  # prepared data
    epochs=EPOCHS,
    validation_data=(test_X, test_y),
    verbose=1
)

In [None]:
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline

In [None]:
history.history.keys()

In [None]:
def plot_mse(hist):
    plt.plot(hist['mse'])
    plt.plot(hist['val_mse'])
    plt.title('MSE')
    plt.ylabel('mse')
    plt.xlabel('epoch')
    plt.legend(['train', 
                'test'], 
               loc='upper left')
    plt.show()
    
def plot_loss(hist):
    plt.plot(hist['loss'])
    plt.plot(hist['val_loss'])
    plt.title('model loss')
    plt.ylabel('loss')
    plt.xlabel('epoch')
    plt.legend(['train', 
                'test'], 
               loc='upper left')
    plt.show()    

In [None]:
plot_mse(history.history)
plot_loss(history.history)

The loss is diverging. The leanring rate is high.

- Explain learning Rate

### Model 2: With Lower Learning Rate

In [None]:
tf.keras.backend.clear_session()

model = models.Sequential()

model.add(Dense(64, input_shape=(train_X.shape[1],)))

model.add(Activation('relu'))

model.add(Dense(64))
model.add(Activation('relu'))

model.add(Dense(1))

In [None]:
sgd = SGD(learning_rate=0.001)
model.compile(optimizer=sgd, loss='mse', metrics=['mse'])

In [None]:
%%time

EPOCHS = 100

history = model.fit(
    train_X, 
    train_y,  # prepared data
    epochs=EPOCHS,
    validation_data=(test_X, test_y),
)

In [None]:
plot_mse(history.history)
plot_loss(history.history)

## Participant Exercise: 1

1. Change the activation functions to a) sigmoid and b) tanh and build the model
2. Add more neurons to the model
3. Add model layers to the model

Print the model summary and validation loss from the last epoch.

### Model Prediction and Measure Accuracy

In [None]:
pred_y = model.predict(test_X)

In [None]:
np.sqrt(metrics.mean_squared_error( test_y, pred_y )) 

## Using Callbacks

In [None]:
from keras.callbacks import ReduceLROnPlateau, EarlyStopping, ModelCheckpoint

In [None]:
callbacks_list = [ReduceLROnPlateau(monitor='val_loss',
                                    factor=0.1, 
                                    patience=3),
                  EarlyStopping(monitor='val_loss',
                                patience=6),
                  ModelCheckpoint(filepath='boston_house_model.h5',
                                  save_format='tf',
                                  monitor='val_loss',
                                  save_best_only=True)]

In [None]:
tf.keras.backend.clear_session()

model = models.Sequential()

model.add(Dense(64, input_shape=(train_X.shape[1],)))

model.add(Activation('relu'))

model.add(Dense(64))
model.add(Activation('relu'))

model.add(Dense(1))

In [None]:
sgd = SGD(learning_rate=0.005)
model.compile(optimizer=sgd, loss='mse', metrics=['mse'])

In [None]:
%%time

EPOCHS = 100

history = model.fit(
    train_X, 
    train_y,  # prepared data
    epochs=EPOCHS,
    callbacks = callbacks_list,
    validation_data=(test_X, test_y),
)

In [None]:
plot_mse(history.history)
plot_loss(history.history)

### Saving the model

In [None]:
model.save('boston_house_model.h5')

## Loading Model and Making Prediction

In [None]:
new_model = keras.models.load_model('boston_house_model.h5')

In [None]:
test_X[0:1]

In [None]:
house_price_pred = model.predict(test_X[0:1])

In [None]:
house_price_pred[0]

The new data always need to be normalized with training data parameters (mean and standard deviation).