## MIS780 - Advanced Artificial Intelligence for Business

## Week 4 - Part 1: Multi-layer Perceptron for Regression

In this notebook, we will perform Ames house price prediction using Deep Learning models.


## Table of Content
   
   
1. [Preparation](#cell_Preparation)    
    
    
2. [Ames real-estate data](#cell_Ames)


3. [Deep Learning with Sequential Model](#cell_deep)


<a id = "cell_Preparation"></a>
## 1. Preparation

Load some standard Python libraries.

In [None]:
from __future__ import print_function
import os
import math
import datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Next, load `Sklearn` and its wrappers

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_absolute_error

Some options to control Pandas display

In [None]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

<a id = "cell_Ames"></a>
## 2. Ames real-estate data

Upload the provided data set `ames_house_data.csv` to Google Colab and run the below code.

In [None]:
ames_data_org = pd.read_csv("ames_house_data.csv")
ames_data_org.set_index('PID', inplace=True)
ames_data_org.head(10)
print('Number of records read: ', ames_data_org.size)

Find the column types and the number of missing values in each column<br>
Note that we can also use: `ames_data_org.info()`

In [None]:
# Finding column types
ames_data_org.dtypes

In [None]:
# Identification of missing values
missing = ames_data_org.isnull().sum()
missing = missing[missing > 0]
missing.sort_values(ascending=False)

Drop columns with lots of missing values then show statistics about each column.

In [None]:
ames_data_org.drop(['Pool_QC', 'Misc_Feature', 'Alley', 'Fence', 'Fireplace_Qu'], axis=1, inplace=True)
ames_data_org.describe(include='all')

Select numeric columns and a few "promising" one-hot-encoded categorical variables.<br>
Note to avoid those columns with huge class unbalance, i.e. those where `freq` is approximately `equal` to count!

In [None]:
ames_data_num = ames_data_org.select_dtypes(include='number')
ames_data_hstyle= pd.get_dummies(ames_data_org['House_Style'], prefix='HStyle')
ames_data_area= pd.get_dummies(ames_data_org['Neighborhood'], prefix='Area')
ames_data = pd.concat([ames_data_num, ames_data_hstyle, ames_data_area], axis=1, join='inner')
label_col = 'SalePrice'
ames_data.head(10)

Split data for training and validation. Split index ranges into three parts, however, ignore the third.

In [None]:
train_size, valid_size, test_size = (0.7, 0.3, 0.0)
ames_train, ames_valid = train_test_split(ames_data,
                                      test_size=valid_size,
                                      random_state=2020)

Extract data for training and validation into x and y vectors.

In [None]:
ames_y_train = ames_train[[label_col]]
ames_x_train = ames_train.drop(label_col, axis=1)
ames_y_valid = ames_valid[[label_col]]
ames_x_valid = ames_valid.drop(label_col, axis=1)

print('Size of training set: ', len(ames_x_train))
print('Size of validation set: ', len(ames_x_valid))

Before the data can be applied to a deep learning model. Missing values needs to be dealed with, and the data needs to be scaled to `[-1,1]` range.

Create an imputation model using training set and use it to impute both training and validation data.

In [None]:
print('Missing training values before imputation = ', ames_x_train.isnull().sum().sum())
print('Missing validation values before imputation = ', ames_x_valid.isnull().sum().sum())

imputer = SimpleImputer(missing_values=np.nan, strategy='mean').fit(ames_x_train)
ames_x_train = pd.DataFrame(imputer.transform(ames_x_train),
                            columns = ames_x_train.columns, index = ames_x_train.index)
ames_x_valid = pd.DataFrame(imputer.transform(ames_x_valid),
                            columns = ames_x_valid.columns, index = ames_x_valid.index)

print('Missing training values after imputation = ', ames_x_train.isnull().sum().sum())
print('Missing validation values after imputation = ', ames_x_valid.isnull().sum().sum())

Next, create a scaling model using training set and use it to scale both training and validation data.

In [None]:
scaler = MinMaxScaler(feature_range=(0, 1), copy=True).fit(ames_x_train)
ames_x_train = pd.DataFrame(scaler.transform(ames_x_train),
                            columns = ames_x_train.columns, index = ames_x_train.index)
ames_x_valid = pd.DataFrame(scaler.transform(ames_x_valid),
                            columns = ames_x_valid.columns, index = ames_x_valid.index)

print('X train min =', round(ames_x_train.min().min(),4), '; max =', round(ames_x_train.max().max(), 4))
print('X valid min =', round(ames_x_valid.min().min(),4), '; max =', round(ames_x_valid.max().max(), 4))

In [None]:
ames_x_valid.head(10)

<a id = "cell_deep"></a>
## 3. Deep Learning with Sequential Model

Load required libraries for Deep Learning with Sequential model.

In [None]:
import tensorflow as tf
from tensorflow.keras import metrics
from tensorflow.keras import regularizers
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Nadam, RMSprop

Convert pandas data frames to `np` arrays.

In [None]:
arr_x_train = np.array(ames_x_train)
arr_y_train = np.array(ames_y_train)
arr_x_valid = np.array(ames_x_valid)
arr_y_valid = np.array(ames_y_valid)

print('Training shape:', arr_x_train.shape)
print('Training samples: ', arr_x_train.shape[0])
print('Validation samples: ', arr_x_valid.shape[0])

Create several **Keras models** for experiment purpose.

The first is very simple, consisting of two layers and `Adam` optimizer.

In [None]:
def basic_model_1(x_size, y_size):
    t_model = Sequential()
    t_model.add(Dense(100, activation="relu", input_shape=(x_size,)))
    t_model.add(Dense(y_size))
    t_model.compile(
        loss='mean_squared_error',
        optimizer=RMSprop(learning_rate=0.001, rho=0.9, epsilon=1e-07, weight_decay=0.0),
        metrics=[metrics.mae])
    return(t_model)

The second with `RMSProp` optimizer consists of 4 layers and the first uses 20% dropouts.

In [None]:
def basic_model_2(x_size, y_size):
    t_model = Sequential()
    t_model.add(Dense(100, activation="tanh", input_shape=(x_size,)))
    t_model.add(Dropout(0.2))
    t_model.add(Dense(180, activation="relu"))
    t_model.add(Dense(20, activation="relu"))
    t_model.add(Dense(y_size))
    t_model.compile(
        loss='mean_squared_error',
        optimizer=RMSprop(learning_rate=0.005, rho=0.9, momentum=0.0, epsilon=1e-07, weight_decay=0.0,),
        metrics=[metrics.mae])
    return(t_model)

Now we create the executable model using one of the above functions. Run below code until the end to obtain the result, then change `basic_model_1` to `basic_model_2` and run the code again. Compare the results generated by the two models.

In [None]:
model = basic_model_1(arr_x_train.shape[1], arr_y_train.shape[1])
model.summary()

Specify Keras callbacks which allow additional functionality while the model is being fitted. ***EarlyStopping*** watches one of the model measurements and stops fitting when no improvement.

Fit the model and record the history of training and validation.
As we specified `EarlyStopping` with `patience=20`, with luck the training will stop in less than 200 epochs.

In [None]:
history = model.fit(arr_x_train, arr_y_train,
    batch_size=64,
    epochs=500,
    shuffle=True,
    verbose=2,
    validation_data=(arr_x_valid, arr_y_valid))

Evaluate and report performance of the trained model

In [None]:
train_score = model.evaluate(arr_x_train, arr_y_train, verbose=0)
valid_score = model.evaluate(arr_x_valid, arr_y_valid, verbose=0)

print('Train MAE: ', round(train_score[1], 2), ', Train Loss: ', round(train_score[0], 2))
print('Val MAE: ', round(valid_score[1], 2), ', Val Loss: ', round(valid_score[0], 2))

Now plot the true vs. predicted values.

In [None]:
y_valid_predict = model.predict(arr_x_valid)
# plot
plt.scatter(arr_y_valid, y_valid_predict)
plt.ylabel('arr_y_valid')
plt.xlabel('y_valid_predict')
plt.show()

corr_result = np.corrcoef(arr_y_valid.reshape(1,879)[0], y_valid_predict.reshape(1,879)[0])
print('The Correlation between true and predicted values is: ',round(corr_result[0,1],3))



Now plot the training history, i.e. the *Mean Absolute Error* and *Loss (Mean Squared Error)*, which were both defined at the time of model compilation.

Note that the plot shows validation error as less than training error, which is quite deceptive. The reason for this is that training error is calculated for the entire epoch (and at its begining it was much worse than at the end), whereas the validation error is taken from the last batch (after the model improved). See the above evaluation statistics to confirm that the evaluation puts these errors in the correct order at the very end.

In [None]:
def plot_hist(h, xsize=6, ysize=5):
    # Prepare plotting
    fig_size = plt.rcParams["figure.figsize"]
    plt.rcParams["figure.figsize"] = [xsize, ysize]

    # Get training and validation keys
    ks = list(h.keys())
    n2 = math.floor(len(ks)/2)
    train_keys = ks[0:n2]
    valid_keys = ks[n2:2*n2]

    # summarize history for different metrics
    for i in range(n2):
        plt.plot(h[train_keys[i]])
        plt.plot(h[valid_keys[i]])
        plt.title('Training vs Validation '+train_keys[i])
        plt.ylabel(train_keys[i])
        plt.xlabel('Epoch')
        plt.legend(['Train', 'Validation'], loc='upper left')
        plt.draw()
        plt.show()

    return

In [None]:
hist = pd.DataFrame(history.history)

# Plot history
plot_hist(hist, xsize=6, ysize=4)

### References:

- Pathak, M. (2019). Using XGBoost in Python. https://www.datacamp.com/community/tutorials/xgboost-in-python
- XGBoost GPU Support https://xgboost.readthedocs.io/en/latest/gpu/
- Agarwal, R. (2020). Lightning Fast XGBoost on Multiple GPUs. https://towardsdatascience.com/lightning-fast-xgboost-on-multiple-gpus-32710815c7c3