# Deep Learning - Aritifical Neural Network Model
Overview of Implementation
1. <a href="#section1">Import Dataset</a>
2. <a href="#section2">Artificial Neural Network Model</a>

## <a id='section1'>1. Import Dataset</a>

In [92]:
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
import seaborn as sb

In [93]:
train = pd.read_csv('train.csv')
train

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,175000
1456,1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1457,1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1458,1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2010,WD,Normal,142125


In [94]:
nullData = [['LotFrontage', 259], ['MasVnrArea', 8], ['Electrical', 1], ['GarageYrBlt', 81]]
n = len(train)
treshold = 0.1
drop = []

print('Drop feature - too many nulls:')
for i in nullData:
    if i[1]/n > treshold: # Arbitrary treshold: 10%
        print(i[0])
        train.drop(columns=[i[0]], inplace=True)
    else:
        drop.append(i[0])
        
print('Remove data point:')
print(drop)
train.dropna(subset=drop, inplace=True)

train

Drop feature - too many nulls:
LotFrontage
Remove data point:
['MasVnrArea', 'Electrical', 'GarageYrBlt']


Unnamed: 0,Id,MSSubClass,MSZoning,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,7917,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,8,2007,WD,Normal,175000
1456,1457,20,RL,13175,Pave,,Reg,Lvl,AllPub,Inside,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1457,1458,70,RL,9042,Pave,,Reg,Lvl,AllPub,Inside,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1458,1459,20,RL,9717,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,4,2010,WD,Normal,142125


In [95]:
#One-Hot encoding
categoricalcolumns = ['MSSubClass', 'MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond','Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature', 'SaleType', 'SaleCondition']
train1 = pd.get_dummies(train, columns= categoricalcolumns, prefix= categoricalcolumns)
print(train1)

        Id  LotArea  YearBuilt  YearRemodAdd  MasVnrArea  BsmtFinSF1  \
0        1     8450       2003          2003       196.0         706   
1        2     9600       1976          1976         0.0         978   
2        3    11250       2001          2002       162.0         486   
3        4     9550       1915          1970         0.0         216   
4        5    14260       2000          2000       350.0         655   
...    ...      ...        ...           ...         ...         ...   
1455  1456     7917       1999          2000         0.0           0   
1456  1457    13175       1978          1988       119.0         790   
1457  1458     9042       1941          2006         0.0         275   
1458  1459     9717       1950          1996         0.0          49   
1459  1460     9937       1965          1965         0.0         830   

      BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  1stFlrSF  ...  SaleType_ConLw  \
0              0        150          856       856  ...     

## <a id='section2'>2. Aritifical Neural Network Model (ANN)</a>
Artificial Neural Network is a model that is inspired by the brain and tries to replicate the way that humans learn. It consists of one input layer, a few hidden layers, and one output layer, with each hidden layer analysing different features of the dataset. During the training stage, the algorithm will learn to detect features that are relevant to predicting the output (ie SalePrice). It can also make use of backpropagation to correct mistakes during the training process, thus improving the model.

We will first employ the default ANN model.

In [98]:
#train-test split
TEST_SIZE = 0.25

filteredData1 = train1.drop(['Id'], axis=1)
train_df, test_df = train_test_split(filteredData1, test_size=TEST_SIZE,shuffle = False) #put shuffle = False so that we can reuse the same training and test sets for better comparison

train_X = train_df.drop('SalePrice', axis=1)
train_Y = train_df['SalePrice']
test_X = test_df.drop('SalePrice', axis=1)
test_Y = test_df['SalePrice']

In [99]:
# define base model
def baseline_model():
    # create model
    model = Sequential()
    model.add(Dense(314, input_dim=314, kernel_initializer='normal', activation='relu'))
    model.add(Dense(1, kernel_initializer='normal'))
    # Compile model
    model.compile(loss='mean_squared_error', optimizer='adam')
    return model

In [100]:
# evaluate model based on kfolds in the training set
estimator = KerasRegressor(build_fn=baseline_model, epochs=100, batch_size=5, verbose=0)
kfold = KFold(n_splits=10)
results = cross_val_score(estimator, train_X, train_Y, cv=kfold)
print("Baseline: %.2f (%.2f) MSE" % (-results.mean(), results.std()))

Baseline: 1620906880.00 (696756545.60) MSE


In [101]:
#fit model onto the train set and evaluate on the test set
estimator.fit(train_X, train_Y)
prediction = estimator.predict(test_X)

In [107]:
def score(y_pred,y_true): #define R2 accuracy score
    u = ((y_pred-y_true)**2).sum()
    v = ((y_true-y_true.mean())**2).sum()
    return (1-u/v)

In [103]:
#calculate the R2 score of accuracy for the model
score = score(prediction,test_Y)
print ("The R2 accuracy score is ", score)

The R2 accuracy score is  0.5030117174108558


### 2.1 Evaluate a Deeper Network Topology
One way to improve the performance a neural network is to add more layers. This might allow the model to extract and recombine higher order features embedded in the data.

We will evaluate the effect of adding one more hidden layer to the model.

In [112]:
def larger_model():
    # create model
    model = Sequential()
    model.add(Dense(314, input_dim=314, kernel_initializer='normal', activation='relu'))
    model.add(Dense(200, kernel_initializer='normal', activation='relu'))
    model.add(Dense(1, kernel_initializer='normal'))
    # Compile model
    model.compile(loss='mean_squared_error', optimizer='adam')
    return model

In [113]:
# evaluate model based on kfolds in the training set
estimator = KerasRegressor(build_fn=baseline_model, epochs=100, batch_size=5, verbose=0)
kfold = KFold(n_splits=10)
results = cross_val_score(estimator, train_X, train_Y, cv=kfold)
print("Baseline: %.2f (%.2f) MSE" % (-results.mean(), results.std()))
estimator.fit(train_X, train_Y)
prediction = estimator.predict(test_X)

Baseline: 1619987942.40 (649474873.38) MSE


In [121]:
#calculate the R2 score of accuracy for the model
score1 = score(prediction,test_Y)
print ("The R2 accuracy score is ", score1)

The R2 accuracy score is  0.517040252243145


In [120]:
#accuracy improvement 
print ("R2 Accuracy improvement =", (score1-0.5030117174108558)/0.5030117174108558)

R2 Accuracy improvement = 0.027889081599327382


### 2.2. Evaluate a Wider Network Topology
Another approach to increasing the representational capability of the model is to create a wider network.
In this section we evaluate the effect of keeping a shallow network architecture and nearly doubling the number of neurons in the one hidden layer.
Here, we have increased the number of neurons in the hidden layer compared to the baseline model from 314 to 600.

In [109]:
# define wider model
def wider_model():
    #create model
    model = Sequential()
    model.add(Dense(600, input_dim=314, kernel_initializer='normal', activation='relu'))
    model.add(Dense(1, kernel_initializer='normal'))
    # Compile model
    model.compile(loss='mean_squared_error', optimizer='adam')
    return model

In [110]:
# evaluate model based on kfolds in the training set
estimator = KerasRegressor(build_fn=baseline_model, epochs=100, batch_size=5, verbose=0)
kfold = KFold(n_splits=10)
results = cross_val_score(estimator, train_X, train_Y, cv=kfold)
print("Baseline: %.2f (%.2f) MSE" % (-results.mean(), results.std()))
estimator.fit(train_X, train_Y)
prediction = estimator.predict(test_X)

Baseline: 1640994131.20 (714831538.02) MSE


In [111]:
#calculate the R2 score of accuracy for the model
score2 = score(prediction,test_Y)
print ("The R2 accuracy score is ", score2)

The R2 accuracy score is  0.5116736941098374


In [122]:
#accuracy improvement 
print ("R2 Accuracy improvement =", (score2-0.5030117174108558)/0.5030117174108558)

R2 Accuracy improvement = 0.01722022847413429
