 ## Case Study #4: Predicting Boston Housing Prices with Neural Network

 ### 1. Upload, explore, clean, and preprocess data for neural network modeling. (Part will not be graded as all questions below have been already done in case study #1).
	a. Create a boston_df data frame by uploading the original data set into Python. Determine and present in this report the data frame dimensions, i.e., number of rows and columns. 

	
	b. Display in Python the column titles. If some of them contain two (or more) words, convert them into one-word titles, and present the modified titles in your report. 
	
	c. Display in Python column data types. If some of them are listed as “object’, convert them into dummy variables, and provide in your report the modified list of column titles with dummy variables. 


### Import required packages

In [1]:
from pathlib import Path

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.neural_network import MLPClassifier, MLPRegressor 
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

from mord import LogisticIT

from dmba import classificationSummary, regressionSummary


#### a. Create a boston_df data frame by uploading the original data set into Python. Determine and present in this report the data frame dimensions, i.e., number of rows and columns.

In [2]:
# Create boston_df data frame from BostonHousing data set.
boston_df=pd.read_csv('BostonHousing.csv')
# check the data frame dimensions.
boston_df.shape
# we have 506 rows and 14 columns.

(506, 14)

#### b. Display in Python the column titles. If some of them contain two (or more) words, convert them into one-word titles, and present the modified titles in your report. 

In [3]:
# Display the colums heads and the first 10 records.
# Check if columns titles contains two or more words.
boston_df.columns

Index(['CRIME', 'ZONE', 'INDUST', 'CHAR RIV', 'NIT OXIDE', 'ROOMS', 'AGE',
       'DISTANCE', 'RADIAL', 'TAX', 'ST RATIO', 'LOW STAT', 'MVALUE',
       'C MVALUE'],
      dtype='object')

In [4]:
# Some of the columns titles contains two words.
# Convert the column titles that are not one-word into one-word title.
boston_df.columns=[s.strip().replace(' ','_') for s in boston_df]
# check the new one-word columns titles.
boston_df.columns

Index(['CRIME', 'ZONE', 'INDUST', 'CHAR_RIV', 'NIT_OXIDE', 'ROOMS', 'AGE',
       'DISTANCE', 'RADIAL', 'TAX', 'ST_RATIO', 'LOW_STAT', 'MVALUE',
       'C_MVALUE'],
      dtype='object')

#### c. Display in Python column data types. If some of them are listed as “object’, convert them into dummy variables, and provide in your report the modified list of column titles with dummy variables.

In [5]:
# Display the column data types.
# Check if any of the columns is listed as "object".
boston_df.dtypes
# CHAR_RIV, C_MVALUE are object type

CRIME        float64
ZONE         float64
INDUST       float64
CHAR_RIV      object
NIT_OXIDE    float64
ROOMS        float64
AGE          float64
DISTANCE     float64
RADIAL         int64
TAX            int64
ST_RATIO     float64
LOW_STAT     float64
MVALUE       float64
C_MVALUE      object
dtype: object

In [6]:
# Change CHAR_RIV to type "category"
boston_df.CHAR_RIV = boston_df.CHAR_RIV.astype('category')


In [7]:
# Check CHAR_RIV type
boston_df.CHAR_RIV.dtype
# OR 
print(boston_df.CHAR_RIV.dtype)

category


In [8]:
#Check the categories in CHAR_RIV 
boston_df.CHAR_RIV.cat.categories

Index(['N', 'Y'], dtype='object')

In [9]:
# Change C_MVALUE to type category 
boston_df.C_MVALUE= boston_df.C_MVALUE.astype('category')
# Check C_MVALUE type
print(boston_df.C_MVALUE.dtype)

category


In [10]:
# Check the categories in C_MVALUE
boston_df.C_MVALUE.cat.categories

Index(['No', 'Yes'], dtype='object')

In [11]:
# Convert category variables in boston_df into dummy variables
# Will set drop_first as True to have Yes column and 
# If the value is 0 in the Y or yes column this will indicate the No or N
boston_df=pd.get_dummies(boston_df, prefix_sep='_', drop_first=True)
boston_df.columns

Index(['CRIME', 'ZONE', 'INDUST', 'NIT_OXIDE', 'ROOMS', 'AGE', 'DISTANCE',
       'RADIAL', 'TAX', 'ST_RATIO', 'LOW_STAT', 'MVALUE', 'CHAR_RIV_Y',
       'C_MVALUE_Yes'],
      dtype='object')

### 2. Develop a neural network model for Boston Housing and use it for predictions.
a. Develop in Python the outcome and predictor variables, partition the data set (60% for training and 40% for validation partitions), display in Python and present in your report the first five records of the training partition. Then, using the StandardScaler() function,
develop the scaled predictors for training and validation partitions. Display in Python and provide in your report the first five records of the scaled training partition. Present a brief explanation of what the scaled values mean and how they are calculated.

b. Train a neural network model using MLPRegressor() with the scaled training data set and the following parameters: hidden_layer_sizes=9, solver=’lbfgs’, max_iter=10000, and random_state=1. Identify and display in Python the final intercepts and network weights of this model. Provide these intercepts and weights in your report and briefly explain what the values of intercepts in the first and second arrays mean. Also, briefly explain what the values of weights in the first and second arrays mean.

c. Using the developed neural network model, make in Python predictions for the outcome variable (MVALUE) using the scaled validation predictors. Based on these predictions, develop and display in Python a table for the first five validation records that contain actual and predicted median prices (MVALUE), and their residuals. Present this table in your report.

d. Identify and display in Python the common accuracy measures for training and validation partitions. Provide and compare these accuracy measures in your report and assess a possibility of overfitting. Would you recommend applying this neural network model for predictions? Briefly explain.

#### a. Develop in Python the outcome and predictor variables, partition the data set (60% for training and 40% for validation partitions), display in Python and present in your report the first five records of the training partition. Then, using the StandardScaler() function, develop the scaled predictors for training and validation partitions. Display in Python and provide in your report the first five records of the scaled training partition. Present a brief explanation of what the scaled values mean and how they are calculated.

In [12]:

# Create Boston outcome and predictors to run neural network
# model.
outcome = 'MVALUE'
predictors = [c for c in boston_df.columns if c != outcome]

# Create predictors and outcome variables.  
X = boston_df[predictors]
y = boston_df[outcome]# Create data partition with training set, 60%(0.6), and 
# validation set 40%(0.4).
train_X, valid_X, train_y, valid_y = train_test_split(X, y, 
                            test_size=0.4, random_state=1)

# Display the first 5 records of boston housing training 
# partition's predictors. 
print('Predictors for Training Partition')
print(train_X.head(5))

# Scale input data (predictors) for training  and validation 
# partitions using StandardScaler().
sc_X = StandardScaler()
train_X_sc = sc_X.fit_transform(train_X)
valid_X_sc = sc_X.transform(valid_X)

# Develop a data frame to display scaled predictors for 
# training partition. Round scaled values to 3 decimals.
# Add coloumn titles to data frame.
train_X_sc_df = np.round(pd.DataFrame(train_X_sc), decimals=3)                            
train_X_sc_df.columns=['CRIME', 'ZONE', 'INDUST', 'NIT_OXIDE', 'ROOMS', 'AGE', 'DISTANCE',
       'RADIAL', 'TAX', 'ST_RATIO', 'LOW_STAT', 'CHAR_RIV_Y',
       'C_MVALUE_Yes']

# Display the first 5 scaled predictors for training partition.
print()
print('Scaled Predictors for Training Partition')
print(train_X_sc_df.head(5))

Predictors for Training Partition
       CRIME  ZONE  INDUST  NIT_OXIDE  ROOMS   AGE  DISTANCE  RADIAL  TAX  \
452  5.09017   0.0   18.10      0.713  6.297  91.8    2.3682      24  666   
346  0.06162   0.0    4.39      0.442  5.898  52.3    8.0136       3  352   
295  0.12932   0.0   13.92      0.437  6.678  31.1    5.9604       4  289   
88   0.05660   0.0    3.41      0.489  7.007  86.3    3.4217       2  270   
322  0.35114   0.0    7.38      0.493  6.041  49.9    4.7211       5  287   

     ST_RATIO  LOW_STAT  CHAR_RIV_Y  C_MVALUE_Yes  
452      20.2     17.27           0             0  
346      18.8     12.67           0             0  
295      16.0      6.27           0             0  
88       17.8      5.50           0             0  
322      19.6      7.70           0             0  

Scaled Predictors for Training Partition
   CRIME   ZONE  INDUST  NIT_OXIDE  ROOMS    AGE  DISTANCE  RADIAL    TAX  \
0  0.146 -0.482   1.006      1.306  0.083  0.803    -0.688   1.662  1.53

#### b. Train a neural network model using MLPRegressor() with the scaled training data set and the following parameters: hidden_layer_sizes=9, solver=’lbfgs’, max_iter=10000, and random_state=1. Identify and display in Python the final intercepts and network weights of this model. Provide these intercepts and weights in your report and briefly explain what the values of intercepts in the first and second arrays mean. Also, briefly explain what the values of weights in the first and second arrays mean.

In [13]:
# Use MLPRegressor() function to train neural network model.
# Apply: 
# (a) default input layer with the number of nodes equal 
#     to number of predictor variables (13); 
# (b) default single hidden layer with 9 nodes; 
# (c) default output layer with one outcome variable (Price);
# (d) optimization function solver = 'lbfgs', 
#     which is applied for small data sets for better 
#     performance and fast convergence. For large data sets, 
#     apply default solver = 'adam' optimization function;
# (e) model is fit with scaled predictors and regular outcome
#     in training partition.
boston_reg = MLPRegressor(hidden_layer_sizes=(9), 
                solver='lbfgs', max_iter=10000, random_state=1)
boston_reg.fit(train_X_sc, train_y)

# Display network structure with the final values of 
# intercepts (Theta) and weights (W).
print('Final Intercepts for Boston Housing Neural Network Model')
print(boston_reg.intercepts_)

print()
print('Network Weights for Boston Housing Neural Network Model')
print(boston_reg.coefs_)

Final Intercepts for Boston Housing Neural Network Model
[array([ 2.26914474,  4.18675342, -1.50466092, -0.95080817,  0.32205673,
        0.49141424, -0.81237211,  0.89649621,  2.52709405]), array([-11.97658924])]

Network Weights for Boston Housing Neural Network Model
[array([[-0.36367279,  0.52687745, -0.00964011, -1.5129948 ,  1.31769872,
        -0.41850542, -1.53776369, -0.62480937, -1.90841567],
       [ 0.66134206, -1.23924591,  0.28794966,  2.11953871,  1.07434276,
         0.43220725, -4.11086338,  0.40334917, -0.08228966],
       [ 0.59676784,  0.58331396,  2.48943055,  0.90207757, -1.28237762,
         0.42433014, -1.95774694,  0.95234875,  1.23319412],
       [-2.20869382, -1.37160062, -0.98568213,  0.36329227, -0.6196563 ,
        -0.30782417,  1.05946611,  0.83457418,  2.71468783],
       [ 0.16998566, -0.09067205, -1.7444194 ,  0.11677319,  1.80120993,
         0.81352311, -0.01783085, -2.03768459,  0.23723986],
       [-0.27165058, -2.52251129, -1.29506367,  0.3880619 

#### c. Using the developed neural network model, make in Python predictions for the outcome variable (MVALUE) using the scaled validation predictors. Based on these predictions, develop and display in Python a table for the first five validation records that contain actual and predicted median prices (MVALUE), and their residuals. Present this table in your report.

In [14]:
# Make 'Price' predictions for validation set using Boston Housing  
# neural network model. 

# Use boston_reg model to predict 'Price' outcome
# for validation set.
price_pred = np.round(boston_reg.predict(valid_X_sc), decimals=2)

# Create data frame to display prediction results for
# validation set. 
price_pred_result = pd.DataFrame({'Actual': valid_y, 
                'Prediction': price_pred, 'Residual': valid_y-price_pred})

print('Predictions for Boston Price for Validation Partition')
print(price_pred_result.head(5))

Predictions for Boston Price for Validation Partition
     Actual  Prediction  Residual
307    28.2       29.63     -1.43
343    23.9       23.49      0.41
47     16.6       17.80     -1.20
67     22.0       18.73      3.27
362    20.8       25.30     -4.50


#### d. Identify and display in Python the common accuracy measures for training and validation partitions. Provide and compare these accuracy measures in your report and assess a possibility of overfitting. Would you recommend applying this neural network model for predictions? Briefly explain.

In [15]:
# Neural network model accuracy measures for training and
# validation partitions. 

# Identify and display neural network model accuracy measures 
# for training partition.
print('Accuracy Measures for Training Partition for Neural Network')
regressionSummary(train_y, boston_reg.predict(train_X_sc))

# Identify and display neural network accuracy measures 
# for validation partition.
print()
print('Accuracy Measures for Validation Partition for Neural Network')
regressionSummary(valid_y, boston_reg.predict(valid_X_sc))

Accuracy Measures for Training Partition for Neural Network

Regression statistics

                      Mean Error (ME) : -0.0034
       Root Mean Squared Error (RMSE) : 1.5617
            Mean Absolute Error (MAE) : 1.1368
          Mean Percentage Error (MPE) : -0.8274
Mean Absolute Percentage Error (MAPE) : 6.0681

Accuracy Measures for Validation Partition for Neural Network

Regression statistics

                      Mean Error (ME) : -0.0912
       Root Mean Squared Error (RMSE) : 3.1675
            Mean Absolute Error (MAE) : 2.2668
          Mean Percentage Error (MPE) : -3.0502
Mean Absolute Percentage Error (MAPE) : 11.6748


In [16]:
# applying confusion matrix to asses the errors and accuracy level for the scaled model
# Identify and display confusion matrix for training partition. 
print('Training Partition for the scaled model')
classificationSummary(train_y, boston_reg.predict(train_X_sc))

# Identify and display confusion matrix for validation partition. 
print()
print('Validation Partition for the scaled Model')
classificationSummary(valid_y, boston_reg.predict(valid_X_sc))

Training Partition for the scaled model


ValueError: continuous is not supported

### 3. Develop an improved neural network model with grid search.
a. Use in Python GridSearchCV() function to identify the best number of nodes for the hidden layer in the Boston Housing neural network model. For that, consider the hidden_layer_sizes parameter in a range from 2 to 20. Provide in your report the best score and best parameter value.

b. Train an improved neural network model using MLPRegressor() with the scaled training data set and the best identified value of the parameter from the previous question. The rest of the parameters remain the same as in model developed in 2b. Present in your report the final intercepts and network weights of the improved neural network model.

c. Identify and display in Python the common accuracy measures for the training and validation partitions with the improved neural network model. Provide and compare these accuracy measures in your report and assess a possibility of overfitting. Would you recommend applying this neural network model for predictions? Briefly explain.

d. Present and compare the accuracy measures for the validation partition from the Exhaustive Search model for multiple linear regression in case study #1 and the validation partition for the improved neural network model in this case. Which of the models would you recommend for predictions? Briefly explain.

#### a. Use in Python GridSearchCV() function to identify the best number of nodes for the hidden layer in the Boston Housing neural network model. For that, consider the hidden_layer_sizes parameter in a range from 2 to 20. Provide in your report the best score and best parameter value.

In [17]:
# Identify grid search parameters. 
param_grid = {
    'hidden_layer_sizes': list(range(2, 20)), 
}

# Utilize GridSearchCV() to identify the best number 
# of nodes in the hidden layer. 
gridSearch = GridSearchCV(MLPRegressor(solver='lbfgs', max_iter=10000, random_state=1), 
                          param_grid, cv=5, n_jobs=-1, return_train_score=True)
gridSearch.fit(train_X, train_y)

# Display the best score and best parament value.
print(f'Best score:{gridSearch.best_score_:.4f}')
print('Best parameter: ', gridSearch.best_params_)

Best score:0.8084
Best parameter:  {'hidden_layer_sizes': 8}


#### b. Train an improved neural network model using MLPRegressor() with the scaled training data set and the best identified value of the parameter from the previous question. The rest of the parameters remain the same as in model developed in 2b. Present in your report the final intercepts and network weights of the improved neural network model.

In [18]:
# Use MLPRegressor() function to train the improved neural network model
# based on grid search results. 

# Apply: 
# (a) default input layer with the number of nodes equal 
#     to number of predictor variables (13); 
# (b) single hidden layer with 10 nodes based on grid search; 
# (c) default output layer with the number nodes equal
#     to number of classes in outcome variable (3);
# (d) 'logistic' activation function;
# (e) solver = 'lbfgs', which is applied for small data 
#     sets for better performance and fast convergence. 
#     For large data sets, apply default solver = 'adam'. 
boston_clf_imp = MLPRegressor(hidden_layer_sizes=(10), max_iter=10000,
                activation='logistic', solver='lbfgs', random_state=1)
boston_clf_imp.fit(train_X, train_y)

# Display network structure with the final values of 
# intercepts (Theta) and weights (W).
print('Final Intercepts for Boston Housing Neural Network Model')
print(boston_clf_imp.intercepts_)

print()
print('Network Weights for Boston Housing Neural Network Model')
print(boston_clf_imp.coefs_)

Final Intercepts for Boston Housing Neural Network Model
[array([ 0.05058636,  0.27695246,  0.03615363, -0.28178208,  0.17730357,
       -0.15748249,  0.18389121, -0.06613618,  0.21517461,  0.14347833]), array([3.08969556])]

Network Weights for Boston Housing Neural Network Model
[array([[-0.04893654,  0.12993563, -0.29480303, -0.11653212, -0.2083245 ,
        -0.24041687, -0.18474998, -0.09108004, -0.06086922,  0.02285407],
       [-0.04765472,  0.10923258, -0.15999075,  0.41501449, -0.27872157,
         0.10053264, -0.02891929,  0.03461213, -0.14298707, -0.37440681],
       [ 0.17719296,  0.27615569, -0.10972874,  0.11714386,  0.2219766 ,
         0.2327196 , -0.23547895, -0.27184089, -0.19372822,  0.21809079],
       [-0.23687031, -0.04652651,  0.27010435,  0.02041521,  0.11315894,
        -0.1087987 ,  0.11130717,  0.19734436, -0.2837778 ,  0.14659305],
       [ 0.28836617,  0.14635486, -0.12838849,  0.18639026, -0.2339956 ,
        -0.0307249 ,  0.26018929, -0.12171536, -0.119234

#### c. Identify and display in Python the common accuracy measures for the training and validation partitions with the improved neural network model. Provide and compare these accuracy measures in your report and assess a possibility of overfitting. Would you recommend applying this neural network model for predictions? Briefly explain.

In [19]:
# Neural network model accuracy measures for training and
# validation partitions. 

# Identify and display neural network model accuracy measures 
# for training partition.
print('Accuracy Measures for Training Partition for Neural Network')
regressionSummary(train_y, boston_clf_imp.predict(train_X))

# Identify and display neural network accuracy measures 
# for validation partition.
print()
print('Accuracy Measures for Validation Partition for Neural Network')
regressionSummary(valid_y, boston_clf_imp.predict(valid_X))

Accuracy Measures for Training Partition for Neural Network

Regression statistics

                      Mean Error (ME) : 0.0000
       Root Mean Squared Error (RMSE) : 8.9460
            Mean Absolute Error (MAE) : 6.5493
          Mean Percentage Error (MPE) : -18.8601
Mean Absolute Percentage Error (MAPE) : 37.1695

Accuracy Measures for Validation Partition for Neural Network

Regression statistics

                      Mean Error (ME) : 1.0476
       Root Mean Squared Error (RMSE) : 9.5609
            Mean Absolute Error (MAE) : 6.6363
          Mean Percentage Error (MPE) : -12.0971
Mean Absolute Percentage Error (MAPE) : 32.6626


#### d. Present and compare the accuracy measures for the validation partition from the Exhaustive Search model for multiple linear regression in case study #1 and the validation partition for the improved neural network model in this case. Which of the models would you recommend for predictions? Briefly explain.

In [20]:
# Develop the multiple linear regression model based on the Exhaustive Search results.

# Identify predictors and outcome of the regression model.
predictors_ex = ['CHAR_RIV_Y', 'CRIME', 'C_MVALUE_Yes', 
                 'DISTANCE', 'INDUST','LOW_STAT', 'NIT_OXIDE', 
                 'RADIAL', 'ROOMS', 'ST_RATIO', 'TAX']
outcome = 'MVALUE'

# Identify X and y variables for regression and partition data
# using 60% of records for training and 40% for validation 
# (test_size=0.4). 
X = boston_df[predictors_ex]
y = boston_df[outcome]
train_X_ex, valid_X_ex, train_y_ex, valid_y_ex = \
          train_test_split(X, y, test_size=0.4, random_state=1)

# Create multiple linear regression model using X and y.
boston_ex = LinearRegression()
boston_ex.fit(train_X_ex, train_y_ex)
# Use predict() function to score (make) predictions 
# for validation set and measure their accuracy using
# Exhaustive Search algorithm.
boston_ex_pred = boston_ex.predict(valid_X_ex)

# Develop and display data frame with actual values of Price,
# scoring (predicted) results, and residuals.
# Use round() function to round vlaues in data frame to 
# 2 decimals. 
result = round(pd.DataFrame({'Actual': valid_y_ex,'Predicted': boston_ex_pred, 
                       'Residual': valid_y_ex - boston_ex_pred}), 2)
print()
print('Prediction for Validation Set Using Exhaustive Search') 
print(result.head(10))

# Display common accuracy measures for validation set.
print()
print('Accuracy Measures for Validation Set Using Exhaustive Search')
regressionSummary(valid_y_ex, boston_ex_pred)


Prediction for Validation Set Using Exhaustive Search
     Actual  Predicted  Residual
307    28.2      25.24      2.96
343    23.9      22.78      1.12
47     16.6      18.17     -1.57
67     22.0      21.86      0.14
362    20.8      18.93      1.87
132    23.0      19.58      3.42
292    27.9      25.25      2.65
31     14.5      18.06     -3.56
218    21.5      22.49     -0.99
90     22.6      23.28     -0.68

Accuracy Measures for Validation Set Using Exhaustive Search

Regression statistics

                      Mean Error (ME) : 0.4505
       Root Mean Squared Error (RMSE) : 3.8674
            Mean Absolute Error (MAE) : 2.7724
          Mean Percentage Error (MPE) : -2.1963
Mean Absolute Percentage Error (MAPE) : 13.3441
