# Neural Network Model
Create a multi-layer perceptron neural network model to predict on a labeled dataset of your choosing. Compare this model to either a boosted tree or a random forest model and describe the relative tradeoffs between complexity and accuracy. Be sure to vary the hyperparameters of your MLP!

## Preliminary Preparation

In [13]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPRegressor

In [2]:
# Loading the dataset
df = pd.read_csv('train.csv')
df.shape

(1460, 81)

In [3]:
# Observing the dataset
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [4]:
# Observing types of data, unique, NaN, and sample in all features
def snapshot(data):
    '''Creates a DataFrame that gives snapshot of original dataset for preliminary cleaning and analysis.'''
    preliminary_details = pd.DataFrame()
    preliminary_details['Type'] = data.dtypes
    preliminary_details['Unique'] = data.nunique()
    preliminary_details['NaN'] = data.isnull().sum()
    preliminary_details['Sample'] = data.sample().T
    return preliminary_details
snapshot(df)

Unnamed: 0,Type,Unique,NaN,Sample
Id,int64,1460,0,1239
MSSubClass,int64,15,0,20
MSZoning,object,5,0,RL
LotFrontage,float64,110,259,63
LotArea,int64,1073,0,13072
Street,object,2,0,Pave
Alley,object,2,1369,
LotShape,object,4,0,Reg
LandContour,object,4,0,Lvl
Utilities,object,2,0,AllPub


In [5]:
# Filling NaN values with mean of 70.049 (feature has original std of 24.284)
df['LotFrontage'] = df.LotFrontage.fillna(df.LotFrontage.mean())
# Seems to infer NaN means that no alley exists: therefore, creating new categorical variable
df['Alley'] = df.Alley.fillna('None')
# Only 8 NaN values: joining them to 'None' category
df['MasVnrType'] = df.MasVnrType.fillna('None')
df['MasVnrArea'] = df.MasVnrArea.fillna(0)
# Seems to infer NaN means that no basement exists: therefore, creating new variable
df['BsmtQual'] = df.BsmtQual.fillna('None')
df['BsmtCond'] = df.BsmtCond.fillna('None')
df['BsmtExposure'] = df.BsmtExposure.fillna('None')
df['BsmtFinType1'] = df.BsmtFinType1.fillna('None')
df['BsmtFinType2'] = df.BsmtFinType2.fillna('None')
# One NaN value: adding to largest category
df['Electrical'] = df.Electrical.fillna('Sbrkr')
# Seems to infer NaN means that no fireplace exists: therefore, creating new variable
df['FireplaceQu'] = df.FireplaceQu.fillna('None')
# Seems to infer NaN means that no garage exists: therefore, creating new variable
df['GarageType'] = df.GarageType.fillna('None')
df['GarageYrBlt'] = df.GarageYrBlt.fillna(df.GarageYrBlt.min()) #filling with min because it is a numerical feature
df['GarageFinish'] = df.GarageFinish.fillna('None')
df['GarageQual'] = df.GarageQual.fillna('None')
df['GarageCond'] = df.GarageCond.fillna('None')
# Seems to infer NaN means that no pool exists: therefore, creating new variable
df['PoolQC'] = df.PoolQC.fillna('None')
# Seems to infer NaN means that no fence exists: therefore, creating new variable
df['Fence'] = df.Fence.fillna('None')
# Miscellaneous categories comprised of "Shed", "Othr", "Gar2" and "TenC": we will limit it to a Shed variable
df['Shed'] = np.where(df.MiscFeature == 'Shed', 1,0)
df = df.drop(['MiscFeature','Id'], 1)

In [6]:
# Creating our full feature set
features = df

# Creating dummy features from categorical variables
dummy_feature_names = list(df.dtypes[df.dtypes == 'object'].index)
for x in dummy_feature_names:
    features = pd.concat([features.drop(x, axis=1), pd.get_dummies(df[x], prefix=x)], axis=1)

print(features.shape)

# Confirming that there are no text variables
features.dtypes[features.dtypes == 'object']

(1460, 300)


Series([], dtype: object)

## Random Forest Boosting Model
This model was used for a previous capstone presentation. For the purposes of this assignment, we will be comparing the Random Forest Regressor model against the multi-layer perceptron Neural Network model.

In [10]:
# For Lasso and Random Forest, the highest performing model was with all features and unadjusted target variable
X = features.drop('SalePrice', 1)
y = features.SalePrice

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=0)

In [12]:
# Applying best parameters to Random Forest Regressor
rfr = RandomForestRegressor(n_estimators=1111, min_samples_split=2, min_samples_leaf=2,
                           max_features='auto', max_depth=80, bootstrap=True, random_state=42)

rfr.fit(X_train, y_train)
rfr.score(X_test, y_test)

0.8518326280555957

## Neural Network Model

In [15]:
# Base performer
mlp = MLPRegressor(random_state=0)
mlp.fit(X_train, y_train)
mlp.score(X_test, y_test)



0.5568149098571558

### Tuning layer parameters

In [16]:
# Increasing hidden layers from default 100 to 1000
mlp = MLPRegressor(hidden_layer_sizes=(1000,), random_state=0)
mlp.fit(X_train, y_train)
mlp.score(X_test, y_test)



0.5902952311335012

In [22]:
# Adding another layer
mlp = MLPRegressor(hidden_layer_sizes=(100,50), random_state=0)
mlp.fit(X_train, y_train)
mlp.score(X_test, y_test)



0.5713359045456876

In [17]:
# Faster performer with higher accuracy
mlp = MLPRegressor(hidden_layer_sizes=(100,10), random_state=0)
mlp.fit(X_train, y_train)
mlp.score(X_test, y_test)



0.5995250512786906

In [23]:
# Slower performer with slightly higher accuracy
mlp = MLPRegressor(hidden_layer_sizes=(1000,10), random_state=0)
mlp.fit(X_train, y_train)
mlp.score(X_test, y_test)



0.6080602153718138

The last cell was the best performer, but it was considerably slower than the penultimate cell. For this reason, we will continue with the hidden layer parameters from the penultimate cell.
### Tuning alpha parameters

In [24]:
# Decreasing alpha
mlp = MLPRegressor(hidden_layer_sizes=(100,10), alpha=.00001, random_state=0)
mlp.fit(X_train, y_train)
mlp.score(X_test, y_test)



0.5949935166993032

In [25]:
# Increasing alpha
mlp = MLPRegressor(hidden_layer_sizes=(100,10), alpha=.001, random_state=0)
mlp.fit(X_train, y_train)
mlp.score(X_test, y_test)



0.5926898170546826

In [29]:
# Decreasing alpha slightly
mlp = MLPRegressor(hidden_layer_sizes=(100,10), alpha=.00009, random_state=0)
mlp.fit(X_train, y_train)
mlp.score(X_test, y_test)



0.5975378182310022

No alpha adjustment performed better than the default amount of '0.0001'. Therefore, we will leave alpha as is.
### Tuning activation parameters

In [33]:
# Changing activation
mlp = MLPRegressor(hidden_layer_sizes=(100,10), alpha=.0001, activation='identity', random_state=0)
mlp.fit(X_train, y_train)
mlp.score(X_test, y_test)



0.5730572427185259

In [32]:
# Changing activation
mlp = MLPRegressor(hidden_layer_sizes=(100,10), alpha=.0001, activation='logistic', random_state=0)
mlp.fit(X_train, y_train)
mlp.score(X_test, y_test)



-4.862515513074266

In [34]:
# Changing activation
mlp = MLPRegressor(hidden_layer_sizes=(100,10), alpha=.0001, activation='tanh', random_state=0)
mlp.fit(X_train, y_train)
mlp.score(X_test, y_test)



-4.86232030885895

In [35]:
# Changing activation
mlp = MLPRegressor(hidden_layer_sizes=(100,10), alpha=.0001, activation='relu', random_state=0)
mlp.fit(X_train, y_train)
mlp.score(X_test, y_test)



0.5995250512786906

The default value wins again.
### Tuning solver parameters

In [37]:
# Changing solver to 'lbgs', which for smaller datasets can converge faster and perform better
mlp = MLPRegressor(hidden_layer_sizes=(100,10), alpha=.0001, activation='relu', solver='lbfgs', random_state=0)
mlp.fit(X_train, y_train)
mlp.score(X_test, y_test)

0.6651118207666947

As a result of changing our solver, many of our parameters can be readjusted for optimization.
### Running cross-validation

In [61]:
# Readjusting hidden layer sizes
mlp = MLPRegressor(hidden_layer_sizes=(100,10,10), alpha=.009, activation='relu', solver='lbfgs', random_state=0)
mlp.fit(X_train, y_train)
mlp.score(X_test, y_test)

0.7165527927329183

In [62]:
# CV shows signs of overfitting
scores = cross_val_score(mlp, X, y, cv=5)
print(scores)
print(scores.mean())

[0.76287218 0.73979048 0.76428385 0.75110344 0.59458778]
0.7225275449828296


In [63]:
# Readjusting hidden layer sizes
mlp = MLPRegressor(hidden_layer_sizes=(100,10,), alpha=.009, activation='relu', solver='lbfgs', random_state=0)
mlp.fit(X_train, y_train)
mlp.score(X_test, y_test)

0.6734534161793031

In [64]:
# By removing one of the hidden layers we find slightly less overfitting and a slight increase in CV score
scores = cross_val_score(mlp, X, y, cv=5)
print(scores)
print(scores.mean())

[0.76934743 0.76896539 0.78337687 0.7273077  0.62296017]
0.7343915130052142


This notebook explored the differences between the Neural Network Regressor and the Random Forest Regressor. At least in this example, Random Forest was the clear winner by over 10%. It would be worthwhile to run the same test between models on a classification problem, instead of a regression problem.