# Neural Network Regression
## Boston housing data


In [1]:
### load the data
import pandas as pd

df = pd.read_csv('data/Boston.csv')
print(df.head())
print('\nDimensions of data frame:', df.shape)

      crim    zn  indus  chas    nox     rm   age     dis  rad  tax  ptratio  \
0  0.00632  18.0   2.31     0  0.538  6.575  65.2  4.0900    1  296     15.3   
1  0.02731   0.0   7.07     0  0.469  6.421  78.9  4.9671    2  242     17.8   
2  0.02729   0.0   7.07     0  0.469  7.185  61.1  4.9671    2  242     17.8   
3  0.03237   0.0   2.18     0  0.458  6.998  45.8  6.0622    3  222     18.7   
4  0.06905   0.0   2.18     0  0.458  7.147  54.2  6.0622    3  222     18.7   

    black  lstat  medv  
0  396.90   4.98  24.0  
1  396.90   9.14  21.6  
2  392.83   4.03  34.7  
3  394.63   2.94  33.4  
4  396.90   5.33  36.2  

Dimensions of data frame: (506, 14)


In [2]:
# train test split
from sklearn.model_selection import train_test_split

X = df.iloc[:, 0:12]
y = df.iloc[:, 13]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)

print('train size:', X_train.shape)
print('test size:', X_test.shape)

train size: (404, 12)
test size: (102, 12)


### Linear regression as a baseline

This code is copied from the 22_Linear_Regression notebook in the GitHub, so refer there for further information.

In [3]:
## train the algorithm
from sklearn.linear_model import LinearRegression

linreg = LinearRegression()
linreg.fit(X_train, y_train)

# make predictions
y_pred = linreg.predict(X_test)

# evaluation
from sklearn.metrics import mean_squared_error, r2_score
print('mse=', mean_squared_error(y_test, y_pred))
print('correlation=', r2_score(y_test, y_pred))

mse= 27.442176730255614
correlation= 0.7326596279331063


### Scaling the data

A neural network will convert faster and get better results with scaled data. The following two code blocks show two different ways to scale the data. Using either scaling method is fine but it is important to scale using means and standard deviations from the train set only so that a firewall is maintained between the train and test sets. 


The first block shows how to scale manually using built-in Python and pandas functionality. Normalizing data consists of two transformations: first the mean is subtracted from every element, then each element is divided by the standard deviation. The primary reason for showing this code is to reinforce the concept of normalization. Normalization changes the shape of data to a normal distribution whereas scaling just scales the data to fit a certain numeric range. 

The second block shows how to scale using functionality from sklearn. Note that the sklearn standard scaler used in the second block will normalize the data unless with with_mean and with_std args are set =False.


In [4]:
# scale the data using Python and pandas functionality

mean = X_train.mean(axis=0)
X_train -= mean
std = X_train.std(axis=0)
X_train /= std

X_test -= mean
X_test /= std

In [5]:
# scale the data using sklearn functionality
from sklearn import preprocessing

scaler = preprocessing.StandardScaler().fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

### Train the algorithm

There are many options available to set up the multi-layer perceptron (MLP) regressor. See [the docs](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html#sklearn.neural_network.MLPRegressor)

Running the regressor with default options on the first try resulted in the error message that it failed to converge. Specifying the layers size, and more importantly increasing the maximum number of iterations allowed it to converge. The default settings can be seen on the link above.

A random state is used so that the code gets the same results every time it is run. The randomness comes into play because weights are initially set to random values.

In [6]:
# train the algorithm
from sklearn.neural_network import MLPRegressor

regr = MLPRegressor(hidden_layer_sizes=(6, 3), max_iter=500, random_state=1234)
regr.fit(X_train, y_train)


MLPRegressor(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
             beta_2=0.999, early_stopping=False, epsilon=1e-08,
             hidden_layer_sizes=(6, 3), learning_rate='constant',
             learning_rate_init=0.001, max_fun=15000, max_iter=500,
             momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
             power_t=0.5, random_state=1234, shuffle=True, solver='adam',
             tol=0.0001, validation_fraction=0.1, verbose=False,
             warm_start=False)

In [7]:
# make predictions

y_pred = regr.predict(X_test)

In [8]:
# evaluation
from sklearn.metrics import mean_squared_error, r2_score
print('mse=', mean_squared_error(y_test, y_pred))
print('correlation=', r2_score(y_test, y_pred))

mse= 29.309654210171164
correlation= 0.7144667517187085


### Try different settings

The default solver is adam, as you can see in the output from the training above. The documentation recommends lbfgs for smaller data sets, stating that it will converge faster and get better results. With the 500 iteration and the lbfgs solver, it did not converge, so the max iterations was bumped up to 1500. As you can see it did get better results than the previous network. 

In [9]:
regr = MLPRegressor(hidden_layer_sizes=(6, 3), solver='lbfgs', max_iter=1500, random_state=1234)
regr.fit(X_train, y_train)

MLPRegressor(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
             beta_2=0.999, early_stopping=False, epsilon=1e-08,
             hidden_layer_sizes=(6, 3), learning_rate='constant',
             learning_rate_init=0.001, max_fun=15000, max_iter=1500,
             momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
             power_t=0.5, random_state=1234, shuffle=True, solver='lbfgs',
             tol=0.0001, validation_fraction=0.1, verbose=False,
             warm_start=False)

In [10]:
y_pred = regr.predict(X_test)

print('mse=', mean_squared_error(y_test, y_pred))
print('correlation=', r2_score(y_test, y_pred))

mse= 11.558679390825924
correlation= 0.8873958986810827


### Summary

Linear regression achieved mse=27.4 which is better than the first neural network's mse=29.3. The second neural network outperformed them both with an mse=11.6. Using the lbfgs solver and a higher number of iterations was able to achieve better results.

Although several other network sizes besides (6, 3) were tested, more complex models performed worse. Since this is a small data set, a complex network will overfit and learn noise from the data.