## Homework #1

Your name: Patrick Gravelle

### Question 1
You will be performing one iteration of the forward pass and backpropagation calculations for a small network using Python. Here we will focus on the calculations for one training example, though in reality your data sets will be much larger and require matrix computation. You will also calculate the associated loss.

Let $X_1 = 2$ and $X_2 = -1$ be the feature inputs and initialize the weights to be as shown in the figure below. This is a neural network with a single hidden layer consisting of three nodes. The blue numbers within each node represent the values for the bias terms and the black numbers along the edges represent the weights. The hidden layer outputs a single node, from which your task is binary classification. The label for this particular training example outcome is $y = 1$. 

<img src="simple_nn.png" width="500">

Implement a single forward pass of the network. You do not need to implement the network in keras and should instead use numpy operations (either scalar or matrix). Please use the variable names and print statements provided in the code chunks to display results for the TAs. 

In [1]:
# forward pass and prediction
import numpy as np
import keras
x = np.array([[2,-1,1]])
w_hidden = np.matrix([[1,0.2,-0.6],[1.1,0,-0.3],[-1.8,-0.4,0.96]])
w_out = np.matrix([[.5],[.1],[1.3]])
b_out = 2
hidden = np.matmul(x,w_hidden)
output = hidden*w_out + b_out
y_hat = 1/(1+np.exp(-output))
prediction = np.round(y_hat)
print('The values for the hidden layer are:', hidden)
print('The value for the output layer is:', output)
print('The predicted probability is:', y_hat)
print('The prediction is:', prediction)

Using TensorFlow backend.


The values for the hidden layer are: [[-0.9   0.    0.06]]
The value for the output layer is: [[1.628]]
The predicted probability is: [[0.83589547]]
The prediction is: [[1.]]


Calculate the loss for the training example making sure to select the appropriate loss function.

In [2]:
# Loss
y_i = 1
loss_i = -y_i*np.log(y_hat)-(1-y_i)*np.log(1-y_hat)
print('The loss is:',loss_i)

The loss is: [[0.1792517]]
-0.9000000000000001


Implement a single backward pass of the network. Again use numpy and report the values using the print statements provided. Please interpret these values. In other words, what are the values you just calculated used for? 

In [3]:
# Backprop
# gradient for loss
dl_dy = (y_hat-y_i)/(y_hat-y_hat**2)
dy_dout = y_hat*(1-y_hat)
dout_dw_h = hidden
dout_db_h = 1
x_hid = np.array([[1,2,-1]])
dout_dh = np.transpose(w_out)

dl_dw_h = dl_dy*dy_dout*dout_dw_h
dl_db_h = dl_dy*dy_dout*dout_db_h
dl_dw_1 = dl_dy*dy_dout*dout_dh[0,0]*x_hid
dl_dw_2 = dl_dy*dy_dout*dout_dh[0,1]*x_hid
dl_dw_3 = dl_dy*dy_dout*dout_dh[0,2]*x_hid
print('The gradients of the loss wrt to the hidden weights are:', dl_dw_h)
print('The gradient of the loss wrt to the hidden bias is:', dl_db_h)
print('The gradients of the loss wrt to the input weights going to hidden node 1 are:', dl_dw_1)
print('The gradients of the loss wrt to the input weights going to hidden node 2 are:', dl_dw_2)
print('The gradients of the loss wrt to the input weights going to hidden node 3 are:', dl_dw_3)

The gradients of the loss wrt to the hidden weights are: [[ 0.14769407  0.         -0.00984627]]
The gradient of the loss wrt to the hidden bias is: [[-0.16410453]]
The gradients of the loss wrt to the input weights going to hidden node 1 are: [[-0.08205226 -0.16410453  0.08205226]]
The gradients of the loss wrt to the input weights going to hidden node 2 are: [[-0.01641045 -0.03282091  0.01641045]]
The gradients of the loss wrt to the input weights going to hidden node 3 are: [[-0.21333588 -0.42667177  0.21333588]]


### Question 2
In class we were considering classification problems where the goal was to predict a single discrete label of an input data point. Another common type of machine learning problem is "regression", which consists of predicting a continuous value instead of a discrete label. For instance, predicting the temperature tomorrow, given meteorological data, or predicting the time that a software project will take to complete, given its specifications.

You will be attempting to predict the median price of homes in a given Boston suburb in the mid-1970s, given a few data points about the suburb at the time, such as the crime rate, the local property tax rate, etc.

The dataset you will be using has another interesting difference from our previous examples: it has very few data points, only 506 in total, split between 404 training samples and 102 test samples, and each "feature" in the input data (e.g. the crime rate is a feature) has a different scale. For instance some values are proportions, which take a values between 0 and 1, others take values between 1 and 12, others between 0 and 100.

The data consists 13 features. The 13 features in the input data are as follows:

1. Per capita crime rate.
2. Proportion of residential land zoned for lots over 25,000 square feet.
3. Proportion of non-retail business acres per town.
4. Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
5. Nitric oxides concentration (parts per 10 million).
6. Average number of rooms per dwelling.
7. Proportion of owner-occupied units built prior to 1940.
8. Weighted distances to five Boston employment centres.
9. Index of accessibility to radial highways.
10. Full-value property-tax rate per $10,000.
11. Pupil-teacher ratio by town.
12. 1000(Bk - 0.63)^2 where Bk is the proportion of Black people by town.
13. % lower SES status of the population.

The targets (outcomes, y) are the median values of owner-occupied homes, in thousands of dollars. The prices are typically between 10,000 and 50,000 dollars. If that sounds cheap, remember this was the mid-1970s, and these prices are not inflation-adjusted.

In [4]:
# Import necessary packages

import keras
from keras import models
from keras import layers
import numpy as np

In [5]:
# Load the data
from keras.datasets import boston_housing

(train_data, train_targets), (test_data, test_targets) =  boston_housing.load_data()

Print the dimensions of the training set, i.e. its shape

In [6]:
# Training shape
print(train_data.shape)

(404, 13)


Print the dimensions of the test set, i.e. its shape

In [7]:
# test shape
print(test_data.shape)

(102, 13)


It would be problematic to feed into a neural network values that all take wildly different ranges. The network might be able to automatically adapt to such heterogeneous data, but it would definitely make learning more difficult. A widespread best practice to deal with such data is to do feature-wise normalization: for each feature in the input data (a column in the input data matrix), we will subtract the mean of the feature and divide by the standard deviation, so that the feature will be centered around 0 and will have a unit standard deviation.

Normalize the data. Be sure to normalize the test set with the training set mean and standard deviation.

In [8]:
# Normalize the data
train_data = train_data.astype('float32')
test_data = test_data.astype('float32')

mean = train_data.mean(axis=0)
train_data -= mean
std = train_data.std(axis=0)
train_data /= std

test_data -= mean
test_data /= std

Fit a fully connected neural network with 2 hidden layers and an output layer. Include 64 hidden units in each hidden layer and an appropriate number of units in the output layer. You are free to choose the activation functions. Use the `rmsprop` optimization function, and choose an appropriate loss function and model performance measure. Referring to the table shown in lectures 2 and 3 may help with these choices. Run the network for 50 epochs and use a batch_size of 10.

In [9]:
model = models.Sequential()
model.add(layers.Dense(64, activation = 'sigmoid', input_shape = (13,)))
model.add(layers.Dense(64, activation = 'sigmoid'))
model.add(layers.Dense(1))
model.compile(loss='mean_squared_error',
             optimizer=keras.optimizers.rmsprop(),
             metrics=['mae'])

model.fit(train_data, train_targets, batch_size = 10, epochs=50)

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Use tf.cast instead.
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0xb262a3e48>

Report the test set accuracy and compare it to the training set accuracy. **Interpret what this means in words, in terms of what you are trying to do with your network**.

In [10]:
# Test Loss versus Training Loss
test_loss, test_mae = model.evaluate(test_data, test_targets)
print(test_mae)
train_loss, train_mae = model.evaluate(train_data, train_targets)
print(train_mae)

3.4146226527644137
3.1097183298356463


Answer: We see that our mean absolute error for both the training and testing set are quite similar and relatively low, considering we are using this model to predict the median prices of homes (between 10,000 - 50,000 with an MAE around 3.0 or 4.0) in different Boston suburbs during the 1970's.



Now fit the same network as above but with 16 hidden nodes in each hidden layer. **What is the test set accuracy and how does it compare to the first network you created? Which model do you think is better?**

In [11]:
# Model with 16 hidden nodes in each hidden layer

model = models.Sequential()
model.add(layers.Dense(16, activation = 'sigmoid', input_shape = (13,)))
model.add(layers.Dense(16, activation = 'sigmoid'))
model.add(layers.Dense(1))
model.compile(loss='mean_squared_error',
             optimizer=keras.optimizers.rmsprop(),
             metrics=['mae'])

model.fit(train_data, train_targets, batch_size = 10, epochs=50)

test_loss, test_mae = model.evaluate(test_data, test_targets)
print(test_mae)
train_loss, train_mae = model.evaluate(train_data, train_targets)
print(train_mae)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
4.757518160576914
4.727914040631587


Answer: We now have the mean absolute error increase for both the training and testing datasets to a value around 5.0 or 6.0, which is an increase from the approximate 3.0/4.0 for the model with 64 hidden nodes in each hidden layer. Using this metric we can conclude that the previous model with the 64 nodes per hidden layer is a better fit to the data and performs better on a test set and is therefore a better model at predicting Boston median housing prices during the 1970's.

