# Problem Set 1
### Due Sunday, April 5th by 11:59pm

### Question 1
#### 60 points total

You will be performing one iteration of the forward pass and backpropagation calculations for a small network using Python. Here we will focus on the calculations for one training example, though in reality your data sets will be much larger and require matrix computation. You will also calculate the associated loss.

Let $X_1 = 2$ and $X_2 = -1$ be the feature inputs and initialize the weights to be as shown in the figure below. This is a neural network with a single hidden layer consisting of three nodes. The blue numbers within each node represent the values for the bias terms and the black numbers along the edges represent the weights. The hidden layer outputs a single node, from which your task is binary classification. The label for this particular training example outcome is $y = 1$. 



![network](https://drive.google.com/uc?id=1lfmGA56cIu81xD0y1SPKB7VuddS11G5o)


Implement a single forward pass of the network. You do not need to implement the network in keras and should instead use numpy operations (either scalar or matrix). Please use the variable names and print statements provided in the code chunks to display results for the TAs. 

* 20 points total

* +5 if variable names provided used
* +2 if variable correctly specified (each: hidden layer, output layer, predicted probability, prediction)
* +5 for correct probability calculation
* +2 for correct final prediction

In [0]:
# Your code here (forward pass and prediction)

import numpy as np
x  = np.array([1, 2,-1])                                  # add bias/intercept as first entry
w1 = np.matrix([[-1.8,-.4,.96], [1,.2,-.6], [1.1,0,-.3]]) # 3x3 matrix of first-layer biases and weights
w2 = np.array([.5,.1,1.3])                                # 1x3 matrix of second-layer biases and weights
b2 = 2                                                    # associated hidden bias term

hidden = np.matmul(x,w1)           # perform matrix multiplication to get hidden layer
output = np.matmul(hidden,w2) + b2 # perform second multiplication to output layer, making sure to add the bias
y_hat = 1/(1+np.exp(-output))      # activation function
prediction = np.round(y_hat)       # round for prediction

print('The values for the hidden layer are:', hidden)
print('The value for the output layer is:', output)
print('The predicted probability is:', y_hat)
print('The prediction is:', prediction)

The values for the hidden layer are: [[-0.9   0.    0.06]]
The value for the output layer is: [[1.628]]
The predicted probability is: [[0.83589547]]
The prediction is: [[1.]]


Calculate the loss for the training example making sure to select the appropriate loss function.

* 10 points total
* +5 for using binary cross-entropy
* +5 for correct calculation
* Note: +5 if used wrong loss function but calculated correctly

In [0]:
# Your code here (loss)

# We use the predicted probability in the loss function.
p_i = y_hat # see chunk above for forward-pass. 
y_i = 1     # positive outcome as defined in the problem
loss_i = -y_i * np.log(p_i) - (1-y_i) * np.log(1-p_i) # BCE loss
print('The loss is:',loss_i)

The loss is: [[0.1792517]]


Implement a single backward pass of the network. Again use numpy and report the values using the print statements provided. Please interpret these values. In other words, what are the values you just calculated used for? 

* 30 points total
* +6 points for each final output
* Note: full credit if calculations were correct but used wrong loss function from above
* Note: can add the bias term to the last 3 gradient calculations or not (can have 3 numbers for each or 2)
* Note: +4 for each if attempted to calculate but calculation is incorrect

In [0]:
# Your code here (backprop)

dl_dp = (-y_i/p_i) + (1-y_i)/(1-p_i) # gradient of loss wrt predicted probability (1x1)
dp_do = p_i * (1-p_i)                # gradient of predicted probability wrt output layer (1x1)
do_dw_h = hidden                     # gradient of output layer wrt hidden layer weights (1x3)
do_db_h = 1                          # gradient of output layer wrt hidden layer bias term (1x1)
do_dh = w2                           # gradient of output layer wrt to hidden layer inputs (1x3)
dh_dw_input = x                      # gradient of hidden layer inputs wrt input layer. (1x3)

dl_dw_h = dl_dp * dp_do * do_dw_h                # gradient of loss wrt hidden weights
dl_db_h = dl_dp * dp_do * do_db_h                # gradient of loss wrt hidden bias term
dl_dw_1 = dl_dp * dp_do * do_dh[0] * dh_dw_input # gradient of loss wrt input weights going to hidden node 1
dl_dw_2 = dl_dp * dp_do * do_dh[1] * dh_dw_input # gradient of loss wrt input weights going to hidden node 2
dl_dw_3 = dl_dp * dp_do * do_dh[2] * dh_dw_input # gradient of loss wrt input weights going to hidden node 3

print('The gradients of the loss wrt to the hidden weights are:', dl_dw_h)
print('The gradient of the loss wrt to the hidden bias is:', dl_db_h)
print('The gradients of the loss wrt to the input weights going to hidden node 1 are:', dl_dw_1)
print('The gradients of the loss wrt to the input weights going to hidden node 2 are:', dl_dw_2)
print('The gradients of the loss wrt to the input weights going to hidden node 3 are:', dl_dw_3)

The gradients of the loss wrt to the hidden weights are: [[ 0.14769407  0.         -0.00984627]]
The gradient of the loss wrt to the hidden bias is: [[-0.16410453]]
The gradients of the loss wrt to the input weights going to hidden node 1 are: [[-0.08205226 -0.16410453  0.08205226]]
The gradients of the loss wrt to the input weights going to hidden node 2 are: [[-0.01641045 -0.03282091  0.01641045]]
The gradients of the loss wrt to the input weights going to hidden node 3 are: [[-0.21333588 -0.42667177  0.21333588]]


### Question 2
#### 40 points total

In class we were considering classification problems where the goal was to predict a single discrete label of an input data point. Another common type of machine learning problem is "regression", which consists of predicting a continuous value instead of a discrete label. For instance, predicting the temperature tomorrow, given meteorological data, or predicting the time that a software project will take to complete, given its specifications.

You will be attempting to predict the median price of homes in a given Boston suburb in the mid-1970s, given a few data points about the suburb at the time, such as the crime rate, the local property tax rate, etc.

The dataset you will be using has another interesting difference from our previous examples: it has very few data points, only 506 in total, split between 404 training samples and 102 test samples, and each "feature" in the input data (e.g. the crime rate is a feature) has a different scale. For instance some values are proportions, which take a values between 0 and 1, others take values between 1 and 12, others between 0 and 100.

The data consists 13 features. The 13 features in the input data are as follows:

1. Per capita crime rate.
2. Proportion of residential land zoned for lots over 25,000 square feet.
3. Proportion of non-retail business acres per town.
4. Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
5. Nitric oxides concentration (parts per 10 million).
6. Average number of rooms per dwelling.
7. Proportion of owner-occupied units built prior to 1940.
8. Weighted distances to five Boston employment centres.
9. Index of accessibility to radial highways.
10. Full-value property-tax rate per $10,000.
11. Pupil-teacher ratio by town.
12. 1000(Bk - 0.63)^2 where Bk is the proportion of Black people by town.
13. % lower SES status of the population.

The targets (outcomes, y) are the median values of owner-occupied homes, in thousands of dollars. The prices are typically between 10,000 and 50,000 dollars. If that sounds cheap, remember this was the mid-1970s, and these prices are not inflation-adjusted.

In [0]:
# Import necessary packages
from __future__ import absolute_import, division, print_function, unicode_literals
try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
import tensorflow as tf
import tensorflow_datasets as tfds
!pip install tensorflow-hub
!pip install tfds-nightly
import tensorflow_hub as hub
import keras
from keras import models
from keras import layers
import numpy as np

Collecting tfds-nightly
[?25l  Downloading https://files.pythonhosted.org/packages/02/de/91ed9b4918b26d7910defe682dcc2c76895792b0b8adb98cdf28b1b781ff/tfds_nightly-3.0.0.dev202004280105-py3-none-any.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 2.7MB/s 
Installing collected packages: tfds-nightly
Successfully installed tfds-nightly-3.0.0.dev202004280105


Using TensorFlow backend.


In [0]:
# Load the data
from keras.datasets import boston_housing

(train_data, train_targets), (test_data, test_targets) =  boston_housing.load_data()

Downloading data from https://s3.amazonaws.com/keras-datasets/boston_housing.npz


Print the dimensions of the training set, i.e. its shape
* 2 points total (all or nothing)

In [0]:
# Your code here

train_data.shape

(404, 13)

Print the dimensions of the test set, i.e. its shape
* 2 points total (all or nothing)

In [0]:
# Your code here

test_data.shape

(102, 13)

It would be problematic to feed into a neural network values that all take wildly different ranges. The network might be able to automatically adapt to such heterogeneous data, but it would definitely make learning more difficult. A widespread best practice to deal with such data is to do feature-wise normalization: for each feature in the input data (a column in the input data matrix), we will subtract the mean of the feature and divide by the standard deviation, so that the feature will be centered around 0 and will have a unit standard deviation.

Normalize the data. Be sure to normalize the test set with the training set mean and standard deviation.

* 5 points total
* +3 if attempted but done incorrectly

In [0]:
# Your code here

mean = train_data.mean(axis=0)
train_data -= mean
std = train_data.std(axis=0)
train_data /= std

test_data -= mean
test_data /= std

13

Fit a fully connected neural network with 2 hidden layers and an output layer. Include 64 hidden units in each hidden layer and an appropriate number of units in the output layer. You are free to choose the activation functions. Use the `rmsprop` optimization function, and choose an appropriate loss function and model performance measure. Referring to the table shown in lectures 2 and 3 may help with these choices. Run the network for 50 epochs and use a batch_size of 10.

* 15 points total
* +10 for correct architecture
* +3 for correct `optimizer`, `loss` and `metrics` (can be `mae` or `mse`)
* +2 for correct number of `epochs` and `batch_size`

In [0]:
# Define model
model = tf.keras.models.Sequential([
  # Layer 1 (Hidden layer)
  tf.keras.layers.Dense(64, activation='relu', input_shape=(train_data.shape[1],)),
  # Layer 1 (Hidden layer)
  tf.keras.layers.Dense(64, activation='relu'),
  # Layer 2 (Output layer)
  tf.keras.layers.Dense(1)
])

# Define how to execute training
model.compile(optimizer='rmsprop',
              loss='mse',
              metrics=['mae'])

# Train the network
model.fit(train_data, train_targets, epochs=50, batch_size=10)


Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<tensorflow.python.keras.callbacks.History at 0x7fc6b56721d0>

Report the test set accuracy and compare it to the training set accuracy. **Interpret what this means in words, in terms of what you are trying to do with your network**.

* 6 points total
* +1 for printing test set mae (or mse)
* +5 for correct interpretation (+3 for attempt at interpretation if incorrect)


In [0]:
# Your code here

test_loss, test_acc = model.evaluate(test_data, test_targets)
print(test_acc)

2.632237672805786


Answer: The test set has a larger error, 2.639 compared to 1.8239. This means that on average, for the test set, our predictions are off by about $2,639.

---



Now fit the same network as above but with 16 hidden nodes in each hidden layer. **What is the test set accuracy and how does it compare to the first network you created? Which model do you think is better?**

* 10 points total
* +3 for correct model
* +2 for printing test set mae (or mse)
* +5 for correct interpretation (Note: depending on initialized weights, model 2 *could* perform better than model 1, but it isn't likely. Full credit if interpretation is correct according to their model performance. +3 if attempted interpretation but is incorrect)

In [0]:
# Your code here
# Define model
model = tf.keras.models.Sequential([
  # Layer 1 (Hidden layer)
  tf.keras.layers.Dense(16, activation='relu', input_shape=(train_data.shape[1],)),
  # Layer 1 (Hidden layer)
  tf.keras.layers.Dense(16, activation='relu'),
  # Layer 2 (Output layer)
  tf.keras.layers.Dense(1)
])

# Define how to execute training
model.compile(optimizer='rmsprop',
              loss='mse',
              metrics=['mae'])

# Train the network
model.fit(train_data, train_targets, epochs=50, batch_size=10)


test_loss, test_acc = model.evaluate(test_data, test_targets)
print(test_acc)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
3.0830862522125244


Answer: The first model has a lower MAE than this model, so the first models performs better. This means the more complex model (64 hidden units in each hidden layer vs 16 hidden units in each hidden layer) is better for this task.