## Lab #1

TA: Matt Ploenzke
Date: 4/5/2019

Today's lab consists of practice questions to review the topics presented thus far in class. We will be focusing on:
    1. Neural network terminology and architecture
    2. python
    3. Forward and backward propagation
    4. Tensorflow

### Question 1
Let's review the terminology introduced by thinking about how to design a model for each the following scenarios. It's important to remember that while there is more than one correct answer in these cases, we want to develop an intuition to help save time in parameter tuning, training, computational resources, etc. We'll also briefy touch on some advanced topics to provide a foundation for later in the course, and remember you do not need to use a deep neural network in every case.

*Case 1:* The input is the MNIST handwritten digits dataset (features are 28x28 pixel intensities and labels are digits 0-9). We want to predict which digit the image represents and there are only 10 images per category ($N=100$).

    - Random forest, k-nearest neighbors because of the small sample size and relatively easy prediction task.

*Case 2:* The identical setup but this time there are thousands of images per category.

    - Either of the above methods are fine, or using a simple neural network. Activation function needs to be softmax and loss function needs to be categorical cross-entropy.

*Case 3:* The identical setup as case 2 but this time images may contain multiple digits or no digits at all.

    - Last layer activation should be sigmoid and BCE.

*Case 4:*  The covariates are BMI measurements and reported smoking status, the labels are binary denoting cardiovascular disease. Our sample consists of 70 individuals and we want to predict an individuals' health status based on their BMI and smoking status. We are interested in the effect of BMI on cardiovascular disease.

    - Logistic regression.

*Case 5:* The input consists of thousands of images of different animals and we want to classify which animal the image contains. 

    - CNN with softmax and CCE or sigmoid and BCE.

*Case 6:* The input consists of thousands of English sentences and we want to predict the next word in the sentences. 

    - RNN

*Case 7:* The input consists of biomarker status for thousands of loci across thousands of individuals (i.e. Ancestry.com). There are no associated labels and we wish to learn about population substructure. 

    - PCA, VAE, etc.

### Question 2

a) Start a jupyter notebook and/or create a python file

b) Install and load numpy. What is your package manager?

c) If you don't have keras and tensorflow installed, install those now. 

d) Ask Matt any questions about python now or forever hold your peace.

In [1]:
import numpy as np
import keras

Using TensorFlow backend.


### Question 3

Draw the architecture of a neural network satisying the following conditions:
    1. X is a univariate covariate. We will consider the case when X=5.
    2. There are two hidden layers. The first consists of two nodes, each with a bias term taking values (-1 and 1, respectively). The weight going to the first node takes value 0.5 and the weight going to the second node takes value -0.5.
    3. The nodes in hidden layer 1 each use a linear activation function.
    4. Hidden layer 2 consists of a single node with no bias term and the ReLU activation function. The weight from node 1 in hidden layer 1 is 0.3 and the weight from node 2 in hidden layer 1 is 0.7.
    5. Hidden layer 2 outputs to a single output node. The bias term for the output node is 0.5 and the weight from hidden layer 2 is 2. 
    6. The loss function to be optimized is squared loss.

<img src="Lab1_q3.png" width="500">

### Question 4
Implement a single forward pass of the network described in Question 3. You do not need to implement the network in keras and should instead use numpy operations (either scalar or matrix). Start by defining the weights and input matrices.

In [2]:
x = np.array([1, 5]) # add bias/intercept as first entry
w_hidden1 = np.matrix([[-1, 1], [.5, -.5]]) # 2x2 matrix of first-layer biases and weights
w_hidden2 = np.matrix([[.3], [.7]]) # 1x2 matrix of second-layer weights
w_out = 2 # 1x1 scalar of third-layer weights
b_out = 0.5 # 1x1 scalar of third-layer bias

Now perform the forward pass.

In [3]:
hidden1 = np.matmul(x,w_hidden1) # perform matrix multiplication to get hidden layer 1
hidden2 = np.matmul(hidden1,w_hidden2) # perform matrix multiplication to get hidden layer 2
hidden2_clamped = np.maximum(hidden2, 0) # relu
y_hat = hidden2_clamped*w_out + b_out # perform third multiplication to get output layer

And let's print the values.

In [4]:
print('The values for the hidden layer 1 are:', hidden1)
print('The values for the hidden layer 2 are:', hidden2)
print('The post-relu values for the hidden layer 2 are:', hidden2_clamped)
print('The value for the output layer is:', y_hat)

The values for the hidden layer 1 are: [[ 1.5 -1.5]]
The values for the hidden layer 2 are: [[-0.6]]
The post-relu values for the hidden layer 2 are: [[0.]]
The value for the output layer is: [[0.5]]


Calculate the loss for the training example given a label of Y=.25.

In [5]:
y_i = .25 # positive outcome as defined in the problem
loss_i = (y_i-y_hat)**2
print('The loss is:',loss_i)

The loss is: [[0.0625]]


Implement a single backward pass of the network. Again use numpy. Start by defining the individual gradient terms.

In [6]:
# gradient for loss
dl_dy = 2*(y_i-y_hat) # gradient of loss wrt predicted probability (1x1)

# gradients for output layer
dy_dhidden2_clamped = w_out # gradient of y_hat wrt hidden output (1x1)
dy_dw_out = hidden2_clamped # gradient of  y_hat wrt output layer weight (1x1)
dy_db_out = 1 # gradient of  y_hat wrt output layer bias (1x1)

# gradient (gate) for relu
dhidden2_clamped_dhidden2 = (hidden2>0)*1 # gradient for relu

# gradients for second hidden layer
dhidden2_dw_2 = hidden1 # gradient of second hidden layer wrt second hidden layer weights
dhidden2_dhidden1 = w_hidden2 # gradient of second hidden layer wrt first hidden layer output

# gradients for first hidden layer
dhidden1_dw_1 = x # gradient of first hidden layer wrt first hidden layer weights

In [7]:
dl_dw_out = dl_dy*dy_dw_out # gradient of loss wrt output weights
dl_db_out = dl_dy*1 # gradient of loss wrt output bias
dl_dw_2 = dl_dy*dy_dhidden2_clamped*dhidden2_clamped_dhidden2*dhidden2_dw_2 # gradient of loss wrt second hidden layer weights
dl_dw_11 = dl_dy*dy_dhidden2_clamped*dhidden2_clamped_dhidden2*dhidden2_dhidden1[0]*dhidden1_dw_1 # gradient of loss wrt first hidden layer weights (node 1)
dl_dw_12 = dl_dy*dy_dhidden2_clamped*dhidden2_clamped_dhidden2*dhidden2_dhidden1[1]*dhidden1_dw_1 # gradient of loss wrt first hidden layer weights (node 2)

We see that a maximum of two quantities are needed for each layer: 
    
    1) the partial derivative w.r.t. the input
    2) the partial derivative w.r.t. the weight/parameter
    
What is the purpose of each of these quantities?

    - The partial derivative w.r.t. the input is used in the backprop/chain-rule calculations.
    - The partial derivative w.r.t. the parameter/weights is used to update the parameter values in the training loop via gradient descent.

### Question 5
Load the MNIST dataset provided by keras. This contains 60,000 28x28 grayscale images of the 10 digits, along with a test set of 10,000 images. Split the data into training and testing sets.

In [1]:
# Import necessary packages

import keras
from keras import models
from keras import layers
import numpy as np

Using TensorFlow backend.


In [2]:
# Load the data
from keras.datasets import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()

Print the shape of the training and testing datasets.

In [3]:
print(x_train.shape)
print(x_test.shape)

(60000, 28, 28)
(10000, 28, 28)


Let's reshape the data to fit the keras format. Don't worry too much about this chunk for now.

In [4]:
from keras import backend as K
if K.image_data_format() == 'channels_first':
    x_train = x_train.reshape(x_train.shape[0], 1, 28, 28)
    x_test = x_test.reshape(x_test.shape[0], 1, 28, 28)
    input_shape = (1, 28, 28)
else:
    x_train = x_train.reshape(x_train.shape[0], 28, 28, 1)
    x_test = x_test.reshape(x_test.shape[0], 28, 28, 1)
    input_shape = (28, 28, 1)

Now print the shape again to see what changed.

In [5]:
print(x_train.shape)
print(x_test.shape)

(60000, 28, 28, 1)
(10000, 28, 28, 1)


Question 2 in Homework 1 asks you to train a neural network on the Boston housing data. This dataset contains features on very different scales (for example there are both binary features and real-valued features). While the MNIST features take on values between 0 and 1 and do not need to be normalized, we will go through the exercise of normalizing the values before training our network.

Can you think of other algorithms in which normalization is necessary? Is it necessary in the case of neural networks? Why or why not? 

    - Clustering, PCA, random forest, etc. Not necessary (universal function approximator) but makes training easier in cases in which the features have very different scales. 

Normalize the data. Be sure to normalize the test set with the training set mean and standard deviation. Don't forget to convert the training and testing sets to float32.

In [None]:
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')

mean = x_train.mean()
x_train -= mean
std = x_train.std()
x_train /= std

x_test -= mean
x_test /= std

How will the code above need to be changed for Boston housing dataset? Why?

    - Need to calculate mean and standard deviation per feature, thus need to use something like x_train.mean(axis=0).

Before we define and fit our model let's one-hot encode the labels. Don't forget to do the same for the testing labels and note you will not need to do this step in the case of regression.

In [6]:
y_train.shape

(60000,)

In [7]:
y_train = keras.utils.to_categorical(y_train, 10)
y_test = keras.utils.to_categorical(y_test, 10)
y_train.shape

(60000, 10)

Now fit a shallow convolutional neural network with a single dense layer. Include 32 convolutional filters of size 3x3 and use the relu activation function.

After the convolutional layer, flatted the tensor to be fed into the dense layer.

In the dense layer use enough output nodes to have one corresponding to each class label (10). What is the activation function you should use here?

In the optimizer use the `Adadelta` optimization function, and choose an appropriate loss function and model performance measure. 

Run the network for 5 epochs and use a batch_size of 64

In [8]:
model = models.Sequential()
model.add(layers.Conv2D(32, kernel_size=(3, 3),
                 activation='relu',
                 input_shape=input_shape))
model.add(layers.Flatten())
model.add(layers.Dense(10, activation='softmax'))

model.compile(loss=keras.losses.categorical_crossentropy,
              optimizer=keras.optimizers.Adadelta(),
              metrics=['accuracy'])

model.fit(x_train, y_train, batch_size=64, epochs=5, verbose=1)

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Use tf.cast instead.
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x12f844cf8>

Report the test set accuracy. 

In [None]:
test_loss, test_acc = model.evaluate(x_test, y_test, verbose=0)
print(test_acc)