# Homework #1

#### Due: April 9th before midnight. Upload to Canvas.


### Linear Algebra and NumPy Review
#### Question 1
Consider the two following random arrays "a" and "b". What are the shapes of a and b, and what will the shape of c be? You can do this in your head or with code.

In [None]:
import numpy as np

a = np.random.randn(2, 3)
b = np.random.randn(2, 1)
c = a + b

#### Answer 1


#### Question 2
Consider the two following random arrays "a" and "b". What will be the shape of "c"? Why?

In [None]:
a = np.random.randn(4, 3)
b = np.random.randn(3, 2)
c = a * b

#### Answer 2


#### Question 3
Suppose an image $x$ is stored as a (64,64,3) array, representing a 64x64 image with 3 color channels red, green and blue. How do you reshape this into a column vector?

In [None]:
x = np.random.randn(64, 64, 3)

#### Answer 3


In [None]:
# YOUR CODE HERE


### Deep Feedforward Neural Networks
#### Question 4
Which of the following are true?
1. ***X*** is a matrix in which each column is one training example.
2. $a^{[2]}_4$ is the activation output by the 4th neuron of the 2nd layer.
3. $a^{[2](12)}$ denotes the activation vector of the 2nd layer for the 12th training example.
4. $a^{[2]}$ denotes the activation vector of the 2nd layer.

#### Answer 4


#### Question 5
You are building a binary classifier for recognizing cucumbers ($y=1$) vs. watermelons ($y=0$). Which one of these activation functions would you recommend using for the output layer?

1. ReLU
2. Leaky ReLU
3. sigmoid
4. tanh

#### Answer 5


#### Question 6
During forward propagation, in the forward function for a layer $l$, you need to know the activation function in a layer (sigmoid, tanh, ReLU, etc.). During backpropagation, the corresponding backward function also needs to know what the activation function is for layer $l$, since the gradient depends on it. True/False?

#### Answer 6


### Classifying newswires: a multi-class classification example
In this section, we will build a network to classify Reuters newswires into 46 different mutually-exclusive topics. Since we have many classes, this problem is an instance of "multi-class classification", and since each data point should be classified into only one category, the problem is more specifically an instance of "single-label, multi-class classification". If each data point could have belonged to multiple categories (in our case, topics) then we would be facing a "multi-label, multi-class classification" problem.

### The Reuters dataset

We will be working with the _Reuters dataset_, a set of short newswires and their topics, published by Reuters in 1986. It's a very simple, widely used toy dataset for text classification. There are 46 different topics; some topics are more represented than others, but each topic has at least 10 examples in the training set. Like IMDB and MNIST, the Reuters dataset comes packaged as part of Keras.

In [None]:
from keras.datasets import reuters

(train_data, train_labels), (test_data, test_labels) = reuters.load_data(num_words=10000)

Like with the IMDB dataset, the argument `num_words=10000` restricts the data to the 10,000 most frequently occurring words found in the data.

#### Question 7
How many traing and testing examples are there in this data set?


In [None]:
# YOUR CODE HERE


#### Answer 7


As with the IMDB reviews, each example is a list of integers (word indices), and the label associated with an example is an integer between 0 and 45: a topic index.

In [None]:
print(train_data[9])   # The 10th training example's feature vector
print(train_labels[9]) # The 10th training example's label

### Prepping the data
We cannot feed lists of integers into a neural network. We have to turn our lists into tensors.

#### Question 8
As you saw in class, vectorize the training examples, training labels, test examples and test labels. Recall that there is a built-in function for vectorizing the training and test labels.

#### Answer 8

In [None]:
from keras.utils.np_utils import to_categorical

def vectorize_sequences(sequences, dimension=10000):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.
    return results

# YOUR CODE HERE


### Building the network
This topic classification problem looks very similar to our previous movie review classification problem: in both cases, we are trying to classify short snippets of text. There is however a new constraint here: the number of output classes has gone from 2 to 46, i.e. the 
dimensionality of the output space is much larger. 

In a stack of `Dense` layers like what we were using, each layer can only access information present in the output of the previous layer. If one layer drops some information relevant to the classification problem, this information can never be recovered by later layers: each 
layer can potentially become an "information bottleneck". In our previous example, we were using 16-dimensional intermediate layers, but a 16-dimensional space may be too limited to learn to separate 46 different classes: such small layers may act as information bottlenecks, 
permanently dropping relevant information.

#### Question 9
For this reason, you will build a network with more hidden units. Build a 3 layer network using 64 hidden units for the first 2 layers, and an appropriate number of units for the last layer. Be sure to choose an appropriate activation function for each layer.

#### Answer 9

In [None]:
from keras import models
from keras import layers

# YOUR CODE HERE

#### Question 10
Now compile the network using the `rmsprop` optimization algorithm, appropriate loss function, and appropriate performance measure.

#### Answer 10

In [None]:
# YOUR CODE HERE

### Validating the approach
#### Question 11 
Set apart the first 1,500 samples in the training data to be used as validation examples.

#### Answer 11

In [None]:
# YOUR CODE HERE

#### Question 12

Train the network for 20 epochs with a batch size of 512. Save the object as `history`.

#### Answer 12

In [None]:
# YOUR CODE HERE

### Plotting accuracy

Run the following code to display the loss and accuracy curves:

In [None]:
import matplotlib.pyplot as plt

loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(loss) + 1)

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

In [None]:
plt.clf()   # clear figure

acc = history.history['acc']
val_acc = history.history['val_acc']

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

plt.show()

#### Question 13
When does the network start overfitting?

#### Answer 13


#### Question 14
Train a new network from scratch for 8 epochs, and then evaluate it on the test set. Use the same model parameters as before (number of layers, hidden units, batch size, optimixation algorithm, activation functions, performance measure).

#### Answer 14

In [None]:
# YOUR CODE HERE

#### Question 15

Print the results. Briefly discuss the results and if the network performs better than a random baseline.

#### Answer 15

In [None]:
# YOUR CODE HERE

#### Question 16

Build the same network as before, but now with only 4 hidden units in your second layer. Train the network for 20 epochs and describe and comment on the difference in accuracy.

#### Answer 16

In [None]:
# YOUR CODE HERE

#### Question 17

Now build the original network but with 32 hidden units in each hidden layer. Train the network for 20 epochs and describe and comment on the difference in accuracy.

#### Answer 17

In [None]:
# YOUR CODE HERE

#### Question 18

Now build a network with 3 hidden layers with 64 hidden units in each layer. Train the network for 20 epochs and describe and comment on the difference in accuracy.

#### Answer 18

In [None]:
# YOUR CODE HERE

#### Question 19

Now build a network with 3 hidden layers with 128 hidden units in each layer. Train the network for 20 epochs and describe and comment on the difference in accuracy.

#### Answer 19

In [None]:
# YOUR CODE HERE

### Fighting Overfitting

#### Question 20
Use the original network from Question 14 and the smaller network from Question 16. Make a single plot that shows their validation loss across epochs. Comment on when each network starts overfitting.

#### Answer 20

In [None]:
# YOUR CODE HERE

#### Question 21

Now add 2 `Dropout` layers to the original network. Make a plot of the validation loss across epochs for the original network and the dropout-regularized network. Comment on the difference.

#### Answer 21

In [None]:
# YOUR CODE HERE

#### Question 22

Name another method to reduce overfitting.

#### Answer 22
