D2L Textbook Solution

In [1]:
import d2l
import mxnet as mx
from mxnet import autograd, gluon, init, nd
from mxnet.gluon import loss as gloss, data as gdata, nn

import time

import numpy as np

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

### Chapter 4
#### 4.1.4. Exercises

1. **Compute the derivative of the tanh and the pReLU activation function.**

    a. The derivative of the Tanh function is:
    $$\frac{d}{dx} \mathrm{tanh}(x) = 1 - \mathrm{tanh}^2(x).$$
    b. The derivative of the pReLU function is:    
    $$\mathrm{pReLU}(x) = \max(0, x) + \alpha \min(0, x)$$
    $$\begin{equation}
          \frac{d}{dx} \mathrm{pReLU}(x) =
                \begin{cases}
                  1 & \text{if $x > 0$}\\
                  undefined & \text{if $x = 0$}\\
                  \alpha & \text{if $x < 0$}
                \end{cases}       
        \end{equation}$$


2. **Show that a multilayer perceptron using only ReLU (or pReLU) constructs a continuous piecewise linear function.**

    By definiton, a `continous piecewise linear function` is a real-valued function defined on the real numbers, whose graph is composed of continous straight-line sections.
    $$\begin{equation}
      \mathrm{pReLU}(x) =
            \begin{cases}
              x & \text{if $x >= 0$}\\
              \alpha x & \text{if $x < 0$}
            \end{cases}       
    \end{equation}$$
    
3. **Show that  $tanh(𝑥)+1=2sigmoid(2𝑥)$.**

    $$ LHS = \text{tanh}(x)+1 = \frac{1 - \exp(-2x)}{1 + \exp(-2x)} + 1
                    =\frac{2}{1 + \exp(-2x)} .$$
    $$ RHS = 2 \mathrm{sigmoid}(2x) = \frac{2}{1 + \exp(-2x)}.$$


4. **Assume we have a multilayer perceptron without nonlinearities between the layers. In particular, assume that we have  𝑑  input dimensions,  𝑑  output dimensions and that one of the layers had only  𝑑/2  dimensions. Show that this network is less expressive (powerful) than a single layer perceptron.**

    A multilayer perceptron without nonlinearities is equal to one layer perceptron.
    $$ {\hat{\mathbf{y}}} = \mathbf{W_2}(\mathbf{W_1}X + b_1) + b_2 = \mathbf{W_2} \mathbf{W_1} X + (\mathbf{W_2} b_1 + b_2) := \mathbf{W_3}X + b_3$$
    
    Hence, if any of layer of d/2 dimension, then the rank of W_3 will be at most d/2, which can not express the final output of dimension d. However, a single layer percptron with sofmax regression will add  nonlinearities to the model, which will learn and represent almost any arbitrary complex function which maps inputs to outputs.
    

5. **Assume that we have a nonlinearity that applies to one minibatch at a time. What kinds of problems do you expect this to cause?**

    Minibatch may not be as representative as whole batch. As a result, parameters learned from the minibatch may get weird gradients and get harder to converge.


#### 4.2.7. Exercises¶

1. **Change the value of the hyper-parameter `num_hiddens` in order to see how this hyperparameter influences your results.**
    
    The upper bound on the number of hidden neurons that won't result in over-fitting is:
    $$N_h = \frac{N_s} {(\alpha * (N_i + N_o))}$$

where 
$\alpha$ = an arbitrary scaling factor usually 2-10;
    
    $𝑁_𝑖$ = number of input neurons; 
    $𝑁_𝑜$ = number of output neurons;   
    $𝑁_𝑠$ = number of samples in training data set.
    
   Below the upper bound, the larger the num_hiddens, the better your results might be.
    
    
2. **Try adding a new hidden layer to see how it affects the results.**

    In general, adding a new hidden layer to a shallow networks should improve the accuracy. Since wide and shallow networks are very good at memorization, but not so good at generalization. Multiple layers are much better at generalizing because they learn all the intermediate features between the raw data and the high-level classification.
    
    
3. **How does changing the learning rate change the result.**
    
    If a learning rate is too high, it may overshoot the minimum and fail to converge in the end. If it is too low, then gradient descent can be slow. 
    


#### 4.3.2. Exercises¶

1. **Try adding a few more hidden layers to see how the result changes.**

2. **Try out different activation functions. Which ones work best?**
    
    Sigmoid and Tanh both suffers from vanishing gradient problems. 
    
    ReLU rectifies the problem but it could result to "Dead Neuron" (since partial of its weights never get updated). Leaky ReLu to fix the problem of dying neurons.  Also, ReLU can be only use within the hidden layer of NN.
    
    Softmax can be use in the output layers of classification model.
    

3. **Try out different initializations of the weights.**

    Zero initialization. All weights will be the same in the end, since the derivative with respect to loss function is the same.
    
    Random initialization. Initializing weights randomly, following normal distribution. May suffer from vanishing gradients and exploding gradients.
    
    Xavier initialization. The initializer fills the weights with random numbers in the range of [−c,c], where $c = \sqrt{\frac{3.}{0.5 * (n_{in} + n_{out})}}$. $n_{in}$ is the number of neurons feeding into weights, and $n_{out}$ is the number of neurons the result is fed to.


#### 4.4.6. Exercises

1. **Can you solve the polynomial regression problem exactly? Hint - use linear algebra.**

    Given the polynomial regression samples $\{(X_i,Y_i)\}_{i=1}^n$, we have $Y =\beta_0+\beta_1X+\beta_2X^2+\cdots+\beta_pX^p+\varepsilon=\mathbf{X}\boldsymbol\beta+\varepsilon$
        
    where
    $$\mathbf{X}=\pmatrix{\begin{array}{ccc}1 & X_1 & \cdots & X_1^p\\
                \vdots & \vdots &\ddots &\vdots\\
                    1 & X_n & \cdots & X_n^p
                \end{array}}_{n\times(p+1)}.$$

    Hence, $\hat{\boldsymbol\beta}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}\mathbf{y},$ where $\mathbf{y}=(Y_1,\ldots,Y_n)^T$. 
    
    Note $(𝑝+1)×(𝑝+1) = \mathrm{rank}(\mathbf{X}^T\mathbf{X})=\mathrm{rank}(\mathbf{X})=n$. Then, if $p=n−1$, $𝐗^T 𝐗$ has dimension 𝑛×𝑛 and its rank is 𝑛, so no problem, is invertible. But if $p=n$, the dimension of $𝐗^T 𝐗$ is (𝑛+1)×(𝑛+1) and the rank remains 𝑛, so in that case (and also if $p>n$) is not invertible (linear dependence arises).
    
    
2. **Model selection for polynomials**

    a. **Plot the training error vs. model complexity (degree of the polynomial). What do you observe?**

    b. **Plot the test error in this case.**

    c. **Generate the same graph as a function of the amount of data?**

    See example in *4.4.4. Polynomial Regression*.
    
    

3. **What happens if you drop the normalization of the polynomial features  $𝑥_𝑖$  by  1/𝑖! . Can you fix this in some other way?**

    There might be very large values of gradients and losses, due to very large values for exponents i.


4. **What degree of polynomial do you need to reduce the training error to 0?**

    As explained in Q1, if $p=n−1$, $𝐗^T 𝐗$ has dimension 𝑛×𝑛 and its rank is 𝑛. Then $\hat{\boldsymbol\beta}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}\mathbf{y},$ is the exact answer and hence the training error is 0.
    

5. **Can you ever expect to see 0 generalization error?**

    Yes. Sometimes, if we accidentally have training set including all testing set's features and labels, (i.e. testing set items are all duplicated to training set). Then we may see a 0 generalization error. 


#### 4.5.6. Exercises

1. **Experiment with the value of  𝜆  in the estimation problem in this page. Plot training and test accuracy as a function of  𝜆 . What do you observe?**


In [None]:
def fit_and_plot_lambda(wd_list, num_epochs):
    '''
    wd_list : a list of number which represents weight_decay value
    '''
    train_ls, test_ls = [], []
    for wd in wd_list:
        net = nn.Sequential()
        net.add(nn.Dense(1))
        net.initialize(init.Normal(sigma=0.1))
        trainer_w = gluon.Trainer(net.collect_params('.*weight'), 'sgd',
                                  {'learning_rate': lr, 'wd': wd})
        trainer_b = gluon.Trainer(net.collect_params('.*bias'), 'sgd',
                                  {'learning_rate': lr})
        
        for _ in range(num_epochs):
            for X, y in train_iter:
                with autograd.record():
                    l = loss(net(X), y)
                l.backward()
                # Call the step function on each of the two Trainer instances to
                # update the weight and bias separately
                trainer_w.step(batch_size)
                trainer_b.step(batch_size)
        train_ls.append(loss(net(train_features),train_labels).mean().asscalar())
        test_ls.append(loss(net(test_features),test_labels).mean().asscalar())
    d2l.semilogy(wd_list, train_ls, 'weight_decay', 'loss',
                     wd_list, test_ls, ['train', 'test'])
    
fit_and_plot_lambda(wd_list=range(10), num_epochs=100)



2. **Use a validation set to find the optimal value of  𝜆 . Is it really the optimal value? Does this matter?**

    Given different hyperparameters, network architecture and dataset, the optimal weight decay will vary. Hence there is no global optimal value. 


3. **What would the update equations look like if instead of  $‖𝐰‖^2$  we used  $∑_𝑖|𝑤_𝑖|$  as our penalty of choice (this is called  ℓ1  regularization).**

    For L2, the loss function and corresponding stochastic gradient descent updates is :
    $$l(\mathbf{w}, b) + \frac{\lambda}{2} \|\boldsymbol{w}\|^2$$
    $$\begin{aligned}
w & \leftarrow \left(1- \frac{\eta\lambda}{|\mathcal{B}|} \right) \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \mathbf{x}^{(i)} \left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right),
\end{aligned}$$

    For L1, the loss function and corresponding stochastic gradient descent updates is :
    $$l(\mathbf{w}, b) + \frac{\lambda}{2} ∑_𝑖|𝑤_𝑖|$$
    $$\begin{aligned}
w & \leftarrow \left(1- \frac{\eta\lambda}{2 |\mathcal{B}| \mathbf{|w|}} \right) \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \mathbf{x}^{(i)} \left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right),
\end{aligned}$$



4. **We know that  $‖𝐰‖^2=𝐰^⊤𝐰$ . Can you find a similar equation for matrices (mathematicians call this the Frobenius norm)?**

    ![title](image/textbook_solution_4.5.6.png)
    
    
5. **Review the relationship between training error and generalization error. In addition to weight decay, increased training, and the use of a model of suitable complexity, what other ways can you think of to deal with overfitting?**

    Add more training examples; Increase dropout; Decrease features, etc.
    

6. **In Bayesian statistics we use the product of prior and likelihood to arrive at a posterior via  $𝑝(𝑤|𝑥)∝𝑝(𝑥|𝑤)𝑝(𝑤)$ . How can you identify  𝑝(𝑤)  with regularization?**



#### 4.6.7. Exercises

1. **Try out what happens if you change the dropout probabilities for layers 1 and 2. In particular, what happens if you switch the ones for both layers?**


In [None]:
## TODO: Run on GPU

num_epochs, lr, batch_size = 10, 0.5, 256
loss = gloss.SoftmaxCrossEntropyLoss()
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
drop_prob1, drop_prob2 = 0.2, 0.5

## network 1
net467_1 = nn.Sequential()
net467_1.add(nn.Dense(256, activation="relu"),
        nn.Dropout(drop_prob1),
        nn.Dense(256, activation="relu"),
        nn.Dropout(drop_prob2),
        nn.Dense(10))
net467_1.initialize(init.Normal(sigma=0.01))
trainer = gluon.Trainer(net467_1.collect_params(), 'sgd', {'learning_rate': lr})
d2l.train_ch3(net467_1, train_iter, test_iter, loss, num_epochs, batch_size, None,
              None, trainer)

In [None]:
## TODO: Run on GPU

## network 2, switch dropout rate
net467_2 = nn.Sequential()
net467_2.add(nn.Dense(256, activation="relu"),
        nn.Dropout(drop_prob2),
        nn.Dense(256, activation="relu"),
        nn.Dropout(drop_prob1),
        nn.Dense(10))
net467_2.initialize(init.Normal(sigma=0.01))
trainer = gluon.Trainer(net467_2.collect_params(), 'sgd', {'learning_rate': lr})
d2l.train_ch3(net467_2, train_iter, test_iter, loss, num_epochs, batch_size, None,
              None, trainer)


2. **Increase the number of epochs and compare the results obtained when using dropout with those when not using it.**

In [None]:
## TODO: Run on GPU

## network 3
num_epochs = 50
net467_3 = nn.Sequential()
net467_3.add(nn.Dense(256, activation="relu"),
        nn.Dropout(drop_prob2),
        nn.Dense(256, activation="relu"),
        nn.Dropout(drop_prob1),
        nn.Dense(10))
net467_3.initialize(init.Normal(sigma=0.01))
trainer = gluon.Trainer(net467_3.collect_params(), 'sgd', {'learning_rate': lr})
d2l.train_ch3(net467_3, train_iter, test_iter, loss, num_epochs, batch_size, None,
              None, trainer)

3. **Compute the variance of the the activation random variables after applying dropout.**

    Dropout replaces an activation  ℎ  with a random variable  ℎ′  with expected value  ℎ  and with variance given by the dropout probability  𝑝 .
    
    $$\begin{split}\begin{aligned}
        h' =
        \begin{cases}
            0 & \text{ with probability } p \\
            \frac{h}{1-p} & \text{ otherwise}
        \end{cases}
        \end{aligned}\end{split}$$



4. **Why should you typically not using dropout?**

    Dropout can help with regularization, but at a risk of lossing improtant information. Especially applying Dropout in the first layer will lead to significant inforamtion loss.
    

5. **If changes are made to the model to make it more complex, such as adding hidden layer units, will the effect of using dropout to cope with overfitting be more obvious?**

   ??????
   
   For an overfitted model, adding a hidden layer with dropout may not help. Especially in the situation that the effective neurons in this layer is larger than the number of neurons in the later layers, since this is equal to adding an extra hidden layer. 
   

6. **Using the model in this section as an example, compare the effects of using dropout and weight decay. What if dropout and weight decay are used at the same time?**


In [None]:
## TODO: Run on GPU

## random simulate dataset
n_train, n_test, num_inputs = 20, 100, 200
true_w, true_b = nd.ones((num_inputs, 1)) * 0.01, 0.05

features = nd.random.normal(shape=(n_train + n_test, num_inputs))
labels = nd.dot(features, true_w) + true_b
labels += nd.random.normal(scale=0.01, shape=labels.shape)
train_features, test_features = features[:n_train, :], features[n_train:, :]
train_labels, test_labels = labels[:n_train], labels[n_train:]
train_iter = gdata.DataLoader(gdata.ArrayDataset(train_features, train_labels), batch_size, shuffle=True)



def fit_and_plot_gluon_467_6(dropout, wd, num_epochs=50, lr=0.01, batch_size=256):
    net = nn.Sequential()
    net.add(nn.Dense(256),
            nn.Dropout(drop_prob),
            nn.Dense(10))
    net.initialize(init.Normal(sigma=0.01))
    loss = gloss.L2Loss()

    trainer_w = gluon.Trainer(net.collect_params('.*weight'), 'sgd',
                              {'learning_rate': lr, 'wd': wd})
    # The bias parameter has not decayed. Bias names generally end with "bias"
    trainer_b = gluon.Trainer(net.collect_params('.*bias'), 'sgd',
                              {'learning_rate': lr})
    train_ls, test_ls = [], []
    for _ in range(num_epochs):
        for X, y in train_iter:
            with autograd.record():
                l = loss(net(X), y)
            l.backward()
            # Call the step function on each of the two Trainer instances to
            # update the weight and bias separately
            trainer_w.step(batch_size)
            trainer_b.step(batch_size)
        train_ls.append(loss(net(train_features),
                             train_labels).mean().asscalar())
        test_ls.append(loss(net(test_features),
                            test_labels).mean().asscalar())
    d2l.semilogy(range(1, num_epochs + 1), train_ls, 'epochs', 'loss',
                 range(1, num_epochs + 1), test_ls, ['train', 'test'])
#     print('L2 norm of w:', net[0].weight.data().norm().asscalar())

fit_and_plot_gluon_467_6(dropout=0.5, wd=0) 
fit_and_plot_gluon_467_6(dropout=0.5, wd=3)
fit_and_plot_gluon_467_6(dropout=0, wd=3) 



7. **What happens if we apply dropout to the individual weights of the weight matrix rather than the activations?**

    The regularization effect will be the same. If we turn partial of the weights to be zero, these neurons won't learn any signal from the inputs, which will have the similar functionality as dropout on activation.


8. **Replace the dropout activation with a random variable that takes on values of  $[0,𝛾/2,𝛾]$ . Can you design something that works better than the binary dropout function? Why might you want to use it? Why not?**

    Define the following dropout activation function:

    $$\begin{split}\begin{aligned}
    h' =
    \begin{cases}
        0 & \text{ with probability } p_1 \\
        𝛾/2 * h & \text{ with probability } (1 - p_1)  p_2 \\
        𝛾 * h & \text{ with probability } (1 - p_1)(1 - p_2)
    \end{cases}
    \end{aligned}\end{split}$$

    The has the expectation remained unchanged, we need to have
    $$ 0 + (𝛾/2) h (1 - p_1)  p_2 +  𝛾  h  (1 - p_1)(1 - p_2)  = h$$
    Thus,
     $$𝛾 = \frac{2}{(2-p_2)(1-p_1)}$$

    For example, if we let $p_1=0.2, p_2=0.75$, then $𝛾 = 2$ by the above formula,
    
     $$\begin{split}\begin{aligned}
    h' =
    \begin{cases}
        0 & \text{ with probability } 0.2 \\
        h & \text{ with probability } 0.6 \\
        2 h & \text{ with probability } 0.2
    \end{cases}
    \end{aligned}\end{split}$$


#### 4.7.6. Exercises

1. **Assume that the inputs  X  are matrices. What is the dimensionality of the gradients?**

    Notice the dimensionality of each layer's graditents is equal the dimensionality of each layer's weighs. i.e. 
    $$ dim(\frac{\partial J}{\partial \mathbf{W}^{(1)}}) = dim(\mathbf{W}^{(1)})$$
    
    If the inputs $X \in \mathbb{R}^{d \times c}$ are matrices at each row, the weight matrix $\mathbf{W}^{(1)}$ would have a dimension equal to ${h \times (d \times  c)}$, where h is the hidden layer dimension.
    

2. **Add a bias to the hidden layer of the model described in this chapter.**
    a. **Draw the corresponding compute graph.**
    b. **Derive the forward and backward propagation equations.**
    
    b. Given a hidden layer with a bias:
    $$\mathbf{z}= \mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)}\\
      \mathbf{h}= \phi (\mathbf{z}) \\
      \mathbf{o}= \mathbf{W}^{(2)} \mathbf{h} + \mathbf{b}^{(2)}\\
      L = l(\mathbf{o}, y) \\
      s = \frac{\lambda}{2} \left(\|\mathbf{W}^{(1)}\|_F^2 + \|\mathbf{W}^{(2)}\|_F^2\right) \\
      J = L + s$$
      
      hence the backward propagation will be
      $$ \frac{\partial J}{\partial \mathbf{W}^{(2)}}
        = \text{prod}\left(\frac{\partial J}{\partial \mathbf{o}}, \frac{\partial \mathbf{o}}{\partial \mathbf{W}^{(2)}}\right) + \text{prod}\left(\frac{\partial J}{\partial s}, \frac{\partial s}{\partial \mathbf{W}^{(2)}}\right)
        = \frac{\partial J}{\partial \mathbf{o}} \mathbf{h}^\top + \lambda \mathbf{W}^{(2)}$$
        
      $$ \frac{\partial J}{\partial \mathbf{b}^{(2)}}
        = \text{prod}\left(\frac{\partial J}{\partial \mathbf{o}}, \frac{\partial \mathbf{o}}{\partial \mathbf{b}^{(2)}}\right)  + \text{prod}\left(\frac{\partial J}{\partial s}, \frac{\partial s}{\partial \mathbf{b}^{(2)}}\right)
        = \frac{\partial J}{\partial \mathbf{o}} \times 1 + 0
        = \frac{\partial J}{\partial \mathbf{o}}$$
      $$ \frac{\partial J}{\partial \mathbf{W}^{(1)}}
        = \frac{\partial J}{\partial \mathbf{o}} \frac{\partial \mathbf{o}}{\partial \mathbf{h}} \frac{\partial \mathbf{h}}{\partial \mathbf{z}} \frac{\partial \mathbf{z}}{\partial \mathbf{W}^{(1)}} + \text{prod}\left(\frac{\partial J}{\partial s}, \frac{\partial s}{\partial \mathbf{W}^{(1)}}\right)
        = \frac{\partial J}{\partial \mathbf{o}} {\mathbf{W}^{(2)}}^\top \odot \phi'\left(\mathbf{z}\right) {\mathbf{x}}^\top + \lambda \mathbf{W}^{(1)}$$
        
      $$ \frac{\partial J}{\partial \mathbf{b}^{(1)}}
        = \frac{\partial J}{\partial \mathbf{o}} \frac{\partial \mathbf{o}}{\partial \mathbf{h}} \frac{\partial \mathbf{h}}{\partial \mathbf{z}} \frac{\partial \mathbf{z}}{\partial \mathbf{b}^{(1)}} + \text{prod}\left(\frac{\partial J}{\partial s}, \frac{\partial s}{\partial \mathbf{b}^{(1)}}\right)
        = \frac{\partial J}{\partial \mathbf{o}} {\mathbf{W}^{(2)}}^\top \odot \phi'\left(\mathbf{z}\right)$$


3. **Compute the memory footprint for training and inference in model described in the current chapter.**

    Training: need memory for $$ \frac{\partial J}{\partial \mathbf{o}}, {\mathbf{W}^{(2)}}, {\mathbf{W}^{(1)}}, \phi'\left(\mathbf{z}\right), {\mathbf{x}},$$
    Inference: only need memory for $$\mathbf{W}^{(1)}, \mathbf{W}^{(2)}, \mathbf{b}^{(1)}, \mathbf{b}^{(2)}$$


4. **Assume that you want to compute second derivatives. What happens to the compute graph? Is this a good idea?**

    

5. **Assume that the compute graph is too large for your GPU.**

    a.**Can you partition it over more than one GPU?**
        
        By default, MXNet uses data parallelism to partition the workload over multiple devices. Assume there are n devices. Then each one will receive a copy of the complete model and train it on 1/n of the data. The results such as gradients and updated model are communicated across these devices.
        
    b.**What are the advantages and disadvantages over training on a smaller minibatch?**
    
    Advantages:
        1. More robust convergence, avoiding local minima (as its model update frequency is higher than batch gradient descent.)
        2. Computationally efficient process than stochastic gradient descent.
        3. Memory efficient than batch gradient descent.
        
    Disadvantages:
        1. Tune minibatch hyperparameter

#### 4.8.4. Exercises

1. **Can you design other cases of symmetry breaking besides the permutation symmetry?**

2. **Can we initialize all weight parameters in linear regression or in softmax regression to the same value?**


3. **Look up analytic bounds on the eigenvalues of the product of two matrices. What does this tell you about ensuring that gradients are well conditioned?**


4. **If we know that some terms diverge, can we fix this after the fact? Look at the paper on LARS by You, Gitman and Ginsburg, 2017 for inspiration.**




#### 4.9.5. Exercises

1. **What could happen when we change the behavior of a search engine? What might the users do? What about the advertisers?**

2. **Implement a covariate shift detector. Hint - build a classifier.**


3. **Implement a covariate shift corrector.**


4. **What could go wrong if training and test set are very different? What would happen to the sample weights?**


### Chapter 5
#### 5.1.6. Exercises

1. **What kind of error message will you get when calling an __init__ method whose parent class not in the __init__ function of the parent class?**

    InitializationError, i.e. cannot initialize the related parameters (the weights).
    

2. **What kinds of problems will occur if you remove the asscalar function in the FancyMLP class?**

    Returns a scalar whose value is copied from the resulted array.
    
    
3. **What kinds of problems will occur if you change self.net defined by the Sequential instance in the NestMLP class to self.net = [nn.Dense(64, activation='relu'), nn. Dense(32, activation='relu')]?**

    If change *nn.Sequential()* to the above list, then we cannot add additional network to NestMLP, since the function of Sequential is the concatenations of layers and blocks. Following code will give you the error.

In [8]:
class NestMLP_exercise(nn.Block):
    def __init__(self, **kwargs):
        super(NestMLP_exercise, self).__init__(**kwargs)
        self.net = [nn.Dense(64, activation='relu'), nn. Dense(32, activation='relu')]  # nn.Sequential()
        self.net.add(nn.Dense(64, activation='relu'),
                     nn.Dense(32, activation='relu'))
        self.dense = nn.Dense(16, activation='relu')

    def forward(self, x):
        return self.dense(self.net(x))

net = nn.Sequential()
net.add(NestMLP_exercise(), nn.Dense(20))
net.initialize()


AttributeError: 'list' object has no attribute 'add'

4. **Implement a block that takes two blocks as an argument, say net1 and net2 and returns the concatenated output of both networks in the forward pass (this is also called a parallel block).**


In [18]:
class ParallelMLP(nn.Block):
    def __init__(self, **kwargs):
        super(ParallelMLP, self).__init__(**kwargs)
        self.input_net1 = nn.Sequential()
        for item1 in net1:
            self.input_net1.add(item1)
        self.input_net1.add(nn.Dense(1))
        
        self.input_net2 = nn.Sequential()
        for item2 in net2:
            self.input_net2.add(item2)
        self.input_net2.add(nn.Dense(1))

    def forward(self, x):
        out1 = self.input_net1(x) ## shape of out1 : (batch_size, 1)
        out2 = self.input_net2(x) ## shape of out1 : (batch_size, 1)
        out = mx.nd.concat(out1, out2, dim=-1) ## shape of out : (batch_size, 2)
        return out

    
x = nd.random.uniform(shape=(2, 20))
net1 = [nn.Dense(64, activation='relu'), nn. Dense(32, activation='relu')]
net2 = [nn.Dense(64, activation='relu')]
parallel = ParallelMLP()
parallel.initialize()
parallel(x)


[[ 0.00052053 -0.03974626]
 [-0.00133751 -0.01168625]]
<NDArray 2x2 @cpu(0)>

5. **Assume that you want to concatenate multiple instances of the same network. Implement a factory function that generates multiple instances of the same block and build a larger network from it.**


In [30]:
out = mx.nd.zeros(shape=(2,3))
out


[[0. 0. 0.]
 [0. 0. 0.]]
<NDArray 2x3 @cpu(0)>

In [33]:
class LargeNetwork(nn.Block):
    def __init__(self, net, **kwargs):
        super(LargeNetwork, self).__init__(**kwargs)
        self.input_net = nn.Sequential()
        for item in net:
            self.input_net.add(item)
        self.input_net.add(nn.Dense(1))
        self.large_net = {}
        
    def __init_large_net(self, length):
        for i in range(length):
            self.large_net[i] = self.input_net
        
    def forward(self, input_list):
        '''
        input_list is a list of input instance of same shape
        '''
        out = mx.nd.zeros(shape=(len(input_list), len(input_list[0]), 1)) ## 1 is the output shape for this input_net
        
        ## initial large net if it does not exist
        if len(self.large_net.keys()) == 0:
            self.__init_large_net(len(input_list))
            
        for j, instance in enumerate(input_list):
            net = self.large_net[j]
            out[j,:] = net(instance) ## shape of out1 : (batch_size, 1)
        return mx.nd.concat(out, dim=-1)

    
x = nd.random.uniform(shape=(2, 20))
y = nd.random.uniform(shape=(2, 20))
z = nd.random.uniform(shape=(2, 20))

net = [nn.Dense(64, activation='relu'), nn.Dense(10, activation='relu')]
large_network = LargeNetwork(net)
large_network.initialize()
large_network([x,y,z])


[[[-0.00043408]
  [ 0.00071604]]

 [[-0.00023887]
  [ 0.00123192]]

 [[ 0.00160318]
  [ 0.00164159]]]
<NDArray 3x2x1 @cpu(0)>

#### 5.2.5. Exercises

1. **Use the FancyMLP definition of the previous section and access the parameters of the various layers.**


In [34]:
class FancyMLP(nn.Block):
    def __init__(self, **kwargs):
        super(FancyMLP, self).__init__(**kwargs)
        # Random weight parameters created with the get_constant are not
        # iterated during training (i.e. constant parameters)
        self.rand_weight = self.params.get_constant(
            'rand_weight', nd.random.uniform(shape=(20, 20)))
        self.dense = nn.Dense(20, activation='relu')

    def forward(self, x):
        x = self.dense(x)
        # Use the constant parameters created, as well as the relu and dot
        # functions of NDArray
        x = nd.relu(nd.dot(x, self.rand_weight.data()) + 1)
        # Reuse the fully connected layer. This is equivalent to sharing
        # parameters with two fully connected layers
        x = self.dense(x)
        # Here in Control flow, we need to call asscalar to return the scalar
        # for comparison
        while x.norm().asscalar() > 1:
            x /= 2
        if x.norm().asscalar() < 0.8:
            x *= 10
        return x.sum()
    
x = nd.random.uniform(shape=(2, 20))
net = FancyMLP()
net.initialize()
net(x)
print(net.collect_params())

fancymlp0_ (
  Constant fancymlp0_rand_weight (shape=(20, 20), dtype=<class 'numpy.float32'>)
  Parameter dense76_weight (shape=(20, 20), dtype=float32)
  Parameter dense76_bias (shape=(20,), dtype=float32)
)


2. **Look at the MXNet documentation and explore different initializers.**
    
    Constant, Normal, Xavier, Orthogonal, MSRAPrelu, etc.
    http://mxnet.incubator.apache.org/api/python/optimization/optimization.html#mxnet.initializer.Mixed
    

3. **Try accessing the model parameters after net.initialize() and before net(x) to observe the shape of the model parameters. What changes? Why?**

    The shape of layer changed. In the below example, for eg, "dense77_weight" was of shape=(20, 0) before net(x), but became shape=(20, 20) after input x. Since dim==-1 of x is 20.

In [40]:
print(x.shape, "\n")
net3 = FancyMLP()
net3.initialize()
print(net3.collect_params())
net3(x)
print(net3.collect_params())

(2, 20) 

fancymlp4_ (
  Constant fancymlp4_rand_weight (shape=(20, 20), dtype=<class 'numpy.float32'>)
  Parameter dense80_weight (shape=(20, 0), dtype=float32)
  Parameter dense80_bias (shape=(20,), dtype=float32)
)
fancymlp4_ (
  Constant fancymlp4_rand_weight (shape=(20, 20), dtype=<class 'numpy.float32'>)
  Parameter dense80_weight (shape=(20, 20), dtype=float32)
  Parameter dense80_bias (shape=(20,), dtype=float32)
)



4. **Construct a multilayer perceptron containing a shared parameter layer and train it. During the training process, observe the model parameters and gradients of each layer.**

In [59]:
## ??????
x = nd.random.uniform(shape=(2, 20))
y = nd.random.uniform(shape=(2, 1))

net4 = nn.Sequential()
shared = nn.Dense(8, activation='relu')
shared_reuse = nn.Dense(8, activation='relu', params=shared.params)
net4.add(nn.Dense(8, activation='relu'),
        shared,
        shared_reuse,
        nn.Dense(1))
net4.initialize()
loss = gloss.SoftmaxCrossEntropyLoss()


with mx.autograd.record():
    y_hat = net4(x)
    print(net4.collect_params())
    l = loss(y_hat, y).sum()
l.backward()
print(shared.weight.grad())
print(shared_reuse.weight.grad())


sequential50_ (
  Parameter dense139_weight (shape=(8, 20), dtype=float32)
  Parameter dense139_bias (shape=(8,), dtype=float32)
  Parameter dense137_weight (shape=(8, 8), dtype=float32)
  Parameter dense137_bias (shape=(8,), dtype=float32)
  Parameter dense140_weight (shape=(1, 8), dtype=float32)
  Parameter dense140_bias (shape=(1,), dtype=float32)
)

[[0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0.]]
<NDArray 8x8 @cpu(0)>

[[0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0.]]
<NDArray 8x8 @cpu(0)>


5. **Why is sharing parameters a good idea?**

    Sharing parameters can save memory in general and have specific benefits for the following:
    
    For CNN in image recognition, sharing parameters gives the network the ability to look for a given feature everywhere in the image, rather than in just a certain area. 
    
    For RNN, it shares parameters across time steps of the sequence, so it can generalize well to examples of different sequence length.
    
    For autoencoder, encoder and decoder share parameters. In a single layer autoencoder with linear activation, sharing weights forces orthogonality among different hidden layer of weight matrix.
    

#### 5.3.5. Exercises

1. **What happens if you specify only parts of the input dimensions. Do you still get immediate initialization?**

    No. Initialization will occur right before the forward pass.
    

2. **What happens if you specify mismatching dimensions?**

    Error like the following will occur:
    
    ```
    Check failed: from.shape() == to->shape() operands shape mismatchfrom.shape = (126,) to.shape=(12,)
    ```
    
3. **What would you need to do if you have input of varying dimensionality? Hint - look at parameter tying.**

    See below printed results for details. Notice that the seiral number xxx of the **shared :  sequentialxxx** is always the same, while that of **unique :  sequentialxxx** is different for different dimensionality inputs.

In [96]:
def get_net535_init():
    net535 = nn.Sequential()
    net535.add(nn.Dense(16, activation='relu'), nn.Dense(8))
    net535.initialize()  ## reinit one specific layer
    return(net535)

def train_535(net535_unique, net535_shared, X, y, test_iter=None, loss=gloss.L2Loss(), 
              num_epochs=2, batch_size=32, params=None, lr=0.05):
    for epoch in range(num_epochs):
        print("****************** epoch {} ******************".format(epoch))
        with autograd.record():
            try:
                mid = net535_unique(X)
                print("unique : ", net535_unique.collect_params())
                
            except:
                ## if existing network shape does not match, reinitialize a network
                net535_unique_new = get_net535_init()
                net535_unique = net535_unique_new
                mid = net535_unique(X)
                print("unique : ", net535_unique.collect_params())
                
            ## the shared network does not need to be reinit as the shape is consistant
            y_hat = net535_shared(mid)
            print("shared : ", net535_shared.collect_params())


## Generate dataset
batch_size = 10
features1 = nd.random.uniform(shape=(batch_size, 3))
print(features1.shape)
features2 = nd.random.uniform(shape=(batch_size, 6))
print(features2.shape)

y = nd.random.uniform(shape=(batch_size, 1))
net535_unique_ex = get_net535_init()
net535_shared_ex = get_net535_init()
print("\n^^^^^^^^^^^^^^^^^^^^^ features2 ^^^^^^^^^^^^^^^^^^^^^".format(features))
train_535(net535_unique_ex, net535_shared_ex, features1, y)
print("\n^^^^^^^^^^^^^^^^^^^^^ features2 ^^^^^^^^^^^^^^^^^^^^^".format(features))
train_535(net535_unique_ex, net535_shared_ex, features2, y)

# train_535(net=net535, train_iter=data_iter)


^^^^^^^^^^^^^^^^^^^^^ features2 ^^^^^^^^^^^^^^^^^^^^^
****************** epoch 0 ******************
unique :  sequential117_ (
  Parameter dense261_weight (shape=(16, 3), dtype=float32)
  Parameter dense261_bias (shape=(16,), dtype=float32)
  Parameter dense262_weight (shape=(8, 16), dtype=float32)
  Parameter dense262_bias (shape=(8,), dtype=float32)
)
shared :  sequential118_ (
  Parameter dense263_weight (shape=(16, 8), dtype=float32)
  Parameter dense263_bias (shape=(16,), dtype=float32)
  Parameter dense264_weight (shape=(8, 16), dtype=float32)
  Parameter dense264_bias (shape=(8,), dtype=float32)
)
****************** epoch 1 ******************
unique :  sequential117_ (
  Parameter dense261_weight (shape=(16, 3), dtype=float32)
  Parameter dense261_bias (shape=(16,), dtype=float32)
  Parameter dense262_weight (shape=(8, 16), dtype=float32)
  Parameter dense262_bias (shape=(8,), dtype=float32)
)
shared :  sequential118_ (
  Parameter dense263_weight (shape=(16, 8), dtype=float32)

#### 5.4.4. Exercises

1. **Design a layer that learns an affine transform of the data, i.e. it removes the mean and learns an additive parameter instead.**


In [98]:
class CenteredLayer(nn.Block):
    def __init__(self, **kwargs):
        super(CenteredLayer, self).__init__(**kwargs)

    def forward(self, x):
        return x - x.mean()

layer = CenteredLayer()
layer(nd.array([1, 2, 3, 4, 5]))


[-2. -1.  0.  1.  2.]
<NDArray 5 @cpu(0)>

2. **Design a layer that takes an input and computes a tensor reduction, i.e. it returns  $y_k = \sum_{i,j} W_{ijk} x_i x_j$.**


In [126]:
class TensorReductionLayer(nn.Block):
    def __init__(self, k, x_shape, **kwargs):
        super(TensorReductionLayer, self).__init__(**kwargs)
        self.weight = self.params.get('weight', shape=(x_shape, k, x_shape))
        
    def forward(self, x):
        mid = nd.dot(self.weight.data(), x)
        print(mid.shape)
        out = nd.dot(x.T, mid)
        return out.reshape(k)

## sample random x with given length
x_length = 5
x = nd.random.uniform(shape=(x_length, 1))

k = 3  ## k can be any integer
TRlayer = TensorReductionLayer(k, x_length)
TRlayer.initialize()
TRlayer(x)

(5, 3, 1)



[-0.01547093 -0.00399414 -0.01535948]
<NDArray 3 @cpu(0)>

3. **Design a layer that returns the leading half of the Fourier coefficients of the data. Hint - look up the fft function in MXNet.**

In [None]:
## TODO: Run on GPU

class FourierLayer(nn.Block):
    def __init__(self, k, x_shape, **kwargs):
        super(FourierLayer, self).__init__(**kwargs)
        self.weight = self.params.get('weight', shape=(x_shape, k, x_shape))
        
    def forward(self, x):
        mid = nd.dot(self.weight.data(), x)
        print(mid.shape)
        out = nd.dot(x.T, mid)
        return out.reshape(k)

## sample random x with given shape
data = np.random.normal(0,1,(3,4))
out = mx.contrib.ndarray.fft(data = mx.nd.array(data,ctx = mx.gpu(0)))
out

#### 5.5.4. Exercises

1. **Even if there is no need to deploy trained models to a different device, what are the practical benefits of storing model parameters?**

    a. Saving intermediate results (checkpointing) to ensure that we don’t lose several days worth of computation when running a long training process.
    
    b. To load a pretrained model for fine tuning.
    

2. **Assume that we want to reuse only parts of a network to be incorporated into a network of a different architecture. How would you go about using, say the first two layers from a previous network in a new network.**


In [136]:
from mxnet.gluon import model_zoo
alexnet = model_zoo.vision.alexnet(pretrained=True)
print(alexnet.collect_params('alexnet.*_conv0.*'))
print(alexnet.collect_params('alexnet.*_conv1.*'))

alexnet6_ (
  Parameter alexnet6_conv0_weight (shape=(64, 3, 11, 11), dtype=<class 'numpy.float32'>)
  Parameter alexnet6_conv0_bias (shape=(64,), dtype=<class 'numpy.float32'>)
)
alexnet6_ (
  Parameter alexnet6_conv1_weight (shape=(192, 64, 5, 5), dtype=<class 'numpy.float32'>)
  Parameter alexnet6_conv1_bias (shape=(192,), dtype=<class 'numpy.float32'>)
)


3. **How would you go about saving network architecture and parameters? What restrictions would you impose on the architecture?**

    In order to reload a trained model, we need to generate the architecture in code and then load the parameters from disk. 

In [None]:
class MLP(nn.Block):
    def __init__(self, **kwargs):
        super(MLP, self).__init__(**kwargs)
        self.hidden = nn.Dense(256, activation='relu')
        self.output = nn.Dense(10)

    def forward(self, x):
        return self.output(self.hidden(x))

## train a network and save
net = MLP()
net.initialize()
x = nd.random.uniform(shape=(2, 20))
y = net(x)
clone.save_parameters('mlp.params')

## define the architecture and then reload parameters
clone = MLP()
clone.load_parameters('mlp.params')

#### 5.6.5. Exercises

1. **Try a larger computation task, such as the multiplication of large matrices, and see the difference in speed between the CPU and GPU. What about a task with a small amount of calculations?**


In [None]:
## TODO: Run on GPU

s = 4096

A = nd.random.normal(shape=(s, s))
B = nd.random.normal(shape=(s, s))
tic = time.time()
C = nd.dot(A, B)
C.wait_to_read()
print("On CPU : Matrix by matrix: " + str(time.time() - tic) + " seconds")


A1 = A.copyto(mx.gpu(1))
B1 = B.copyto(mx.gpu(1))
tic = time.time()
C = nd.dot(A, B)
C.wait_to_read()
print("On GPU : Matrix by matrix: " + str(time.time() - tic) + " seconds")


2. **How should we read and write model parameters on the GPU?**

    Use `net.load_parameters(file_name, ctx=ctx)` to read model parameters, and `net.save_parameters(file_name)` to save model parameters.
    

3. **Measure the time it takes to compute 1000 matrix-matrix multiplications of  100×100 matrices and log the matrix norm  $tr(MM^⊤)$  one result at a time vs. keeping a log on the GPU and transferring only the final result.**


In [None]:
## TODO: Run on GPU
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


tic = time.time()
for j in range(1000):
    M = nd.random.normal(shape=(100, 100))
    C = nd.sum(nd.diag(nd.dot(M, M.T)))
    C.wait_to_read()
print("Read one-by-one on GPU : " + str(time.time() - tic) + " seconds")



tic = time.time()
D = nd.zeros(shape=(1000,))
for j in range(1000):
    M = nd.random.normal(shape=(100, 100))
    D[i] = nd.sum(nd.diag(nd.dot(M, M.T)))
    
D.wait_to_read()
print("Read all at once on GPU : " + str(time.time() - tic) + " seconds")


4. **Measure how much time it takes to perform two matrix-matrix multiplications on two GPUs at the same time vs. in sequence on one GPU (hint - you should see almost linear scaling).**


In [151]:
## TODO: Run on GPU

s = 4096
tic = time.time()
for j in range(2):
    M = nd.random.normal(shape=(s, s))
    C = nd.dot(M, M.T)
    C.shape
#     C.wait_to_read()
print("Two matrix-matrix multiplications in sequence on GPU : " + str(time.time() - tic) + " seconds")



tic = time.time()
M = nd.random.normal(shape=(2, s, s))
N = nd.random.normal(shape=(s, s))
D = nd.dot(M, N)
D.shape
print("Two matrix-matrix multiplications at the same time on GPU : " + str(time.time() - tic) + " seconds")


(4096, 4096)

(4096, 4096)

Two matrix-matrix multiplications in sequence on GPU : 0.009020805358886719 seconds


(2, 4096, 4096)

Two matrix-matrix multiplications at the same time on GPU : 0.004949331283569336 seconds


### Chapter 6
#### 6.1 Exercises

1. **Assume that the size of the convolution mask is ∆ = 0. Show that in this case the convolutional mask implements an MLP independently for each set of channels.**

    For any tensor k, since $∆ = 0$, the learner are independent. i.e.
    $$h[i,j,k] = \sum_{a = -\Delta}^{\Delta} \sum_{b = -\Delta}^{\Delta} V[a,b,k] \cdot x[i+a,j+b]
               = V[0,0,k] \cdot x[i,j]$$
               

2. **Why might translation invariance not be a good idea after all? Does it make sense for pigs to fly?**

    In translation invariance, we assume that we would recognize an object wherever it is in an image. It is only reasonable to assume that the location of the object shouldn’t matter too much to determine whether the object is there. For example, a face is still a face regardless of whether it is moved horizontally or vertically in an image.**



3. **What happens at the boundary of an image?** 

    On the boundaries we encounter the problem that we keep on losing pixels. (Without padding)
    
    
4. **Derive an analogous convolutional layer for audio.**

    Mel spectrogram transform the input raw sequence to a 2D feature map where one dimension represents time and the other one represents frequency and the values represents amplitude. https://en.wikipedia.org/wiki/Mel-frequency_cepstrum

    Moving a sound event horizontally offsets its position in time and it can be argued that a sound event means the same thing regardless of when it happens. However, moving a sound vertically might influence its meaning: Moving the frequencies of a male voice upwards could change its meaning from man to child or goblin, for example.
	https://towardsdatascience.com/whats-wrong-with-spectrograms-and-cnns-for-audio-processing-311377d7ccd


5. **What goes wrong when you apply the above reasoning to text? Hint - what is the structure of language?**
        
    Language is a sequence data, hence we cannot assume a translation invariance, i.e. the location of each word matters. Sequence modeling is more effective. 
        
        
6. **Prove that f⊛g=g⊛f.**

    Let $x-z = y$, then
    $$[f \circledast g](x) = \int_{\mathbb{R}^d} f(z) g(x-z) dz 
                            = \int_{\mathbb{R}^d} f(x-y) g(y) dy
                            = [g \circledast f](x)$$

#### 6.2 Exercises

1. **Construct an image X with diagonal edges.**

    • **What happens if you apply the kernel K to it?**
    
    • **What happens if you transpose X?**
    
    • **What happens if you transpose K?**


2. **When you try to automatically find the gradient for the Conv2D class we created, what kind of error message do you see?**
    ??????
    
    
3. **How do you represent a cross-correlation operation as a matrix multiplication by changing the input and kernel arrays?**
    In the two-dimensional cross-correlation operation, the convolution window starts from the top-left of the input array, and slides in the input array from left to right and top to bottom. [See details in 6.2.1]


4. **Design some kernels manually.**
    •**What is the form of a kernel for the second derivative?**
    
    • **What is the kernel for the Laplace operator?**
    https://math.stackexchange.com/questions/483585/kernels-to-compute-second-order-derivative-of-digital-image
        
	• **What is the kernel for an integral?**
    http://mathworld.wolfram.com/IntegralKernel.html
        
	• **What is the minimum size of a kernel to obtain a derivative of degree d?**
    
    
    

#### 6.3 Exercises

1. **For the last example in this section, use the shape calculation formula to calculate the output shape to see if it is consistent with the experimental results.**


2. **Try other padding and stride combinations on the experiments in this section.**


3. **For audio signals, what does a stride of 2 correspond to?**

    In the time dimension, stride of 2 aggregates every 2 timestamps. 
    
    In the frequency dimension, …
    
4. **What are the computational benefits of a stride larger than 1.**
    
    Save memory and reduce computational time.

#### 6.4 Exercises

1. **Assume that we have two convolutional kernels of size k1 and k2 respectively (with no nonlinearity in between).** 

    • **Prove that the result of the operation can be expressed by a single convolution.** 
    
    Convolution associate law.
    
    • **What is the dimensionality of the equivalent single convolution?**
    
    k1+k2-1
    
    • **Is the converse true?**
    
    
2. **Assume an input shape of ci ×h×w and a convolution kernel with the shape co ×ci ×kh ×kw, padding of (ph,pw), and stride of (sh,sw).**

    • **What is the computational cost (multiplications and additions) for the forward computation?** 
    
    • **What is the memory footprint?** 
    
    $O(c_i c_o k_h k_w m_h m_w)$
    
    • **What is the memory footprint for the backward computation?**
    
    • **What is the computational cost for the backward computation?**
    
	https://kasperfred.com/posts/computational-complexity-of-neural-networks


3. **By what factor does the number of calculations increase if we double the number of input channels ci and the number of output channels co? What happens if we double the padding?**


4. **If the height and width of the convolution kernel is $k_h = k_w = 1$, what is the complexity of the forward computation?**

    $O(c_i*c_o*h*w)$


5. **Are the variables Y1 and Y2 in the last example of this section exactly the same? Why?**

    Yes. The main computation of the 1 × 1 convolution occurs on the channel dimension. And $k_h & k_w == 1$ in both function.
    
    
6. **How would you implement convolutions using matrix multiplication when the convolution window is not 1×1 ?**



#### 6.5 Exercises

1. **Implement average pooling as a convolution.**

    6.5.1
    
    
2. **What is the computational cost of the pooling layer? Assume that the input to the pooling layer is of size  c×h×w , the pooling window has a shape of  ph×pw  with a padding of  $(p_h,p_w)$  and a stride of  $(s_h,s_w)$.**

    $c×[(I_h - p_h + pad_h + s_h)/s_h]×[(l_w - p_w + pad_w + s_w)/s_w]$
    

3. **Why do you expect maximum pooling and average pooling to work differently?**

    • Max pooling: the strongest pattern signal in a window
    
    • Average pooling:  The average signal strength in a window 

    ![title](image/textbook_solution_6.5.3.png)
    
    
4. **Do we need a separate minimum pooling layer? Can you replace it with another operation?**

    $argmin(Xa_{ij}) = argmax( -1 * Xa_{ij})$, hence min-pooling can be modeling through CNN and max-pooling
    
    
5. **Is there another operation between average and maximum pooling that you could consider (hint - recall the softmax)? Why might it not be so popular?**
    
    A pooling layer is to alleviate the excessive sensitivity of the convolutional layer to location., i.e. reduce the resolution.
    Softmax computation cost is too high.	


#### 6.6 Exercises

1. **Replace the average pooling with max pooling. What happens?**


2. **Try to construct a more complex network based on LeNet to improve its accuracy.**
Adjust the convolution window size.
Adjust the number of output channels.
Adjust the activation function (ReLU?).
Adjust the number of convolution layers.
Adjust the number of fully connected layers.
Adjust the learning rates and other training details (initialization, epochs, etc.)
Try out the improved network on the original MNIST dataset.


3. **Display the activations of the first and second layer of LeNet for different inputs (e.g. sweaters, coats).**

#### 6.7 Exercises

1. **Try increasing the number of epochs. Compared with LeNet, how are the results different? Why?**


2. **AlexNet may be too complex for the Fashion-MNIST data set.**

    a. **Try to simplify the model to make the training faster, while ensuring that the accuracy does not drop significantly.**
    
    b. **Can you design a better model that works directly on  28×28  images.**
    
    
3. **Modify the batch size, and observe the changes in accuracy and GPU memory.**


4. **Rooflines**

    ![title](image/textbook_solution_6.7.4.png)
    
    a. **What is the dominant part for the memory footprint of AlexNet?**
    
    Dense1 - 26 millions of parameters
    
    b. **What is the dominant part for computation in AlexNet?**
    
    Conv2 - 16 millions of FLOP
    
    c. **How about memory bandwidth when computing the results?**
    
    
5. **Apply dropout and ReLU to LeNet5. Does it improve? How about preprocessing?**



#### 6.8 Exercises

1. **When printing out the dimensions of the layers we only saw 8 results rather than 11. Where did the remaining 3 layer informations go?**

    There are 3 pairs of CNN (3 vgg blocks) has the same shape
    
2.  **Compared with AlexNet, VGG is much slower in terms of computation, and it also needs more GPU memory. Try to analyze the reasons for this.**


3. **Try to change the height and width of the images in Fashion-MNIST from 224 to 96. What influence does this have on the experiments?**


4. **Refer to Table 1 in the original VGG Paper to construct other common models, such as VGG-16 or VGG-19.**



#### 6.9 Exercises

1. **Tune the hyper-parameters to improve the classification accuracy.**


2. **Why are there two  1×1  convolutional layers in the NiN block? Remove one of them, and then observe and analyze the experimental phenomena.**


3. **Calculate the resource usage for NiN:**
    
    ![title](image/textbook_solution_6.9.3.png)
    
    a. **What is the number of parameters?**
    
    b. **What is the amount of computation?**
    
    c. **What is the amount of memory needed during training?**
    
    d. **What is the amount of memory needed during inference?**

4. **What are possible problems with reducing the  384×5×5  representation to a  10×5×5 representation in one step?**

### Chapter 7
#### 7.1.5. Exercises


1. **Improve the above model.**

    a. **Incorporate more than the past 4 observations? How many do you really need?**
    
    b. **How many would you need if there were no noise? Hint - you can write  sin  and  cos  as a differential equation.**
    
    
    c.**Can you incorporate older features while keeping the total number of features constant? Does this improve accuracy? Why?**
    
    
    d.**Change the architecture and see what happens.**
    
    
2. **An investor wants to find a good security to buy. She looks at past returns to decide which one is likely to do well. What could possibly go wrong with this strategy?**



3. **Does causality also apply to text? To which extent?**



4. **Give an example for when a latent variable autoregressive model might be needed to capture the dynamic of the data.**





#### 7.2.5. Exercises
1. **Suppose there are 100,000 words in the training data set. How many word frequencies and multi-word adjacent frequencies does a four-gram need to store?**

    In English, the Zipf’s law in the n-gram data exhibits two regimes: one among words with frequencies above about 0.01% (Zipf’s exponent γ ≈ 1) and another (γ ≈ 1.4) among words with frequency below 0.0001%.

    By Zipf's law, the normalized frequency of elements of rank k, $f(k;s,N)$ is:

    $$ f(k;s;N) = \frac{1}{k^s \sum_1^N \frac{1}{n^s}},$$ 
    where N = 100,000 is the number of words in the English language, s = 1.07 for unigram. 


2. **Review the smoothed probability estimates. Why are they not accurate? Hint - we are dealing with a contiguous sequence rather than singletons.**

    Laplace smoothing allows the assignment of non-zero probabilities to words which do not occur in the sample. Lots of contiguous sequence may not occurs in the bag of word, hence the smoothed probability is not accurate in the right tail.


3. **How would you model a dialogue?**

    First, we can decide which type of dialogue (conversational models) are we going to model (rule-based, retrieval-based, neural generative models, grounded/visual, chit-chat vs. task-based, etc.)?

    Next, choose framework for modelling, such as for neural generative models choosed Semantically Conditioned LSTM-based model. (https://arxiv.org/abs/1508.01745); or Deep Reinforcement Learning for Dialogue Generation (https://aclweb.org/anthology/D16-1127) and so on.


4. **Estimate the exponent of Zipf’s law for unigrams, bigrams and trigrams.**



#### 7.3.5. Exercises

1. **If we use an RNN to predict the next character in a text sequence, how many output dimensions do we need?**

    The output dimension should be the length of the dictionary of unique characters from both train and test dataset.


2. **Can you design a mapping for which an RNN with hidden states is exact? Hint - what about a finite number of words?**


3. **What happens to the gradient if you backpropagate through a long sequence?**

    High powers of matrices can lead to explode or vanish gradients.


4. **What are some of the problems associated with the simple sequence model described above?**

 a. numerically unstable;
 b. difficulty of long-term information preservation and short-term input skipping in latent variable models

**7.4.5. Exercises**

1. **What other mini-batch data sampling methods can you think of?**

    Non-uniform mini-batch sampling. (i.e. suppressing the probability of similar data points in the same mini-batch, which will reduce the stochastic gradient noise, leading to faster convergence).


2. **Why is it a good idea to have a random offset?**
    
    In this way, we can get both coverage(by sequential partitioning strategies) and randomness.

      a. **Does it really lead to a perfectly uniform distribution over the sequences on the document?**

        Picking just a random set of initial positions is no good either since it does not guarantee uniform coverage of the array. For instance, if we pick  n  elements at random out of a set of  n  with random replacement, the probability for a particular element not being picked is  
            
                $$(1−1/n)^n → e^{−1}$$

      b. **What would you have to do to make things even more uniform?**

            # Offset for the iterator over the data for uniform starts
            offset = int(random.uniform(0,num_steps))
            
            
5. **If we want a sequence example to be a complete sentence, what kinds of problems does this introduce in mini-batch sampling? Why would we want to do this anyway?**

    Since a complete sentence is long, it is acceptable to discard half-empty mini-batch. Since these sequences are covered by part of other batches in mini-batch sampling.



**7.5.8. Exercises**

1. **Show that one-hot encoding is equivalent to picking a different embedding for each object.**
    
    Elementary row and column operations on a matrix are rank-preserving.

2. **Adjust the hyperparameters to improve the perplexity.**
    
    a. **How low can you go? Adjust embeddings, hidden units, learning rate, etc.**
    
        
    
    b. **How well will it work on other books by H. G. Wells, e.g. The War of the Worlds.**
    
    
3. **Run the code in this section without clipping the gradient. What happens?**


4. **Set the pred_period variable to 1 to observe how the under-trained model (high perplexity) writes lyrics. What can you learn from this?**


5. **Change adjacent sampling so that it does not separate hidden states from the computational graph. Does the running time change? How about the accuracy?**


6. **Replace the activation function used in this section with ReLU and repeat the experiments in this section.**



7. **Prove that the perplexity is the inverse of the harmonic mean of the conditional word probabilities.**