## Questionnaire

1. Write the Python code to implement a single neuron.

In [None]:
neuron_output = sum([x*w for x,w in zip(inputs,weights)]) + bias

__2. Write the Python code to implement ReLU.__

In [None]:
def relu(x): return 0 if x<0 else x

__3. Write the Python code for a dense layer in terms of matrix multiplication.__

In [None]:
y = x @ w.t() + b

__4. Write the Python code for a dense layer in plain Python (that is, with list comprehensions and functionality built into Python).__

In [None]:
y[i,j] = sum([a * b for a,b in zip(x[i,:],w[j,:])]) + b[j]

__5. What is the "hidden size" of a layer?__

The number of neurons in that layer

__6. What does the `t` method do in PyTorch?__

Tranposes the matrix

__7. Why is matrix multiplication written in plain Python very slow?__

It's too high level. C++ is used in Pytorch to speed things up.

__8. In `matmul`, why is `ac==br`?__

The number of columns in matrix `a` must equal the number of rows in matrix `b`

__9. In Jupyter Notebook, how do you measure the time taken for a single cell to execute?__

`%time`

__10. What is "elementwise arithmetic"?__

When a mathematical operation is applied to every cell in the tensor.

__11. Write the PyTorch code to test whether every element of `a` is greater than the corresponding element of `b`.__

In [None]:
(a < b).all()

__12. What is a rank-0 tensor? How do you convert it to a plain Python data type?__

A rank-0 tensor is a tensor that returns one element

You can turn it into a plain Python data type using the `.item()` command

__13. What does this return, and why? `tensor([1,2]) + tensor([1])`__

Should given an error. The tensors don't have the same shapes.

__14. What does this return, and why? `tensor([1,2]) + tensor([1,2,3])`__

Same as above

__15. How does elementwise arithmetic help us speed up `matmul`?__

We can do this: `a[i,:] * b[:,j]`. This takes the product of one row from `a` and one column at `b` at each cell. Then we call `.sum()` on the result. This removes an entire loop from the `matmul` function.

__16. What are the broadcasting rules?__

Pytorch compares tensor shapes elementwise. It starts with the ending dimensions and works it way backward. It will add 1 where it sees empty dimensions. Stopping criteria:

- Tensors are equal
- One of the tensors is 1


__17. What is `expand_as`? Show an example of how it can be used to match the results of broadcasting.__

In [None]:
c = tensor([10.,20,30])
m = tensor([[1., 2, 3], [4,5,6], [7,8,9]])
c.expand_as(m)
# [22]: tensor([[10., 20., 30.],
#         [10., 20., 30.],
#         [10., 20., 30.]])

__18. How does `unsqueeze` help us to solve certain broadcasting problems?__

It adds a unit dimension to our tensor. This helps if we need to broadcast tensors of different shapes.

__19. How can we use indexing to do the same operation as `unsqueeze`?__

In [None]:
# the following explains it

c = torch.randn(64,28,28)
c.shape
# torch.Size([64, 28, 28])

c.unsqueeze(1).shape
# torch.Size([64, 1, 28, 28])

c[:,None].shape
# torch.Size([64, 1, 28, 28])

c[:,:, None].shape
# torch.Size([64, 28, 1, 28])

c[...,None].shape
# torch.Size([64, 28, 28, 1])

__20. How do we show the actual contents of the memory used for a tensor?__

In [None]:
c.storage()

__21. When adding a vector of size 3 to a matrix of size 3×3, are the elements of the vector added to each row or each column of the matrix? (Be sure to check your answer by running this code in a notebook.)__

The elements of the vector are added to each row of the matrix as follows:

In [None]:
c = torch.randn(3)
d = torch.randn(3,3)
c.shape, d.shape
# (torch.Size([3]), torch.Size([3, 3]))

c
# tensor([ 0.2224, -0.3962,  2.0097])

d
# tensor([[-0.2752, -0.3189,  0.5216],
#         [ 0.6154, -0.1518, -1.0665],
#         [ 0.0169, -0.0300,  1.0752]])

c+d
# tensor([[-0.0528, -0.7152,  2.5313],
#         [ 0.8378, -0.5480,  0.9433],
#         [ 0.2393, -0.4262,  3.0849]])

__22. Do broadcasting and `expand_as` result in increased memory use? Why or why not?__

No it does not. Pytorch gives the tensor a stride of 0. So when it looks for the next row by adding the stride - it doesn't move.

__23. Implement `matmul` using Einstein summation.__

In [None]:
def matmul(a,b): return torch.einsum('ik,kj->ij', a, b)

__24. What does a repeated index letter represent on the left-hand side of einsum?__

The number of rows in one matrix need to equal the number of cols in another matrix. Eg. in the question above the letter `k` is repeated on the LHS.

__25. What are the three rules of Einstein summation notation? Why?__

Repeated indices are implicitly summed over. Above, the term k is repeated so we sum over that index.

Each index can appear at most twice in any term. 

Each term must contain identical nonrepeated indices.


TODO: fill in why

__26. What are the forward pass and backward pass of a neural network?__

forward pass: compute the output of a model given input and weights

backward pass: compute gradients of each layer starting at the end of the model.

__27. Why do we need to store some of the activations calculated for intermediate layers in the forward pass?__

So we can calculate the gradients in the backward pass

__28. What is the downside of having activations with a standard deviation too far away from 1?__

We run the risk of the activations becoming `nans` as they are too large

__29. How can weight initialization help avoid this problem?__

We can mulitply each layer by a scale value: `1/√𝑛𝑖𝑛`

__30. What is the formula to initialize weights such that we get a standard deviation of 1 for a plain linear layer, and for a linear layer followed by ReLU?__

Multiply each layer by the scale value: `2/√𝑛𝑖𝑛`. Kaiming init.

__31. Why do we sometimes have to use the `squeeze` method in loss functions?__

To get rid of the trailing unit dimension

__32. What does the argument to the `squeeze` method do? Why might it be important to include this argument, even though PyTorch does not require it?__

The argument tells PyTorch what axis to remove the unit dimensions from. It can be sometimes useful to hardcode this yourself for clarity?

__33. What is the "chain rule"? Show the equation in either of the two forms presented in this chapter.__

$$\frac{\text{d} loss}{\text{d} b_{2}} = \frac{\text{d} loss}{\text{d} out} \times \frac{\text{d} out}{\text{d} b_{2}} = \frac{\text{d}}{\text{d} out} mse(out, y) \times \frac{\text{d}}{\text{d} b_{2}} lin(l_{2}, w_{2}, b_{2})$$


__34. Show how to calculate the gradients of `mse(lin(l2, w2, b2), y)` using the chain rule.__

In [None]:
def lin(x, w, b): return x @ w + b
def mse(output, targ): return (output.squeeze(-1) - targ).pow(2).mean()


1. What is the gradient of ReLU? Show it in math or code. (You shouldn't need to commit this to memory—try to figure it using your knowledge of the shape of the function.)
1. In what order do we need to call the `*_grad` functions in the backward pass? Why?
1. What is `__call__`?
1. What methods must we implement when writing a `torch.autograd.Function`?
1. Write `nn.Linear` from scratch, and test it works.
1. What is the difference between `nn.Module` and fastai's `Module`?