# Questionnaire Chapter 4: Training a Digit Classifier

1. How is a grayscale image represented on a computer? How about a color image?

> Images are represented by arrays with pixel values representing the content of the image. For greyscale images, a 2-dimensional array is used with the pixels representing the greyscale values, with a range of 256 integers. A value of 0 would represent white, and a value of 255 represents black, and different shades of greyscale in between. For color images, three color channels (red, green, blue) are typicall used, with a separate 256-range 2D array used for each channel. A pixel value of 0 again represents white, with 255 representing solid red, green, or blue. The three 2-D arrays form a final 3-D array (rank 3 tensor) representing the color image.

2. How are the files and folders in the `MNIST_SAMPLE` dataset structured? Why?

> The `MNIST` dataset follows a common layout for ML datasets: separate folders for the training set and the validation (and/or test) set. Every practicioner could generate their own train/validation-split of the data. Public datasets are usually pre-split to simplifiy comparing results between implementations/publications.
>
> fastai `MNIST_SAMPLE` has two subfolders two subsubfolders 3 and 7 which contain the .jpg files for the respective class of images. This is a common way of organizing datasets comprised of pictures. For the full `MNIST` dataset there are 10 subsubfolders, one for the images for each digit.

3. Explain how the "pixel similarity" approach to classifying digits works.

> The main idea is to find the average pixel value for every pixel of the 3s, then do the same for the 7s. This will give two group averages, defining what we might call the "ideal" 3 and 7. Then, to classify an image as one digit or the other, we see which of these two ideal digits the image is most similar to.

4. What is a list comprehension? Create one now that selects odd numbers from a list and doubles them.

> Solution at the end of the notebook.

5. What is a "rank-3 tensor"?

> It's a tensor with 3 dimensions. The rank of a tensor is the number of dimensions it has. In particular, the rank of a tensor is independent of its shape or dimensionality, e.g., a tensor of shape 2x2x2 and a tensor of shape 3x5x7 both have rank 3. Note that the term “rank” has different meanings in the context of tensors and matrices (where it refers to the number of linearly independent column vectors).

6. What is the difference between tensor rank and shape? How do you get the rank from the shape?

> __*Rank*__ is the number of axes or dimensions in a tensor; __*shape*__ is the size of each axis of a tensor.

7. What are RMSE and L1 norm?

> Root mean squared error (RMSE) is also called the L2 norm, and mean absolute difference (MAE), also called the L1 norm, are two commonly used methods of measuring “distance”. Simple differences do not work because some difference are positive and others are negative, canceling each other out. Therefore, a function that focuses on the magnitudes of the differences is needed to properly measure distances. The simplest would be to add the absolute values of the differences, which is what MAE is. RMSE takes the mean of the square (makes everything positive) and then takes the square root (undoes squaring).

8. How can you apply a calculation on thousands of numbers at once, many thousands of times faster than a Python loop?

> As loops are very slow in Python, it is best to represent the operations as array operations rather than looping through individual elements. If this can be done, then using NumPy or PyTorch will be thousands of times faster, as they use underlying C code which is much faster than pure Python. Even better, PyTorch allows you to run operations on GPU, which will have significant speedup if there are parallel operations that can be done.

9. Create a 3×3 tensor or array containing the numbers from 1 to 9. Double it. Select the bottom-right four numbers.

> Solution at the end of the notebook.

10. What is broadcasting?

> Scientific/numerical Python packages like NumPy and PyTorch will often implement broadcasting that often makes code easier to write. In the case of PyTorch, tensors with smaller rank are expanded to have the same size as the larger rank tensor. In this way, operations can be performed between tensors with different rank.

11. Are metrics generally calculated using the training set, or the validation set? Why?

> We use the __*validation set*__ to calculate metrics because we don't want the model to overfit - that is, train a model to work well only on our training data.

12. What is SGD?

> Stochastic Gradient Descent (SGD) is an optimization algorithm that updates the parameters of a model to minimize a given loss function that was evaluated on the predictions and target. The main idea behind SGD is that the gradient of the loss function determines the best way to update the parameters to minimize the loss function.

13. Why does SGD use mini-batches?

> 

14. What are the seven steps in SGD for machine learning?
15. How do we initialize the weights in a model?
16. What is "loss"?
17. Why can't we always use a high learning rate?
18. What is a "gradient"?
19. Do you need to know how to calculate gradients yourself?
20. Why can't we use accuracy as a loss function?
21. Draw the sigmoid function. What is special about its shape?
22. What is the difference between a loss function and a metric?
23. What is the function to calculate new weights using a learning rate?
24. What does the `DataLoader` class do?
25. Write pseudocode showing the basic steps taken in each epoch for SGD.
26. Create a function that, if passed two arguments `[1,2,3,4]` and `'abcd'`, returns `[(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]`. What is special about that output data structure?
27. What does `view` do in PyTorch?
28. What are the "bias" parameters in a neural network? Why do we need them?
29. What does the `@` operator do in Python?
30. What does the `backward` method do?
31. Why do we have to zero the gradients?
32. What information do we have to pass to `Learner`?
33. Show Python or pseudocode for the basic steps of a training loop.
34. What is "ReLU"? Draw a plot of it for values from `-2` to `+2`.
35. What is an "activation function"?
36. What's the difference between `F.relu` and `nn.ReLU`?
37. The universal approximation theorem shows that any function can be approximated as closely as needed using just one nonlinearity. So why do we normally use more?

In [1]:
# 4. Create a list comprehension that selects odd numbers from a list and doubles them.
# List of numbers
nums = range(10)

# Double odd numbers
double_odd = [2*n for n in nums if n % 2 != 0]
double_odd

[2, 6, 10, 14, 18]

In [7]:
# 9. Create a 3×3 tensor or array containing the numbers from 1 to 9. Double it. Select the bottom-right four numbers.
import numpy as np

array = np.array(range(1, 10)).reshape(3, 3)
print(array, '\n')

array = array*2
print(array, '\n')

array[1:, 1:]

[[1 2 3]
 [4 5 6]
 [7 8 9]] 

[[ 2  4  6]
 [ 8 10 12]
 [14 16 18]] 



array([[10, 12],
       [16, 18]])

In [18]:
import torch

tensor = torch.Tensor(range(1, 10)).view(3,3)
print(tensor, '\n')

tensor = 2*tensor
print(tensor, '\n')

tensor[1:, 1:]

tensor([[1., 2., 3.],
        [4., 5., 6.],
        [7., 8., 9.]]) 

tensor([[ 2.,  4.,  6.],
        [ 8., 10., 12.],
        [14., 16., 18.]]) 



tensor([[10., 12.],
        [16., 18.]])

In [3]:
# 26. Create a function...
def f(nums, letters):
    return [(nums[i], letters[i]) for i in range(len(nums))]

nums = range(1, 5)
letters = 'abcd'
f(nums, letters)

[(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]

### Deep learning terminology 

| Term | Meaning |
| ---- | ------- |
|ReLU | Function that returns 0 for negative numbers and doesn't change positive numbers. |
|Mini-batch | A small group of inputs and labels gathered together in two arrays. A gradient descent step is updated on this batch (rather than a whole epoch). |
|Forward pass | Applying the model to some input and computing the predictions. |
|Loss | A value that represents how well (or badly) our model is doing. |
|Gradient | The derivative of the loss with respect to some parameter of the model. |
|Backward pass | Computing the gradients of the loss with respect to all model parameters. |
|Gradient descent | Taking a step in the directions opposite to the gradients to make the model parameters a little bit better. |
|Learning rate | The size of the step we take when applying SGD to update the parameters of the model. |