## Questionnaire

__1. What is the "head" of a neural net?__

The part of the net that is specialised for a particular task. Usually the part after the avg pooling layer (for a CNN).

__2. What is the "body" of a neural net?__

The rest of the network. Includes the stem.

__3. What is "cutting" a neural net? Why do we need to do this for transfer learning?__

Cutting is the process of slicing a neural net. We want to remove certain layers from the net and replace them with customised layers. This process is necessary for transfer learning because it customises the network for our requirements.

__4. What is `model_meta`? Try printing it to see what's inside.__

A dict of information to determine where the body ends and the head starts. We need to replace the head with our customised version for transfer learning.

In [None]:
[i. for i in model_meta.keys()]

```
[<function fastai.vision.models.xresnet.xresnet18>,
 <function fastai.vision.models.xresnet.xresnet34>,
 <function fastai.vision.models.xresnet.xresnet50>,
 <function fastai.vision.models.xresnet.xresnet101>,
 <function fastai.vision.models.xresnet.xresnet152>,
 <function torchvision.models.resnet.resnet18>,
 <function torchvision.models.resnet.resnet34>,
 <function torchvision.models.resnet.resnet50>,
 <function torchvision.models.resnet.resnet101>,
 <function torchvision.models.resnet.resnet152>,
 <function torchvision.models.squeezenet.squeezenet1_0>,
 <function torchvision.models.squeezenet.squeezenet1_1>,
 <function torchvision.models.densenet.densenet121>,
 <function torchvision.models.densenet.densenet169>,
 <function torchvision.models.densenet.densenet201>,
 <function torchvision.models.densenet.densenet161>,
 <function torchvision.models.vgg.vgg11_bn>,
 <function torchvision.models.vgg.vgg13_bn>,
 <function torchvision.models.vgg.vgg16_bn>,
 <function torchvision.models.vgg.vgg19_bn>,
 <function torchvision.models.alexnet.alexnet>]
```

__5. Read the source code for `create_head` and make sure you understand what each line does.__

Code as of 11/04/2021:

```python
if concat_pool: nf *= 2
lin_ftrs = [nf, 512, n_out] if lin_ftrs is None else [nf] + lin_ftrs + [n_out]
bns = [first_bn] + [True]*len(lin_ftrs[1:])
ps = L(ps)
if len(ps) == 1: ps = [ps[0]/2] * (len(lin_ftrs)-2) + ps
actns = [nn.ReLU(inplace=True)] * (len(lin_ftrs)-2) + [None]
pool = AdaptiveConcatPool2d() if concat_pool else nn.AdaptiveAvgPool2d(1)
layers = [pool, Flatten()]
if lin_first: layers.append(nn.Dropout(ps.pop(0)))
for ni,no,bn,p,actn in zip(lin_ftrs[:-1], lin_ftrs[1:], bns, ps, actns):
    layers += LinBnDrop(ni, no, bn=bn, p=p, act=actn, lin_first=lin_first)
if lin_first: layers.append(nn.Linear(lin_ftrs[-2], n_out))
if bn_final: layers.append(nn.BatchNorm1d(lin_ftrs[-1], momentum=0.01))
if y_range is not None: layers.append(SigmoidRange(*y_range))
return nn.Sequential(*layers)
```

Line by line:

```python
if concat_pool: nf *= 2
```
If `concat_pool` is activated then we need to use the concat-pool trick. This requires us to multiply the number of filters by 2 since we concat max pool and average pool.

```python
lin_ftrs = [nf, 512, n_out] if lin_ftrs is None else [nf] + lin_ftrs + [n_out]
```
Set size of layers.

```python
bns = [first_bn] + [True]*len(lin_ftrs[1:])
```

List of Batch Norms. True means that batch norm is applied.

```python
ps = L(ps)
if len(ps) == 1: ps = [ps[0]/2] * (len(lin_ftrs)-2) + ps
```
List of Dropout probabilities. Dropout is divided by 2 for every layer except the last one. 


```python
actns = [nn.ReLU(inplace=True)] * (len(lin_ftrs)-2) + [None]
```

List of activations. ReLU is provided for every layer except the last one.

```python
pool = AdaptiveConcatPool2d() if concat_pool else nn.AdaptiveAvgPool2d(1)
```

Add pooling layer. If concat_pool selected we use `AdaptiveConcatPool2d`. Else we use `AdaptiveAvgPool2d` with averages the activations into a map of dimension `1x1`.

```python
layers = [pool, Flatten()]
```

Add the pooling layer and the Flatten layer. Flatten removes the unit axes. 

```python
if lin_first: layers.append(nn.Dropout(ps.pop(0)))
```

Add dropout layer to network


```python
for ni,no,bn,p,actn in zip(lin_ftrs[:-1], lin_ftrs[1:], bns, ps, actns):
    layers += LinBnDrop(ni, no, bn=bn, p=p, act=actn, lin_first=lin_first)
```

Create the rest of the layers. Batch Norm (`bn`), Activations (`actn`), Dropout (`p`) is determined by lists made earlier. `LinBnDrop` is a specialised class that combines `BatchNorm1d`, `Dropout` and `Linear` layers.

```python
if lin_first: layers.append(nn.Linear(lin_ftrs[-2], n_out))
```

Add linear layer with appropriate sizes. This outputs to the number of required channels.

```python
if bn_final: layers.append(nn.BatchNorm1d(lin_ftrs[-1], momentum=0.01))
```

Add final batch norm layer with appropriate size.

```python
if y_range is not None: layers.append(SigmoidRange(*y_range))
```

Add final sigmoid layer. This will ensure that outputs are between the desired range.  

__6. Figure out how to change the dropout, layer size, and number of layers created by `create_cnn`, and see if you can find values that result in better accuracy from the pet recognizer.__



__7. What does `AdaptiveConcatPool2d` do?__

It combines `AdaptiveAvgPool2d` and `AdaptiveMaxPool2d`

```
(0): AdaptiveConcatPool2d(
    (ap): AdaptiveAvgPool2d(output_size=1)
    (mp): AdaptiveMaxPool2d(output_size=1)
  )
```

__8. What is "nearest neighbor interpolation"? How can it be used to upsample convolutional activations?__

It is a layer that increases the grid size of the activation map. Replace every pixel in the grid with four pixels in a 2x2 square.

__9. What is a "transposed convolution"? What is another name for it?__

Zero padding is inserted between all pixels in input. This explains it pretty well:

![img](https://raw.githubusercontent.com/fastai/fastbook/master/images/att_00051.png)

__10. Create a conv layer with `transpose=True` and apply it to an image. Check the output shape.__


In [None]:
!wget https://storage.googleapis.com/pr-newsroom-wp/1/2020/05/IMG_1874-copy-1.jpg # joe rogan

In [None]:
import torchvision, PIL
x = torchvision.transforms.functional.to_tensor(PIL.Image.open('IMG_1874-copy-1.jpg'))
x = x[None,:] # add unit axes (represents batch)
x.shape
# [81]: torch.Size([1, 3, 733, 1920])

In [None]:
def run_tranposed_conv(stride = 2):
  tranposed_conv = ConvLayer(3, 3, stride=2, transpose=True) # n out channels is arbitrarily set to 3
  return tranposed_conv(x)

In [None]:
y = run_tranposed_conv(stride = 2)
y.shape
# [82]: torch.Size([1, 3, 1467, 3841])

In [None]:
y = run_tranposed_conv(stride = 4)
y.shape
# [82]: torch.Size([1, 3, 2931, 7679])

The dim of the activation map has increased by (approximately) by the stride. 


ie. (1467, 3841) approximately equal to (733*2, 1920*2)


ie. (2931, 7679) approximately equal to (733*4, 1920*4)

__11. Draw the U-Net architecture.__

TODO

__12. What is "BPTT for Text Classification" (BPT3C)?__

It's a classifier which consists of a for loop that loops over each batch of a sequence of text. The activations of each batch are stored. Then at the end a pool_concat is used over the RNN sequence to get the desired output grid size.

__13. How do we handle different length sequences in BPT3C?__

Padding. 

We determine the text with the longest length. Then we use the special token `xxpad` to fill up all lines that have length < longest length.

__14. Try to run each line of `TabularModel.forward` separately, one line per cell, in a notebook, and look at the input and output shapes at each step.__

Forward function

```python
def forward(self, x_cat, x_cont=None):
    if self.n_emb != 0:
        x = [e(x_cat[:,i]) for i,e in enumerate(self.embeds)]
        x = torch.cat(x, 1)
        x = self.emb_drop(x)
    if self.n_cont != 0:
        if self.bn_cont is not None: x_cont = self.bn_cont(x_cont)
        x = torch.cat([x, x_cont], 1) if self.n_emb != 0 else x_cont
    return self.layers(x)

```

In [None]:
from fastai.tabular.model import TabularModel

In [None]:
model = TabularModel(emb_szs = [[x.shape[0],x.shape[-1]]], n_cont = 0, out_sz = 10, layers = [500])
model

```TabularModel(
  (embeds): ModuleList(
    (0): Embedding(3, 1920)
  )
  (emb_drop): Dropout(p=0.0, inplace=False)
  (bn_cont): BatchNorm1d(0, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (layers): Sequential(
    (0): LinBnDrop(
      (0): BatchNorm1d(1920, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (1): Linear(in_features=1920, out_features=500, bias=False)
      (2): ReLU(inplace=True)
    )
    (1): LinBnDrop(
      (0): Linear(in_features=500, out_features=10, bias=True)
    )
  )
)```

In [None]:
model(x)

```
tensor([[[ 2.1888e-01, -2.8956e-02,  1.6438e-01,  ...,  1.9798e-04,  2.4943e-01, -2.9432e-01],
         [ 2.1888e-01, -2.8956e-02,  1.6438e-01,  ...,  1.9798e-04,  2.4943e-01, -2.9432e-01],
         [ 2.1888e-01, -2.8956e-02,  1.6438e-01,  ...,  1.9798e-04,  2.4943e-01, -2.9432e-01],
         ...,
         [ 2.1888e-01, -2.8956e-02,  1.6438e-01,  ...,  1.9798e-04,  2.4943e-01, -2.9432e-01],
         [ 2.1888e-01, -2.8956e-02,  1.6438e-01,  ...,  1.9798e-04,  2.4943e-01, -2.9432e-01],
         [ 2.1888e-01, -2.8956e-02,  1.6438e-01,  ...,  1.9798e-04,  2.4943e-01, -2.9432e-01]],

        [[ 2.1888e-01, -2.8956e-02,  1.6438e-01,  ...,  1.9798e-04,  2.4943e-01, -2.9432e-01],
         [ 2.1888e-01, -2.8956e-02,  1.6438e-01,  ...,  1.9798e-04,  2.4943e-01, -2.9432e-01],
         [ 2.1888e-01, -2.8956e-02,  1.6438e-01,  ...,  1.9798e-04,  2.4943e-01, -2.9432e-01],
         ...,
         [ 2.1888e-01, -2.8956e-02,  1.6438e-01,  ...,  1.9798e-04,  2.4943e-01, -2.9432e-01],
         [ 2.1888e-01, -2.8956e-02,  1.6438e-01,  ...,  1.9798e-04,  2.4943e-01, -2.9432e-01],
         [ 2.1888e-01, -2.8956e-02,  1.6438e-01,  ...,  1.9798e-04,  2.4943e-01, -2.9432e-01]],

        [[ 2.1888e-01, -2.8956e-02,  1.6438e-01,  ...,  1.9798e-04,  2.4943e-01, -2.9432e-01],
         [ 2.1888e-01, -2.8956e-02,  1.6438e-01,  ...,  1.9798e-04,  2.4943e-01, -2.9432e-01],
         [ 2.1888e-01, -2.8956e-02,  1.6438e-01,  ...,  1.9798e-04,  2.4943e-01, -2.9432e-01],
         ...,
         [ 2.1888e-01, -2.8956e-02,  1.6438e-01,  ...,  1.9798e-04,  2.4943e-01, -2.9432e-01],
         [ 2.1888e-01, -2.8956e-02,  1.6438e-01,  ...,  1.9798e-04,  2.4943e-01, -2.9432e-01],
         [ 2.1888e-01, -2.8956e-02,  1.6438e-01,  ...,  1.9798e-04,  2.4943e-01, -2.
```

In [None]:
# if self.n_emb != 0:
model.n_emb!=0
# [208]: True

In [None]:
y = [e(x[:,i]) for i,e in enumerate(model.embeds)]
y[0].shape
# [208]: torch.Size([3, 1920, 1920])

In [None]:
y = torch.cat(y, 1)
y.shape
# [209]: torch.Size([3, 1920, 1920])

In [None]:
y = model.emb_drop(y)
y.shape
# [210]: torch.Size([3, 1920, 1920])

In [None]:
# if self.n_cont != 0:
model.n_cont
# [211]: 0

In [None]:
# if self.bn_cont is not None: 
model.bn_cont
# [212]: BatchNorm1d(0, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

In [None]:
# x_cont = self.bn_cont(x_cont)

In [None]:
# x = torch.cat([x, x_cont], 1) if self.n_emb != 0 else x_cont

In [None]:
# return self.layers(x)
model.layers(y).shape
# [215]: torch.Size([3, 1920, 10])

__15. How is `self.layers` defined in `TabularModel`?__

Relevant lines of code:

```python
_layers = [LinBnDrop(sizes[i], sizes[i+1], bn=use_bn and (i!=len(actns)-1 or bn_final), p=p, act=a)
                       for i,(p,a) in enumerate(zip(ps+[0.],actns))]
self.layers = nn.Sequential(*_layers)

```

This creates layers using the `LinBnDrop` function. The activations are defined by the list `actns`. The dropout prob is defined by `ps`. 

__Breakdown__

The following determines if the layer should use Batch Norm:

```python
use_bn and (i!=len(actns)-1 or bn_final)
```

Cases where Batch Norm is applied:
1. `use_bn` is True and `bn_final` is True => Batch Norm applied at that layer
2. `use_bn` is True and `i!=len(actns)-1` (ie. we are not at the last layer) => Batch Norm applied

In all other cases Batch Norm is not applied.


The rest of the code is fairly straightforward.

__16. What are the five steps for preventing over-fitting?__

1. More data
2. Augmentation
3. Generalisable architectures (eg. BatchNorm)
4. Regularization (eg. Dropout)
5. Reduce architecture complexity

__17. Why don't we reduce architecture complexity before trying other approaches to preventing overfitting?__

More parameters allows your model to learn about more subtle relationships in the data. You would miss these relationships if you chose a smaller model. This is why a larger model is recommended.