In [2]:
%autosave 1

Autosaving every 1 seconds


- **KNN**
    - how it works
    - *k* is a hyper-parameter, is not something we learn from the data, we manually set before trining starts.
    - *model capacity*: by changing *k*, we change the model capacity, namely a measurement of **how complex/flexible** the functions that can be learned are.

- **Linear classifiers**
    - $f(x; \theta) = Wx + b = scores$
    - *as template matching*: we can think each row of the W matrix as being a template on the input image for a particular class, and out of this template matching we get a score for a particular class.
    - *loss*: the loss is a proxy measure that is correlated with "how good our model is"
        - 0-1 loss: this loss takes into accoun the number of errors that our classifier does. However this loss is difficult to optimize because produces jumps/sudden changes becuase if we move our decision boundary a bit, then the loss will probably have the exact same value.
    - *softmax*: is an activation function that transforms scores computed by the model into probabilities (generating a probability distribution over the predictions)
        - $softmax(scores_j) = \frac{exp(s_j)}{sum(exp(s_{all}))}$
    - *cross-entropy loss*: is the $-log(softmax(f(x;\theta)))$. <br> If the true classe is "bird" and we have a prediction of 0.9 in correspondence of the bird, then we will have $-log(0.9) = 0.1$ which is a low value, which means that the loss will contribute just 0.1 to the total loss. If instead the true label is "car" and we have a prediction of 0.09 in correspondence of "car", then we need to penalize this because we need a high prediction in correspondence of the car, in fact $-log(0.09) = 2.4$ will give us a high value.
    - *gradient descent*:
        - for each epoch:
            - forward pass: classify all training data and compute the loss
            - backward pass: compute the gradients wrt parameteres
            - step: update the parameters subtracting from the parameters the inverse of the gradients multiplied by a learning rate
        - *problems*: for just one update, we perform $D_n$ (number of training example) forward and backward passes. So we take an a further appriximation of the gradient which is computed on a batch of images (SGD).

- **Optimizers**
- SGD
    - SGD with minibatches: we compute the gradient and then update the parameters only after one batch is processed.
    - online learning: is SGD with minibatches with *batch_size* = 1
    - *batch size* became an hyper-parameter:
        - larger batches provide smoother estimation of the gradient and exploit hardware parallelism
        - smaller batch size may have a regularization effect
    - advantages: SGD with minibatches is faster because we do more updates even though the gradient is more approximate wrt standard GD.
    - problems of SGD:
        - producese "sphere" within loss landscape which which enjoy faster convergence by measn of *higher learning rates*, but we usually have small learning rates
        - gradients estimated by mini-batches are noisy
        - *critical points* where the gradient is 0 other tha global optima:
            - saddle points
            - local optima
- Momentum
    - momentum is like the interpretation of our parameteres moving on the landscape of our loss with the detail that **when it moves, it gains velocity**. "A ball rolling down the surface of the loss, should keep the velocity it gains" and this should help to navigate the loss landscape better and faster.
    - $v^{(t + 1)}  = \beta v^{(t)} - lr \nabla L(\theta^{(t)})$
    - $\theta^{(t+1)} = \theta^{(t)} + v^{(t + 1)}$
    - $\beta$ is a value strictly less than 1
    - $v^{(t + 1)}$ contains a running average of the previous update steps (if before we were at a certain velocity $x$, our next velocity will depends on the previous velocity)
    - *advantages*: 
        - momentum reduces the effect of noise
        - faster convergence

- Nesterov momentum
    - very similar to *momentum*, but the gradient is computed after having "partially" updated $\theta^{(t)}$ with $\beta v^{(t)}$:
    - $v^{(t + 1)}  = \beta v^{(t)} - lr \nabla L(\theta^{(t)} + \beta v^{(t)})$
    - $\theta^{(t+1)} = \theta^{(t)} + v^{(t + 1)}$
    - Nesterov momentum shows a faster convergence
- AdaGrad
    -Adaptive Gradient proposed to rescale each entry of the gradient with the inverse of the history of the squared gradients
    - $s^{(t + 1)} = s^{(t)} + \nabla L (\theta ^{(t)}) * \nabla L (\theta ^{(t)})$, this $s$ is the history of the squared gradients
    - $\theta ^{(t + 1)} =  \theta ^{(t)} - \frac{lr}{\sqrt{s^{(t+1)}}+ \epsilon} * \nabla L (\theta ^{(t)})$
    - $s^{(t)}$ is monotonically increasing: it  may reduce the learning rate too early even when we are far from a good minimum
- RMSProp
    - in practice we do not use AdaGrad, but we use a modification of it which is RMSProp. The idea is: since $s^{(t)}$ is growing a lot, let's down-weight a bit all its history. So, instead of just accumulating square gradients into $s$, we create an exponential moving average of the square gradients. In practice we take a lot of the past history (using a $\beta$ parameter like 0.9) and a *tiny contribution from the present values of the square gradients (1 - \beta) in order to prevent the history to grow indefinetely.*
    - this turned out to work better because the optimizer keeps being active: it react to changes in the loss. However it is a bit nervous, we will see that ADAM will handle this.
    - $s^{(t + 1)} = \beta s^{(t)} + (1 - \beta)\nabla L (\theta ^{(t)}) * \nabla L (\theta ^{(t)})$, this $s$ is the history of the squared gradients
    - $\theta ^{(t + 1)} =  \theta ^{(t)} - \frac{lr}{\sqrt{s^{(t+1)}}+ \epsilon} * \nabla L (\theta ^{(t)})$
    - $\beta$ typically = $0.9$ or higher
    
- ADAM
    - ADAM follows the idea of RMSProp where we create an exponential moving average of the square gradients in order to prevent the history $s$ to grow indefinitely and then uses this history to adapt the learning rate. Plus, we do the same for the gradients, so we keep an exponential moving average also for the gradiets itself. This leads ADAM into a smoother path and less nervous because we are smoothing the gradients itself.
    - **bias correction**: since $g^{(0)} = s^{(0)} = 0$ (namely ADAM uses more of the history - which will be zero - than the current gradient while performing a step), the first values of $g$ and $s$ will be very small because it will be $0 * 0.9 + (1 - 0.9) * gradient$, so only 0.1 part of the gradient will contribiute to the first step which will result in a slow start of the optimizer. To counter this a bias correction is added to both the gradient and the history of the square gradients.
    - $g^{(t + 1)} = \beta_1 g^{(t)} + (1 - \beta_1)\nabla L (\theta ^{(t)})$
    - $s^{(t + 1)} = \beta_2 s^{(t)} + (1 - \beta_2)\nabla L (\theta ^{(t)}) * \nabla L (\theta ^{(t)})$
    - $g^{debiased} = \frac{g^{(t+1)}}{1-\beta_1^{t+1}}$, $s^{debiased} = \frac{s^{(t+1)}}{1-\beta_2^{t+1}}$
    - $\theta^{(t+1)} = \theta^{(t)} - \frac{lr}{\sqrt{s^{debiased}}+ \epsilon} * g^{debiased}$

- **Activation function**
    - what a ReLU adds? It improves the chances that the new representation is linearly separable.

- convolution
    - properties
    - formula of learnable parameters
    - formula of MB to store in memory
    - formula of flops
- input image -- conv2D -- output shapes relationships (also for multiple layers)
- formula for H_out and W_out
    - after a conv2D
    - after a conv2D with padding
    - after a conv2D with padding and stride
- pooling layers
    - pro and cons
    - formula of learnable parameters
    - formula of MB to store in memory
    - formula of flops
- Batch Normalization
    - internal covariance shift
    - training time
    - test time
    - pro and cons

- AlexNet
    - trends
    - general performance
- ZFNet / Clarify
    - (visualization of kernels and activations)
    - general performance
- VGG
    - stages
    - its three main choices
    - (no stemming)
    - general performance
- Inception v1
    - (stem - inception modules - GlobalAVGPooling)
    - naive inception module
    - 1x1 conv
    - real inception module
    - GlobalAVGPooling
    - Inception v3
    - general performance
- Residual Networks
    - residual block
        - how is formed
            - skip connections
            - two 3x3 conv
        - halve spatial resolution, doubles channels
        - uses stem and GlobalAVGPooling
    - skip connection dimension problem
        - problem of 3/4 discarded pixel of the 1x1 conv with s=2
            - solved by adding a 2x2 AvgPool layer before the 1x1 conv with s=2
    - bottleneck residual block
    - effects of residual learning
- ResNeXt
    - idea
    - argue the growing complexity of 3x3 convs
        - compute flops and solve for *d*
    - why ResNeXt idea is a good one? 
    - grouped convs
- SENet
    - capture global context
    - squeeze part: GlobalAVGPooling
    - excitation part: outputs weight to reweight the channels
        - *r* reductionf factor
        - relative importance of channels
- Depthwise Separable conv
    - extreme grouped conv with #groups==C
- Inverted residual block
    - why Bottleneck residual block are not ok
    - expansion - process - compression
    - *t* expansion rate
    - MobileNet-v2
        - stack of inverted residual block
- Wide ResNet
    - ResNet with channels multiplied by a facto *k*
- EfficientNet
    - "what is the optimal way to scale up a model?"
    - single dimension scaling
        - all three saturates at 80%
    - compound scaling: scaling W, D and R in an optimal way to improve the most we can
        - compound scaling $\phi$
        - formulation
    - NAS (Neaural Architecture Search)

- model capacity
    - factors the infuences it
- regularization
    - increase bias paying training error
- parameter norm penalties
    - optimize another term of the loss which is conflicting that say:
        - "we want our params to have small values"
        - Lambda hyperparameter
    - weigh decay
- early stopping
- label smoothing
    - problem of one hot encoding of labels
        - making model overly confident: overfitting
    - better alternative: smooth the labels
        - this accounts for mislabeled examples
    - how to apply labels smoothing
    - KLDiv loss
- dropout
    - in forward pass we use a subset of the network
        - hyperparameter *p* zeroing activation
    - why is this a good idea?
        - prevents feature detectors to co-adapt
            - face detector example
    - test time preds are stochastic
        - value at test time
        - expected value at training time with p=0.5
            - example
            - inverted drop out
- data augmentation
    - multi-scale training
    - multi-scale testing
        - domain shift problem
        - second alternative to multi-scale testing
- color augmentation (jittering)
- cutout
- Mixup
    - linear combination of two images according to a weight lambda
        - lambda picked from a Beta distribution
    - why is a good idea?
        - contraints what the network does between classes
    - testing
        - unmodified input

- learning rate schedule
    - step decay 
    - cosine decay
    - linear
    - warm-up
        - to use when our trainig loss is flattened for a long time
    - one cycle
        - update the learning rate after each interation, not epoch
        - vary momentum
- random hyper-parameter search
- recipe to train a NN
    - test time: ensemble
    - snaphot ensambling
        - uses cyclic cosine decay
        - majory voting at test time of M models
    - Polyak average
        - eponential moving average of parameters
    - Stochastic Weight Averaging
        - uses cyclic learning rates
        - real running average only when the learning rate is decreased

- Transfer Learning
    - First way
        - freeze backbone and train just the last layer
    - Second way
        - train everything
            - discrepancy between last layer and backbone
            - keep frozen backbne for few epochs untul last layer goes into a good landscape
            - unfreeze backbone and train with e-4 lr if it was e-3
            - Progressive LRs: first layers are ok so we freeze them
                - a growing lr when we go deep into the net to be more task specific

- Detecting multiple objects
    - problem 1: background
    - problem 2: too many possible windows
    - solution: region proposal
        - apply with Selective Search to come up with regions that are likely to contain obj
- R-CNN
    - run Selective Search to come up with for example 2000 proposals
    - for each of this proposal:
        - warp it adding 16 pixels of context
        - pass through the Net
        - get class and BB correction
    - problem: really slow
- Fast R-CNN
    - still run Selective Search to come up with for example 2000 proposals
    - run full image up to a certain conv layer (like conv5) only once
    - project the proposal into the resulting activation
    - use RolPool layer to crop the projections and resize to the right shape
    - advantage: the 2000 proposals pass only to a small part and non-expensive of the net
        - which are the FC layers at the end

QUESTIONS:
- dilated convs
    - why are useful, its advantages
- what algo do we use to train NN
    - what are the hyperparameters that influences the training (learning rate, batch size)
    - effect of learning rate
    - effect of smaller and bigger batch size
- regularization
    - approaches we use to improve it
        - labels smoothing
            - why is useful, how it works
            - softmax and CE formula
- metric learning
    - why we need triplette loss, what it is and what it improves
    - contrastive loss vs triplette loss
    - triplette loss formula
    - do we take all possible triplettes or just a subset?
        - semi hard negative mining
            - how we define an example being semi hard negative?