In [1]:
%autosave 1

Autosaving every 1 seconds


- **KNN**
    - how it works
    - *k* is a hyper-parameter, is not something we learn from the data, we manually set before trining starts.
    - *model capacity*: by changing *k*, we change the model capacity, namely a measurement of **how complex/flexible** the functions that can be learned are.

- **Linear classifiers**
    - $f(x; \theta) = Wx + b = scores$
    - *as template matching*: we can think each row of the W matrix as being a template on the input image for a particular class, and out of this template matching we get a score for a particular class.
    - *loss*: the loss is a proxy measure that is correlated with "how good our model is"
        - 0-1 loss: this loss takes into accoun the number of errors that our classifier does. However this loss is difficult to optimize because produces jumps/sudden changes becuase if we move our decision boundary a bit, then the loss will probably have the exact same value.
    - *softmax*: is an activation function that transforms scores computed by the model into probabilities (generating a probability distribution over the predictions)
        - $softmax(scores_j) = \frac{exp(s_j)}{sum(exp(s_{all}))}$
    - *cross-entropy loss*: is the $-log(softmax(f(x;\theta)))$. <br> If the true classe is "bird" and we have a prediction of 0.9 in correspondence of the bird, then we will have $-log(0.9) = 0.1$ which is a low value, which means that the loss will contribute just 0.1 to the total loss. If instead the true label is "car" and we have a prediction of 0.09 in correspondence of "car", then we need to penalize this because we need a high prediction in correspondence of the car, in fact $-log(0.09) = 2.4$ will give us a high value.
    - *gradient descent*:
        - for each epoch:
            - forward pass: classify all training data and compute the loss
            - backward pass: compute the gradients wrt parameteres
            - step: update the parameters subtracting from the parameters the inverse of the gradients multiplied by a learning rate
        - *problems*: for just one update, we perform $D_n$ (number of training example) forward and backward passes. So we take an a further appriximation of the gradient which is computed on a batch of images (SGD).

- **Optimizers**
- SGD
    - SGD with minibatches: we compute the gradient and then update the parameters only after one batch is processed.
    - online learning: is SGD with minibatches with *batch_size* = 1
    - *batch size* became an hyper-parameter:
        - larger batches provide smoother estimation of the gradient and exploit hardware parallelism
        - smaller batch size may have a regularization effect
    - advantages: SGD with minibatches is faster because we do more updates even though the gradient is more approximate wrt standard GD.
    - problems of SGD:
        - producese "sphere" within loss landscape which which enjoy faster convergence by measn of *higher learning rates*, but we usually have small learning rates
        - gradients estimated by mini-batches are noisy
        - *critical points* where the gradient is 0 other tha global optima:
            - saddle points
            - local optima
- Momentum
    - momentum is like the interpretation of our parameteres moving on the landscape of our loss with the detail that **when it moves, it gains velocity**. "A ball rolling down the surface of the loss, should keep the velocity it gains" and this should help to navigate the loss landscape better and faster.
    - $v^{(t + 1)}  = \beta v^{(t)} - lr \nabla L(\theta^{(t)})$
    - $\theta^{(t+1)} = \theta^{(t)} + v^{(t + 1)}$
    - $\beta$ is a value strictly less than 1
    - $v^{(t + 1)}$ contains a running average of the previous update steps (if before we were at a certain velocity $x$, our next velocity will depends on the previous velocity)
    - *advantages*: 
        - momentum reduces the effect of noise
        - faster convergence

- Nesterov momentum
    - very similar to *momentum*, but the gradient is computed after having "partially" updated $\theta^{(t)}$ with $\beta v^{(t)}$:
    - $v^{(t + 1)}  = \beta v^{(t)} - lr \nabla L(\theta^{(t)} + \beta v^{(t)})$
    - $\theta^{(t+1)} = \theta^{(t)} + v^{(t + 1)}$
    - Nesterov momentum shows a faster convergence
- AdaGrad
    -Adaptive Gradient proposed to rescale each entry of the gradient with the inverse of the history of the squared gradients
    - $s^{(t + 1)} = s^{(t)} + \nabla L (\theta ^{(t)}) * \nabla L (\theta ^{(t)})$, this $s$ is the history of the squared gradients
    - $\theta ^{(t + 1)} =  \theta ^{(t)} - \frac{lr}{\sqrt{s^{(t+1)}}+ \epsilon} * \nabla L (\theta ^{(t)})$
    - $s^{(t)}$ is monotonically increasing: it  may reduce the learning rate too early even when we are far from a good minimum
- RMSProp
    - in practice we do not use AdaGrad, but we use a modification of it which is RMSProp. The idea is: since $s^{(t)}$ is growing a lot, let's down-weight a bit all its history. So, instead of just accumulating square gradients into $s$, we create an exponential moving average of the square gradients. In practice we take a lot of the past history (using a $\beta$ parameter like 0.9) and a *tiny contribution from the present values of the square gradients (1 - \beta) in order to prevent the history to grow indefinetely.*
    - this turned out to work better because the optimizer keeps being active: it react to changes in the loss. However it is a bit nervous, we will see that ADAM will handle this.
    - $s^{(t + 1)} = \beta s^{(t)} + (1 - \beta)\nabla L (\theta ^{(t)}) * \nabla L (\theta ^{(t)})$, this $s$ is the history of the squared gradients
    - $\theta ^{(t + 1)} =  \theta ^{(t)} - \frac{lr}{\sqrt{s^{(t+1)}}+ \epsilon} * \nabla L (\theta ^{(t)})$
    - $\beta$ typically = $0.9$ or higher
    
- ADAM
    - ADAM follows the idea of RMSProp where we create an exponential moving average of the square gradients in order to prevent the history $s$ to grow indefinitely and then uses this history to adapt the learning rate. Plus, we do the same for the gradients, so we keep an exponential moving average also for the gradiets itself. This leads ADAM into a smoother path and less nervous because we are smoothing the gradients itself.
    - **bias correction**: since $g^{(0)} = s^{(0)} = 0$ (namely ADAM uses more of the history - which will be zero - than the current gradient while performing a step), the first values of $g$ and $s$ will be very small because it will be $0 * 0.9 + (1 - 0.9) * gradient$, so only 0.1 part of the gradient will contribiute to the first step which will result in a slow start of the optimizer. To counter this a bias correction is added to both the gradient and the history of the square gradients.
    - $g^{(t + 1)} = \beta_1 g^{(t)} + (1 - \beta_1)\nabla L (\theta ^{(t)})$
    - $s^{(t + 1)} = \beta_2 s^{(t)} + (1 - \beta_2)\nabla L (\theta ^{(t)}) * \nabla L (\theta ^{(t)})$
    - $g^{debiased} = \frac{g^{(t+1)}}{1-\beta_1^{t+1}}$, $s^{debiased} = \frac{s^{(t+1)}}{1-\beta_2^{t+1}}$
    - $\theta^{(t+1)} = \theta^{(t)} - \frac{lr}{\sqrt{s^{debiased}}+ \epsilon} * g^{debiased}$

- **Activation function**
    - what a ReLU adds? It improves the chances that the new representation will be linearly separable.

- **Convolution**
    - *limit of FC layers*: FC layers reason globally, not spatially. They are able to learn a specific weight for every input feature (for each pixel), they are too precise and this smells as overfitting. We know that images are smooth spatially. We want to have something that reason locally as the images are, not globally. So with this prior we go to convolutions.
    - *relationship between spatial dim*:
        - $H_{out} = H_{in} - H_k + 1$
        - $W_{out} = W_{in} - W_k + 1$
    - *padding*: enlarge the image to the input size padding with some values like zeros
        - $H_{out} = H_{in} - H_k + 1 + 2P$
        - $W_{out} = W_{in} - W_k + 1 + 2P$
    - *stride*: with stride we downsample. It means that we do not apply conv on all possible input position of the kernel but just on a subset.
        - $H_{out} = inf[\frac{(H_{in} - H_k + 2P)}{S}] + 1$
        - $W_{out} = inf[\frac{(W_{in} - W_k + 2P)}{S}] + 1$
    - *formula of learnable parameters*: are all weights of all the kernels. So in general we apply a conv with with a kernel $16 \times 8 \times 5 \times 5$ with 16 "how many kernels apply", then the overall formula for the learnable parameters is:
        - $16 \times (8 \times 5 \times 5 + 1)$ (+1 for the bias)
     - *formula of memory to store*: coincides with the output activation tensor. So after having computed the $H_{out}$ and $W_{out}$ with the previous general formula, we obtain a 3D tensor like $16 \times H_{out} \times W_{out}$ and the formula of the memory to store is the product of the three elements. If we would have had a batch of images, then the tensor would have been 4D and the formula needs to be multiplied by the batch size too.
    - formula of flops:
        - output feature map $\times$ 3D kernel size $\times 2$
        - the latter $\times 2$ is because we perform $n$ summation and $n$ multiplications
- *pooling layers*
    - aggregates several values into one output value with *pre-specified and not learned* kernel
    - difference wrt conv: each input channel is aggregated independently
    
- Batch Normalization
    - *internal covariance shift*:
        - Say that I have a 2 layer NN and we want to learn representations. So, when we train, imagine we are looking at the second layer, this layer is learning taking in input a representation $r$ **which is not fixed/completely learnt**, this $r$ is changing while training. This means that is like we have layers that are training together, but in reality what we would like to do is to come up with the best representation for the first layer and then train the second one on this best representation, not together. 
        - **Example**: think about that a layer learns to captures edges, then we can give this representation to the next layer that from the edges extract the corners. This is what we are hoping to do, however is pracice this does not happen. What happes is that the layers are learned together so the second layer receive in input edges that are not stable because the first layer has not finished to learnt them. So this problem is actually very bad, because when we want to train something we want the training set to be fixed, we do not want our distribution to change. But as it is not like that, here each layer sees the input distribution of the things it has to work on that changes.
    - **batch norm idea**: 
        - the idea to counter the previous problem is that: since the distribution $r$ is changing, I will try to normalize the output of the first layer such that the distrubution $r$ does not change too much. We are still learning, but we are constraining what the distribution of $r$ may look like. In particular we will make $r$ to follow a Gaussian distribution, so *zero mean* and *unit variance*.
    - test time:
        - at test time BN is different from training time. At test time, **we do not want to stochastically depend on the other items in the mini-batch**, namely we want our output to depend only on the input **deterministically**.
            - So **to counter this problem** at test time, I king of arbitrarily say that the *mean* and *variance* will be constants and their value will be a running average of the velues seen during the training time. So during the training I will keep a running average of the mean and variance and when training ends I will freeze the values obtained and I use them to compute the test predictions.
    - pros:
        - allows the use of higher learning rates
        - carefull initialization is less important
        - training is not deterministic, **acts as regularization**
    - cons:
        - not clear why is so beneficial
        - more complex implementation since need to distinguish between training and testing time

- **Architectures** 
- AlexNet
    - trends:
        - heavy stemming at the beginning of the net (from 224 to 55)
        - nearly all the parameters are in the FC layers
        - first two convs are responsible for the largest memory consumption because produces activations with a large spatial resolution
        - conv layers requires the largest number of flops
- ZFNet / Clarifai
    - ZFNet is basically an AlexNet. They worked on AlexNet and discovered that the heavy stemming at the beginnig of the net results in **dead filters** and **aliasing artifacts** in the first layers.
    - **to counter this** they proposed to use a less aggressive stemming at the beginning of the net.
- VGG
    - the **idea** o f VGG is the following: ok we have AlexNet, *can we simplify the design space and came up with a simpler and regular design that we can repeat over and over?* So VGG commits to explore the effectiveness of simple design choices by allowing:
        - 3x3 convs
        - 2x2 max-pool
        - doubling the number of channels after each pool
    - **stages**:
        - so here we do not have only layers but we also have this kind of **regular composition of layers that we repeat throughout the net**. And this is called *stage* of *module*.
        - in VGG stages are either:
            - conv-conv-pool
            - conv-conv-conv-pool
            - conv-conv-conv-conv-pool
    - no stemming layer
    - one stage **has the same receptive field of a larger conv** but requires **less parameters and computation and introdues more non-linearity**. However the **drawback** is that the memory for activations **doubles**.
    
- Inception v1
    - Inception v1 **wanted to higher depth and width wrt VGG but withot paying the price for it**.
    - stem layers - stack of inception modules - GlobalAVGPooling
    - naive inception module:
        - stack along the channel dimension the output of several convs (5x5 and 3x3) with several kernels sizes of the input activation
        - **problems**: 
            - the number of channels grows very fast
            - expensive in terms of flops
    - 1x1 conv are used to overcome the above problems
        - 1x1 conv allows to **change (ugually shrink) the depth** of the activations while preserving the spatial size.
        - we can interpret them as **applying a linear FC layer at each spatial location**.
    - real inception module:
        - in the naive inception module the problem was that the convolutions were too heavy because of the high number of channels. So, **to overcome the problem**, before applying the convs, we use a 1x1 conv to shrink the number of channels to make the 5x5 and 3x3 convs less expensive becuase they are processed on less channels. So at the end the output activation has less channels and less computation has been required to produce it.
    - GlobalAVGPooling:
        - the difference wrt FC layers is that we are not flattening everything, but we aggregate every channel (every panel) spatially into one single value. Doing so, we will have a much smaller FC at the end. Also, since we reduced the dimensionality and the informations a lot (less parameters thus), **we can use just one FC layer** and not more, and this also reduces the computational cost.
        - **intuition:** one way to think about this is that when we arrive at the last activation of our net, its panels contains very specific high level features. So by averaging them spatially, we are actually making thinkgs more **robust** becuase all the spurious variations we have in that panel are not very important (counter overfitting). Example: is not important whether the petal flower is in the top right corner or bottom left, what is important is that is present or not.
    - Inception v3:
        - in Inception v3, they king of dropped the uniformity of Inception v1 of having the same inception module everywhere but **they kind of specialized the module for different resolutions** (convolution factorization). So for some specific spatial resolution, we will have a specific inception module. This has as **advantage** that we have less parameters and thus the net is easier to train computationally speaking.
    
- Residual Networks
    - **idea**: growing depth improves performace, right? Actually no. In fact, *stacking more layers does not automatically improve performace*. It has been discovered that staking a lot of "non residual block" makes the training very very hard, we fall into an **overfitting problem, so a training problem**. To solve this, we add a residual connection around a block of two convs. Doing so, when we increase the number of layers, we get what we expect, namely less training and testing error, namely **deeper nets perform better that less deeper**.
    - **residual block**:
        - stem layer - residual block stages - GlobalAvgPool
            - residual block stages:
                 - residual block stages are a stack of two residual block
                 - each residual block is a stack of two 3x3 convs + BN
                 - the first block of each stage **halves** the input resolution (with stride-2 convs) and **doubles** the number of channels
        
        <img src="resnet.png" width="40%" height="40%">

- skip connection dimension problem:
    - the residual block described so far cannot be used because the number of channels and the spatial dimension **do not maatch** along the skip connection.
    - **solution**: use a 1x1 conv does doubles the number of channels with stride S=2.
- bottleneck residual block:
    - bottleneck residual block is used when we want very deep resnets. With the normal residual block, there is the concern that stacking many blocks, the number of parameters and flops will increase and this may made the training difficult. So they created a different version of residual block called **bottleneck residual block**. This type of block let us to increase the number of layer (so we are increasing the depth of the net) **but we are not paying the price of higher number of parameters**. *In particular we will have the same complexity in terms of flops but less parameters*. So now we can stack more of these blocks and the training will be easier.
        - how is made:
            - instead of having two 3x3 convs that process C channels, we start with 4C channels, then we have a 1x1 conv to reduce to C channels (**compresssion**), then we have a inner 3x3 conv and then we have again a 1x1 conv to get back to 4C (**decompression**). Of course these copression and decompression perform also representaion learning since there are non linearities in between (are not just a compression and a decompression), so we are imoroving the quality and the capacity of the model out of the fact that we are also reconciling the number of dimensions in order to perfom the summation.
    <img src="bottleneck.png" width="80%" height="80%">

- effects of residual learning:
    - skip connections make the loss landscape smoother


- ResNeXt
    - idea:
        - the multi-branch architecture of Inception proves effectively. However, the design is heuristic and handcrafed. ResNext decides to use the multi-branch architecture but **being regular** in the design like VGG and ResNet.
        - so, what they did, is to decompose the bottleneck residual block of ResNet into **G** parallel branches -- a new hyperparameter --, called **cardinality** of the convolution (also, the name *ResNeXt* comes from this idea of having a *next dimension* in the conv space). Instead of starting from 4C and process C channels in the inner convs, they start with 4C and process **d** channels, a new hyperparameter and this *d* is the same for each branch. So, at the end, is like expandin to a multi-branch architecture but with some kind of **rules** instead of handcrafted design decisions.
    - argue the growing complexity of 3x3 convs:
        - one could argue that having a lot of 3x3 convs in the multi-branches **could make the computation explode in complexity**. Actually yes. To overcome this, what is done is that:
            - *we choose G*
            - then we compute the total flops of the standard ResNet bottleneck block and the total flops of the ResNeXt block 
            - and then by equating these two things, **we can solve for d** since G and C are fixed, and we get a number that approximately preserves the number of flops.
    - why ResNeXt idea is a good one? 
        - ResNet is capturing a lot of features that are *uncorrelated*, like green blobs with a horizontal edge. So is capturing noise. Is capturing combinations of features that is not relevant and thus the model does not generalize well. **By limiting the expressive power** using ResNeXt with *d*, we can probably force to capture only relevant interactions *between channels*.
        <img src="resnext.png" width="60%" height="60%">
        
    - grouped convs:
        <img src="gc.png" width="60%" height="60%">
        
       - take the input tensor and divide the channels into G groups, say we have a 6 channel tensor with G=2.
       - the first conv that produces the first panel then is performed on only the first 3 channels (6/2) using just 3 kernel channels, not all the 6 channels. Therfore 3 channles of the first kernel are lost, **this is why we have a reduction in the number of paramters and flops**.
       - the second conv that produces the second panel is done on the other 3 channles of the input tensor in the same way. Grouped convs are more constrained.
       - GC has **G time less params and $G^2$ time less flops**.
- SENet
    - idea: maybe some channels are more important than others. So what we would like to do is having a number saying for each channel **how important is that channel**.
    - to do that we have a **squeeze** part and an **excitation** one after the residual block:
        - *squeeze part*: we apply GlobalAVGPooling, so for each channel we now have a single number
        - *excitation part*: outputs weight from 0-1 that is used to reweigh the channels saying "this is more important than this other one".
        - the branch before computes a weight, then we get back the output of the residual block and reweigh each channel
        
        <img src="se.png" width="50%" height="50%">
        
- Depthwise Separable conv
    - Depthwise Separable convolution is basically a group convolution where each channel is processed alone with one kernel. **So it is the extreme case of Grouped Conv with groups = C**. Semantically this means that each feature map created will capture just, say, vertical edges and that’s it. Another feature map will capture just horizontal edges, and so on. But to combine these features together we need to look at pixel level. This is achieved by **the subsequent 1x1 conv** such that the features are combined and we can detect corners for example. 
    - Depthwise Separable convolution **is way cheaper than normal convs in terms of flops** but basically it does the same thing of a conv but conv does it in one shot, here we did in two steps. Of course we pay the price of “final result”, we are a bit less accurate. However this could be a good thing because it could generalize better.
- Inverted residual block
    - why Bottleneck residual block are not ok
    - expansion - process - compression
    - *t* expansion rate
    - MobileNet-v2
        - stack of inverted residual block
- Wide ResNet
    - ResNet with channels multiplied by a facto *k*
- EfficientNet
    - "what is the optimal way to scale up a model?"
    - single dimension scaling
        - all three saturates at 80%
    - compound scaling: scaling W, D and R in an optimal way to improve the most we can
        - compound scaling $\phi$
        - formulation
    - NAS (Neaural Architecture Search)

- model capacity
    - factors the infuences it
- regularization
    - increase bias paying training error
- parameter norm penalties
    - optimize another term of the loss which is conflicting that say:
        - "we want our params to have small values"
        - Lambda hyperparameter
    - weigh decay
- early stopping
- label smoothing
    - problem of one hot encoding of labels
        - making model overly confident: overfitting
    - better alternative: smooth the labels
        - this accounts for mislabeled examples
    - how to apply labels smoothing
    - KLDiv loss
- dropout
    - in forward pass we use a subset of the network
        - hyperparameter *p* zeroing activation
    - why is this a good idea?
        - prevents feature detectors to co-adapt
            - face detector example
    - test time preds are stochastic
        - value at test time
        - expected value at training time with p=0.5
            - example
            - inverted drop out
- data augmentation
    - multi-scale training
    - multi-scale testing
        - domain shift problem
        - second alternative to multi-scale testing
- color augmentation (jittering)
- cutout
- Mixup
    - linear combination of two images according to a weight lambda
        - lambda picked from a Beta distribution
    - why is a good idea?
        - contraints what the network does between classes
    - testing
        - unmodified input

- learning rate schedule
    - step decay 
    - cosine decay
    - linear
    - warm-up
        - to use when our trainig loss is flattened for a long time
    - one cycle
        - update the learning rate after each interation, not epoch
        - vary momentum
- random hyper-parameter search
- recipe to train a NN
    - test time: ensemble
    - snaphot ensambling
        - uses cyclic cosine decay
        - majory voting at test time of M models
    - Polyak average
        - eponential moving average of parameters
    - Stochastic Weight Averaging
        - uses cyclic learning rates
        - real running average only when the learning rate is decreased

- Transfer Learning
    - First way
        - freeze backbone and train just the last layer
    - Second way
        - train everything
            - discrepancy between last layer and backbone
            - keep frozen backbne for few epochs untul last layer goes into a good landscape
            - unfreeze backbone and train with e-4 lr if it was e-3
            - Progressive LRs: first layers are ok so we freeze them
                - a growing lr when we go deep into the net to be more task specific

- Detecting multiple objects
    - problem 1: background
    - problem 2: too many possible windows
    - solution: region proposal
        - apply with Selective Search to come up with regions that are likely to contain obj
- R-CNN
    - run Selective Search to come up with for example 2000 proposals
    - for each of this proposal:
        - warp it adding 16 pixels of context
        - pass through the Net
        - get class and BB correction
    - problem: really slow
- Fast R-CNN
    - still run Selective Search to come up with for example 2000 proposals
    - run full image up to a certain conv layer (like conv5) only once
    - project the proposal into the resulting activation
    - use RolPool layer to crop the projections and resize to the right shape
    - advantage: the 2000 proposals pass only to a small part and non-expensive of the net
        - which are the FC layers at the end

QUESTIONS:
- dilated convs
    - why are useful, its advantages
- what algo do we use to train NN
    - what are the hyperparameters that influences the training (learning rate, batch size)
    - effect of learning rate
    - effect of smaller and bigger batch size
- regularization
    - approaches we use to improve it
        - labels smoothing
            - why is useful, how it works
            - softmax and CE formula
- metric learning
    - why we need triplette loss, what it is and what it improves
    - contrastive loss vs triplette loss
    - triplette loss formula
    - do we take all possible triplettes or just a subset?
        - semi hard negative mining
            - how we define an example being semi hard negative?