- **KNN**
    - training:
        - store al the training examples and labels (we basically do nothing, just load the dataset)
    - testing:
        - 1) compute the distance in feature space between test sample and **all the training examples**.
        - 2) attach the label to the test sample taking into account the majority voting of the first _k_ training examples
    - *k* is a hyper-parameter, is not something we learn from the data, we manually set before training starts.
    - *model capacity*: by changing *k*, we change the model capacity, namely a measurement of **how complex/flexible** the functions that can be learned are.

- **Linear classifiers**
    - $f(x; \theta) = Wx + b = scores$
    - *as template matching*: we can think each row of the W matrix as being a template on the input image for a particular class, and out of this template matching we get a score for a particular class.
    - *loss*: the loss is a proxy measure that is correlated with "how good our model is"
        - 0-1 loss: this loss takes into account the number of errors that our classifier does. However, this loss is difficult to optimize because it produces jumps/sudden changes because if we move our decision boundary a bit, then the loss will probably have the exact same value.
    - *softmax*: is an activation function that transforms scores computed by the model into probabilities (generating a probability distribution over the scores).
        - $softmax(scores_j) = \frac{exp(s_j)}{sum(exp(s_{all}))}$
    - *cross-entropy loss*: is the $-log(softmax(f(x;\theta)))$. <br> If the true classe is "bird" and we have a prediction of 0.9 in correspondence of the bird, then we will have $-log(0.9) = 0.1$ which is a low value, which means that the loss will contribute just 0.1 to the total loss. If instead the true label is "car" and we have a prediction of 0.09 in correspondence of "car", then we need to penalize this because we need a high prediction in correspondence of the car, in fact $-log(0.09) = 2.4$ will give us a high value.
    - *gradient descent*:
        - for each epoch:
            - forward pass: classify all training data and compute the loss
            - backward pass: compute the gradients wrt parameters
            - step: update the parameters subtracting from the parameters the inverse of the gradients multiplied by a learning rate
        - *problems*: for just one update, we perform $D_n$ (number of training example) forward and backward passes. To counter this problem we will take a further approximation of the gradient which is computed on a batch of images (SGD).

- **Optimizers**
- SGD
    - SGD with minibatches: we compute the gradient and then update the parameters only after one batch is processed.
    - online learning: is SGD with minibatches with *batch_size* = 1
    - *batch size* became an hyper-parameter:
        - larger batches provide smoother estimation of the gradient and exploit hardware parallelism
        - smaller batch size may have a regularization effect
    - advantages: SGD with minibatches is faster because we do more updates even though the gradient is more approximate wrt standard GD.
    - problems of SGD:
        - produces "sphere" within loss landscape which enjoy faster convergence by means of *higher learning rates*, but we usually have small learning rates
        - gradients estimated by mini-batches are noisy
        - *critical points* where the gradient is 0 other than global optima:
            - saddle points
            - local optima
- Momentum
    - momentum is like the interpretation of our parameters moving on the landscape of our loss with the detail that **when it moves, it gains velocity**. "A ball rolling down the surface of the loss, should keep the velocity it gains" and this should help to navigate the loss landscape better and faster.
    - $v^{(t + 1)}  = \beta v^{(t)} - lr \nabla L(\theta^{(t)})$
    - $\theta^{(t+1)} = \theta^{(t)} + v^{(t + 1)}$
    - $\beta$ is a value strictly less than 1
    - $v^{(t + 1)}$ contains a running average of the previous update steps (if before we were at a certain velocity $x$, our next velocity will depends on the previous velocity)
    - *advantages*: 
        - momentum reduces the effect of noise
        - faster convergence

- Nesterov momentum
    - very similar to *momentum*, but the gradient is computed after having "partially" updated $\theta^{(t)}$ with $\beta v^{(t)}$:
    - $v^{(t + 1)}  = \beta v^{(t)} - lr \nabla L(\theta^{(t)} + \beta v^{(t)})$
    - $\theta^{(t+1)} = \theta^{(t)} + v^{(t + 1)}$
    - Nesterov momentum shows a faster convergence
- AdaGrad
    - Adaptive Gradient proposed to rescale each entry of the gradient with the inverse of the history of the squared gradients
    - $s^{(t + 1)} = s^{(t)} + \nabla L (\theta ^{(t)}) * \nabla L (\theta ^{(t)})$, this $s$ is the history of the squared gradients
    - $\theta ^{(t + 1)} =  \theta ^{(t)} - \frac{lr}{\sqrt{s^{(t+1)}}+ \epsilon} * \nabla L (\theta ^{(t)})$
    - $s^{(t)}$ is monotonically increasing: it  may reduce the learning rate too early even when we are far from a good minimum
- RMSProp
    - in practice we do not use AdaGrad, but we use a modification of it which is RMSProp. The idea is: since $s^{(t)}$ is growing a lot, let's down-weight a bit all its history. So, instead of just accumulating square gradients into $s$, we create an exponential moving average of the square gradients. In practice we take a lot of the past history (using a $\beta$ parameter like 0.9) and a *tiny contribution from the present values of the square gradients (1 - $\beta$) in order to prevent the history to grow indefinetely.*
    - this turned out to work better because the optimizer keeps being active: it react to changes in the loss. However it is a bit nervous, we will see that ADAM will handle this.
    - $s^{(t + 1)} = \beta s^{(t)} + (1 - \beta)\nabla L (\theta ^{(t)}) * \nabla L (\theta ^{(t)})$, this $s$ is the history of the squared gradients
    - $\theta ^{(t + 1)} =  \theta ^{(t)} - \frac{lr}{\sqrt{s^{(t+1)}}+ \epsilon} * \nabla L (\theta ^{(t)})$
    - $\beta$ typically = $0.9$ or higher
    
- ADAM
    - ADAM follows the idea of RMSProp where we create an exponential moving average of the square gradients in order to prevent the history $s$ to grow indefinitely and then uses this history to adapt the learning rate. Plus, we do the same for the gradients, so we keep an exponential moving average also for the gradiets themselves. This leads ADAM into a smoother path and less nervous because we are smoothing the gradients itself.
    - **bias correction**: since $g^{(0)} = s^{(0)} = 0$ (recalling that ADAM uses more of the history - which will be zero - than the current gradient while performing a step), the first values of $g$ and $s$ will be very small because it will be $0 * 0.9 + (1 - 0.9) * gradient$, so only 0.1 part of the gradient will contribute to the first step which will result in a slow start of the optimizer. To counter this, a bias correction is added to both the gradient and the history of the square gradients.
    - $g^{(t + 1)} = \beta_1 g^{(t)} + (1 - \beta_1)\nabla L (\theta ^{(t)})$
    - $s^{(t + 1)} = \beta_2 s^{(t)} + (1 - \beta_2)\nabla L (\theta ^{(t)}) * \nabla L (\theta ^{(t)})$
    - $g^{debiased} = \frac{g^{(t+1)}}{1-\beta_1^{t+1}}$, $s^{debiased} = \frac{s^{(t+1)}}{1-\beta_2^{t+1}}$
    - $\theta^{(t+1)} = \theta^{(t)} - \frac{lr}{\sqrt{s^{debiased}}+ \epsilon} * g^{debiased}$

- **Activation function**
    - what a ReLU adds? It improves the chances that the new representation will be linearly separable.

- **Convolution**
    - *limit of FC layers*: FC layers reason globally, not spatially. They are able to learn a specific weight for every input feature (for each pixel), they are too precise and this smells as overfitting. We know that images are smooth spatially. We want to have something that reasons locally as the images are, not globally. So with this prior we go to convolutions.
    - *relationship between spatial dim*:
        - $H_{out} = H_{in} - H_k + 1$
        - $W_{out} = W_{in} - W_k + 1$
    - *padding*: enlarge the image to the input size padding with some values like zeros
        - $H_{out} = H_{in} - H_k + 1 + 2P$
        - $W_{out} = W_{in} - W_k + 1 + 2P$
    - *stride*: with stride we downsample. It means that we do not apply conv on all possible input positions of the kernel but just on a subset.
        - $H_{out} = inf[\frac{(H_{in} - H_k + 2P)}{S}] + 1$
        - $W_{out} = inf[\frac{(W_{in} - W_k + 2P)}{S}] + 1$
    - *formula of learnable parameters*: are all weights of all the kernels. So in general we apply a conv with a kernel $16 \times 8 \times 5 \times 5$ with 16 being "how many kernels apply", then the overall formula for the learnable parameters is:
        - $16 \times (8 \times 5 \times 5 + 1)$ (+1 for the bias)
     - *formula of memory to store*: coincides with all the output activation tensor. So after having computed the $H_{out}$ and $W_{out}$ with the previous general formula, we obtain a 3D tensor like $16 \times H_{out} \times W_{out}$ and the formula of the memory to store is the product of the three elements. If we would have had a batch of images, then the tensor would have been 4D and the formula needs to be multiplied by the batch size too.
    - formula of flops:
        - output feature map $\times$ 3D kernel size $\times 2$
        - the latter $\times 2$ is because we perform $n$ summation and $n$ multiplications
- *pooling layers*
    - aggregate several values into one output value with *pre-specified and not learned* kernel
    - difference wrt conv: each input channel is aggregated independently
    
- Batch Normalization
    - *internal covariance shift*:
        - Say that I have a 2 layer NN and we want to learn representations. So, when we train, imagine we are looking at the second layer, this layer is learning taking in input a representation $r$ **which is not fixed/completely learnt**, this $r$ is changing while training. This means that is like we have layers that are training together, but in reality what we would like to do is to come up with the best representation for the first layer and then train the second one on this best representation, not together. 
        - **Example**: think about that a layer learns to capture edges, then we can give this representation to the next layer that from the edges extract the corners. This is what we are hoping to do, however is practice this does not happen. What happes is that the layers are learned together so the second layer receives in input edges that are not stable because the first layer has not finished to learn them. So this problem is actually very bad, because when we want to train something, we want the training set to be fixed, we do not want our distribution to change. But as it is not like that, here each layer sees the input distribution of the things it has to work on that changes.
    - **batch norm idea**: 
        - the idea to counter the previous problem is that: since the distribution $r$ is changing, I will try to normalize the output of the first layer such that the distrubution $r$ does not change too much. We are still learning, but we are constraining what the distribution of $r$ may look like. In particular we will make $r$ to follow a Gaussian distribution, so *zero mean* and *unit variance*.
    - test time:
        - at test time BN is different from training time. At test time, **we do not want to stochastically depend on the other items in the mini-batch**, namely we want our output to depend only on the input **deterministically**.
            - So **to counter this problem** at test time, I king of arbitrarily say that the *mean* and *variance* will be constants and their value will be a running average of the values seen during the training time. So during the training I will keep a running average of the mean and variance and when training ends I will freeze the values obtained and I use them to compute the test predictions.
    - pros:
        - allows the use of higher learning rates
        - carefull initialization is less important
        - training is not deterministic, **acts as regularization**
    - cons:
        - not clear why is so beneficial
        - more complex implementation since need to distinguish between training and testing time

- **Architectures** 
- AlexNet
    - trends:
        - heavy stemming at the beginning of the net (from 224 to 55)
        - nearly all the parameters are in the FC layers
        - first two convs are responsible for the largest memory consumption because produce activations with a large spatial resolution
        - conv layers require the largest number of flops
- ZFNet / Clarifai
    - ZFNet is basically an AlexNet. They worked on AlexNet and discovered that the heavy stemming at the beginnig of the net resulted in **dead filters** and **aliasing artifacts** in the first layers.
    - **to counter this** they proposed to use a less aggressive stemming at the beginning of the net.
- VGG
    - the **idea** of VGG is the following: ok we have AlexNet, *can we simplify the design space and came up with a simpler and regular design that we can repeat over and over?* So VGG commits to explore the effectiveness of simple design choices by allowing:
        - 3x3 convs
        - 2x2 max-pool
        - doubling the number of channels after each pool
    - **stages**:
        - so here we do not have only layers but we also have this kind of **regular composition of layers that we repeat throughout the net**. And this is called *stage* or *module*.
        - in VGG stages are either:
            - conv-conv-pool
            - conv-conv-conv-pool
            - conv-conv-conv-conv-pool
    - no stemming layer
    - one stage **has the same receptive field of a larger conv** but requires **less parameters and computation and introdues more non-linearity** (example: a stack of two 3x3 convs is like applying a 5x5 conv). However the **drawback** is that the memory for activations **doubles**.
    
- Inception v1
    - Inception v1 **wanted to grow depth and width wrt VGG but without paying the price for it**.
    - stem layers - stack of inception modules - GlobalAVGPooling
    - naive inception module:
        - the output of each module is realized through the stack along the channel dimension the output of several convs (5x5 and 3x3)
        - **problems**: 
            - the number of channels grows very fast
            - expensive in terms of flops
    - 1x1 convs are used to overcome the above problems
        - 1x1 conv allows to **change (shrink) the depth** of the activations while preserving the spatial size.
        - we can interpret them as **applying a linear FC layer at each spatial location** (note that we are doing representation learning while we apply 1x1 conv! We introduce more non-linearity! We are not just shrinking / enlarging the activation along channel dimension).
    - real inception module:
        - in the naive inception module the problem was that the convolutions were too heavy because of the high number of channels. So, **to overcome the problem**, before applying the convs, we use a 1x1 conv to shrink the number of channels to make the 5x5 and 3x3 convs less expensive becuase they are processed on less channels. So at the end the output activation has less channels and less computation has been required to produce it.
    - GlobalAVGPooling:
        - the difference wrt FC layers is that we are not flattening everything, but we aggregate every channel (every panel) spatially into one single value. Doing so, we will have a much smaller FC at the end. Also, since we reduced the dimensionality and the informations a lot (less parameters thus), **we can use just one FC layer** and not more, and this also reduces the computational cost.
        - **intuition:** one way to think about this is that when we arrive at the last activation of our net, its panels contain very specific high level features. So by averaging them spatially, we are actually making things more **robust** becuase all the spurious variations we have in that panel are not very important (thus counter overfitting). Example: is not important whether the petal flower is in the top right corner or bottom left, what is important is that is present or not.
    - Inception v3:
        - in Inception v3, they king of dropped the uniformity of Inception v1 of having the same inception module everywhere but **they kind of specialized the module for different resolutions** (convolution factorization). So for some specific spatial resolution, we will have a specific inception module. This has as **advantage** that we have less parameters and thus the net is easier to train computationally speaking.
    
- Residual Networks
    - **idea**: growing depth improves performace, right? Actually no. In fact, *stacking more layers does not automatically improve performace*. It has been discovered that stacking a lot of "non residual block" makes the training very very hard, we fall into an **underfitting problem, so a training problem**. To solve this, we add a residual connection around a block of two convs. Doing so, when we increase the number of layers, we get what we expect, namely less training and testing error, namely **deeper nets perform better that less deeper**.
    - **ResNet atchitecture**:
        - stem layer - residual block stages - GlobalAvgPool
            - residual block stages:
                 - residual block stages are a stack of two residual block
                 - each residual block is a stack of two 3x3 convs + BN
                 - the first block of each stage **halves** the input resolution (with stride-2 convs) and **doubles** the number of channels
        
        <img src="resnet.png" width="40%" height="40%">

- skip connection dimension problem:
    - the residual block described so far cannot be used directly because the number of channels and the spatial dimension **do not match** along the skip connection.
    - **solution**: use a 1x1 conv that doubles the number of channels with stride S=2.
- **bottleneck residual block**:
    - bottleneck residual block is used when we want very deep resnets. With the normal residual block, there is the concern that stacking many blocks, the number of parameters and flops will increase and this may made the training difficult. So they created a different version of residual block called **bottleneck residual block**. This type of block let us to increase the number of layer (so we are increasing the depth of the net) **but we are not paying the price of higher number of parameters**. *In particular we will have the same complexity in terms of flops but less parameters*. So now we can stack more of these blocks and the training will be easier.
        - how it is made:
            - instead of having two 3x3 convs that process C channels, we start with 4C channels, then we have a 1x1 conv to reduce to C channels (**compresssion**), then we have a inner 3x3 conv and then we have again a 1x1 conv to get back to 4C (**decompression**). Of course these compression and decompression perform also representation learning since there are non linearities in between (are not just a compression and a decompression), so we are improving the quality and the capacity of the model out of the fact that we are also reconciling the number of dimensions in order to perfom the summation.
    <img src="bottleneck.png" width="80%" height="80%">

- effects of residual learning:
    - skip connections make the loss landscape smoother


- ResNeXt
    - idea:
        - the multi-branch architecture of Inception proved effectively. However, the design is heuristic and handcrafted. ResNext decides to use the multi-branch architecture but **being regular** in the design like VGG and ResNet.
        - so, what they did, is to decompose the bottleneck residual block of ResNet into **G** parallel branches -- a new hyperparameter -- called **cardinality** of the convolution (also, the name *ResNeXt* comes from this idea of having a *next dimension* in the conv space). Instead of starting from 4C and process C channels in the inner convs, they start with 4C and process **d** channels, a new hyperparameter and this *d* is the same for each branch. So, at the end, is like expanding to a multi-branch architecture but with some kind of **rules** instead of handcrafted design decisions.
    - argue the growing complexity of 3x3 convs:
        - one could argue that having a lot of 3x3 convs in the multi-branches **could make the computation explode in complexity**. Actually yes. To overcome this, what is done is that:
            - *we choose G*
            - then we compute the total flops of the standard ResNet bottleneck block and the total flops of the ResNeXt block 
            - and then by equating these two things, **we can solve for d** since G and C are fixed, and we get a number that approximately preserves the number of flops.
    - why ResNeXt idea is a good one? 
        - ResNet is capturing a lot of features that are *uncorrelated*, like green blobs with a horizontal edge. So is capturing noise. Is capturing combinations of features that are not relevant and thus the model does not generalize well. **By limiting the expressive power** using ResNeXt with *d*, we can probably force to capture only relevant interactions *between channels*.
        <img src="resnext.png" width="60%" height="60%">
        
    - grouped convs:
        <img src="gc.png" width="60%" height="60%">
        
       - take the input tensor and divide the channels into G groups, say we have a 6 channel tensor with G=2.
       - the first conv that produces the first panel then is performed on only the first 3 channels (6/2) using just 3 kernel channels, not all the 6 channels. Therefore 3 channels of the first kernel are lost, **this is why we have a reduction in the number of parameters and flops**.
       - the second conv that produces the second panel is done on the other 3 channels of the input tensor in the same way. Grouped convs are more constrained.
       - GC has **G time less params and $G^2$ time less flops**.
- SENet
    - idea: maybe some channels are more important than others. So what we would like to do is having a number saying for each channel **how important is that channel**.
    - to do that we have a **squeeze** part and an **excitation** one after the residual block:
        - *squeeze part*: we apply GlobalAVGPooling, so for each channel we now have a single number.
        - *excitation part*: this part is composed by some FC layers with a _sigmoid_ at the end that output some weights from 0-1 that are used to reweigh the channels saying "this is more important than this other one".
        - the branch before computes a weight, then we get back the output of the residual block and reweigh each channel.
        
        <img src="se.png" width="50%" height="50%">
        
- Depthwise Separable conv
    - Depthwise Separable convolution is basically a grouped convolution where each channel is processed alone with one kernel. **So it is the extreme case of Grouped Conv with groups = C**. Semantically this means that each feature map created will capture just, say, vertical edges and that’s it. Another feature map will capture just horizontal edges, and so on. But to combine these features together we need to look at pixel level. This is achieved by **the subsequent 1x1 conv** such that the features are combined and we can detect corners for example. 
    - Depthwise Separable convolution **is way cheaper than normal convs in terms of flops** but basically it does the same thing of a conv but conv does it in one shot, here we did in two steps. Of course we pay the price of “final result”, we are a bit less accurate. However this could be a good thing because it could generalize better.
- Inverted residual block
    - why Bottleneck residual block are not ok:
        - in the bottleneck residual block, the 3x3 inner conv in processing things in a **compressed domain**. But actually this is not a good idea because we are doing the processing on the minimum amount of informations which may result is **information loss going through ReLUs**, but we would like to do the processing when all the details are present, not in a *compressed* domain. Also because it is easier to learn and do something meaningful on the full detailed image rather than the compressed one. So maybe a better idea is to use **inverted residual blocks**.
    - inverted residual blocks:
        - so the idea here is to _expand, process and then compress back_.
        - this block may cost a lot if we keep C the same as the bottleneck block, so **to limit the computation** we set C to a very small number and the first 1x1 conv **expands** the channels to **t** *C* with *t* the expansion rate.
        - to limit the computation, the inner 3x3 conv is realized as a **depth-wise separable convolution**.
        - then the last 1x1 conv **compress** back the channels from tC to C.
        - the last difference wrt bottleneck residual block is the removal of the last ReLU after the summation: this is motivatd by theoretical studies of ReLUs of preserving informations in a low-dimensional space.
        
        <img src="inv.png" width="50%" height="50%">
    
    - MobileNet-v2
        - stack of inverted residual blocks
        
        
        
- Wide ResNet
    - is no more than a ResNet where the number of output channels are multiplied by a factor *k*
    - Wide ResNet has less layers but more wider activations, and they are **faster** compared to ResNets with same performance more or less.
- EfficientNet
    - **"what is the optimal way to scale up a model?"**, this question is answered by EfficientNet.
    - single dimension scaling
        - they tried to scale the models using singularly higher W (width), then higher D (depth) and then higher R (input resolution) **but all three saturate at 80%**.
    - **compound scaling** is this idea of scaling W, D and R in an optimal way to improve the most we can.
        - EfficientNet uses a compound scaling $\phi$ to scale all the dimensions in a principled way according to an optimization problem that uses some **compound scaling rules**.
        - $\phi$ represents **how much more we are willing to pay in terms of computation to obtain higher performance**.
    - the previous optimization problem is searched by a NAS (Neaural Architecture Search) using a RL (Reinforcement Learning) technique.

- **Regularization**

- model capacity:
    - the factors that infuence the model capacity: W, D, R, admissible range for values of parameters, training time, ... in general all the hyperparameters.
- regularization:
    - is any modification we make to the learning algorithm that is intended to **reduce the generalization error without caring much at the training error**.
- parameter norm penalties
    - we are asking the model to minimize both the *task loss* and another loss which is conflicting that say **"we want our params to have small values"**. Why this? Because this constraints the capacity of our model and so it should reduce the variance at the expense of increasing the bias.
        - a new hyperparameter is introduced which is **$\lambda$** that decides which loss contributes more between the task loss and the regularization loss.
- early stopping:
    - training time (i.e. number of updates) is an **hyperparameter** controlling the capacity of our model.
    - we train and test on the validation set. This hyperparameter is chosen where we had the highest accuracy on the validation set.

- label smoothing
    - problem of one hot encoding of labels
        - we are making model overly confident about the fact that the "bird" is the right label and the model is specializing  a lot in getting super high scores for that specific image. So we are leading to **overfit** our training set.
        - also, a perfect one-hot encoding **is a target that we will never be able to reach**.
    - a better alternative is to **smooth the labels**, this will prevent the model to became overly confident.
    - moreover, **labels may be affected by some noise**, so is this case, with label smoothing we are saying that sometimes with a small probability, the "bird" class is not the correct one. While the one-hot encoding does not leave room for this doubt being 1 to the correct class (with label smoothing instead we have something less like 0.91).
    - how to apply label smoothing:
        - a new **hyperparameter $\epsilon$** is defined to be a small number (like 0.01) which is kind of noise which is divided by the number of classes and in corrispondence of the true label we have an expression in order to make the softmax outputs to sum up to 1.
    - when we use label smoothing, we have to remember to use the **Kullback-Leibler divergence loss** because with this loss the target tensor will have the same shape as the input tensor. With normal CE we could not able to do label smoothing.
    
    
- dropout
    - idea:
        - the idea is that whenever we have a mini-batch and we forward it to the network, **we do not use the full network but we use a random subset of it**.
        - according to a random probability *p* (hyperparameter), some activations are set to zero.
        - pay attention that *we do not drop weights but we drop activations*.
        - so in the end we end up with a very sparse subset of the initial network.
        - be aware to not apply dropout to the last layer where we compute the scores for the loss!
    - why is this a good idea since we are dropping away a lot of informations?
        - dropout **prevents feature detectors to co-adapt**
            - face detector example:
                 - say we want to detect a face, in general the faces do not always have all the face attributes like nose, eyes, mouth, etc.. In general something could be missing. So for example I would like to train my network also for faces with an occluded eye. So it is a good idea to **not having** all the features that co-adapt but randomly learn a subset of these features **to be able to generalize better on unseen data**, so to force my face detector to work also when **only some of the attributes are present**.
    - test time preds are stochastic because we randomly sample a dropout mask
        - the solution to remove the stochasticity is to use **inverted dropout**
        
- data augmentation
    - multi-scale training:
        - 1) we take an image, we pick a random size **S** within a range (say [256, 480])
        - 2) then, supposing we have an image 600x800 and S=300, we **isotropically** resize the image to the short side to retain the **aspect ratio** of the image (so the image will become 300x400).
        - 3) then we sample random 224x224 **patches**
        - 4) then we apply **color augmentation**
        - **this creates new data for free with a lot of variability**. Of course it will be harder to fit these data, so we expect to increase a lot the bias resulting in a *higher training error*.
        
    - multi-scale testing:
        - 1) we choose some scales **Q** = [224, 256, 384, 480, 640]
        - 2) then we **isotropically** resize the training image so that the short side = Q, so we will have the same image with different sizes
        - 3) for each image we compute the prediction
        - 4) **average the prediction**
        - **domain shift problem**:
            - at training time we see 224x224 images, here instead we see larger images, so we see larger and smaller "birds". To counter this, another technique is used:
                - 1) isotropically resize to one scale, eg. 256
                - 2) predict on 224x224 **center crop**

- color augmentation (jittering):
    - at training: random color change of all pixels
    - at testing: unmodified input

- cutout:
    - is the idea of replacing a subset of my image with either a random noise or with a constant gray pattern (like a gray square).
    
- Mixup
    - we will create a new image which will be a linear combination of images according to a weight lambda that we randomly sample from a Beta distribution.
    - why is a good idea even though the resulting image is unrealistic?
        - it contraints what the network does between classes
    - testing
        - unmodified input

- **Learning rate**: the problem is that is hard to find a perfect learning rate, so the **solution** is to use a mixture of high and low learning rate.
    - step decay: start with high learning rate (e.g. 0.1) and divide it by 10 when error plateaus. 
    - cosine decay: we start from a learning rate value and at every epoch we will have a new lower value of the lr that changes according to a cosine function.
    - linear: the learning rate is decayed according to a linear function.
    - warm-up: 
        - to use when our training loss is flattened and is high for a long time at the beginning.
        - for very deep nets (like ResNet-101), a high lr can slow down convergence at the beginning of training (e.g. accuracy remains at chance level for several epochs). **This is usually a symptom of poor initialization**: a way to counteract this is to use **a lower lr for few epochs until accuracy increases**).
    - one cycle:
        - update the learning rate **after each interation** (namely after each mini-batch), not epoch
        - we start from a low lr _$lr_{start}$_ , then we grow it at each iteration according to an hyperparameter *p* (e.g. 0.3) until a pre-specified _$lr_{max}$_ and then we decrease it by *p* at each iteration until a pre-specified _$lr_{min}$_ (note tha usually _$lr_{start}$_ > _$lr_{min}$_ ).
        - **varying momentum**: whenever we want a carefull lr, we will also vary the momentum to kind of say **we want to go towards that direction**.
        
- random hyper-parameter search: random search leads to a more efficient exploration of the space, but **beware of the curse of dimensionality**.

- recipe to train a NN:
    - ensembles:
        - training: train multiple randomly initialized models on the same dataset
        - testing: run each model over the same test image and average the results
        - this usually increases the overall performance by 1-2%. Even if networks have similar errore rates, they tend to make different mistakes.
        - **downsides**:
            - we have to train a lot of networks from scratch
            - we have to run a lot of networks at test-time
            
    - snaphot ensambling
        - uses **cyclic cosine decay**, namely a lr that starts from a high lr then at every epoch is decreased following a consine function until a certain min lr value, then the lr is suddenly grown again and is decreased again to the same min lr value.
        - ensembles work well. However we have to train a lot of networks and this could take a lot of time. To counteract this issue, snapshot ensembling comes to aid. By using a cyclic learning rate schedule, **we can simulate  $M$  trainings in the time span of one by taking snapshots of the parameters reached at the end of each cycle**. At the end we will end up with  $M$  models. By ensembling those models at test time to get the predictions and we usually gain better performance that just doing one cycle of training. Also, we are not paying the price of training $M$ models.
        
    - Polyak average
        - if we do not want to pay the price of the ensemble at test time (because at runtime we would need to run  $M$  models), we would want to get the benefit of ensembles without running ensembles. How to do so? One thing we can do is Polyak average which is basically the idea of updating the parameters in the same way we are doing so far but then we keep another copy of them ($\theta^{(test)}$) where we accumulate our parameters **in a exponential moving average** way at each step.
        - $$\theta^{(test)} = (1-\rho)\theta^{(i+1)} + \rho\theta^{(test)}$$
        - $\rho$ is an hyperparameter which represents to who give more weight. Being $\rho$ usually very high, we give more weight to the past.
        - **Why this is a good idea?** Because we will see that the loss will be super noisy. It will bumps a lot between different minibatches. So this Polyak average of the parameters is actually taking the mean of the crazy movements of the loss and on average will give us better performance.
        
    - Stochastic Weight Averaging
        - to avoid building and running a costly ensemble, **we can average snapshots in weight space**. Instead of keeping an exponential moving average as Polyak, we keep **a real running average of best models**. But instead of doing it for all the parameters as Polyak, we average only the good ones, namely the ones where the lr is already decreased and so it could have reached a good minima (so cyclic lr schedule must be used to achieve this).

- **Transfer Learning**
    - First way
        - we take a pre-trained backbone with freezed weights and we throw away the last FC layer and we add our new randomly initialized FC layer that we will need to train. So in this scenario we just train only our last new FC layer. 
    
    - Second way: **Fine-tuning**
        - the second way is to change the last layer as before but in this case **we train also the feature extractor** that stars with weights learned on ImageNet for instance so that we start from a good weights and we train our network to become specific on my dataset.
            - discrepancy between last layer and backbone: the last layer is barely initialized while the backbone has already good weights. So we do not want to mess up bad weights with already good weights. So what we do to counteract this problem is that at the beginning we keep the backbone **freezed for a while**, a few epochs for instance. Doing so, we are giving the time to the last FC layer to reach a decent place in the landscape (**namely checking whether the loss goes down**). After we saw that the loss has decreased, we can unfreeze also the backbone **but we do not just blindy do that**: if our backbone has been pretrained with a lr of e-3, to finetune it we now use a lr of e-4 (this is just a rule of thumb, we would need to tune this lr value).
            
            - Progressive LRs: is this idea that says that we should not have just one lr for the backbone and one for the new layer **but a set of different learning rates**. So for instance for the new FC layer we have a certain lr, for the previous layer we have a lr that is something smaller and so on until the first layer that has maybe 0 has lr, so we freeze it basically. **What is the intuition behind this?** Features that are closer to the input image should be more generic, for example the first layer extracts edges and blobs that are good for all the tasks we have in mind. So maybe we do not want to fine tune it. On the other hand, the closer I am to the output, **the more problem specific I am**, so I want my network to change to fit my problem better.
            - note that fine-tuning in this way we also save computation (even though we do not this to save computation but for the aforementioned reasons).
            - Transfer learning should be used when we have small datasets and when we want to save computation time.

**Object Detection**
- Viola Jones
    - boosting: is a way to train and build an ensemble of _M_ week learners (WL) to obtain a strong learner (SL). How good a WL should be to make this work? They just need to be better than a random guess (> 50% of accuracy).
    - AdaBoost: AdaBoost uses as decision stumps linear classifiers that **are restricted to only one feature** (remember though that the error rate of the chosen feature must be < 50%).
        - how we build a SL with AdaBoost and these decision stumps?
            - we have a weight applied to each sample of the dataset.
            - then we search for the best decision stump (the one that has an error rate < 50%).
            - **now there is the boosting magic**: we reweight the wrongly classified examples by a formula $\beta$ so that they weight more and at the next iteration the algorithm is more prone to classify them correctly.
            - normalize the weights to sum-up to 1.
            - we iterate again
        - so what our SL actually learns?
           - it learns a decision boundary.

   
   - so how Viola Jones used boosting and AdaBoost to perform face detection?
       - they first needed the decision stumps. They used **Haar-like features** which are simple _rectangular filters_ comprising of 2, 3 or 4 inner rectangulars applied at a **fixed position** within a 24x24 patch. The inner rectangulars has a function of +1 or -1 which means that the image pixels inside that inner region are summed up or subtracted.
       - giving these Haar-like features, we have over 160k possible filters in a 24x24 patch we can use to detect faces. **AdaBoost is used to select a small subset of the most effective ones**. The two most effective filters/WL chosen by AdaBoost are also interpretable!
      
       
   - integral images:
       - to speed up the computation of rectangular features (i.e. decision stumps) they used **integral images**.
       - **problem**: a practical problem of integral images is the _overflow_.
       - <img src="integral.png" width="20%" height="20%">
       - integral image can be computed with only one scan of the input image by using the **recursive rule**:
       - $$II(i, j) = II(i, j-1) + II(i-1, j) - II(i-1, j-1) + I(i, j)$$
       - with integral images we can compute the rectangular filters in **constant time**
   
   - multi-scale sliding window detector
       - at test time, the SL is applied to all spatial locations in the image.
       - faces have not a fixed size in the test images: hence, **multi-scale detection** in necessary.
       - to achieve good performance, about 200 features learned by AdaBoost are used to classify each patch (the sliding window locations). However, even if each feature can be computed very fast thanks to integral images, **there are still too many windows in an image to achieve real-time performance, so Cascade has been invented**.
       
   - cascade:
       - **observation**: faces are far less frequent than background. Most of the time is "wasted" in computing a lot of features for background patches.
       - **idea**: rejects most of the "easy" background patches with a simple classifier which can be run very fast and **whose threshold is tuned on a validation set so that it does not discard any face (i.e. it has 100% recall but very low precision)**. Applying this cascade mechanism, a classifier that use just the two top features can discard about 60% non-faces windows without discarding any face with a very limited computation.



- Box overlap
    - at test time we will have several BB for each face, but we want just one.
    - $IoU (BB_i, BB_j) = \frac{area \space of \space intersection}{area \space of \space union}$ ($ area \space of \space union$ is the orange part, $area \space of \space intersection$ is the purple part)
    - <img src="iou.png" width="20%" height="20%">
    - Non Maxima Suppression (NMS): to come up with a single detection, we perform NMS

        
- R-CNN
    - run Selective Search to come up with for example 2000 proposals.
    - for each of this proposal:
        - warp it to 224x224 as AlexNet expects and we add 16 pixels of context
        - pass through the Net
        - get class and BB correction
    - **problem**: really slow because we need to do 2000 forward passes per image.
    
- Fast R-CNN
    - we still run Selective Search to come up with for example 2000 proposals.
    - then we run full image up to a certain conv layer (like conv5) only once. We ignore the regions here.
    - then we project the proposals onto the resulting downsampled activation.
    - **use RolPool layer to crop the projections and resize to the right shape** AlexNet expects in the forward FC layers.
    - **advantage**: the 2000 proposals that previously were run through the expensive convs of AlexNet, now are passed only on a small and non expensive part of the network which are the last FC layers.
    - <img src="fastrcnn.png" width="75%" height="75%">
    
    - **RolPool layer**
        - it converts the projected Region of interest onto the activation into a fixed spatial dimension required by the remaining layers of the network.
        - 1) snap the float projections to the grid (we approximate the projections from float numbers to integer).
        - 2) perform a max-pooling over each subregion.
            - if the kernel is not divisible with the size of the projection, some kernels are made like 3x2 in oder to make things match.
    - **Fast R-CNN loss**:
        - the loss for the BB correction is the L1 loss (Huber loss or Robust loss) because is less sensitive to outliers and easier to optimize for large values.
    - **problem**: Selective search is the bottleneck, thus Faster R-CNN has been invented.
    
- Faster R-CNN
    - the idea is to add a network called **RPN (Region Proposal Network) whose task is to generate proposals**. It takes as input the very same feature map that AlexNet up to conv5 produces for an image and its task is to say " _I think that in this region of this activation there is an object_ " and then Faster R-CNN follows the very same Fast R-CNN pipeline. So RPN produces a score and a BB proposal. The score is called **objectness score** which indicates the probability for that box to contain an object or a background. RPN is a small network composed by a 3x3 conv followed by two parallel 1x1 convs to fix the number of channels we need in output.
    - at test time, the proposals are sort by the objectness score and just the top 300 are being processed.
    - <img src="fasterrcnn.png" width="80%" height="80%">
    - **Feature Pyramid Network (FPN)**
        - we want to detect large and small objects.
        - the solution is to use FPN which somehow merges the high level informations computed at the first layer of the net with the more localized and rich informations we have in the depth of the net. So for each stage we have an activation and we compute **hallucinated activations** by summing up these activations creating different hallucinated activations per stage where we can perform multi-scale detection.
        - to make the dimensions match we use _1x1 conv and Nearest Neighbor upsampling_.
        - so out of our feature extractor with FPN we do not have just one activation but we have **a pyramid of activations at different spatial resolution with the same number of channels** where we can perform multi-scale detection.
        
- Class imbalance
    - OD is an **imbalanced problem**: the number of negatives boxes far outweights the number of objects in a typical image. This imbalanced problem can overwhelm the training leading to suboptimal models and **training inefficiency** since most of the samples selected into the mini-batch will be *easy negatives* and they will not be useful for learning the signal.
    - Two stages classifiers partially and naturally solve these problems because they impirically sample _hard negatives_ , **however one stage detectors do not**.
        - **problem**: so, if negative samples are randomly selected, _the mini-batch will contain on average easy negatives, if not all_.
        - **hard negative mining:** SSD proposed to use hard negative mining that says that negative anchors in an image are sorted by classification loss and then NMS is performed. Then, only the top ones are used in the mini-batch up to an affordable ration (like 3:1) wrt positive examples. _Yet, the net still sees only a subset of the negative examples in the training._ **RetinaNet** comes here to solve this problem. Its idea is to work on the loss so that it can naturally handle all the negatives without being overwhelmed by these negatives.
        - **Focal Loss**:
        - <img src="focalloss.png" width="50%" height="50%">
        - **what is the problem of BCE when we have class imbalance?** The problem is that, for example when our _pt = 0.7_ , _we are paying a lot of loss even though we are doing a very good job since 0.7 is a good value_. Considering that we have a lot of easy negatives that are easily classifiable, we would pay a lot of loss for nothing even though we are classifying them very well and these small loss values, since they are a lot, can **overwhelm the rare positive class**. Moreover, above the threshold and below ( _pt = 0.5_ ) the loss values are comparable in order of magnitude and we do not want that. We want the ratio between the loss of the examples of score > 0.5 and the loss of the examples of score < 0.5 to be more higher then BCE in order of magnitude. With Focal Loss we want a hundreds of time of ratio, not just doubled as BCE.
        - **how to achieve that?** It is done by adding a weight that can be seen in the below formula:
        - $$BFL(p_t) = -(1-p_t)^y ln p_t$$
        - <img src="pt.png" width="70%" height="70%">
        - where $y$ is tunable **focusing** hyperparameter (usually 2)
        
- CenterNet
    - limits of anchors:
        - are very inefficient because we are enumerating all possible boxes.
        - we obtain a lot of duplicates entries for each object.
        - assignment of anchors to ground-truth is hand-crafted.
    - **CenterNet overcomes these issues** proposing to represent _objects as points_ and regress their size.
    - CenterNet is a net whose aim is to produce a _heatmap_ (an image with scores per each pixel which are in the range [0, 1]) that says whether a pixel is the center of the object or not. 
         - _a technical detail_ is that our net will produces an image that is smaller than the original one, in particular is reduced by a stride R. But usually we do not want to use a very large stride since we want our heatmap to be of high resolution (the higher the resolution, the more precise we are in detecting the center of the object). 
         - after we have found the center of our object, we apply a **correction for the discretization error introduced by the stride** for recovering the exact center.
         - finally, when we have the center of the object, we have a tensor predicted by the net that will tell me the BB and this is possible because we have the _H_ and _W_ of the objects.
         - **test time**: at test time we simply take our feature map and for each class (each channel corresponds to a class) we find the local maxima by comparing the pixels within a 3x3 neighborhood and then we draw the bounding boxes (we already have the BB infos!).
         - with this approach **is way easier to create ground truth!** We just need to create a heatmap saying that the center of the object is 1 and the rest is 0. Actually, we know that nets like smoothness, so we will not put just zeros and one but we will place a Gaussian in the center. In this way, if the net predicts not exactly the center, we are not penalizing the loss too much since if the predicted value is the value just next to the ground truth one, we are still doing a good job.
         
- EfficientDet
    - EfficientDet explores how we can scale up the different dimensions of the problem coherently. In particular how to scale up:
        - _backbone: deeper or wider_
        - _image resolution_
        - _FPN architecture_
        - _classification and regression_ heads are not usually scaled up when changing the backbone
    - EfficientDet changed the FPN architecture: **path aggregation networks (PANet)** explored the advantages of _adding a bottom-up path which merges hallucinated features at higher resolution into coarser one_.
    - EfficientDet introduces the **weighted bi-directional feature pyramid network (BiFPN)** which:
        - repeatedly applies a modified bidirectional PANet multi-path feature fusion.
        - adds learnable weights to learn the importance of input features at different resolutions.
    - in the end EfficientDet is composed by a EfficientNet backbone, then we have a BiFPN network with as many layers we need to scale up the model and we have some usual convs to perform class prediction and BB correction.

**Semantic Segmentation**
- Random forests
    - are an ensemble of simple classifiers (that usually have low bias and high variance), in this case decision/regressor trees. Each tree is trained **on a random view of the dataset** and then the output is the average of their predictions.
    - the trees should be very different from each other because different trees will make different errors and averaging their predicions at the end we will obtain a lower error. How to make trees different from each other? We should train them on different datasets, but usually we do not have many datasets, so **we simulate them!**.
    - **how we simulate different datasets starting from just one? Bagging!**
        - we start from the dataset and randomly and **with replacements** we create a new version of the dataset and use this dataset to create a tree, the proceed doing the same for the second tree and so on.
    - at each node, the trees define the split on a random subset of features, this reduces the correlation between trees.
    - **when to use Random forests?** When we have classifiers with low bias with the aim to reduce the overall variance.
    - **RF advantages:**
        - fast to train and test
        - model interpretability: feature importance, feature contribution
        
- Fully Connected Network (FCN)
    - FCN idea is to start from an image, pass it through a backbone and this will result into a tensor. This tensor however must have a number of channels equal to the number of labels we have (this is a requirement for Semantic Segmentation tasks). So for instance if we use Pascal, we will have 21 labels and if the tensor that resulted from the backbone has 512 channels for instance, we need to match it because we need to create a mask. So we need to pass from 512 channels to 21 channels, how to do that? **1x1 conv!** (this conv is usually referred as _scoring_ layer in the next images). Then we will have for instance a tensor of 21x6x8, so a very coarse spatial resolution. We would like to have a mask that is like 21x256x320. So how to pass from that coarse spatial resolution to this bigger one? **Upsampling!** **So, in FCN we score then we upsample and sum**.
    - FCN-X
        - FCN-32 uses a backbone with a total stride of 32.
        - this has a problem that at the end we need to do **bilinear upsampling** x32 times.
        - **problem:** this x32 is too aggressive, it will result in a very coarse mask.
        - **solution:** upsample multiple activations at different resolutions. So we take the tensor out of the _L-1_ stage, we score it (1x1 conv to match the number of channels) and then we sum up this activation we the scored and upsampled activation of the _L_ stage (the activations _L-1_ and _L_ are used in FCN-16). This will turn out to have a less aggressive bilinear upsampling like only x2 and x8 as in FCN-8, not just one of x32 as in FCN-32.
    
        <img src="fcn.png" width="80%" height="80%">
        
- **mean IoU** is the main measure to rank semantic segmentation algorithms:
    - the intersection of defined as the number of pixel where the labels are the same.
    
- Transposed convolutions:
    - transposed convolutions are used to perform upsampling
    - <img src="tc.png" width="80%" height="80%">
    
- U-net:
    - U-net is basically an FCN with some variations, it has an **encoder-decoder structure**. In FCN we score then we upsample and sum. With U-net instead we have a skip connection for every stage _L_. In particular we take the tensor of the last stage and we upsample with **transposed convolutions**, then we apply come convolutions and we upsample again and continue with these operations until we have a segmentation map bigger enough. Only at the end the **scoring layer** is applied to match the channels.
    - **the key innovation of U-net** is to concatenate (instead of summing as FCN) the tensors obtained through the upsample operation with the tensor of the skip connection of the stage _L_ at the same spatial resolution (cropping it a bit because it is a bit larger).
    - **what if the effect of using a decoder? Precision at boundaries!** The layers of the backbone give us very detailed informations that contributes to have high precision at the boundaries.
    <img src="unet.png" width="80%" height="80%">
    
- Dilated/Atrous convolutions
    - in SS we have to give a class but we also need to understand _where_ the pixel is to give a class, we need more **context**.
    - dilated convolutions give us at **constant cost**:
        - rich features
        - large spatial resolution (size of the features maps)
        - large receptive field
    - actually we can get the advantages that dilated convs give us by **not downsampling!** If we would remove max pooling and striding, we would get a fully conv architecture that will just keep the input size at every stage. **Why we cannot do that?** 
        - **Problem 1**: because it will run forever since processing large conv maps is very expensive, **this is why we use max pooling and strided convs**.
        - **Problem 2**: if we do not downsample we will not get a large receptive field because it will grow **linearly**, but we want an exponential growth of the receptive field.
    - **dilated convolutions** in practice modifies the normal convolution by adding another parameter which is the **dilation rate r**. Its effect is to _pad the kernel_ in principle, so is equivalent to inserting holes between the kernel weights. With $r = 1$ we have a dense kernel so it will a normal dense convolution.
    - **if we stack dilated convolutions with an exponentially increasing dilation rate, the receptive field will grow exponentially with the number of layers while the number of parameters will grow linearly and resolution is not reduced**.
    - **since DC comes with so many advantages, why do not use them everywhere?** 
        - the price we pay is that the net becomes slower and slower the deeper I go into the net.
        - to understand where we need them we have to think whether we actually need a high resolution in our problem.
        
- DeepLab
    - DeepLab is basically a FCN with dilated backbone + **SPP**.
    - SPP (Spatial Pyramid Pooling layer) **solves the problem of effectively recognizing objects at different scales with fixed-resolution networks**.
    
- PSPNet
    - is basically a net with dilated backbone with SPP but modified in a way to be convolutional instead of having FC layers.
    
    

- **Istance Segmentation**: it segments different instances of the same object (overlapped cars example)

- Mask R-CNN
    - it is basically a Faster R-CNN with two modifications:
        - adding another hat on the second stage which will be responsible to the segmentation of the object on each RoI layer. This is realized through a **very small Fully Convolutional Network responsible for the mask**.
        - **switch from the RolPool to RolAlign layer**.
    - **Rol Align**
        - recall what RoI pool does: what it does is that it takes the RoI proposed by RPN or Selective Search from the original image and projects it down to the activation. And then from the purple square (which will have continuous values for the corner - it will not be pixel aligned) we move it to the grid by quantizing the floating point numbers. And then we have this very weird way of creating windows of varying sizes within the RoI itself to come up with a 3x3 grid (for instance) constant output. **So these quantizations we introduce is of course a problem. So RolAlign has been introduced to solve this source of quantization**.
        <img src="rol.png" width="60%" height="60%">
        
       - **How it works?**
           - 1) first of all we **do not snap**, namely once we have projected the input Rol onto the activation, we keep is as it is with floating point corners. Then we evenly divide the projection into how many windows as we need, like 3.
           - 2) then we sample like 4 feature values at a regular grid of points within each Rol cell with **bilinear interpolation**.
           - 3) then for each of those floating points we can **compute the maximum/average** and this maximum/average will be the value that I will put in my ouput activation for that spatial position.

**Metric Learning**

- **Face recognition (FR)**: given a query image, tell **who is that person** or tell that I do not know who is
- Embedding classification problem:
    - the embeddings learnt from a classificator are not suited for FR because these embeddings have been trained with the purpose of "learn classification". In fact the learnt representation will look linearly separable in a "flower" way, which is something we do not want, we would want **clusters**.

- **Face verification (FV)**: given two images, confirm that they depict the same person. To do that we need to learn a similarity function, this is the so called **Metric Learning (ML)**. The idea of ML is to use a specific loss at training time to guide the feature extractor to favor a **clustered structure of embedding** where:
    - 1) the distance between faces of the same person is **minimized**
    - 2) the distance between faces of different persons is **maximized**
    
- **Siamese network training**
    - the purpose of the Siamese network is to be able to compute embeddings on two images and then train them. "Train them" means to improve the embeddings we get to fulfill our requirements which is "if two images are of the same person, then make the embeddings similar, otherwise make the embeddings different".
    - **there is a technical problem**: we are used to pass an image to the net and then backprop. Here instead we need to take one image, pass through the net and we get an embedding, then we take another image, pass through the net and we get another embedding. Then, if the two images are of the same person, then we want to make the embeddings similar, not otherwise. So at this point we will compute the value of the loss and **only at this point we can backpropagate!**. _So how do we propagate when we are using two examples?_ **The solution is the so-called Siamese network training**. So instead of having one net, we will have two nets with shared parameters. Actually, in practice we will not have two nets but only one with shared parameters.
- **FR as k-NN classificator** 
    - we can take all the faces in our dataset, project them in the embedding space, and if our model has learnt to discriminate our faces, it will project the different faces in separated parts within the embedding space. So now, when we receive a new test image we just compute its embedding and perform Nearest Neighbor into the search space. Now, differently from before, since the embedding is organized in a way that is good for our problem, we should have a higher chance that the embedding of the Einstein face projected into the embedding space has as Nearest Neighbors only other Einsteins.
    - adding new identity is easy because if we have a **new user**, it can also be told to be new or not, but generally we can realize it to be new because its embedding is very far away from all the other embeddings.
    
- **DeepFace**
    - as a first step they **frontalized** the faces
    - they learnt embeddings by solving a **classification** problem, using therefore a classification loss (naïve approach)
    - at the end, the embedding representation is **normalized** by its L2 norm and by its maximum for that dimension.
    - DeepFace has introduced the so-called **Locally connected layer**
        - in this case we have aligned faces, thanks to frontalization. So it did no really makes sense to have the same filters sliding all over the place because probably some filters make sense only on some regions of the faces. This is a **prior we are injecting** that we can exploit. **So this is where Locally connected layer comes to aid**.
        - Locally connected layer drops three convolution property:
            - do not share parameteres across different locations
            - do not adapt to varying input sizes
            - is not equivariant to translation of the input
        - the **number of parameteres** however is larger compared to normal convolution
        
- **Contrastive loss**
    - we have seen that embedding learned by classification loss are not optimized for open world problems (problems where if we have a new identity, we should in principle re train the net from scratch to be able to net ro learn the new identity).
    - a better loss should make:
        - $d(x^{(i)}, x^{(j)})$ **small** if they depict the same person
        - $d(x^{(i)}, x^{(j)})$ **large** if they depict two different persons
    - a simple loss that captures both requirements is the following (where Euclidean norm is used):
    
        - img src="con.png" width="70%" height="70%"
    - **problem**: when two clusters are enough far away, I am happy, there is no need to push their distance at infinity. **How do we tell this to the loss? --> Hinge loss**.
    
    - **Hinge loss**:
        - Hinge loss introduces an hyperparameter _**m**_ , and its formulation says _"I want two images of the same person to be as close as possible and two images of different persons to be far away **at least by m**"_ . When they are far away by _m_, then I do not optimize anymore.
        - <img src="hinge.png" width="80%" height="80%">
        
- **Triplet loss**
    - **Contrastive loss problem**: in the meantime, people thought about the constrastive loss and came up with a better idea. I mean, if you think about that, we said that we need two properties:
        - two images of the same person to have similar embeddings
        - two images of different person to have far away embeddings
    - now, the constrastive loss is capturing this but it optimize this indirectly. **But to have good embeddings we have both conditions verified at once**. The contrastive loss instead, takes a pair and try either one or the other, not both at the same time. So people came up with the Triplette loss that tries to fullfill these two properties at once.
    - the formulation of the Triplet loss is in the below image
    - <img src="triplet.png" width="70%" height="70%">
    - **there is a problem with this formulation though**: as soos as I pushed the orange point to be farther away then the green point from the reference purple point, **the optimization stops**, it does not push farther anymore. **To solve this problem a margin _m_ is added**. This margin ensures that Fermi picture is far away from the cluster Einstein picture by at least _m_.
    - <img src="triplet2.png" width="70%" height="70%">
    - Why Triplet loss is better that constrastive loss?
        - constrastive loss requires the two pictures of Einstein to become a point, which is impossible in generalt to do. While triplet loss instead **learns to rank**, namely to sort these pictures. So this is a easier task and easier to optimize and tends to work better.
     
    - **Semi-hard negative mining**:
        - **problem**: effectively creating triplets is not so easy though. In particular **the choice of negative examples is the key**.
        - for most of all the triplets we have, which are $N^3$, the constraint we impose with triplet loss above (the one with the margin) will be already satisfied. We want to compute the loss only on the so-called _active triplets_ which are these who will contribute to the learning.
        - so we proceed in that way:
            - we create a _large_ minibatch _B_ where we pick a fixed number of images for _D_ identities to ensure that a significant representation of the anchor-positive distance can be formed, and then **we randomly sample negatives that will be added to complete the batch**.
            - then the triplets are formed **on-line**:
                - creating all the possible (anchor, positive) pairs for each identity
                - creating a triplet adding to the batch a  semi-hard negative _N_ **if the negative example lays within the margin _m_** (so a negative that is beyond the dotted line, not within)
        - notice that we will not select the hardest negatives where the hard negative lays between the pair (anchor, positive) since this was found to lead to poor training.
        
- **FaceNet**
    - FaceNet uses triplet loss with stacks of InceptionV1 modules with some modifications:
        - not all the pooling layers are maxpooling, some are L2 pooling, so the output is the L2 norm of the values in the kernel (is a mystery why this worked better though).
        - as usual the output is L2 normalized.
        - the embeddings are 1x1x128, so very small embeddings.
        - there is not test time augmentation and preprocessing as DeepID2, only simple center crop and aligned faces.
        - stunning performance of 99.63% on FV task on Label Faces in the Wild dataset.
        - **drawback**: it requires a very large training dataset with 8M of identities.
        
- **ArcFace**
    - ArcFace idea is to go back to classification loss, why?
        - BCE is easier to optimize and it is guaranteed to converge
        - we do not need triplets and so we do not have to mess with semi-hard negative mining
        - it is faster because we do not need to form the triplets online when we have a batch
        - we can use a reasonably size of the batch because each example is independent
    - so classification loss has a lot of pros, but, as a drawback, we know that it **does not shape the embeddings to form clusters**. **To solve this problem, ArcFace tried to modify the softmax to enforce _high instra-class compactness_**.
    - How they did proceed?
        - we now that the last layer computes **angular distances** between _template_ and the embedding.
        - they took the embedding vector and **normalized to have unit norm, i.e. they are point on a hypersphere**, each one corresponding to a class. The hypersphere can be seen below.
        - <img src="arc.png" width="15%" height="15%">
        - of course **there is noise at the boundaries** because with just the softmax we cannot guarantee compacteness of each cluster. The idea is to have something like this:
        - <img src="arc2.png" width="20%" height="20%">
        - how to do that?
            - basically **we always penalize** the correct predictions by a constant so that **the loss is never happy about placing the correct face near to its correct template**. So for instance, when my net is right and says that the angle between the template and the face is 15°, I say that is actually 15+**15** (a constant) so that the net tries to minimize it more making everything more compact.

**Depth estimation**

- FlowNet vs DispNet
    - the difference between the two is just the search space: in FlowNet the matching pixel can be everywhere in a 2D neighborhood, while in DispNet the matching pixel lives on a line, _but the architecture is the same_ :
    - FlowNet/DispNet architecture is made up of an _encoder and decoder_. In the **encoder**:
        - at the beginning it has a **Siamese part** which extracts features in a consistent way from both images (the images taken by the stereo vision system).
        - then the tensors obtained are passed to a **correlation layer** which works like that:
            - given the feature maps $f^1$ and $f^2$ obtained from the images $x^1$ and $x^2$ respectively, the matching cost at location $(u, v)$ in $f^1$ is computed as a dot product (or unnormalized NCC) considering a 2D window in $f^2$ of a $(du, dv)$ displacement (namely a window around the pixel considered at location $(u, v)$).
            - this layers **has no learnable parameters**. It has only two **hyperparameters**: 
                - how big the window  is
                - stride
            - for a _flow problem_ , we will search on a 2D window, so in this case, using $d = 3, s = 2$, we will have to search on the following positions:
            - <img src="corr1.png" width="50%" height="50%">
            - for a _stereo problem_ , we have only a 1D direction to search on, so in this case using $d = 4, s = 1$, we will have to search on the following positions:
            - <img src="corr2.png" width="35%" height="35%">
    - while the **decoder**:
        - it is a U-net decoder with up-convs (upsampling convolutions) and skip connections. The only difference wrt U-net is that the loss is applied also to intermediate resolutions that up-convs produce.
        
- GCNet (Geometry and Context Network)
    - GCNet tries to use deep learning to recreate explicitly all those steps of traditional stereo, because with FlowNet and DispNet we throw away everything and replaced 30 years of knowledge with deep learning.
    - GCNet is structured like that:
        - it has as **feature extractor** a Siamese part to extract features. In particular they used ResNet backbone where they changed the stride to 2 to _not lose much spatial resolution_.
        - then we obtain two tensors and a **4D cost volume** is obtained by stacking along the fourth dimensions these two tensors.
        - then **they used a U-net encoder-decoder architecture** to take this huge 4D volume and produce a 3D volume at the end using 3D convs.
        - at the end a **soft argmin** is used to produce the final disparity map.