**Softmax**
- $softmax(scores_j) = \frac{exp(s_j)}{sum(exp(s_{all}))}$

**Cross-entropy loss**
- $-log(softmax(f(x;\theta)))$

**Momentum**
- $v^{(t + 1)}  = \beta v^{(t)} - lr \nabla L(\theta^{(t)})$
- $\theta^{(t+1)} = \theta^{(t)} + v^{(t + 1)}$
- $v^{(t + 1)}$ contains a running average of the previous update steps

**Nesterov momentum**
- very similar to *momentum*, but the gradient is computed after having "partially" updated $\theta^{(t)}$ with $\beta v^{(t)}$:
- $v^{(t + 1)}  = \beta v^{(t)} - lr \nabla L(\theta^{(t)} + \beta v^{(t)})$
- $\theta^{(t+1)} = \theta^{(t)} + v^{(t + 1)}$
- Nesterov momentum shows a faster convergence

**AdaGrad**
- $s^{(t + 1)} = s^{(t)} + \nabla L (\theta ^{(t)}) * \nabla L (\theta ^{(t)})$, this $s$ is the history of the squared gradients
- $\theta ^{(t + 1)} =  \theta ^{(t)} - \frac{lr}{\sqrt{s^{(t+1)}}+ \epsilon} * \nabla L (\theta ^{(t)})$

**RMSProp**
- $s^{(t + 1)} = \beta s^{(t)} + (1 - \beta)\nabla L (\theta ^{(t)}) * \nabla L (\theta ^{(t)})$, this $s$ is the history of the squared gradients
- $\theta ^{(t + 1)} =  \theta ^{(t)} - \frac{lr}{\sqrt{s^{(t+1)}}+ \epsilon} * \nabla L (\theta ^{(t)})$

**ADAM**
- $g^{(t + 1)} = \beta_1 g^{(t)} + (1 - \beta_1)\nabla L (\theta ^{(t)})$
- $s^{(t + 1)} = \beta_2 s^{(t)} + (1 - \beta_2)\nabla L (\theta ^{(t)}) * \nabla L (\theta ^{(t)})$
- $g^{debiased} = \frac{g^{(t+1)}}{1-\beta_1^{t+1}}$, $s^{debiased} = \frac{s^{(t+1)}}{1-\beta_2^{t+1}}$
- $\theta^{(t+1)} = \theta^{(t)} - \frac{lr}{\sqrt{s^{debiased}}+ \epsilon} * g^{debiased}$

**Relationship between spatial dim**
- $H_{out} = H_{in} - H_k + 1$
- $W_{out} = W_{in} - W_k + 1$

**Padding**
- $H_{out} = H_{in} - H_k + 1 + 2P$
- $W_{out} = W_{in} - W_k + 1 + 2P$

**Stride**
- $H_{out} = inf[\frac{(H_{in} - H_k + 2P)}{S}] + 1$
- $W_{out} = inf[\frac{(W_{in} - W_k + 2P)}{S}] + 1$

**Formula of learnable parameters**: are all weights of all the kernels. So in general we apply a conv with a kernel $16 \times 8 \times 5 \times 5$ with 16 being "how many kernels apply", then the overall formula for the learnable parameters is:
- $16 \times (8 \times 5 \times 5 + 1)$ (+1 for the bias)

**Formula of flops**
- output feature map $\times$ 3D kernel size $\times 2$
- the latter $\times 2$ is because we perform $n$ summation and $n$ multiplications

**Polyak average**
- $$\theta^{(test)} = (1-\rho)\theta^{(i+1)} + \rho\theta^{(test)}$$

**Integral image**
- $$II(i, j) = II(i, j-1) + II(i-1, j) - II(i-1, j-1) + I(i, j)$$

**Focal loss**
- $$BFL(p_t) = -(1-p_t)^y ln p_t$$
- <img src="pt.png" width="70%" height="70%">

**Contrastive loss**
- <img src="con.png" width="70%" height="70%">

**Hinge loss**
- <img src="hinge.png" width="80%" height="80%">

**Triplet loss + margin**
- <img src="triplet2.png" width="70%" height="70%">

**EfficientNet**
- <img src="eff.png" width="70%" height="70%">
