# üß† Machine Learning for Classification ‚Äî Key Equations

## **Linear Classifier**
The model computes class scores using a weight matrix.

$$
s = Wx
$$

---

## **Softmax Probability**
Softmax converts raw scores into normalized probabilities.

$$
p_j = \frac{e^{s_j}}{\sum_{k} e^{s_k}}
$$

---

## **Cross Entropy Loss**
Measures how well predicted probabilities match the true class.

$$
L_i = -\log(p_{y_i})
$$

Expanded:

$$
L_i = -\log\left( \frac{e^{s_{y_i}}}{\sum_j e^{s_j}} \_


## **6 ‚Äì ML for Classification**
1.1 k-NN (distance)

$$
d(\mathbf{x}, \mathbf{x}_i)
= \left\lVert \mathbf{x} - \mathbf{x}_i \right\rVert_2
= \sqrt{\sum_{j=1}^D (x_j - x_{i,j})^2}
$$

Euclidean distance between query sample 
ùë•
x and training sample 
ùë•
ùëñ
x
i
	‚Äã

.


## k-NN Distance (Euclidean)

$$
d(\mathbf{x}, \mathbf{x}_i)
= \left\lVert \mathbf{x} - \mathbf{x}_i \right\rVert_2
= \sqrt{\sum_{j=1}^D (x_j - x_{i,j})^2}
$$

**Explanation:**  
Measures similarity between the query sample and a training sample. Smaller distance = more similar.


## Linear Classifier ‚Äî Score Function

$$
f(\mathbf{x}, W, \mathbf{b}) = W \mathbf{x} + \mathbf{b}
$$

**Explanation:**  
Computes raw class scores using a linear transformation of the input vector.


## Bias Trick (Augmented Input)

$$
\tilde{\mathbf{x}} =
\begin{bmatrix}
\mathbf{x} \\
1
\end{bmatrix},
\qquad
f(\mathbf{x}, W) = W \tilde{\mathbf{x}}
$$

**Explanation:**  
Adds a constant 1 to absorb the bias into the weight matrix.


## Multiclass SVM Loss (Per Sample)

$$
L_i
= \sum_{j \ne y_i}
\max(0,\, s_j - s_{y_i} + \Delta)
$$

**Explanation:**  
Penalizes any incorrect class that is within margin Œî of the correct class score.


## Full SVM Loss (Regularized)

$$
L(W)
= \frac{1}{N} \sum_{i=1}^N L_i
+ \lambda \lVert W \rVert_2^2
$$

**Explanation:**  
Adds L2 penalty to control overfitting.


## Softmax Probability

$$
p_j
= \frac{e^{s_j}}{\sum_{k=1}^C e^{s_k}}
$$

**Explanation:**  
Converts raw class scores into normalized probabilities.


## Cross-Entropy Loss (Softmax)

$$
L_i = -\log p_{y_i}
= -\log
\frac{e^{s_{y_i}}}{\sum_k e^{s_k}}
$$

Full loss with L2 regularization:
$$
L(W)
= \frac{1}{N} \sum_{i=1}^N
\left(
-\log
\frac{e^{s_{y_i}}}{\sum_{k=1}^C e^{s_k}}
\right)
\;+\; \lambda \,\lVert W \rVert_2^2
$$
Standard loss for multiclass classification.

**Explanation:**  
Minimizes the negative log-probability of the correct class.


## Gradient Descent Update

$$
W_{t+1} = W_t - \eta \frac{\partial L}{\partial W}
$$

**Explanation:**  
Moves weights opposite to gradient direction to reduce loss.
Œ∑ = learning rate; move parameters opposite the gradient of the loss.

## Mini-Batch SGD Update

$$
W_{t+1}
= W_t - \eta\, \frac{1}{|B|}
\sum_{i\in B} \frac{\partial L_i}{\partial W}
$$

**Explanation:**  
Uses gradients averaged over a mini-batch for stable training.

Single-sample SGD:
$$
W^{(t+1)}
= W^{(t)} - \eta \,\frac{\partial L_i}{\partial W}
$$

Mini-batch SGD (batch 
ùêµ
B):
$$
W^{(t+1)}
= W^{(t)} - \eta \,\frac{1}{|B|}
\sum_{i \in B} \frac{\partial L_i}{\partial W}
$$
Use gradients from one sample (very noisy) or a mini-batch (more stable).

1.8 Inverse-sqrt learning-rate schedule (as in slides)

$$
\eta_t = \frac{\eta_0}{\sqrt{t / T}}
$$
xample schedule: 
ùúÇ
0
Œ∑
0
	‚Äã

 initial rate, 
ùë°
t = epoch index, 
ùëá
T = total epochs (exact constants may vary in slides, but this is the usual inverse-sqrt form)

---------------------------------------
# **üìò 8 ‚Äì Image Classification with CNN**
---------------------------------------

## 2D Convolution (Single Output Channel)

$$
y(i,j)
= \sum_{u=0}^{k_h-1}
\sum_{v=0}^{k_w-1}
w(u,v)\, x(i+u,\, j+v)
$$

**Explanation:**  
Computes weighted sum of local image patch using kernel.


## 2D Convolution (Multi-Channel)

$$
y_k(i,j)
= \sum_{c=1}^{C_{in}}
\sum_{u=0}^{k_h-1}
\sum_{v=0}^{k_w-1}
w_{k,c}(u,v)\, x_c(i+u,j+v)
+ b_k
$$

**Explanation:**  
Each filter aggregates across all input channels.k = output channel index, 
ùëê
c = input channel, 
ùëè
ùëò
b
k
	‚Äã

 = bias.

## ReLU Activation

$$
\text{ReLU}(z) = \max(0, z)
$$

**Explanation:**  
Adds non-linearity by clipping negative values to zero.


## Max Pooling

$$
y_c(i,j)
=
\max_{(u,v)\in \mathcal{W}}
x_c(i+u, j+v)
$$

**Explanation:**  
Reduces spatial size by taking local maxima.

## Number of Parameters in a Conv Layer

$$
\#\text{params}
= k_h k_w C_{in} C_{out}
+ C_{out}
$$

**Explanation:**  
Kernel weights + one bias per output channel.


2.5 CNN classifier head

Often last layer is linear + softmax:


$$
\mathbf{s} = W_\text{fc} \,\mathbf{h} + \mathbf{b}_\text{fc},
\qquad
p_j = \frac{e^{s_j}}{\sum_k e^{s_k}}
$$
h = flattened feature vector from conv layers; then apply cross-entropy like before.


2.6 Backpropagation (conceptual)

Simple 1-hidden-layer chain rule example:
$$
\frac{\partial J}{\partial w_2}
= \frac{\partial J}{\partial \hat{y}}\,
  \frac{\partial \hat{y}}{\partial z}\,
  \frac{\partial z}{\partial w_2}
$$




---------------------------------------
# *üìò 9 ‚Äì Object Detection*
---------------------------------------

## Bounding Box Regression (Anchor-Based)

$$
t_x = \frac{x - x_a}{w_a},\quad
t_y = \frac{y - y_a}{h_a},\quad
t_w = \log\frac{w}{w_a},\quad
t_h = \log\frac{h}{h_a}
$$

**Explanation:**  
Converts GT box into normalized offsets relative to an anchor box.

## IoU (Intersection over Union)

$$
IoU
= \frac{|B_p \cap B_{gt}|}{|B_p \cup B_{gt}|}
$$

**Explanation:**  
Measures overlap between predicted and true boxes (0‚Äì1).

## Smooth L1 (Huber) Loss

$$
\text{SmoothL1}(d)=
\begin{cases}
0.5 d^2, & |d|<1 \\
|d|-0.5, & \text{otherwise}
\end{cases}
$$

**Explanation:**  
Less sensitive to outliers than pure L2 loss.

## Total Detection Loss (Multi-task)

$$
L = L_{cls} + \lambda L_{box}
$$

**Explanation:**  
Combines classification and bounding-box regression losses.

## Precision & Recall

$$
Precision = \frac{TP}{TP+FP},
\qquad
Recall = \frac{TP}{TP+FN}
$$

**Explanation:**  
Used for computing AP/mAP detection metrics.

## Average Precision (AP)

$$
AP = \int_0^1 p(r)\,dr
$$

**Explanation:**  
Area under the precision‚Äìrecall curve.


## mAP (Mean Average Precision)

$$
mAP = \frac{1}{C} \sum_{c=1}^C AP_c
$$

**Explanation:**  
Averaged AP over all classes.


# 3. 9 ‚Äì Object Detection
3.1 Bounding box parameterization (relative to anchor)

$$
t_x = \frac{x - x_a}{w_a}, \quad
t_y = \frac{y - y_a}{h_a}, \quad
t_w = \log\frac{w}{w_a}, \quad
t_h = \log\frac{h}{h_a}
$$


   3.2 Multi-task loss (classification + regression)

$$
L
= L_\text{cls} + \lambda \, L_\text{box}
$$


Total loss is classification loss (e.g., cross-entropy) plus weighted bounding-box regression loss.

Common Smooth L1 (Huber) for box regression:

$$
\text{SmoothL1}(d) =
\begin{cases}
0.5 d^2, & |d| < 1 \\
|d| - 0.5, & \text{otherwise}
\end{cases}
$$


3.3 Intersection over Union (IoU)

$$
\text{IoU}(B_p, B_{gt})
= \frac{|B_p \cap B_{gt}|}{|B_p \cup B_{gt}|}
$$


3.4 Precision, Recall, AP, and mAP

Precision & recall:

$$
\text{Precision} =
\frac{\text{TP}}{\text{TP} + \text{FP}},
\qquad
\text{Recall} =
\frac{\text{TP}}{\text{TP} + \text{FN}}
$$

Average Precision (area under P‚ÄìR curve):
$$
\text{AP} = \int_0^1 p(r)\,dr
$$


Mean AP over classes:
$$
\text{mAP}
= \frac{1}{C} \sum_{c=1}^C \text{AP}_c
$$


---------------------------------------
# *üìò 10 ‚Äì Object Tracking*
---------------------------------------

## Constant-Velocity State Model

$$
\mathbf{x}_t =
\begin{bmatrix}
x_t \\ y_t \\ v_{x,t} \\ v_{y,t}
\end{bmatrix}
$$

**Explanation:**  
Tracks object position + velocity over time.

## State Transition (Kalman Form)

$$
\mathbf{x}_t = F\mathbf{x}_{t-1} + w_t,\quad
F=
\begin{bmatrix}
1&0&\Delta t&0\\
0&1&0&\Delta t\\
0&0&1&0\\
0&0&0&1
\end{bmatrix}
$$

**Explanation:**  
Predicts next state assuming constant velocity.

## Measurement Model

$$
\mathbf{z}_t = H\mathbf{x}_t + v_t,\quad
H=
\begin{bmatrix}
1&0&0&0\\
0&1&0&0
\end{bmatrix}
$$

**Explanation:**  
We only observe the (x,y) position.

## SSD Template Matching

$$
SSD(u,v)=
\sum_{x,y}
\left[
I(x+u,y+v) - T(x,y)
\right]^2
$$

**Explanation:**  
Lower SSD = better match.

## NCC Template Matching

$$
NCC(u,v)=
\frac{
\sum (I-\mu_I)(T-\mu_T)
}{
\sqrt{
\sum(I-\mu_I)^2 \sum(T-\mu_T)^2
}}
$$

**Explanation:**  
Normalized score robust to lighting changes.


---------------------------------------
# *üìò 11 ‚Äì Image Segmentation*
---------------------------------------

## Pixel-wise Softmax

$$
p_{i,c}
= \frac{e^{s_{i,c}}}{\sum_k e^{s_{i,k}}}
$$

**Explanation:**  
Probability a pixel belongs to class c.

## Pixel-wise Cross-Entropy Loss

$$
L =
-\frac{1}{N}
\sum_{i=1}^N \sum_{c=1}^C
y_{i,c}\log p_{i,c}
$$

**Explanation:**  
Common loss for semantic segmentation.

## Dice Coefficient

$$
Dice(P,G)
= \frac{2|P\cap G|}{|P|+|G|}
$$

**Explanation:**  
Measures segmentation mask overlap.

## Dice Loss

$$
L_{Dice} = 1 - Dice(P,G)
$$

**Explanation:**  
Higher Dice = better segmentation. Loss minimizes 1 ‚Äì Dice.


---------------------------------------
# *üìò 12 ‚Äì Generative Models (AE, VAE, GAN)*
---------------------------------------

## Autoencoder ‚Äî Encoding & Decoding

$$
\mathbf{z}=f_\theta(\mathbf{x}),\quad
\hat{\mathbf{x}}=g_\phi(\mathbf{z})
$$

**Explanation:**  
AE compresses input into latent code and reconstructs it.

## AE Reconstruction Loss

$$
L_{AE} = \lVert \mathbf{x} - \hat{\mathbf{x}} \rVert_2^2
$$

**Explanation:**  
Minimizes reconstruction error.

6.2 VAE generative story

Prior:
$$
\mathbf{z} \sim \mathcal{N}(\mathbf{0}, I)
$$


Encoder outputs mean and log-variance:
$$
\boldsymbol{\mu}(\mathbf{x}),\;
\boldsymbol{\sigma}(\mathbf{x})
$$

Reparameterization trick:
$$
\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, I),
\qquad
\mathbf{z}
= \boldsymbol{\mu}(\mathbf{x})
+ \boldsymbol{\sigma}(\mathbf{x}) \odot \boldsymbol{\epsilon}
$$

Allows backprop through sampling.

## VAE Reparameterization Trick

$$
\mathbf{z}
= \mu(\mathbf{x})
+ \sigma(\mathbf{x}) \odot \epsilon,
\qquad
\epsilon\sim\mathcal{N}(0,I)
$$

**Explanation:**  
Allows backpropagation through sampling.

## VAE Loss (Negative ELBO)

$$
L_{VAE}
=
-\mathbb{E}_{q(z|x)}[\log p(x|z)]
+
KL\big(q(z|x)\;||\;p(z)\big)
$$

**Explanation:**  
Reconstruction term + regularization on latent space.

## KL Divergence (Gaussian)

$$
KL=
\frac{1}{2}\sum_j
(\mu_j^2 + \sigma_j^2 - 1 - \log\sigma_j^2)
$$

**Explanation:**  
Closed-form KL for diagonal Gaussian.

## GAN Minimax Objective

$$
\min_G \max_D\;
\Big[
\mathbb{E}_{x\sim p_{data}}[\log D(x)]
+ \mathbb{E}_{z\sim p_z}[\log(1 - D(G(z)))]
\Big]
$$

**Explanation:**  
Discriminator tries to classify real vs fake; generator tries to fool D.


---------------------------------------
# *üìò 13 ‚Äì 3D Vision*
---------------------------------------

## Pinhole Camera Projection

$$
s\begin{bmatrix}
u\\v\\1
\end{bmatrix}
=
K [R|t]
\begin{bmatrix}
X\\Y\\Z\\1
\end{bmatrix}
$$

**Explanation:**  
Maps 3D world coordinates to 2D image coordinates.

## Intrinsic Matrix

$$
K =
\begin{bmatrix}
f_x & 0 & c_x\\
0 & f_y & c_y\\
0 & 0 & 1
\end{bmatrix}
$$

**Explanation:**  
Camera internal parameters: focal lengths + principal point.

## Rigid Transformation (3D)

$$
\mathbf{X}' = R\mathbf{X} + t
$$

**Explanation:**  
Rotates + translates a 3D point.

## Euclidean Distance (3D Points)

$$
d=\sqrt{(X_1-X_2)^2 + (Y_1-Y_2)^2 + (Z_1-Z_2)^2}
$$

**Explanation:**  
Used for point-cloud matching and clustering.

## PointNet Set Function

$$
f(\{x_i\}) = \gamma\left(\max_i h(x_i)\right)
$$

**Explanation:**  
Permutation-invariant network for unordered 3D points.
