In [None]:
import numpy as np

**The Loss Function: Maximum Likelihood**

Okay, we set some parameters, train the model, and it works... works... **Are you sure?**

In order to objectively estimate how well our model predicts the output, we need a function that fairly compares our predictions with the actual outputs. This is the **Loss Function**.

We will start with the fundamental principle from which most other loss functions are derived: **Maximum Likelihood Estimation (MLE)**.

$$
\hat{\phi} = \underset{\phi}{\arg\max} \prod_{i=1}^I P(y_i \mid f(x_i, \phi))
$$



It is quite intuitive:

* $f(x_i, \phi)$: This is our model (function) with learnable parameters $\phi$ acting on input $x_i$.
* $P(y_i \mid f(x_i, \phi))$: This is the **likelihood** of observing the true target $y_i$. The model outputs parameters for a distribution (like the mean $\mu$ for a Gaussian or the probability $\lambda$ for a Bernoulli), and we check how probable the true data point $y_i$ is under that distribution.
* **The Goal:** We want to **maximize the product** of these probabilities. If the product is high, it means our model parameters $\phi$ make the observed data "very likely" to happen.

> **Technical Note:**
> While the formula above multiplies the probabilities ($\prod$), in practice, multiplying many small numbers causes computer errors (underflow). Therefore, we usually take the **Negative Logarithm** of this formula.
>
> 1.  **Logarithm:** Turns the **Product** into a **Sum** (which is easier to calculate).
> 2.  **Negative Sign:** In optimization, we prefer minimizing "Loss" rather than maximizing "Likelihood."
>
> This gives us the **Negative Log-Likelihood (NLL)**:
> $$\hat{\phi} = \underset{\phi}{\arg\min} \left[ -\sum_{i=1}^I \log P(y_i \mid f(x_i, \phi)) \right]$$

We won't go through the full mathematical derivation for every function, but trust me that the principle above is fundamental.

The main loss functions we will cover are:
* **Least Squares Loss** (Mean Squared Error) – Derived from Gaussian distribution.
* **Cross-Entropy Loss** – Derived from Bernoulli distribution (Binary classification).
* **Multiclass Cross-Entropy Loss** – Derived from Categorical distribution.

1. **Least squares loss**
$$L = \sum_{i=1}^N (y_i - \hat{y}_i)^2$$


In [2]:
def mse(y_hat: np.ndarray, y_true: np.ndarray) -> float:
  return np.sum((y_hat - y_true)**2)

**Derivative of MSE**

Since the backward process (gradient computation) wants to discover how parameters $\phi$ influence the total loss $L[\phi]$, we must start at the end of the network.

In order to obtain this, we start by computing the derivative of the loss with respect to our prediction function $\hat{f}(x, \phi)$ and move backward using the **Chain Rule**.



Let's define the Mean Squared Error (MSE) for a batch of $I$ samples. Let $\hat{y}_i = \hat{f}(x_i, \phi)$ be our prediction and $y_i$ be the true value.

$$L = \frac{1}{I} \sum_{i=1}^I (\hat{y}_i - y_i)^2$$

We want to find $\frac{\partial L}{\partial \hat{y}_i}$ (how the Loss changes if the prediction changes). We apply the **Power Rule** of calculus ($\frac{d}{dx}u^2 = 2u \cdot u'$):

$$
\frac{\partial L}{\partial \hat{y}_i} = \frac{\partial}{\partial \hat{y}_i} \left[ \frac{1}{I} (\hat{y}_i - y_i)^2 \right]
$$

Since the sum involves other terms ($j \neq i$) that are constants with respect to $\hat{y}_i$, they disappear, leaving only the term for the current sample:

$$
\frac{\partial L}{\partial \hat{y}_i} = \frac{1}{I} \cdot 2 \cdot (\hat{y}_i - y_i) \cdot \frac{\partial}{\partial \hat{y}_i}(\hat{y}_i - y_i)
$$

Since the derivative of $(\hat{y}_i - y_i)$ is just $1$:

$$
\frac{\partial L}{\partial \hat{y}_i} = \frac{2}{I} (\hat{y}_i - y_i)
$$


In [3]:
def mse_derivative(y_hat, y_true):
  return 2 * (y_hat - y_true)

2. **Binary cross-entropy**
$$L[\phi] = - \sum_i \left[ y_i \log(\text{sig}(f[x_i, \phi])) + (1 - y_i) \log(1 - \text{sig}(f[x_i, \phi])) \right]$$

Here we use $sig(z)$(sigmoid) since as we assume that we have Bernoulli distribution, so we need parameter $\lambda$, which is a probaility, and since our model will not return number from 0 to 1 - we pass it through sigmoid. 

In [5]:
def sigmoid(z: np.ndarray) -> np.ndarray:
    return 1 / (1 + np.exp(-z))

In [None]:
def binary_cross_entropy(y_true, y_hat):
  p = sigmoid(y_hat)
  
  epsilon = 1e-15
  p = np.clip(p, epsilon, 1 - epsilon)
  
  return -np.mean(y_true * np.log(p) + (1 - y_true) * (np.log(1 - p)))

**Derivative of binary cross-entropy**

Let’s define our variables:
* **$z$**: The raw input (logit) from the linear layer ($z = wx + b$).
* **$\hat{y}$**: The predicted probability, calculated using the Sigmoid function:
    $$\hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}}$$
* **$y$**: The actual ground truth label ($0$ or $1$).
* **$L$**: The Binary Cross-Entropy Loss function:
    $$L = -[y \log(\hat{y}) + (1 - y) \log(1 - \hat{y})]$$

We want to find the gradient of the Loss with respect to the input $z$.
$$\frac{\partial L}{\partial z} = \text{?}$$

Since $L$ depends on $\hat{y}$, and $\hat{y}$ depends on $z$, we apply the Chain Rule:

$$\frac{\partial L}{\partial z} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z}$$

We need to calculate these two parts separately.

We take the derivative of the BCE formula:

$$L = -y \ln(\hat{y}) - (1 - y) \ln(1 - \hat{y})$$

Using the rule $\frac{d}{dx}\ln(x) = \frac{1}{x}$:

1.  Derivative of first term: $\frac{-y}{\hat{y}}$
2.  Derivative of second term: $-(1-y) \cdot \frac{1}{1-\hat{y}} \cdot (-1)$ (Chain rule applied to $1-\hat{y}$) $\rightarrow \frac{1-y}{1-\hat{y}}$

Combine them:
$$\frac{\partial L}{\partial \hat{y}} = -\frac{y}{\hat{y}} + \frac{1 - y}{1 - \hat{y}}$$

**Simplify by finding a common denominator:**
$$= \frac{-y(1 - \hat{y}) + \hat{y}(1 - y)}{\hat{y}(1 - \hat{y})}$$
$$= \frac{-y + y\hat{y} + \hat{y} - y\hat{y}}{\hat{y}(1 - \hat{y})}$$
$$= \frac{\hat{y} - y}{\hat{y}(1 - \hat{y})}$$

This is just the derivative of the Sigmoid function, which we derived earlier.

$$\frac{\partial \hat{y}}{\partial z} = \hat{y}(1 - \hat{y})$$

Now we multiply Part A and Part B. Watch the magic happen:

$$
\frac{\partial L}{\partial z} = \underbrace{\left( \frac{\hat{y} - y}{\hat{y}(1 - \hat{y})} \right)}_{\text{Part A}} \cdot \underbrace{\left( \hat{y}(1 - \hat{y}) \right)}_{\text{Part B}}
$$

The denominator of the Loss derivative **perfectly cancels out** the derivative of the Sigmoid function.

$$
\frac{\partial L}{\partial z} = \hat{y} - y
$$

In [None]:
def binary_cross_entropy_derivative(y_true, y_hat):
  return sigmoid(y_hat) - y_true

3. **Categorical cross-entropy**

The **Multiclass Cross-Entropy Loss** (also known as Categorical Cross-Entropy) is the standard loss function used in machine learning for classification tasks with more than two classes.

It quantifies the difference between the **true probability distribution** (typically represented as one-hot encoded labels) and the **predicted probability distribution** output by the model. Conceptually, it encourages the model to assign a high probability to the correct class and suppresses the probabilities of incorrect classes.

For a batch of $N$ samples and $K$ classes, the loss $L$ is defined as:
$$L = -\frac{1}{N} \sum_{i=1}^N \sum_{k=1}^K y_{i,k} \log(p_{i,k})$$
**Where:**
- $N$: The batch size (number of samples).
- $K$: The number of classes.
- $y_{i,k}$: The true label for sample $i$ and class $k$.
    - $y_{i,k} = 1$ if $k$ is the correct class.
    - $y_{i,k} = 0$ otherwise (One-Hot Encoding).
- $p_{i,k}$: The predicted probability that sample $i$ belongs to class $k$.


**The Softmax Connection**

The predicted probability $p_{i,k}$ is usually computed via the **Softmax function** applied to the model's raw logits (scores) $f$:
$$p_{i,k} = \text{softmax}(f_k) = \frac{e^{f_k[x_i, \phi]}}{\sum_{k'=1}^K e^{f_{k'}[x_i, \phi]}}$$

Because the true labels $y_{i,k}$ are **one-hot encoded** (only one $1$, the rest are $0$), the inner summation collapses. We only care about the probability of the _correct_ class ($y_i$):
$$L = -\frac{1}{N} \sum_{i=1}^N \log(p_{i, y_i})$$


In practice (e.g., PyTorch's `nn.CrossEntropyLoss`), we substitute the Softmax equation directly into the Loss equation. This is the **Log-Sum-Exp** form, which is numerically stable:

$$L = -\frac{1}{N} \sum_{i=1}^N \left( \underbrace{f_{y_i}[x_i, \phi]}_{\text{Score of Correct Class}} - \underbrace{\log \sum_{k'=1}^K e^{f_{k'}[x_i, \phi]}}_{\text{LogSumExp (Total Energy)}} \right)$$

In [None]:
def categorical_cross_entropy(y_true, y_hat):
  
  loss = 0
  n = len(y_true)
  
  for true_class_idx, logits in zip(y_true, y_hat):
    log_sum_exp = np.log(np.sum(np.exp(logits)))
    
    loss += logits[true_class_idx] - log_sum_exp
    
  return -(loss / n)

In [None]:
def categorical_cross_entropy_derivative(y_true, y_hat):
  return sigmoid(y_hat) - y_true