## Exercise A.1

### Understanding Onehot Encoding and Cross Entropy in PyTorch

#### What are Logits?

**Logits** are the raw, unnormalized output scores produced by a neural network before applying an activation function like softmax. They are the direct result of the final linear layer in a classification model.

Key characteristics of logits:
- **Unnormalized**: Can be any real number (positive, negative, or zero)
- **Not probabilities**: Don't sum to 1 and aren't bounded between 0 and 1
- **Relative values matter**: Higher logits indicate stronger preference for that class
- **Input to softmax**: Converting logits to probabilities requires the softmax function

**Example:**

In [11]:
import torch

In [25]:
# Logits from a neural network's final layer
logits = torch.tensor([-0.3, -0.5, -0.5])  # Raw scores for 3 classes

# Convert to probabilities using softmax
probabilities = torch.softmax(logits, dim=0)  # [0.38, 0.31, 0.31]

print(f"logits: {logits}")
print(f"probabilities: {probabilities}")

logits: tensor([-0.3000, -0.5000, -0.5000])
probabilities: tensor([0.3792, 0.3104, 0.3104])


#### Onehot Encoding

make this into markdown One-hot encoding transforms a class label $y \in \{0, 1, \dots, k-1\}$


into a vector $\mathbf{y}_{\text{one-hot}} \in \{0, 1\}^k$

where:
$$
\mathbf{y}_{\text{one-hot}}[i] =
\begin{cases}
1 & \text{if } i = y \\
0 & \text{otherwise}
\end{cases}
$$

This allows us to compute:

$$\text{Loss} = - \sum_{i=1}^{k} y_i \cdot \log(\hat{y}_i)$$

Where $y_i$ is the one-hot, and $\hat{y}_i$ is the predicted probability.

In [30]:
from torch import Tensor


def to_onehot(t: Tensor, num_classes: int):
    """
    Converts a tensor of class indices to a one-hot encoded tensor.
    
    One-hot encoding is a representation where each class index is converted to a binary vector
    with a length equal to the number of classes. The vector has a value of 1.0 at the index
    corresponding to the class and 0.0 at all other positions.
    
    How it works:
    1. Creates a zero tensor of shape (N, num_classes) where N is the number of samples
    2. Uses scatter_ to place 1.0 at the column index corresponding to each class
    3. Converts the result to float type
    
    Args:
        t: Tensor of shape (N,) containing class indices (integers from 0 to num_classes-1)
        num_classes: The total number of classes to encode
    
    Returns:
        A one-hot encoded tensor of shape (N, num_classes) where each row is a binary vector
    
    Example:
        >>> y = torch.tensor([0, 1, 2, 2])
        >>> y_onehot = to_onehot(y, 3)
        >>> print(y_onehot)
        tensor([[1., 0., 0.],  # class 0: one-hot encoded as [1, 0, 0]
                [0., 1., 0.],  # class 1: one-hot encoded as [0, 1, 0]
                [0., 0., 1.],  # class 2: one-hot encoded as [0, 0, 1]
                [0., 0., 1.]]) # class 2: one-hot encoded as [0, 0, 1]
        
        In this example:
        - Input has 4 samples with class indices [0, 1, 2, 2]
        - Output is a 4x3 matrix (4 samples, 3 classes)
        - Each row represents one sample's class as a binary vector
        - The position of the 1.0 indicates which class that sample belongs to
    """
    y_onehot = torch.zeros(t.size(0), num_classes)
    y_onehot.scatter_(1, t.view(-1, 1).long(), 1).float()
    return y_onehot


y: Tensor = torch.tensor([0, 1, 2, 2])

y_enc = to_onehot(y, 3)

print('one-hot encoding:\n', y_enc)


one-hot encoding:
 tensor([[1., 0., 0.],
        [0., 1., 0.],
        [0., 0., 1.],
        [0., 0., 1.]])


#### Softmax

Suppose we have some net inputs Z, where each row is one training example:

In [28]:
# Z is a tensor containing the raw output scores (logits) from a neural network layer
# before applying the softmax activation function.
#
# Shape: (4, 3) - representing 4 training examples with 3 possible classes each.
# Each row represents one training example's raw scores across all classes
# Each column represents a particular class (class 0, class 1, class 2)
#
# Example interpretation:
# - First sample:  [-0.3, -0.5, -0.5]  → slightly prefers class 0 (least negative)
# - Second sample: [-0.4, -0.1, -0.5]  → strongly prefers class 1 (least negative)
# - Third sample:  [-0.3, -0.94, -0.5] → prefers class 0
# - Fourth sample: [-0.99, -0.88, -0.5] → prefers class 2
#
# These logits will be converted to probabilities using the softmax function below,
# which transforms them into a probability distribution where all values are between 
# 0 and 1 and sum to 1 for each row.
Z: Tensor = torch.tensor([[-0.3, -0.5, -0.5],
                          [-0.4, -0.1, -0.5],
                          [-0.3, -0.94, -0.5],
                          [-0.99, -0.88, -0.5]])

Z


tensor([[-0.3000, -0.5000, -0.5000],
        [-0.4000, -0.1000, -0.5000],
        [-0.3000, -0.9400, -0.5000],
        [-0.9900, -0.8800, -0.5000]])

Next, we convert them to `probabilities` via softmax:

$$P(y=j \mid z^{(i)}) = \sigma_{\text{softmax}}(z^{(i)}) = \frac{e^{z^{(i)}}}{\sum_{j=0}^{k} e^{z_{k}^{(i)}}}.$$

where: 
* $z^{(i)}$ is the $i$th row of input tensor $z$
* $j$ is the class label index
* $k$ is the number of classes

In [14]:
def softmax(z: Tensor) -> Tensor:
    """
    Applies the softmax function to convert logits (raw scores) into probabilities.
    
    The softmax function transforms each row of input tensor z into a probability distribution
    where all values are between 0 and 1 and sum to 1. This is commonly used in multi-class
    classification to convert model outputs into class probabilities.
    
    Formula: softmax(z_i) = exp(z_i) / sum(exp(z_j)) for all j
    
    Args:
        z: Input tensor of shape (N, C) where N is batch size and C is the number of classes
    
    Returns:
        Tensor of the same shape as input with probability distributions across each row.
        Each row sums to 1.0 and all values are in range [0, 1].
    
    Example:
        Input: [[-0.3, -0.5, -0.5]]
        Output: [[0.3792, 0.3104, 0.3104]] # probabilities summing to 1.0
    """
    return (torch.exp(z.t()) / torch.sum(torch.exp(z), dim=1)).t()


smax = softmax(Z)
print('softmax:\n', smax)


softmax:
 tensor([[0.3792, 0.3104, 0.3104],
        [0.3072, 0.4147, 0.2780],
        [0.4263, 0.2248, 0.3490],
        [0.2668, 0.2978, 0.4354]])


The probabilities can then be converted back to class labels based on the largest probability in each row:

In [15]:
def to_classlabel(z: Tensor) -> Tensor:
    """
    Converts probability distributions or one-hot encoded tensors to class labels.
    
    This function takes a tensor where each row represents either a probability distribution
    (from softmax) or a one-hot encoded vector and returns the index of the maximum value
    in each row, which corresponds to the predicted or actual class label.
    
    Args:
        z: Input tensor of shape (N, C) where N is the number of samples and C is the number
           of classes. Each row should contain either probabilities or one-hot encoded values.
    
    Returns:
        Tensor of shape (N,) containing the class label indices (0 to C-1) for each sample.
    
    Example:
        Input (probabilities from softmax):
        tensor([[0.1, 0.7, 0.2], # the highest probability at index 1
                [0.6, 0.3, 0.1]]) # the highest probability at index 0
        
        Output:
        tensor([1, 0])  # class labels
        
        Input (one-hot encoded):
        tensor([[1., 0., 0.],    # one-hot encoded class 0
                [0., 0., 1.]])   # one-hot encoded class 2
        
        Output:
        tensor([0, 2]) # class labels
    """
    return torch.argmax(z, dim=1)


print('predicted class labels: ', to_classlabel(smax))
print('true class labels: ', to_classlabel(y_enc))


predicted class labels:  tensor([0, 1, 0, 2])
true class labels:  tensor([0, 1, 2, 2])


## Cross Entropy

Cross-entropy is a loss function that measures the difference between two probability distributions: the true distribution (actual class labels) and the predicted distribution (model's output probabilities). It is widely used in classification tasks because it effectively quantifies how well the model's predictions match the true labels.

### Mathematical Formulation

For each training example, the cross-entropy loss is computed as:

$$\mathcal{L}(\mathbf{W}; \mathbf{b}) = \frac{1}{n} \sum_{i=1}^{n} H(T_i, O_i),$$

where: 
* $T_i$ is the true class label (one-hot encoded) for the $i$th training example
* $O_i$ is the predicted probability distribution (from softmax) for the $i$th training example
* $n$ is the number of training examples 
* $H$ is the cross-entropy function

The cross-entropy function for a single example is:

$$H(T_i, O_i) = -\sum_m T_{i,m} \cdot \log(O_{i,m}),$$

where $m$ is the class label index ranging over all classes.

### Why Cross-Entropy is Used

1. **Probabilistic Interpretation**: Cross-entropy naturally measures the "distance" between probability distributions, making it ideal for classification where we predict class probabilities.

2. **Penalizes Confident Wrong Predictions**: The logarithm heavily penalizes predictions that are confidently wrong. For example, if the true class has a predicted probability of 0.01, the loss is much higher than if it were 0.5.

3. **Smooth Gradients**: Unlike accuracy (which is discrete), cross-entropy provides smooth gradients that enable effective gradient-based optimization during training.

4. **Works Well with Softmax**: When combined with softmax activation, cross-entropy loss has mathematically convenient properties for backpropagation and optimization.

5. **Encourages Calibrated Probabilities**: The loss pushes the model to output well-calibrated probability distributions, not just correct class predictions.


In [None]:
def cross_entropy(softmax: Tensor, y_target: Tensor) -> torch.Tensor:
    """
    Computes the cross-entropy loss between predicted probabilities and target labels.
    
    Cross-entropy is a loss function commonly used in multi-class classification tasks.
    It measures the difference between the predicted probability distribution (from softmax)
    and the true distribution (one-hot encoded target labels).
    
    Formula: H(T, O) = -sum(T * log(O))
    where:
    - T is the true class label (one-hot encoded)
    - O is the predicted probability distribution (from softmax)
    
    The function computes the negative sum of element-wise pof the log of predicted
    probabilities and the true labels. Lower cross-entropy values indicate better predictions.
    
    Args:
        softmax: Tensor of shape (N, C) containing predicted probabilities for each class,
                 where N is the number of samples and C is the number of classes.
                 Values should be in range [0, 1] and each row should sum to 1.
        y_target: Tensor of shape (N, C) containing one-hot encoded true class labels.
                  Each row has a 1.0 at the true class index and 0.0 elsewhere.
    
    Returns:
        Tensor of shape (N,) containing the cross-entropy loss for each sample.
    
    Example:
        >>> # Predicted probabilities (after softmax)
        >>> softmax = torch.tensor([[0.3792, 0.3104, 0.3104],  # predicts class 0
        ...                         [0.3072, 0.4147, 0.2780],  # predicts class 1
        ...                         [0.4263, 0.2248, 0.3490],  # predicts class 0
        ...                         [0.2668, 0.2978, 0.4354]]) # predicts class 2
        >>> 
        >>> # True labels (one-hot encoded)
        >>> y_target = torch.tensor([[1., 0., 0.],  # true class 0
        ...                          [0., 1., 0.],  # true class 1
        ...                          [0., 0., 1.],  # true class 2
        ...                          [0., 0., 1.]]) # true class 2
        >>> 
        >>> loss = cross_entropy(softmax, y_target)
        >>> print(loss)
        tensor([0.9698, 0.8796, 1.0520, 0.8316])
        
        In this example:
        - First sample: correct prediction (class 0), low loss (0.9698)
        - Second sample: correct prediction (class 1), low loss (0.8796)
        - Third sample: incorrect prediction (predicted 0, true 2), higher loss (1.0520)
        - Fourth sample: correct prediction (class 2), low loss (0.8316)
    """
    return - torch.sum(torch.log(softmax) * y_target, dim=1)


xent = cross_entropy(smax, y_enc)
print('Cross Entropy:', xent)


## In PyTorch

In [17]:
import torch.nn.functional as F

Note that `nll_loss` takes log(softmax) as input:

In [18]:
F.nll_loss(torch.log(smax), y, reduction='none')

tensor([0.9698, 0.8801, 1.0527, 0.8314])

Note that `cross_entropy` takes logits as input:

In [19]:
F.cross_entropy(Z, y, reduction='none')

tensor([0.9698, 0.8801, 1.0527, 0.8314])

### Defaults
By default, nll_loss & cross_entropy are already returning the average over training examples, which is useful for stability during optimization.

In [21]:
F.cross_entropy(Z, y)

tensor(0.9335)

In [22]:
torch.mean(cross_entropy(smax, y_enc))

tensor(0.9335)