
# 📌 Summary: Softmax Regression

## 🔷 Motivation

Linear regression answers “how much?” (continuous outputs), while **classification** answers “which category?” (discrete outputs).

Examples:
- Is this email spam?
- Is the animal a cat, chicken, or dog?

Some problems allow soft assignments (probabilities for each class), others require hard decisions (one category). In multi-label classification, more than one class can apply.

---

## 🔷 Problem Setup

A sample problem:
- Inputs: $ 2 \times 2 $ grayscale image → 4 features: $ x_1, x_2, x_3, x_4 $
- Outputs: 3 classes (e.g., cat, chicken, dog)

**Label encoding options:**
- **Integer class** (e.g. 1 = cat, 2 = chicken, 3 = dog) – works only when classes are ordered.
- **One-hot encoding** – vector of 0s and one 1, e.g.,  
  - cat → $ (1, 0, 0) $  
  - chicken → $ (0, 1, 0) $  
  - dog → $ (0, 0, 1) $

---

## 🔷 Linear Model

For input vector $ \mathbf{x} \in \mathbb{R}^4 $, output scores (logits) $ \mathbf{o} \in \mathbb{R}^3 $ are:

$$
\mathbf{o} = \mathbf{W} \mathbf{x} + \mathbf{b}
$$

Where:
- $ \mathbf{W} \in \mathbb{R}^{3 \times 4} $: weights
- $ \mathbf{b} \in \mathbb{R}^{3} $: biases

Each output $ o_i $ is an affine function of inputs. This layer is fully connected.

$W_{ij}$ is the weight from input feature $ j $ to output class $ i $. The bias $ b_i $ shifts the output.

---

## 🔷 Softmax Function

To convert output scores $ \mathbf{o} $ to valid probabilities $ \hat{\mathbf{y}} $, use:

$$
\hat{y}_i = \frac{e^{o_i}}{\sum_j e^{o_j}} \quad \text{for each class } i
$$

This ensures:
- Probabilities are positive
- Sum to 1

Softmax is monotonic: largest $ o_i $ corresponds to highest $ \hat{y}_i $. So:

$$
\arg\max_j \hat{y}_j = \arg\max_j o_j
$$

Inspired by Boltzmann distributions in physics, where probabilities are proportional to $ e^{-E/kT} $.

---

## 🔷 Vectorized Computation

For minibatch input $ \mathbf{X} \in \mathbb{R}^{n \times d} $:

$$
\begin{aligned}
\mathbf{O} &= \mathbf{X} \mathbf{W} + \mathbf{b} \\
\hat{\mathbf{Y}} &= \text{softmax}(\mathbf{O}) \quad \text{(row-wise)}
\end{aligned}
$$

Efficient and avoids loops. Libraries handle numerical stability.

---

## 🔷 Loss Function: Cross-Entropy

It intends to calculate the probability $P(\mathbf{Y} | \mathbf{X})$ of labels given inputs. The loss function is the negative log-likelihood as 

$$
P(\mathbf{Y} \mid \mathbf{X}) = \prod_{i=1}^n P(\mathbf{y}^{(i)} \mid \mathbf{x}^{(i)}).
$$

and you can turn products into sums by taking the logarithm. 

Given predictions $ \hat{\mathbf{y}} $ and one-hot labels $ \mathbf{y} $, the log-likelihood is:

$$
l(\mathbf{y}, \hat{\mathbf{y}}) = -\sum_{j=1}^{q} y_j \log \hat{y}_j
$$

This becomes:

$$
l(\mathbf{y}, \hat{\mathbf{y}}) = \log \sum_k e^{o_k} - \sum_j y_j o_j
$$

Loss is 0 only if predicted with 100% certainty.

---

## 🔷 Gradient of the Loss

Derivative of loss with respect to logit $ o_j $:

$$
\frac{\partial l}{\partial o_j} = \hat{y}_j - y_j
$$

Same idea as in regression: prediction minus ground truth. Works even when labels are probabilities (not just one-hot vectors).

This makes gradient descent easy to implement.

---

## 🔷 Why It’s Called Cross-Entropy

Cross-entropy quantifies how many bits are needed to encode the true distribution $ \mathbf{y} $ using the predicted distribution $ \hat{\mathbf{y}} $.

### Entropy

The central idea in information theory is to quantify the
amount of information contained in data.
This places a  limit on our ability to compress data.
For a distribution $P$ its *entropy*, $H[P]$, is defined as:

$$H[P] = \sum_j - P(j) \log P(j).$$
:eqlabel:`eq_softmax_reg_entropy`

One of the fundamental theorems of information theory states
that in order to encode data drawn randomly from the distribution $P$,
we need at least $H[P]$ "nats" to encode it :cite:`Shannon.1948`.
If you wonder what a "nat" is, it is the equivalent of bit
but when using a code with base $e$ rather than one with base 2.
Thus, one nat is $\frac{1}{\log(2)} \approx 1.44$ bit.


You might be wondering what compression has to do with prediction. Imagine that we have a stream of data that we want to compress. If it is always easy for us to predict the next token, then this data is easy to compress. Take the extreme example where every token in the stream always takes the same value. That is a very boring data stream! And not only it is boring, but it is also easy to predict. Because the tokens are always the same, we do not have to transmit any information to communicate the contents of the stream. Easy to predict, easy to compress.

However if we cannot perfectly predict every event, then we might sometimes be surprised. The **surprisal** of an event is described as $-\log P(j)$, where $P(j)$ is the probability of the event. 


In [1]:
# Cross-entropy H(p, q) measures how well a probability distribution q approximates a true distribution p.
# For discrete distributions:
#   H(p, q) = -sum_j p(j) * log(q(j))
# If p is one-hot (true class), this reduces to -log(q(true_class)).
# In softmax regression, p is the true label distribution, q is the predicted probabilities.