#4.1 Softmax Regression

## Discussions & Exercises

* Regression is the hammer we reach for when we want to answer *how much?* or *how many*
* Classification problems: focus on *which category?* questions
* Classification to describe two problems:
  * those where we are interested only in hard assignments of examples to categories (classes)
  * those where we wish to make soft assignments, i.e., to assess the probability that each category applies.
* There are cases where more than one label might be true -> Multi-label classification
* In general, classification problems do not come with natural orderings among the classes.
* statisticians long ago invented a simple way to represent categorical data: the one-hot encoding. A one-hot encoding is a vector with as many components as we have categories. The component corresponding to a particular instance’s category is set to 1 and all other components are set to 0.
* Normalization: transform the values so that they add up to 1 by dividing each by their sum.
* Softmax function:
$$
\hat{\mathbf{y}} = \text{softmax}(\mathbf{o}) \quad \text{where} \quad \hat{y}_i = \frac{\exp(o_i)}{\sum_j \exp(o_j)}
$$
4.1.2 Loss Function
* Cross-entropy loss is one of the most used losses for classification problems. It measures the number of bits needed to encode what we, $y$, relative to what we predict that should happen, $\hat{\mathbf{y}}$
4.1.3 Information Theory Basics
* The central idea in information theory is to quantify the amount of information contained in data.
* Entropy
$$ H[P] = {\sum_j} - P(j) log P(j)
$$
  * In order to encode data drawn randomly from the distribution P we need at least H[P] "nats" to encode it
  * 1 nat $\approx$ 1.44 bit
* Easy to predict, easy to compress
* If we cannot perfectly predict every event, then we might be suprised.
  * Our surprise is greater when an event is assigned lower probability.
  $log \frac{1}{P(j)} = - log P(j)$ to quantify one's *surprisal* at observing an event j having assigned it a (subjective) probability P(j)
* Entropy is the level of surprise experienced by someone who knows the true probability.
* The cross-entropy from P to Q, denoted H(P,Q), is the expected surprisal of an observer with subjective probabilities Q upon seeing data that was actually generated according to probabilities P
  * The lowest possible cross-entropy is achieved when P = Q.
* We can think of cross-entropy classification objective in two ways:
  * As maximizing the likelihood of the observed data
  * As minimizing our surprisal (and thus the number of bits) required to communicate the labels.


### Exercises

Compute the second derivative of the cross-entropy loss $l(y,\hat{y})$for softmax.

$$l(y,\hat{y}) = log \sum_{k = 1}^{q}\text{exp}(o_{k}) - \sum_{k = 1}^{q}y_{j}o_{j} $$
$$\partial_{o_{j}}l(y,\hat{y}) = \frac{\text{exp}(o_{j})}{\sum_{k = 1}^{q}\text{exp}(o_{k})} - y_{j} =  \hat{y_{j}} - y_j$$

$$\frac{\partial^{2} l(y,\hat{y})}{\partial o_{j}^{2}} = \frac {\text{exp}(o_{j})(\sum_{k = 1}^{q}\text{exp}(o_{k})) - \text{exp}(o_{j}) \text{exp}(o_{j})}{(\sum_{k = 1}^{q}\text{exp}(o_{k}))^{2}}$$

$$ \frac{\partial^{2} l(y,\hat{y})}{\partial o_{j}^{2}} = \frac{\text{exp}(o_{j})[\sum_{k = 1}^{q}\text{exp}(o_{k}) - \text{exp}(o_{j})]}{(\sum_{k = 1}^{q}\text{exp}(o_{k}))^{2}}$$

$$ \frac{\partial^{2} l(y,\hat{y})}{\partial o_{j}^{2}} = \frac{\text{exp}(o_{j})}{\sum_{k = 1}^{q}\text{exp}(o_{k})} (\frac {\sum_{k = 1}^{q}\text{exp}(o_{k})}{\sum_{k = 1}^{q}\text{exp}(o_{k})} - \frac {\text{exp}(o_{j})}{\sum_{k = 1}^{q}\text{exp}(o_{k})})$$

$$ \frac{\partial^{2} l(y,\hat{y})}{\partial o_{j}^{2}} = \hat{y_{j}}(1 - \hat{y_{j}})$$

$$ \frac{\partial^{2} l(y,\hat{y})}{\partial o_{j}^{2}} = \text{softmax}(\mathbf{o_j})(1 - \text{softmax}(\mathbf{o_j}) )$$
