# 3.4.Softmax Regression
Regression is the hammer we reach for when we want to answer how much? or how many? questions.

Classification is the hammer we reach for when we want to answer “which one” questions.

##3.4.1. Classification Problem
Let us start off with a simple image classification problem. Here, each input consists of a  2×2  grayscale image. We can represent each pixel value with a single scalar, giving us four features  x1,x2,x3,x4 . Further, let us assume that each image belongs to one among the categories “cat”, “chicken”, and “dog”.

Next, we have to choose how to represent the labels.
We have two obvious choices.
Perhaps the most natural impulse would be to choose $y \in \{1, 2, 3\}$,
where the integers represent $\{\text{dog}, \text{cat}, \text{chicken}\}$ respectively.

Fortunately, statisticians long ago invented a simple way to represent categorical data: the one-hot encoding. A one-hot encoding is a vector with as many components as we have categories. 

The component corresponding to particular instance's category is set to 1
and all other components are set to 0.
In our case, a label $y$ would be a three-dimensional vector,
with $(1, 0, 0)$ corresponding to "cat", $(0, 1, 0)$ to "chicken",
and $(0, 0, 1)$ to "dog":

$$y \in \{(1, 0, 0), (0, 1, 0), (0, 0, 1)\}.$$


##3.4.2. Network Architecture

In order to estimate the conditional probabilities associated with all the possible classes, we need a model with multiple outputs, one per class.

In our case, since we have 4 features and 3 possible output categories,
we will need 12 scalars to represent the weights ($w$ with subscripts),
and 3 scalars to represent the biases ($b$ with subscripts).
We compute these three *logits*, $o_1, o_2$, and $o_3$, for each input:

$$
\begin{aligned}
o_1 &= x_1 w_{11} + x_2 w_{12} + x_3 w_{13} + x_4 w_{14} + b_1,\\
o_2 &= x_1 w_{21} + x_2 w_{22} + x_3 w_{23} + x_4 w_{24} + b_2,\\
o_3 &= x_1 w_{31} + x_2 w_{32} + x_3 w_{33} + x_4 w_{34} + b_3.
\end{aligned}
$$

We can depict this calculation with the neural network diagram:

![Softmax regression is a single-layer neural network.](http://d2l.ai/_images/softmaxreg.svg)

Since the calculation of each output,  o1,o2 , and  o3 , depends on all inputs,  x1 ,  x2 ,  x3 , and  x4 , the output layer of softmax regression can also be described as fully-connected layer.


##3.4.3. Parameterization Cost of Fully-Connected Layers


However, as the name suggests,
fully-connected layers are *fully* connected
with potentially many learnable parameters.
Specifically,
for any fully-connected layer
with $d$ inputs and $q$ outputs,
the parameterization cost is $\mathcal{O}(dq)$,
which can be prohibitively high in practice.
Fortunately,
this cost 
of transforming $d$ inputs into $q$ outputs
can be reduced to $\mathcal{O}(\frac{dq}{n})$,
where the hyperparameter $n$ can be flexibly specified
by us to balance between parameter saving and model effectiveness in real-world applications `Zhang.Tay.Zhang.ea.2021`.

##3.4.4. Softmax Operation
The main approach that we are going to take here is to interpret the outputs of our model as probabilities. We will optimize our parameters to produce probabilities that maximize the likelihood of the observed data. Then, to generate predictions, we will set a threshold, for example, choosing the label with the maximum predicted probabilities.

For example, if $\hat{y}_1$, $\hat{y}_2$, and $\hat{y}_3$
are 0.1, 0.8, and 0.1, respectively,
then we predict category 2, which (in our example) represents "chicken".

It is easy to see $\hat{y}_1 + \hat{y}_2 + \hat{y}_3 = 1$
with $0 \leq \hat{y}_j \leq 1$ for all $j$.

Although softmax is a nonlinear function,
the outputs of softmax regression are still *determined* by
an affine transformation of input features;
thus, softmax regression is a linear model.

##3.4.5. Vectorization for Minibatches

To improve computational efficiency and take advantage of GPUs, we typically carry out vector calculations for minibatches of data. Assume that we are given a minibatch $\mathbf{X}$ of examples
with feature dimensionality (number of inputs) $d$ and batch size $n$.
Moreover, assume that we have $q$ categories in the output.
Then the minibatch features $\mathbf{X}$ are in $\mathbb{R}^{n \times d}$,
weights $\mathbf{W} \in \mathbb{R}^{d \times q}$,
and the bias satisfies $\mathbf{b} \in \mathbb{R}^{1\times q}$.


##3.4.6. Loss Function
We need a loss function to measure the quality of our predicted probabilities.

##3.4.7. Information Theory Basics
Information theory deals with the problem of encoding, decoding, transmitting, and manipulating information (also known as data) in as concise form as possible.

###3.4.7.1. Entropy
The central idea in information theory is to quantify the information content in data. 

###3.4.7.2. Surprisal
If we cannot perfectly predict every event, then we might sometimes be surprised. The entropy is then the expected surprisal when one assigned the correct probabilities that truly match the data-generating process.

###3.4.7.3. Cross-Entropy Revisited
So if entropy is level of surprise experienced by someone who knows the true probability, then the cross-entropy *from* $P$ *to* $Q$, denoted $H(P, Q)$,
is the expected surprisal of an observer with subjective probabilities $Q$
upon seeing data that were actually generated according to probabilities $P$.
The lowest possible cross-entropy is achieved when $P=Q$.

In short, we can think of the cross-entropy classification objective in two ways: (i) as maximizing the likelihood of the observed data; and (ii) as minimizing our surprisal (and thus the number of bits) required to communicate the labels.

##3.4.8. Model Prediction and Evaluation
After training the softmax regression model, given any example features, we can predict the probability of each output class. Normally, we use the class with the highest predicted probability as the output class. The prediction is correct if it is consistent with the actual class (label). In the next part of the experiment, we will use accuracy to evaluate the model’s performance. This is equal to the ratio between the number of correct predictions and the total number of predictions.