# Softmax  Regression - Logistic Regression for many classes

In this notebook we generalize logistic regression to handle $K$
classes instead of 2. This is often known as Multinomial Regression or Softmax, but sometimes Logistic Regression refers to this more general case.

**We will use the exact same approach as for Logistic Regression**, but it becomes slightly more technical due to the extra
classes.

## The Setup
Assume the problem at hand is a multiclass classification problem with $K$ classes.
For this problem we choose to encode the target labels in a very particular way.

A target value $y$ is naturally represented as a number in $\{1, \dots, K\}$. However, for this exposition we chose to represent it as a vector of
length $K$ with all zeros except one which corresponds to the class. This is called a *one-in-K encoding*.

If a data point $x$ is labelled with class 3 and there are five classes then
$y = [0,0,1,0,0]^\intercal = e_3$.

To store the labels of all data points we create a matrix $Y$ of size $n \times K$.
The data matrix $X$ is unchanged.


$$
X=\begin{pmatrix} 
1&- & x_1^T & - \\
\vdots & \vdots & \vdots \\
1&- & x_n^T & - \\
\end{pmatrix}\in \mathbb{R}^{n \times d}\quad\quad 
y=\begin{pmatrix}
- & y_1^T & -\\
- & \vdots &- \\
- & y_n^T & -\end{pmatrix}\in\{0,1\}^{n\times K}
$$


To generalize to $K$ classes we will use $K$ weight vectors $w_1,\dots,w_k$ each of length $d$, one for each class.
Note this fits with the original wine example from the lectures.
Also, to use  such a list of weight vectors for classification we do as described in the wine example:

Given data x, compute $w_i^\intercal x$ for $i=1,\dots, K$ and return the index of the largest value.

Now, this list of weight vectors we pack into a matrix $W$ of size $d \times K$ by putting $w_1$ in column one and so on.
$$
W=\begin{pmatrix} 
(w_1)_1  & \dots & (w_K)_1 \\
\vdots & \vdots & \vdots \\
(w_1)_d  & \dots & (w_K)_d \\
\end{pmatrix}\in \mathbb{R}^{d \times n}
$$

This way we can compute the weighed sum for each class by the vector matrix product $x^\intercal W$ and then pick argmax of that to do the classification. Pretty Neat!.


In [1]:
import numpy as np
# example with 3 classes and d = 10
W = np.random.rand(10, 3)
print('Shape w:', W.shape)
x = np.array([1., 2., 3., 4., 5., 6., 7., 8., 9., 10.0]).reshape(10, 1)
print('Shape x:', x.shape)
model_predictions = x.T @ W
print('model (unnormalzed log) predictions: - pick the larger one\n', model_predictions)
print('class output: ', 1 +np.argmax(model_predictions))

Shape w: (10, 3)
Shape x: (10, 1)
model (unnormalzed log) predictions: - picke the larger one
 [[33.27812034 33.97847283 26.81526456]]
class output:  2


## Probabilistic Outputs
However, as for Logistic Regression we are really interested in computing a probability of each class given a data point.
Said differently, given a set of model parameters $W$ and a data point $x$ we want $P(y=i\mid x, W)$ for $i=1,\dots K$.
This is a list of length $K$ given the probability of each class as estimated by the model.

So how to define this probabilistic model.  
We need to map the output of $w_1^\intercal x, \dots, w_K^\intercal x= x^\intercal W$ which is a vector of $K$ real numbers to a vector of $K$ non-negative numbers that sum to one, generalizing the sigmoid/logistic function.

This generalization is the **softmax** function.
Softmax takes as input a vector of length $K$
and outputs another vector of the same length $K$,
that is a mapping from the $K$ input numbers into $K$
*probabilities*, e.g. non-negative numbers that sum to one.  Softmax is defined as

$$
\textrm{softmax}(x)_j =
\frac{e^{x_j}}
{\sum_{i=1}^K e^{x_i}}\quad
\textrm{ for }\quad j = 1, \dots, K.
$$
where  $\textrm{softmax}(x)_j$ denote the $j$'th entry in the vector.

Notice that the denominator acts as a normalization term that ensures
that the probabilities sum to one and that the exponentiaion ensures all numbers are positive. As for the sigmoid function we get nice derivatives, (exercise later)

$$
\frac
{\partial \;\textrm{softmax}(x)_i}
{\partial x_j} =
(\delta_{i,j} - \textrm{softmax}(x)_j)
\textrm{softmax}(x)_i\quad\quad\text{where}\quad\quad
\delta_{ij}=\begin{cases}1 &\text{if }i=j\\
0 & \text{else}
\end{cases}
$$

With this softmax defined we can now define our probabilistic model
$$
p(y \mid x, W) =
\textrm{softmax}(x^\intercal W) =
 \left \{
\begin{array}{l l}
 \textrm{softmax}(x^\intercal W)_1 & \text{ if } y = e_1,  \\
 \vdots & \\
 \textrm{softmax}(x^\intercal W)_K & \text { if } y = e_K.
\end{array}
\right.
$$

Think of the probability distribution over $y$ as throwing a $K$-sided die
where the likelihood of landing on each of the $K$ sides is stored in the
vector $\textrm{softmax}(W^\intercal x)$ (which is a vector of length $K$) and the
probability of landing on side $i$ is $\textrm{softmax}(W^\intercal x)_i$. 



In [2]:
import numpy as np
x = np.array([1, 1, 1])
softmax = np.exp(x)/np.sum(np.exp(x))
print('softmax of the ones vector: ', softmax)
print('That seems reasonable, right?')

softmax of the ones vector:  [0.33333333 0.33333333 0.33333333]
That seems reasonable, right?


As for logistic regression we compute the likelihood of the data given a fixed matrix of parameters.
We use the notation $[z]$ for the indicator function e.g. $[z]$ is one if $z$ is true and zero otherwise.

$$
P(D \mid W) =
\prod_{(x,y)\in D}
\prod_{j=1}^K
\textrm{softmax}(x^\intercal W)_j^{[y_j=1]}
=
\prod_{(x,y)\in D}
y^\intercal
\textrm{softmax}(x^\intercal W)
.
$$

This way of expressing is the same as we did for logistic regression.
The product over the $K$ classes will have one element
that is not one, namely the $y_j$'th element ($y$ is a
vector of $K-1$ zeros and a one). The remaining probabilities are
raised to a power of zero and has the value one.

## The Negative Log Likelihood
For convenience we minimize the negative log likelihood of the data
instead of maximizing the likelihood of the data and get a pointwise sum.

$$
\begin{align}\textrm{NLL}(D\mid W) &=
-\sum_{(x,y)\in D}
\sum_{j=1}^K
[y_j=1]
\ln (\textrm{softmax}(x^\intercal W)_j)
\\
&=-\sum_{(x,y)\in D}
y^\intercal
\ln (\textrm{softmax}(x^\intercal W))
.
\end{align}
$$
Notice again that inside the last summation only one value will be nonzero. For a particular data point (x, y) where $y=e_j$ let $z = W^\intercal x$ be the input to softmax then the cost for that point is just
$$ - \ln \mathrm{softmax}(z)_j = \ln \left( \frac{e^{z_j}}{\sum_{i=1}^d e^{z_i}}\right) = - (z_j - \ln \sum_{i=1}^d e^{z_i}) $$


Again we define $E_\textrm{in} = \frac{1}{|D|} \textrm{NLL}$, which we need to minimize.  
To apply stochastic mini-batch gradient descent as for Logistic Regression all you really need is the
gradient of the negative log likelihood function.  This gradient is a
*simple* generalization of the one used in logistic
regression. There is a set of parameters for each of $K$ classes, $W_j$ for
$j=1,\ldots,K$ (the $j$'th column in the parameter matrix $W$) that must be learned.
Luckily some nice people tell you what the gradient is: 
$$
\nabla \textrm{NLL}(W) =
-X^\intercal
(Y - \textrm{softmax}(XW)),
$$

where softmax is taken on each row of the matrix (that is, $X W$ is an
$n \times K$ matrix and softmax is computed for each training case over
the $K$ classes).

We will skip the derivation of the gradient that can be found by application of the chain rule.


## Numerical Issues with Softmax
There are some numerical issues with the softmax function

$$
\textrm{softmax}(x)_j = \frac{e^{x_j}}{\sum_{i=1}^K e^{x_i}} \textrm{ for } j=1,\ldots,K.
$$

This is because this is a sum of exponentials (before taking logs again),
and exponentiation of numbers tend to make them very large giving numerical problems.
Let's look at the function for a fixed $j$,
$\frac{e^{x_j}}{\sum_{i=1}^K e^{x_i}}$.
Since the logarithm and the exponential function are each other's inverse,
we may write it as
$$
e^{\textstyle x_j - \ln(\sum_{i=1}^K e^{x_i})}
$$

The problematic part is the logarithm of the sum of exponentials.
However, we can move $e^c$ for any constant $c$ outside the sum, that is,
$$
\ln\left(\sum_i e^{x_i}\right) =
\ln\left(e^c \sum_i e^{x_i-c}\right) =
c + \ln\left(\sum_i e^{x_i -c}\right).
$$

We need to find a good $c$, and we choose $c = \max_i x_i$ since
$e^{x_i}$ is the dominant term in the sum. We are less concerned with
values being inadvertently rounded to zero since that does not
change the value of the sum significantly.

