# Softmax regression

## Unvectorized

$\textbf{x} \in \mathbb{R}^{p}$

$\textbf{W} \in \mathbb{R}^{n_c \times p}$

$\textbf{b} \in \mathbb{R}^{n_c}$

$\textbf{y} \in \{0, 1\}^{n_c}$

$L = - \sum_{k=1}^{n_k} y_k \log \left[ (\mathrm{softmax}(\textbf{W} \textbf{x} + \textbf{b})_k \right]$

$\textbf{z} = \textbf{W} \textbf{x} + \textbf{b} \in \mathbb{R}^{n_c}$

$\textbf{a} = \mathrm{softmax}(\textbf{z}) \in \mathbb{R}^{n_c}$

$\frac{\partial L_k}{\partial W_{i,j}} = \frac{\partial L_k}{\partial a_k}\frac{\partial a_k}{\partial z_i}\frac{\partial z_i}{\partial W_{i,j}}$

$\frac{\partial L}{\partial a_k} = \frac{y_k}{a_k}$

$\frac{\partial a_k}{\partial z_i} = \begin{cases} a_i (1 - a_i) & \text{if i == k} \\ -a_i a_j & \text{otherwise} \end{cases}$

$\frac{\partial z_i}{\partial W_{i,j}} = x_j$

$\begin{eqnarray} \frac{\partial L}{\partial W_{i,j}} &=& \frac{-y_i}{a_i}a_i(1-a_i)x_j + \sum_{k \neq i} \frac{y_k}{a_k}a_i a_k x_j \\ &=& (-y_i + y_i a_i) x_j + \sum_{k \neq i} y_k a_i x_j \\ &=& -y_i x_j + y_i a_i x_j + \sum_{k \neq i} y_k a_i x_j \\ &=& -y_i x_j + \sum_{k=1}^{n_c} y_k a_i x_j \\ &=& -y_i x_j + a_i x_j \sum_{k=1}^{n_c} y_k \\ &=& (a_i - y_i) x_j  \end{eqnarray}$

$\frac{\partial J}{\partial W_{i,j}} = \frac{1}{n} \sum_{t=1}^n L^{(t)} = \frac{1}{n} \sum_{t=1}^n (a_t - y_t) x_j^{(t)}$

## Vectorized

$\textbf{X} \in \mathbb{R}^{n \times p}$

$\textbf{W} \in \mathbb{R}^{p \times n_c}$

$\textbf{b} \in \mathbb{R}^{n_c}$

$Y \in [0, 1]^{n \times n_c}$

$\sum_{j=1}^{n_c} Y_{i,j} = 1$

$Z = XW + b \in \mathbb{R}^{n \times n_c}$

$A = \mathrm{softmax}(Z) \in \mathbb{R}^{n \times n_c}$

## Implementation

* Using log_softmax instead of softmax would make it more stable

In [21]:
import scipy.special
import numpy as np

import warnings
def warn(*args, **kwargs):
    pass
warnings.warn = warn
import sklearn.linear_model

In [22]:
def _softmax(x):
    m = np.max(x, axis=-1, keepdims=True)
    z = x - m
    e_z = np.exp(z)
    return e_z / np.sum(e_z, axis=-1, keepdims=True)


def train_sgd(X, Y):
    n, p = X.shape
    n, k = Y.shape

    np.random.seed(0)
    W = np.random.randn(p, k) * 0.01
    b = np.zeros(k)

    lr = 0.01
    batch_size = 64
    num_epochs = 100
    for epoch in range(num_epochs):
        start = 0
        end = batch_size
        while end <= n:
            X_batch = X[start:end, :]
            Y_batch = Y[start:end, :]
            Z = np.dot(X_batch, W) + b
            A = _softmax(Z)
            dW = (1./n) * (np.dot(X_batch.T, A - Y_batch))
            db = np.mean(A - Y_batch)
            W = W - lr * dW
            b = b - lr * db
            start = end
            end = start + batch_size
    return W, b


def predict(X, W, b):
    Z = np.dot(X, W) + b
    A = _softmax(Z)
    return A

In [23]:
np.random.seed(0)
n = 4096
p = 4
k = 3
X = np.random.randn(n, p)
W_true = np.random.randn(p, k)
b_true = np.random.randn()
Z = np.dot(X, W_true) + b_true
A = scipy.special.softmax(Z, axis=1)
Y = np.zeros((n, k))
for i in range(n):
    Y[i, :] = np.random.multinomial(n=1, pvals=A[i, :])
X_train = X[:2048]
Y_train = Y[:2048]
X_test = X[2048:]
Y_test = Y[2048:]

model = sklearn.linear_model.LogisticRegression(solver="lbfgs")
y_train = np.argmax(Y_train, axis=1)
y_test = np.argmax(Y_test, axis=1)
model.fit(X_train, y_train)
y_hat_test_sklearn = model.predict(X_test)
acc_sklearn = np.mean(y_hat_test_sklearn == y_test)

W, b = train_sgd(X_train, Y_train)
A_hat = predict(X_test, W, b)
y_hat_test = np.argmax(A_hat, axis=1)
acc = np.mean(y_hat_test == y_test)
assert acc >= acc_sklearn

print(acc_sklearn)
print(acc)

0.6787109375
0.68359375


## Sources

* [L8.8 Softmax Regression Derivatives for Gradient Descent](https://www.youtube.com/watch?v=aeM-fmcdkXU)
* https://math.stackexchange.com/questions/945871/derivative-of-softmax-loss-function
* http://web.archive.org/save/http://deeplearning.stanford.edu/tutorial/supervised/SoftmaxRegression/
* Vectorizing softmax
    * https://stackoverflow.com/questions/57741998/vectorizing-softmax-cross-entropy-gradient
    * https://stackoverflow.com/questions/59286911/vectorized-softmax-gradient
    * https://mattpetersen.github.io/softmax-with-cross-entropy