### Softmax Classification (with L2 Regularization)

For the multiclass classification problem, let $s_j$ be the score that a classifier assigns to class $j$, and let $y$ be the true class label. By interpreting the scores of the classifier as a set of unnormalized log probabilities, we can come up with the softmax loss function, which is essentially the KL divergence between the distribution assigning all probability mass to the correct class label and the distribution reported by the classifier. See http://cs231n.github.io/linear-classify/ for an excellent explanation of the softmax loss:

\begin{equation}
\ell(s) = - \log \frac{e^{s_y}}{\sum_{j = 1}^n e^{s_j}} = -s_y + \log \sum_{j = 1}^n e^{s_j}
\end{equation}

We will use the softmax loss to train a linear classifier as follows. Given labelled data $(x_1, y_1), (x_2, y_2),... (x_m, y_m)$, We wish to find a matrix $\Theta$ and an offset vector $\beta$ such that the scores given by $s^{(i)} = \Theta^T x_i + \beta$ minimize the loss:

\begin{equation}
\ell(X \Theta + \beta^T \mathbb{1}) = \frac{1}{m}\sum_{i = 1}^m \left[ - s^{(i)}_{y_i}  + \log \sum_{i = 1}^n \exp(s^{(i)}_j) \right]
\end{equation}

In this particular problem instance, we minimize the sum of the largest $p$ softmax losses, plus an L2 regularization term on the weight matrix $\Theta$. As before, let $s^{(i)} = \Theta^T x_i + \beta$ be the vector of scores for the data entry $(x_i, y_i)$. Then let $z$ be the vector such that:

\begin{equation}
z_i = -s^{(i)}_{y_i} + \log \sum_{j = 1}^n \exp(s^{(i)}_j)
\end{equation}

Our final problem is

\begin{equation*}
  \begin{aligned}
    &\text{minimize} && \sum_{l = 1}^p z_{[p]} + \|\Theta\|_2^2 \\
  \end{aligned}
\end{equation*}

With variables $\Theta$ and $\beta$ and data $(x_1, y_1), (x_2, y_2),... (x_m, y_m)$. Here, $z_{[p]}$ denotes the $p$th largest entry of $z$.


In [2]:
import cvxpy as cp
import numpy as np
import scipy as sp

# Variable declarations

import scipy.sparse as sps

def normalized_data_matrix(m, n, mu):
    if mu == 1:
        # dense
        A = np.random.randn(m, n)
        A /= np.sqrt(np.sum(A**2, 0))
    else:
        # sparse
        A = sps.rand(m, n, mu)
        A.data = np.random.randn(A.nnz)
        N = A.copy()
        N.data = N.data**2
        A = A*sps.diags([1 / np.sqrt(np.ravel(N.sum(axis=0)))], [0])

    return A

np.random.seed(0)
k = 20  # Number of classes
m = 100 # Number of instances
n = 50  # Dimension of each instance
p = 5   # p-largest
X = normalized_data_matrix(m,n,1) # Randomly generated data
Y = np.random.randint(0, k, m) # Randomly generated class scores



# Problem construction
problems = []
opt_vals = []

# Problem 1 (Unconstrained)
Theta = cp.Variable(n,k)
beta = cp.Variable(1,k)
obs = cp.vstack([-(X[i]*Theta + beta)[Y[i]] + cp.log_sum_exp(X[i]*Theta + beta) for i in range(m)])
prob1 = cp.Problem(cp.Minimize(cp.sum_largest(obs, p) + cp.sum_squares(Theta)))

# Problem 2 (Epigraph form)
def one_hot(y, k):
    m = len(y)
    return sps.coo_matrix((np.ones(m), (np.arange(m), y)), shape=(m, k)).todense()

Theta2 = cp.Variable(n,k)
beta2 = cp.Variable(1, k)
t = cp.Variable(m)
texp = cp.Variable(m)
f = cp.sum_largest(t+texp, p) + cp.sum_squares(Theta2)
C = []
C.append(cp.log_sum_exp(X*Theta2 + np.ones((m, 1))*beta2, axis=1) <= texp)
Yi = one_hot(Y, k)
C.append(t == cp.vstack([-(X[i]*Theta2 + beta2)[Y[i]] for i in range(m)]))
prob2 = cp.Problem(cp.Minimize(f), C)

# Problem collection

# Single problem collection
problem1Dict = {
    "problemID" : "max_softmax_0",
    "problem"   : prob1,
    "opt_val"   : None
}
problem2Dict = {
    "problemID" : "max_softmax_epigraph_0",
    "problem"   : prob2,
    "opt_val"   : None
}
problems = [problem1Dict, problem2Dict]



# For debugging individual problems:
if __name__ == "__main__":
    def printResults(problemID = "", problem = None, opt_val = None):
        print(problemID)
        problem.solve()
        print("\tstatus: {}".format(problem.status))
        print("\toptimal value: {}".format(problem.value))
        print("\ttrue optimal value: {}".format(opt_val))
    printResults(**problems[0])
    printResults(**problems[1])


max_softmax_0
	status: optimal
	optimal value: 14.9433288172
	true optimal value: None
max_softmax_epigraph_0
	status: optimal
	optimal value: 14.9433288162
	true optimal value: None
