### Logistic Regression with L1 Regularization

In a logistic regression, we assume that the probability $p$ of a Bernoulli random variable is given by the logistic function:

\begin{equation}
p = \frac{\exp(a^Tx)}{1 + \exp(a^T x)}
\end{equation}

where $x$ is some parameter vector to be estimated, and $a$ is the data vector. In general, we would want to consider $a^Tx+ v$ instead of just $a^T x$, but here we assume our data set has zero mean (i.e. $E[a] = 0$) in which case it is valid to leave out the bias term $v$ for simplicity (this essentially fixes $p | a = 0$ to be $1/2$).

Given data $a_1,..., a_m$ and the corresponding class labels $b_1,..., b_m \in \{-1, 1\}$, The maximum likelihood estimate of $p$ is then obtained by maximizing the log-likelihood of the data:

\begin{equation*}
  \begin{aligned}
    &\text{maximize} && \sum_{i | b_i = 1} \log \frac{1}{1 + \exp(-a^T x_i)} 
                    +\sum_{i | b_i = -1} \log \frac{1}{1 + \exp(a^T x_i)} 
  \end{aligned}
\end{equation*}

Noting that the only difference between the first and second sums is the sign of the coefficient of $a^Tx$, we can express this compactly as

\begin{equation*}
  \begin{aligned}
    &\text{minimize} && \sum_{i=1}^m \log(1 +\exp(-b_ia^T x_i)) 
  \end{aligned}
\end{equation*}

Note that we have flipped the leading sign and changed the maximize to a minimize. Since Log-sum-exp is convex, this is a convex optimization problem.

If we have reason to believe that $x$ is sparse (i.e. there are relatively few important parameters in the data vector $u$) then we can add an $\ell_1$ regularization term to the objective to obtain

\begin{equation*}
  \begin{aligned}
    &\text{minimize} && \sum_{i=1}^m \log(1 +\exp(-b_ia^T x_i)) + \lambda \|x\|_1
  \end{aligned}
\end{equation*}

where $\lambda$ is a hyperparameter that controls the strength of our regularization.


In [2]:
import cvxpy as cp
import numpy as np
import scipy as sp

# Variable declarations

import scipy.sparse as sps

def normalized_data_matrix(m, n, mu):
    if mu == 1:
        # dense
        A = np.random.randn(m, n)
        A /= np.sqrt(np.sum(A**2, 0))
    else:
        # sparse
        A = sps.rand(m, n, mu)
        A.data = np.random.randn(A.nnz)
        N = A.copy()
        N.data = N.data**2
        A = A*sps.diags([1 / np.sqrt(np.ravel(N.sum(axis=0)))], [0])

    return A

def create_classification(m, n, rho=1, mu=1, sigma=0.05):
    """Create a random classification problem."""
    A = normalized_data_matrix(m, n, mu)
    x0 = sps.rand(n, 1, rho)
    x0.data = np.random.randn(x0.nnz)
    x0 = x0.toarray().ravel()

    b = np.sign(A.dot(x0) + sigma*np.random.randn(m))
    return A, b

np.random.seed(0)
m = 1500
n = 50000
rho = 0.01
mu = 0.1

A, b = create_classification(m, n, rho = rho, mu = mu)

ratio = float(np.sum(b==1)) / len(b)
lambda_max = np.abs((1-ratio)*A[b==1,:].sum(axis=0) +
                    ratio*A[b==-1,:].sum(axis=0)).max()
lam = 0.5*lambda_max

x = cp.Variable(A.shape[1])


# Problem construction
prob = None
opt_val = None

def logistic_loss(theta, X, y):
    if not all(np.unique(y) == [-1, 1]):
        raise ValueError("y must have binary labels in {-1,1}")
    return cp.sum_entries(cp.logistic(-sps.diags([y],[0])*X*theta))

f = logistic_loss(x, A, b) + lam*cp.norm1(x)
prob = cp.Problem(cp.Minimize(f))


problemDict = {
    "problemID": "logreg_l1_0",
    "problem": prob,
    "opt_val": None
}

problems = [problemDict]

# For debugging individual problems:
if __name__ == "__main__":
    def printResults(problemID = "", problem = None, opt_val = None):
        print(problemID)
        problem.solve()
        print("\tstatus: {}".format(problem.status))
        print("\toptimal value: {}".format(problem.value))
        print("\ttrue optimal value: {}".format(opt_val))
    printResults(**problems[0])

('status:', 'optimal')
('optimal value:', 961.009369017684)
('true optimal value:', None)
