# Logistic Regression/Softmax Regression

## Model Specification

Logistic regression, despite its name, is a linear model for classification rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled in terms of $K-1$ **log-odds** or **logit transformations**.

\begin{align}
\log\frac{P(G=k|X=x)}{P(G=K|X=x)} = \beta_{k0} + \beta_{k}^{\top}x,
\end{align}

where $K$ is is the base class (there are $K$ classes) and $k=1,\dots, K-1$. The above leads to the following expressions for the probabilities

\begin{align}
P(G=k|X=x) &= \frac{\exp(\beta_{k0} + \beta_{k}^{\top}x)}{1+\sum_{l=1}^{K-1}\exp(\beta_{l0} + \beta_{l}^{\top}x)},\;k=1,\dots,K-1\\
P(G=K|X=x) &= \frac{1}{1+\sum_{l=1}^{K-1}\exp(\beta_{l0} + \beta_{l}^{\top}x)}.
\end{align}

The log likelihood is given by $l(\beta)=\sum_{i=1}^N\log p_{g_i}(x_i; \beta)$， where $p_k(x_i;\theta)=P(G=k|X=x_i;\theta)$. Parameters are learned via maximum likelihood, where the optimization is done via (variants) of **gradient descent or Newton's method**: it has to be done by a numerical method since there is no closed-form solution to the MLE, but since the log-likelihood is convex, gradient descent is guaranteed to converge. 

We discuss in more detail the case of $K=2$ here. Although it does not look like it at first glance, the log likelihood defined above reduces to the usual log loss that one may be more familiar with in two-class case:
\begin{align}
l(\beta) &= \sum_{i=1}^N\left(y_n\log p_1(x_i; \beta) + (1 - y_n)\log (1 - p_0(x_i; \beta))\right)\\
&= \sum_{i=1}^N \left(y_i\beta^{T}x_i - \log(1 + e ^{\beta^Tx_i})\right)
\end{align}
In vector form, the **first- and second-order derivatives** are given by
\begin{align}
\frac{\partial l(\beta)}{\partial\beta} &= \bf{X}^{\top}(\bf{y}-\bf{p})\\
\frac{\partial^2 l(\beta)}{\partial\beta\partial\beta^{\top}} &= -\bf{X}^{\top}\bf{W}\bf{X}
\end{align}
where $\bf{X}\in \mathbb{R}^{N\times(p+1)}$ is the stacked form of predictor matrix plus the intercept, $\bf{y}$ the vector of $y_i$, $\bf{p}$ the vector of fitted probabilities $p(x_i;\beta)$ and $\bf{W}\in \mathbb{R}^{N\times N}$ matrix of weights with the $i$-th diagonal element $p(x_i;\beta)(1-p(x_i;\beta))$. In particular, **given $X$, the Fisher information** matrix, defined to be $E_{\beta}\left(\frac{\partial^2 l(\beta)}{\partial\beta\partial\beta^{\top}}\right)$, is just $\bf{X}^{\top}\bf{W}\bf{X}$. See this [lecture note](https://www.evernote.com/shard/s191/nl/21353936/27f42ac1-2573-b4e3-8466-fff06dce4eac?title=Untitled) for details in the formula.

The Newton's step by the two derivatives above resembles fitting a weighted least square of some adjusted response, but probably does not worth the detail.

### Variants and Generalizations

- L1 penalization: $l(\beta_0, \beta)=\sum_{i=1}^N\log p_{g_i}(x_i; \beta_0, \beta)-\lambda |\beta|^{\top}1$.
- L2 penalization: $l(\beta_0, \beta)=\sum_{i=1}^N\log p_{g_i}(x_i; \beta_0, \beta)-\frac{1}{2}\lambda \beta^{\top}\beta$.

Note that in the above, we typically do not penalize the intercept term $\beta_0$. Also there are equivalent specification by a parameter $C=\frac{1}{\lambda}$, as in `sklearn`. 

There are nonlinear optimization methods that solves the penalized versions of logistic regression; see Further Reading for reference. But the path algorithms similar to LAR is more difficult, because the coefficient profiles are piecewise smooth rather than linear; see Section 4.4.4. in ESL for more details.

## Theoretical Properties

### Advantages

- Shared with linear regression, logistic regression models are used mostly as a data analysis and inference tool, when **the goal is to understand the role of input variables in explaining the outcome**. More specifically, like in linear regression, one can compute the Z score for coefficients of the model as the estimated value of the coefficients divided by their standard deviation, and using the Z score to judge the significance of the predictor. Although the standard deviations do not seem to be included anywhere in `sklearn`'s attributes or methods, this can probably approximated by the second derivative of the log likelihood.

- There are **consistency** and **asymptotic normality** properties of the MLE estimate $\hat{\beta}$, which provide theoretical foundation.

- There are **theoretically sound procedure to perform variable selection**. For instance, one can do the likelihood ratio test (or analysis of deviance, or Rao's Score test) and the log likelihood or the deviance can be efficiently approximated by 
\begin{align}
\sum_{i=1}^N\frac{(y_i-\hat{p}_i)^2}{\hat{p}_i(1-\hat{p}_i)}
\end{align}

### Disadvantages

- Again similar to linear regression, logistic regression suffers from predictor colinearity.

### Relation to Other Models

**Logistic Regression vs. LDA**

Both logistic regression and LDA produces linear boundaries - but there are differences in how these parameters are estimated.

- For LDA, the parameters are estimated by maximizing the full likelihood. It has stronger model assumptions - by relying on the additional model assumptions, we have more informatoin about the parameters and hence can estimate them more efficiently (lower variance). In fact, if the true marginal density $f_k(x)$ are Gaussian, the in the worst case ignoring this marginal part of the likelihood constitutes a loss of efficiency of **about $30\%$ asymptotically in error rate**. The marginal likelihood can be thought of as a regularizer - LDA can even be used to do unsupervised learning given its stronger assumptions, while logistic regression cannot. But on the other hand, LDA is not robust to gross outliers.

- For logistic regression, the parameters are estimated by maximizing the conditional likelihood without taking into account the distribution of $X$. As such, it is less sensitive to outliers and more robust. Naturally, when the data is indeed non-Gaussian, asymptotically logistic regression does better than LDA.

Empirically, the performance of the these two models can be very similar, even when LDA is used inappropriately, such as with qualitative predictors.

**Generalized Linear Model (GLM)**

Logistic Regression and softmax regression are just special cases of GLM. More speicifically, the model takes the form of distribution in the exponential family
\begin{align}
p(y|x, \beta) = p(y|\eta) = b(y)\exp(\eta^{\top}T(y)-a(\eta)),
\end{align}
where $T(y)$ is called the sufficient statistic. More often than not, $T(y)=y$.

Then try to predict
\begin{align}
h_{\beta}(x) = E(T(y)|x,\beta)
\end{align}

Finally, specify linear specification:
\begin{align}
\eta = x^{\top}\beta
\end{align}

Note that the gradient descent rule for GLM models look similar.

## Empirical Performance

### Advantages and Disadvantages

## Implementation Details and Practical Tricks

**`LogisticRegression` in `sklearn`**

In [2]:
from sklearn.linear_model import LogisticRegression
LR = LogisticRegression(penalty='l2',
                        dual=False,
                        C=1.0,
                        fit_intercept=True,
                        intercept_scaling=1,
                        solver='liblinear',
                        multi_class='ovr')

**Selected Parameters**

- `penalty`: `'l1'` or `'l2'`,

    Used to specify the norm used in the penalization. The `'newton-cg'`, `'sag'` and `'lbfgs'` solvers support only `'l2'` penalties. New in version 0.19: `'l1'` penalty with `'SAGA'` solver (allowing `'multinomial'` + L1)

- `dual`:

    Dual or primal formulation. Dual formulation is only implemented for `'l2'` penalty with `liblinear` solver. Prefer `dual=False` when $n_samples > n_features$.

- `C`:

    Inverse of regularization strength: $C=\frac{1}{\lambda}$; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.

- `fit_intercept`:

    Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.

- `intercept_scaling`:

    Useful only when the solver `'liblinear'` is used and `self.fit_intercept` is set to True. In this case, `x` becomes `[x, self.intercept_scaling]`, i.e. a synthetic feature with constant value equal to intercept_scaling is appended to the instance vector. The intercept becomes `intercept_scaling * synthetic_feature_weight`. Note! the synthetic feature weight **is subject to l1/l2 regularization as all other features**. To lessen the effect of regularization on synthetic feature weight (and therefore on the intercept) `intercept_scaling` has to be increased.

- `solver` : `{'newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'}`

    - The solver `'liblinear'` uses a coordinate descent algorithm, relying on the [C++ LIBLINEAR library](https://www.csie.ntu.edu.tw/~cjlin/liblinear/).
    - For small datasets, `'liblinear'` is a good choice, whereas `'sag'` and `'saga'` are faster for large ones.
    - For multiclass problems, only `'newton-cg'`, `'sag'`, `'saga'` and `'lbfgs'` handle multinomial loss. That is, setting `multi_class` to `'multinomial'` with these solvers learns a true multinomial logistic regression, which means that its probability estimates should be better calibrated than the default `ovr` setting. `'liblinear'` is limited to one-versus-rest schemes.
    - `'newton-cg'`, `'lbfgs'` and `'sag'` only handle `l2` penalty, whereas `'liblinear'` and `'saga'` handle `l1` penalty.
    - `'newton-cg'`, `'lbfgs'` and `'sag'` are found to converge faster for some high dimensional data. Note that `'sag'` and `'saga'` fast convergence is only guaranteed on features with approximately the same scale. You can preprocess the data with a scaler from sklearn.preprocessing.


**Selected Attributes**

- `coef_` : shape `(1, n_features)` or `(n_classes, n_features)`

    Coefficient of the features in the decision function. `coef_` is of shape `(1, n_features)` when the given problem is binary.
    

- `intercept_` : shape `(1,)` or `(n_classes,)`

    Intercept (a.k.a. bias) added to the decision function. If `fit_intercept` is set to `False`, the intercept is set to zero. `intercept_` is of shape `(1,)` when the problem is binary.

**Selected Methods**

- `decision_function(X)`: returns `shape=(n_samples,) if n_classes == 2 else (n_samples, n_classes)`

    Predict confidence scores for samples. The confidence score for a sample is the signed distance of that sample to the hyperplane.

## Use Cases

## Results Interpretation, Metrics and Visualization

## References

- [ESL](https://www.evernote.com/shard/s191/nl/21353936/c2a0e9ac-da49-4fee-8701-3cd70fc42134?title=The%20Elements%20of%20Statistical%20Learning_print12.pdf) Chapters 4, 18.
- [`sklearn` document 1.1](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression)

### Further Reading

- LIBLINEAR – [A Library for Large Linear Classification](http://www.csie.ntu.edu.tw/~cjlin/liblinear/)
- SAG – Mark Schmidt, Nicolas Le Roux, and Francis Bach
    [Minimizing Finite Sums with the Stochastic Average Gradient](https://hal.inria.fr/hal-00860051/document)
- SAGA – Defazio, A., Bach F. & Lacoste-Julien S. (2014).
    [SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives](https://arxiv.org/abs/1407.0202)
- Hsiang-Fu Yu, Fang-Lan Huang, Chih-Jen Lin (2011). [Dual coordinate descent methods for logistic regression and maximum entropy models. Machine Learning 85(1-2):41-75.](http://www.csie.ntu.edu.tw/~cjlin/papers/maxent_dual.pdf)

## Misc.