# Logistic Regression

Logistic regression predicts the log odds of a binary outcome dependent on other factors.

Some definitions:
1. Given a probability $p$ of an outcome, the odds of that outcome are $\frac{p}{1-p}$. Note that with an observed set of outcomes, because of the shared denominator, this is equivalent to $\frac{f^+}{f^-}$, where $f^+$, where $f^+$ is the number of cases with a positive outcome and $f^-$ the number of cases with a negative outcome.
1. The relative likelihood of an event for two different individuals can be summarized with the odds ratio: $\frac {\text{odds}_a}{odds}_b =\frac{p_a}{1-p_a} (\frac{p_b}{1-p_b})^{-1} = \frac{(1-p_b)p_a}{(1-p_a)p_b} = \frac {f_b^-f_a^+}{f_a^-f_b^+} $. 
1. For a $2 \times 2$ frequency table, this is easy to calculate:

In [5]:
import pandas as pd

two_way_table = pd.DataFrame(
    {'y+': [6, 162], 'y-': [13, 2343]}, index=['x+', 'x-']
)
two_way_table['odds'] = two_way_table['y+']/two_way_table['y-']
print("Odds ratio:", two_way_table.loc['x+', 'odds']/two_way_table.loc['x-', 'odds'])

Odds ratio: 6.6752136752136755


Logistic regression allows us to estimate the impacts of a vector of variables $X$ using MLE.
$$ \text{logit}(p) = \log \frac{p}{1-p} = \beta_0 + \beta X $$

Once estimated as $\hat y = \hat \beta_0 + \hat \beta X$, $e^{\hat y}$ for a new $X'$ is the estimated odds of the outcome.

For an independent variable of interest $X_j$, the coefficient $\beta_j$ represents the change in the log odds associated with changing it (by 1 unit if continuous, from negative to positive if binary). Let $\hat y=\log \frac{p}{1-p}$ be the base predicted log odds, and $\hat y'=\log \frac{p'}{1-p'}$ the new prediction, then: $\beta_1 = \hat y' - \hat y$. With logarithm arithmetic, this is equivalent to the log odds ratio associated with the change in the ration. Thus, to report the odds ratio associated with a change in one of the variables, all else equal, we need only calculate $e^{\hat \beta_j}$.

# Regularization

A problem faced by "big data" is that the available data tends to be "wide" rather than "tall": there are lots of measures relative to the number of observations. This is the inverse of the classical assumption of statistics. Regression performs best when data is clustered around a meaningful "middle." Various methods of regularization (or "penalized regression") seek to counterbalance this. In these methods, what is minimized is not the sum of squared residuals alone but a cost function combining the error with some penalty for overfitting. 
1. Ridge regression ($l_2$ penalty): the penalty term is the square sum of the coefficients $\lambda \sum B_j^2$. This is equivalent to the square length of the parameter vector.
1. Lasso ($l_1 penalty$): the sum of the absolute values of the coefficients $\lambda \sum |B_j|$. This is equivalent to the manhattan length of the parameter vector. This has the effect of pushing parameter vectors onto the "axes", whereas the penalty rate of ridge regression increases with the magnitude of coefficients and reduces parameters *proportionately*. This results in a "sparser" parameter vector, which is useful insofar as it allows for reducing the number of relevant features.

The mathematical logic for this can be linked back to Stein's paradox, which shows that even when estimating parameters for a set independent distributions, it is better to "shrink" the estimates towards the grand mean of the data rather than using only the observations for each type.