# Logistic Regression

Logistic regression predicts the log odds of a binary outcome dependent on other factors.

Some definitions:
1. Given a probability $p$ of an outcome, the odds of that outcome are $\frac{p}{1-p}$. Note that with an observed set of outcomes, because of the shared denominator, this is equivalent to $\frac{f^+}{f^-}$, where $f^+$, where $f^+$ is the number of cases with a positive outcome and $f^-$ the number of cases with a negative outcome.
1. The relative likelihood of an event for two different individuals can be summarized with the odds ratio: $\frac {\text{odds}_a}{odds}_b =\frac{p_a}{1-p_a} (\frac{p_b}{1-p_b})^{-1} = \frac{(1-p_b)p_a}{(1-p_a)p_b} = \frac {f_b^-f_a^+}{f_a^-f_b^+} $. 
1. For a $2 \times 2$ frequency table, this is easy to calculate:

In [5]:
import pandas as pd

two_way_table = pd.DataFrame(
    {'y+': [6, 162], 'y-': [13, 2343]}, index=['x+', 'x-']
)
two_way_table['odds'] = two_way_table['y+']/two_way_table['y-']
print("Odds ratio:", two_way_table.loc['x+', 'odds']/two_way_table.loc['x-', 'odds'])

Odds ratio: 6.6752136752136755


Logistic regression allows us to estimate the impacts of a vector of variables $X$ using MLE.
$$ \text{logit}(p) = \log \frac{p}{1-p} = \beta_0 + \beta X $$

Once estimated as $\hat y = \hat \beta_0 + \hat \beta X$, $e^{\hat y}$ for a new $X'$ is the estimated odds of the outcome.

For an independent variable of interest $X_j$, the coefficient $\beta_j$ represents the change in the log odds associated with changing it (by 1 unit if continuous, from negative to positive if binary). Let $\hat y=\log \frac{p}{1-p}$ be the base predicted log odds, and $\hat y'=\log \frac{p'}{1-p'}$ the new prediction, then: $\beta_1 = \hat y' - \hat y$. With logarithm arithmetic, this is equivalent to the log odds ratio associated with the change in the ration. Thus, to report the odds ratio associated with a change in one of the variables, all else equal, we need only calculate $e^{\hat \beta_j}$.

# Regularization

A problem faced by "big data" is that the available data tends to be "wide" rather than "tall": there are lots of measures relative to the number of observations. This is the inverse of the classical assumption of statistics. Regression performs best when data is clustered around a meaningful "middle." Various methods of regularization (or "penalized regression") seek to counterbalance this. In these methods, what is minimized is not the sum of squared residuals alone but a cost function combining the error with some penalty for overfitting. 
1. Ridge regression ($l_2$ penalty): the penalty term is the square sum of the coefficients $\lambda \sum B_j^2$. This is equivalent to the square length of the parameter vector.
1. Lasso ($l_1 penalty$): the sum of the absolute values of the coefficients $\lambda \sum |B_j|$. This is equivalent to the manhattan length of the parameter vector. This has the effect of pushing parameter vectors onto the "axes", whereas the penalty rate of ridge regression increases with the magnitude of coefficients and reduces parameters *proportionately*. This results in a "sparser" parameter vector, which is useful insofar as it allows for reducing the number of relevant features.

The mathematical logic for this can be linked back to Stein's paradox, which shows that even when estimating parameters for a set independent distributions, it is better to "shrink" the estimates towards the grand mean of the data rather than using only the observations for each type.

# Decision Trees

A decision tree classifies observations through a series of binary decision points (e.g. $X_1\le0$). A decision tree is fit by recursively considering, for a group of point, which split on which single feature would result in two groups with highest weighted average "purity" in terms of labels. Purity (or, in practice, its absence) can be measured in different ways, including entropy ($-\sum_j p_j \log p_j$) and Gini impurity ($1-\sum_j p_j^2$), where $p_j$ is the share of a label $j$ in a cell.

When compared to linear methods like Perceptron or Logistic Regression, the main advantages of Decision Trees are:
1. Fast fitting and prediction
1. Indifference to scale of features and to numerical vs. categorical features
1. Interpretability of decision points (as opposed to $\theta$ vectors)

The main disadvantages are:
1. All decision boundaries are "orthogonal," based on the value of one feature at a time, as opposed to the diagonal or even interacted relationships captures by linear models.
1. Strong tendency towards over-fitting.

That is to say, with unlimited depth, a decision tree can always perfectly classify training data, though this can result in a very high-variance classifier (i.e. with lots of very small cells). To improve generalizability, it can help to:
1. Limit the depth of the tree.
1. Limit the minimal size of a leaf.
1. Fit multiple trees and select the majority result (an example of ensemble learning known as a **random forest** in its full form). In general, this is a useful technique for high-variance models.

With mixed leaves or multiple trees, the model can either output a maximum likelihood label or probabilities. Decision trees or random forests can also be adapted to regression problems by outputting the average of values placed in a leaf based on minimizing some loss function.

A random forest needs to introduces some randomness (since the naive fitting algorithm is deterministic), including:
1. Bootstrap resampling (with replacement) the data for each tree (a technique known as **bagging**)
1. Considering only a random subset of features (as a rule of thumb, $\sqrt K$ of them) for each decision point (or for each tree? different descriptions are ambiguous)

Random Forests can achieve classification performance in the ballpark of much more complicated neural networks, but are much easier to set up and even parallelize because of the independence of the models.

**Bagging** as a method for ensemble learning can be used for other kinds of models as well. It can be augmented also by assigning weights to each model.

# Boosting

Boosting is an iterative ensemble method. It does not just average different models but also learns weights for the observations and models. Generally speaking, *misclassified* cases in a given modeling iteration are giving greater weight in next model fitting, while models are weighted inversely to their error. Two key hyperparameters are the complexity of the trees (often a single-split "stump" works surprisingly well) and the "learening rate" response for the weights.

For example **Adaptive Boosting** (AdaBoost) for binary classification begins with equal weights for each observation. It fits a classifier and calculates a weighted average error rate. The weight for the model $a_m = \log \frac {1-\text{error}_m} {\text {error}_m}$ and re-weights the observations as $w_i' = w_i e^{a_m (1-z_i)}$, where $z_i$ is 1 for a correct prediction, 0 otherwise. The output of the ensemble is then just the sign of the weighted average of model predictions.

**Gradient Boosting** similarly sequentially fits models, but instead of weighting the observations, each model after the first fit the *residuals* ($y- \hat y$) of the previous model. The final prediction is then the sum of the predictions of all of the models, often with a decaying rate of influence.

Both techniques are implemented in `sklearn.ensemble` as the model classes `AdaBoostClassifier`, `AdaBoostRegressor`, `GradientBoostingClassifier`, and `GradientBoostingRegressor`. Note that AdaBoost by default is based on decision trees but any model can be passed as `based_estimator`. while gradient boosting is implemented only with decision trees.

Boosting methods are powerful, but they can overfit as well because they "chase" hard-to-fit cases. Compared to random forest, boosting is slower because it must be fit iteratively rather than in parallel, and it is more sensitive to its hyperparameters. On the other hand, boosting methods have an advantage in interpretability by outputting feature importances.



## Vectorization

Textual data needs to be transformed into data-points, i.e. feature vectors, to be used in machine learning. There are various ways to do this:
1. The "bag of words": indicating or counting the appearances of words in each text, so that there is one feature for each words in the dictionary (perhaps excluding certain "stop words" that do not convey much information).

Note that vectorization, like any feature transformation, represents some kind of information input to the model, so it should be based only on training data.

The `NLTK` package implements many tools.