# Logistic Regression

Logistic regression predicts the log odds of a binary outcome dependent on other factors.

Some definitions:
1. Given a probability $p$ of an outcome, the odds of that outcome are $\frac{p}{1-p}$. Note that with an observed set of outcomes, because of the shared denominator, this is equivalent to $\frac{f^+}{f^-}$, where $f^+$, where $f^+$ is the number of cases with a positive outcome and $f^-$ the number of cases with a negative outcome.
1. The relative likelihood of an event for two different individuals can be summarized with the odds ratio: $\frac {\text{odds}_a}{odds}_b =\frac{p_a}{1-p_a} (\frac{p_b}{1-p_b})^{-1} = \frac{(1-p_b)p_a}{(1-p_a)p_b} = \frac {f_b^-f_a^+}{f_a^-f_b^+} $. 
1. For a $2 \times 2$ frequency table, this is easy to calculate:

In [5]:
import pandas as pd

two_way_table = pd.DataFrame(
    {'y+': [6, 162], 'y-': [13, 2343]}, index=['x+', 'x-']
)
two_way_table['odds'] = two_way_table['y+']/two_way_table['y-']
print("Odds ratio:", two_way_table.loc['x+', 'odds']/two_way_table.loc['x-', 'odds'])

Odds ratio: 6.6752136752136755


Logistic regression allows us to estimate the impacts of a vector of variables $X$ using MLE.
$$ \text{logit}(p) = \log \frac{p}{1-p} = \beta_0 + \beta X $$

Once estimated as $\hat y = \hat \beta_0 + \hat \beta X$, $e^{\hat y}$ for a new $X'$ is the estimated odds of the outcome.

For an independent variable of interest $X_j$, the coefficient $\beta_j$ represents the change in the log odds associated with changing it (by 1 unit if continuous, from negative to positive if binary). Let $\hat y=\log \frac{p}{1-p}$ be the base predicted log odds, and $\hat y'=\log \frac{p'}{1-p'}$ the new prediction, then: $\beta_1 = \hat y' - \hat y$. With logarithm arithmetic, this is equivalent to the log odds ratio associated with the change in the ration. Thus, to report the odds ratio associated with a change in one of the variables, all else equal, we need only calculate $e^{\hat \beta_j}$.

# Regularization

A problem faced by "big data" is that the available data tends to be "wide" rather than "tall": there are lots of measures relative to the number of observations. This is the inverse of the classical assumption of statistics. Regression performs best when data is clustered around a meaningful "middle." Various methods of regularization (or "penalized regression") seek to counterbalance this. In these methods, what is minimized is not the sum of squared residuals alone but a cost function combining the error with some penalty for overfitting. 
1. Ridge regression ($l_2$ penalty): the penalty term is the square sum of the coefficients $\lambda \sum B_j^2$. This is equivalent to the square length of the parameter vector.
1. Lasso ($l_1 penalty$): the sum of the absolute values of the coefficients $\lambda \sum |B_j|$. This is equivalent to the manhattan length of the parameter vector. This has the effect of pushing parameter vectors onto the "axes", whereas the penalty rate of ridge regression increases with the magnitude of coefficients and reduces parameters *proportionately*. This results in a "sparser" parameter vector, which is useful insofar as it allows for reducing the number of relevant features.

The mathematical logic for this can be linked back to Stein's paradox, which shows that even when estimating parameters for a set independent distributions, it is better to "shrink" the estimates towards the grand mean of the data rather than using only the observations for each type.

# Decision Trees

A decision tree classifies observations through a series of binary decision points (e.g. $X_1\le0$). A decision tree is fit by recursively considering, for a group of point, which split on which single feature would result in two groups with highest weighted average "purity" in terms of labels. Purity (or, in practice, its absence) can be measured in different ways, including entropy ($-\sum_j p_j \log p_j$) and Gini impurity ($1-\sum_j p_j^2$), where $p_j$ is the share of a label $j$ in a cell.

When compared to linear methods like Perceptron or Logistic Regression, the main advantages of Decision Trees are:
1. Fast fitting and prediction
1. Indifference to scale of features and to numerical vs. categorical features
1. Interpretability of decision points (as opposed to $\theta$ vectors)

The main disadvantages are:
1. All decision boundaries are "orthogonal," based on the value of one feature at a time, as opposed to the diagonal or even interacted relationships captures by linear models.
1. Strong tendency towards over-fitting.

That is to say, with unlimited depth, a decision tree can always perfectly classify training data, though this can result in a very high-variance classifier (i.e. with lots of very small cells). To improve generalizability, it can help to:
1. Limit the depth of the tree.
1. Limit the minimal size of a leaf.
1. Fit multiple trees and select the majority result (an example of ensemble learning known as a **random forest** in its full form). In general, this is a useful technique for high-variance models.

With mixed leaves or multiple trees, the model can either output a maximum likelihood label or probabilities. Decision trees or random forests can also be adapted to regression problems by outputting the average of values placed in a leaf based on minimizing some loss function.

A random forest needs to introduces some randomness (since the naive fitting algorithm is deterministic), including:
1. Bootstrap resampling (with replacement) the data for each tree (a technique known as **bagging**)
1. Considering only a random subset of features (as a rule of thumb, $\sqrt K$ of them) for each decision point (or for each tree? different descriptions are ambiguous)

Random Forests can achieve classification performance in the ballpark of much more complicated neural networks, but are much easier to set up and even parallelize because of the independence of the models.

**Bagging** as a method for ensemble learning can be used for other kinds of models as well. It can be augmented also by assigning weights to each model.

# Boosting

Boosting is an iterative ensemble method. It does not just average different models but also learns weights for the observations and models. Generally speaking, *misclassified* cases in a given modeling iteration are giving greater weight in next model fitting, while models are weighted inversely to their error. Two key hyperparameters are the complexity of the trees (often a single-split "stump" works surprisingly well) and the "learening rate" response for the weights.

For example **Adaptive Boosting** (AdaBoost) for binary classification begins with equal weights for each observation. It fits a classifier and calculates a weighted average error rate. The weight for the model $a_m = \log \frac {1-\text{error}_m} {\text {error}_m}$ and re-weights the observations as $w_i' = w_i e^{a_m (1-z_i)}$, where $z_i$ is 1 for a correct prediction, 0 otherwise. The output of the ensemble is then just the sign of the weighted average of model predictions.

**Gradient Boosting** similarly sequentially fits models, but instead of weighting the observations, each model after the first fit the *residuals* ($y- \hat y$) of the previous model. The final prediction is then the sum of the predictions of all of the models, often with a decaying rate of influence.

Both techniques are implemented in `sklearn.ensemble` as the model classes `AdaBoostClassifier`, `AdaBoostRegressor`, `GradientBoostingClassifier`, and `GradientBoostingRegressor`. Note that AdaBoost by default is based on decision trees but any model can be passed as `based_estimator`. while gradient boosting is implemented only with decision trees.

Boosting methods are powerful, but they can overfit as well because they "chase" hard-to-fit cases. Compared to random forest, boosting is slower because it must be fit iteratively rather than in parallel, and it is more sensitive to its hyperparameters. On the other hand, boosting methods have an advantage in interpretability by outputting feature importances.



# Vectorization

Textual data needs to be transformed into data-points, i.e. feature vectors, to be used in machine learning. There are various ways to do this:
1. The "bag of words": indicating or counting the appearances of words in each text, so that there is one feature for each words in the dictionary (perhaps excluding certain "stop words" that do not convey much information).

Note that vectorization, like any feature transformation, represents some kind of information input to the model, so it should be based only on training data.

The `NLTK` package implements many tools.

# Time Series Analysis

## Properties of time series

The property of a time series that requires a special approach is **memory**. That is to say that the value at $t$ in some way depends on the value at $t-1$. This is also known as **auto-correlation**.

Time series analysis is structured around a mathematical decomposition of an observation into **signal** and **noise**. These are said to be **stationary** (or not) if they have constant mean, standard deviation, and autocorrelation (or not). They have **seasonality** if these vary at regular periods.

Pre-processing (or filtering) can seek to:
1. De-trend the data (if non-stationary or seasonal)
1. Account for autocorrelation
1. Smoothing out noise

For example, if we are interested in the seasonality of some time series, we can fit a linear model over time (or log-linear for exponential growth) and subtract the prediction, i.e. look at the residuals.

A time series is **trend stationary** if the data is stationary once the trend is removed. It is **difference stationary** if, for example, the first differences ($y_t - y_{t-1}$) show a stable distribution. For example, stock values often do not have a stable trend but are difference stationary.

If the first difference is entirely random with no auto-correlation, then the series can be desrcibed as a **random walk model** $y_t = y_{t-1} + w_t$, where $w_t$ is an i.i.d. noise term. A random walk with drift is a model in which the differences have a non-zero mean: $y_t = \mu + y_{t-1} + w_t$. In a random walk, the autocorrelation is 1, or equivalently, the correlation between the difference and the previous-period value is 0: the Dicky-Fuller test uses this as a null hypothesis. The Augmented Dicky Fuller Test extends this to multiple lags and represents a test of whether the process has a unit root.

Sometimes, although two time series are each separately random walks, the linear relationship between them is not. This entails modeling the spread between two variables, for example prices of substitutes. Thus,
$$P_t = \mu+cQ_t+\epsilon_t$$
For cointegreated variables, the Dickey-Fuller test of $P_t - cQ_t$ should reject the null hypothesis.

Seasonality can be removed by fitting periodic averages and subtracting these. In general, modeling time series entails experimenting with differences and seasonal differences in order to identify a transformation that yield stationarity. This is important, for example, because any two time series variables that have trends or similar seasonality will be correlated, even if totally unrelated in reality.

Auto-correlation is defined using the same formula as correlation, but instead of two variables, it is between one variable and that same variable lagged by some constant (here, $k$):
$$ r_k = \frac {\sum_{i=1}^{N-k} (Y_i - \bar Y)(Y_{i+k} - \bar Y)}{\sigma^2_Y} $$
Auto-correlation can both capture periodicity and memory. The former looks like a high auto-correlation at some specific period, while the latter looks like a gradually decreasing auto-correlation over greater periods. The **autocorrelation function** shows the autocorrelation at all lags. Differences often need to be taken for periods with high correlation.

Smoothing strategies include low-pass filtering or statistical filters such as weighting average of nearby values. One issue with smoothing is that it loses information at ends of series. Generally speaking, the point of smoothing is to preserve larger, less frequent events while removing smaller, more frequent changes.

## Time series models

### Auto-regressive models

An auto-regressive model of order $p$ ($\text{AR}(p)$) represents an observation as a function of $p$ earlier observations (plus a drift term and noise). So, an $\text{AR}(1)$ model is written as:
$$ X_t = \mu + \phi X_{t-1} + \epsilon_t $$
For the process to be stationary, $-1<\phi<1$, representing of decay of the autocorrelation function. $\phi=1$ represents a random walk, $\phi=0$ represents random noise. Positive and negative values of $\phi$ reflect *momentum* and *mean reversion*, respectively.

The Partial Autocorrelation Function shows the $\phi$ values for increasing orders of AR models. Generally, the order of the AR model fitted should be the last for which the coefficient is significantly different from 0. Alternatively, we can look at the BIC of various orders of models to choose the one with the lowest value.

### Moving-average models

A moving average model of order $q$ ($\text{MA}(q)$) represents an observation as a function of $q$ previous error terms (plus drift and noise). So, an $\text{MA}(1)$ model is written as:
$$ X_t = \mu + \epsilon_t + \theta \epsilon_{t-1} $$
An MA model is always stationary. The 1-period autocorrelation for this model would be $\frac {\theta}{1+\theta^2}$ and 0 for longer lags. A consequence of this is that the (MLE) forecast for an MA model beyond $q$ steps is always $\mu$.

### ARIMA models

An autoregressive integrated moving average model of order $(p,d,q)$, where $p$ and $q$ are defined as above and $d$ is the number of differences to take. For example, an $\text{ARIMA}(0,1,0)$ model:
$$ X_t = \mu + X_{t-1} + \epsilon_t $$
is a random walk with drift. The value of $d$ should be sufficient to achieve stationarity of the data set.

Note that it is possible to represents an AR(1) model as a MA model of infinite order.

## Tools

Tools for time series analysis are included in the `statsmodels` package's `tsa` module.
- `statsmodels.tsa.stattools.adfuller(x, maxlag=None, regression='c', autolag='AIC', store=False, regresults=False) -> (float, float, int, int, dict, float)`: Augmented Dickey Fuller Test, based on the null hypothesis that there is a unit root, i.e. non-stationary behavior (paradigmatically, a random walk). The 2nd return is the pvalue for the test. If the null cannot be rejected, the series will need to be differenced.
- `statsmodels.tsa.stattools.coint(y0, y1, trend='c', method='aeg', maxlag=None, autolag='aic', return_results=None) -> float, float, dict` tests for cointegration between the two given variables. As with `adfuller`, this returns test statistics and p values for null hypothesis that there is no cointegration.
- `statsmodels.tsa.seasonal.seasonal_decompose(x, model='additive', filt=None, period=None, two_sided=True, extrapolate_trend=0) -> DecomposeResult`: decomposes a time series into `trend`, `seasonal` variation, and `resid`uals. These are accessible as attributes of returned object, along with the original data as `observed`. Included `plot()` method will put all of the elements on a line graph.
- `statsmodels.tsa.stattools.acf(x, adjusted=False, nlags=None, qstat=False, fft=True, alpha=None, bartlett_confint=True, missing='none')`: calculates the autocorrelation function for the given time series data. Returns a 1d array where the index is the number of lags (NB ACF(0) is always 1). Also, with `alpha` parameter, returns $1-\alpha$ confidences intervals. With `qstat=True`, returns Q statistics and p values for correlations. The `pacf()` estimates the partial autocorrelation function with the same syntax.
- `statsmodels.tsa.arima.model.ARIMA` and `statsmodels.tsa.arima.model.SARIMAX` implement ARIMA models with or without seasonality or exogenous regressors. Both versions take the parameter `order=(p,d,q)` to control the non-seasonal parameters.
- `statsmodels.tsa.arima_process.ArmaProcess(ar, ma)` creates an object for simulating an ARMA process. `ar` should be an array of autoregression coefficients, starting with $\mu$ and then giving the **negative** of the $\phi$ values. `ma` should be an array of moving average polynomial coefficients ($\theta$, not $-\theta$), use [1] for an AR model. The object .generate_sample(nsample) method simulates data.


# Unsupervised Learning

## Hierarchical clustering

A flexible technique for identifying commonalities among observations.
1. Calculate distances for all pairs and store `weight` of 1 for each
1. Initialize running `height` to 0
1. Repeat until no more grouping to be done:
    1. Identify two observations with shortest distance between them and combine them:
        1. Set its position as the weighted average of the two
        1. Set its `weight` to be sum of the two
        1. Set its `height` to be the running `height` total plus the `weight` of the new group
    1. Update the running `height` total

Hierachical clustering can be visualized as a dendrogram, and groups made by "cutting" it at some `height` value. Note that this algorithm can be done with different measures of distance and centrality. There is also a version that proceeds backwards through splits.

## Graph community detection

One way of thinking of networks is to say that a natural grouping maximizing the **modularity** of each group:
$$ M_m = \frac{1}{2L} \sum_{i,j \in C_m,i\ne j} (A_{ij}-\frac{k_ik_j}{2L}) $$
where $A_{ij}$ is the adjacency of two nodes (1 if connected, 0 if not), and $k_i$ is the number of edges connected to node $i$ (i.e. the sum of row $i$ in the adjaceny matrix $A$).

## Distance measures

Distance measures can be generalized using Minkowski metric:
$$ d_p(x, x') = (\sum_{k=1}^d |x_k - x_k'|^p)^{\frac{1}{p}}$$
Thus, with $p=1$, this yields Manhattan distance, with $p=2$, it is Euclidean distance.