### (Binary and Multinomial) Logistic Regression

#### Generalized linear model
- So far, all of our models are "general linear models," and they are linear with homogenous variance.
- Extending to "generalized linear model," which relaxes the constraint of normality

The scientists who induced generalized linear models have since apologized for the confusing name

#### Specifics
- Generalized linear models are seet up the same way as general linear models, but use what is called "maximum likelihood" to estimate regression coefficients, rather than least squares.
- One example of a GLM is logistic regression—a model that is typically used for binary values or variables with three or more categories
- Logistic regression interprets categorical values as probabilities of choosing those values, and then converts those probabilities into the odds of choosing one value or another (on a logarithmic scale)
    - Conversion is known as a "link function"
    - Depending on the data and the number of categories, logistic regression may assume a Bernoulli, binomial, categorical, or (ordered) multinomial distribution
    
#### Probit vs. logit models
- Sometimes, when we talk about logistic regression, probit regression also comes up, since both regressions can handle binary DVs
- While logit models converet probabilities of being in a category into the log odds of being in that category, probit models instead convert the probabilities into Z scores
- The choice of link is situation dependent, but logistic regression tends to be more popular due to its easier interpretation.
 
#### Binary logistic regression
- Most basic logistic regression - two categories
- Link function: `logit`
- Our goal is to calculate the odds of being in one category or another.
- To perform this in R, we can use thee `glm` function.

e.g., predict whether sleep patterns have changed based on whether mood has worsened.

$$ \hat{Y}_{SleepPatternChanged} = b_0 + b_1 X_{HasMoodWorsened} $$

(Assume dummy coding for this example)

R code:
`glm(SleepPatternChanged ~ HasMoodWorsened, family = binomial(), data = SHHWData`
- `family` - distribution we expect the **DV** to follow

In a normal regression (dummy coding):
- `intercept` = mean of control
- slope of dv = difference between mean of control and mean of treatment

In a logistic regression (dummy coding):
- log odds of sleep changing in control = intercept
- log odds of sleep changing in treatment = dv slope

Probability of YES in DV = `LogOddsIntercept` + `LogOddsVariable`

#### Odds, logs, probabilities
- Log odds to odds:
$$ \text{Odds} = e^{\text{LogOdds}} $$
- Odds to probabilities:
$$ P = \frac{O}{1+O} $$

$$ logit^{-1}(x) = logistic(x) = \frac{e^x}{1+e^x} $$

#### Consideration
- Don't backtransform until you've added the log-odds
- In general, $L_1 + L_2 \ne logit^{-1}(L_1) + logit^{-1}(L_2)$

#### Logistic regression with continuous predictors
- `intercept` = log odds of DV Yes at predictor = 0
- dv slope = additional log odds of DV Yes for each additional value of predictor

#### Multiple predictors
- We can have multiple predictors

#### Proportions
- Note that the `glm` function will accept b inary values (0 or 1) or proportions (0.2, 0.5, etc.)
- If you specify proportions, you'll need to specify the weight so that R can calculate the corresponding binary values

#### Ordered multinomial logistic regression
- We need to use a link that will accommodatee this ordering
- Link: cumulative logit
- Rather than the odds of ieing in a category, we calculate the odds of moving from one category into the next category over

### Multiple regression
Purposes of conducting regression:
- Linear regression can be used in a confirmatory setting where a researcher wants to estimate the regression coefficients of a set of predictors.
- Linear regression can also be used when it is not known *a priori* which predictors have an impact on the criterion.

### Model selection
- Best subset selection: out of fashion, but historically important
- Can be used if you have $\le 40$ predictors and want to know which and how many give the best predictions.

Steps:
1. Calculate residual SS for all possible combinations of predictors for all possible *k* (branch and bound algorithm)
2. Figure out which *k* is the best

#### Stepwise regression
Start with zero predictors and iteratively add the next best predictor

(Alternatively: start with all predictors, and delete the least important ones)

At each step, the model needs to be re-estimated. Subsequent models are always nested.

Potential drawback: the decision to include or exclude a predictor is final.

You can also make the decision of feature selection using AIC/BIC stepwise.

#### Example
- Fit initial model to high number of predictors.
    - Iteratively, drop one predictor and find smallest AIC/BIC
    - When we stop getting gains from dropping, we are done.
    
#### Example 2
- Fit initial model to a single predictor.
    - Iteratively, add each predictor and see if it decreases the AIC/BIC.
    - When we stop getting gains from adding, we're done.
  
 
**Hybrid model:** consider adding/dropping at each step

#### Regularized (Penalized) Degression
##### Shrinkage
- In model selection approaches we've discussed so far, predictors are either in or out.
    - The interpretation of significance is good, but sometimes the discreteness of inclusion/exclusion leads to inflated prediction error.
- Shrinkage methods regulate the effects of predictors in a more continuous way. By placing a constraint on the sum of (a function of) the regression coefficients, they're shrunken towards zero and each other

- Ridge and lasso regression differ by the type of constraint they impose

##### Ridge regression

Imposes constraint $\sum^{p}_{j=1}{b^2_j} \le t$, where t is a constant.

When fitting a ridge regression, predictors are assumed to be normalized to Z, and Y is assumed to be centered. This is because ridge regression coefficients are not equivariant under scaling of predictors.

- Ridge regression is a soft thresholding method.

(In hard thresholding, we either keep or drop a preedictor. In soft thresholding, we simply shrink predictors)

##### Lasso
- The lasso has a different constraint: $\sum^{p}_{j=1}{|b_j|} \le t$
- Ridge regression coefficients approach zero as the shrinkage parameter $\lambda$ approaches infinity. Lasso shrinks coefficients to exactly zero at smaller values of $\lambda$, so lasso can be use for variable selection.

- Hard thresholding — does drop variables

- Uses `glimmet` package in R

#### Quantile regression
- Quantile regression can be viewed as an extension of ordinary regression, but based on means
- Motivation: typical regression only lets us view how some measure of central tendency varies with predictors
    - We want to see how the distribution varies as a whole
    
Answer: fit a regression line to several percentiles of the dataset.
- We can fit to any given percentile, including the median, which (barring extreme skew) is similar to the mean regression line

(Cynthia shows a cool graph plotting the intercept against each quantile, and some linear interpolation between the points)

#### Median/quantile regression advantages
- More robust against outliers and nonnormal data.
- Quantile regression provides a richer characterization of the data—we can see a set of conditional distributions, rather than a single conditional mean.

#### Differences
In median regression, we minimize the sum of absolute residuals rather than the sum of square residuals.

