# Classification

Though linear regression can be applied in the case of binary qualitative responses, difficulties arise beyond two levels. For example, choosing a coding scheme is problematic and different coding scheme can yield wildly different predictions.

## Logistic Regression
Logistic regression models the probability that $y$ belongs to a particular category rather than modeling the response itself.

Logistic regression uses the logistic function to ensure a prediction between 0 and 1. The logistic function takes the form

$$p(\mathbf{X}) = \frac{e^{\beta_0 + \beta_1\mathbf{X}}}{1 + e^{\beta_0 + \beta_1\mathbf{X}}}.$$
This yields a probability greater than 0 and less than 1.

The logistic function can be rebalanced to yield

$$
\frac{p(\mathbf{X})}{1−p(\mathbf{X})} = e^{\beta_0 + \beta_1\mathbf{X}}
$$
$\frac{p(\mathbf{X})}{1−p(\mathbf{X})}$ is known as the __odds__ and takes on a value between 0 and infinity.

As an example, a probability of 1 in 5 yields odds of $\frac{1}{4}$ since $\frac{0.2}{1−0.2} = \frac{1}{4}.$

Taking a logarithm of both sides of the logistic odds equation yields an equation for the __log-odds__ or __logit__,

$$\text{log}\bigg(\frac{p(\mathbf{X})}{1−p(\mathbf{X})}\bigg) = \beta_0 + \beta_1\mathbf{X}$$
Logistic regression has a logit that is linear in terms of $\mathbf{X}$.

Unlike linear regression where $\beta_1$ represents the average change in $\mathbf{Y}$ with a one-unit increase in $\mathbf{X}$, for logistic regression, increasing $\mathbf{X}$ by one-unit yields a $\beta_1$ change in the log-odds which is equivalent to multiplying the odds by $e\beta_1$.
The relationship between $p(\mathbf{X})$ and $\mathbf{X}$ is not linear and because of this $\beta_1$ does not correspond to the change in $p(\mathbf{X})$ given one-unit increase in $\mathbf{X}$. However, if $\beta_1$ is positive, increasing $\mathbf{X}$ will be associated with an increase in $p(\mathbf{X})$ and, similarly, if $\beta_1$ is negative, an increase in $\mathbf{X}$ will be associated with a decrease in $p(\mathbf{X})$. How much change will depend on the value of $\mathbf{X}.$

## Estimating Regression Coefficients

Logistic regression uses a strategy called __maximum likelihood__ to estimate regression coefficients.

Maximum likelihood plays out like so: determine estimates for $\beta_0$ and $\beta_1$ such that the predicted probability of $\hat{p}(x_i)$ corresponds with the observed classes as closely as possible. Formally, this yield an equation called a __likelihood function__:

$$
\mathcal{L}(\beta_0, \beta_1) = \prod_{i:y_i=1}p(\mathbf{X}_i) \times \prod_{j:y_j=0}(1−p(\mathbf{X}_j)).
$$

Estimates for $\beta_0$ and $\beta_1$ are chosen so as to maximize this likelihood function.

Linear regression’s least squares approach is actually a special case of maximum likelihood.

Logistic regression measures the accuracy of coefficient estimates using a quantity called the __z-statistic__. The z-statistic is similar to the t-statistic. The z-statistic for $\beta_1$ is represented by

$$\text{z-statistic }(\beta_1) = \frac{\hat{\beta}_1}{SE(\hat{\beta}_1)}$$

A large z-statistic offers evidence against the null hypothesis.

In logistic regression, the null hypothesis

$$H_0: \beta_1 = 0$$
implies that

$$p(\mathbf{X}) = \frac{e^{\beta_0}}{1 + e^{\beta_0}}$$

and, ergo, $p(\mathbf{X})$ does not depend on $\mathbf{X}$.

## Making Predictions
Once coefficients have been estimated, predictions can be made by plugging the coefficients into the model equation

$$\hat{p}(\mathbf{X}) = \frac{e^{\hat{\beta}_0 + \hat{\beta}_1\mathbf{X}}}{1+e^{\hat{\beta}_0 + \hat{\beta}_1\mathbf{X}}}.$$

In general, the estimated intercept, $\beta_0$, is of limited interest since it mainly captures the ratio of positive and negative classifications in the given data set.

Similar to linear regression, __dummy variables__ can be used to accommodate qualitative predictors.