# Chapter 4: Classification
Predicting qualitative responses.

Linear regression is not suited because any encoding implies an ordering of the categories and selecting a distance among them, which in most cases is not sensible. In the binary case, one could transform the response variable into 0s and 1s and build a linear regression model: this would lead to an OK model, even if some predictions could be larger than 1, and so it would be difficult to interpret the result of it as a probability of classifying the input as belonging to class 1.

## Logistic Regression

Logistic regression models the probability that $Y$ belongs to a particular category, e.g. $\mathbb{P}(\text{default=yes}|\text{balance}) = p(\text{balance})$ will range between 0 and 1, so, given any value for balance, a prediction for default can be made.

Logistic function: $p(X) = \dfrac{e^{\beta_0 + \beta_1X}}{1+e^{\beta_0 + \beta_1X}}$, with outputs between 0 and 1. From the definition it follows that:

$$
\dfrac{p(X)}{1-p(X)} = e^{\beta_0 + \beta_1X}.
$$
The quantity on the left is called `odds`, and ranges from 0 to $\infty$.
Taking the logarithm,
$$
\log\left({\dfrac{p(X)}{1-p(X)}}\right) = \beta_0 + \beta_1X.
$$
The quantity on the left is called `logit` or `log-odds`, which is linear in $X$. Here, increasing $X$ by 1 unit changes the log-odds by $\beta_1$, which does not correspond to the change in $p(X)$: that will depend on the current value of $X$, too.

### Estimating the coefficients

Maximum Likelihood: we try to find $\hat{\beta}_0$ and $\hat{\beta}_1$ such that plugging these estimates into the model for $p(X)$ yields a number close to one for all instances of one class, and a number close to zero for all instances of the other class. The goal is to find $\hat{\beta}_0$ and $\hat{\beta}_1$ which maximize the following function:

$$
l(\beta_0, \beta_1) = \prod_{i: y_i=1}p(x_i)\prod_{j: y_j=0}(1-p(x_j)).
$$

Then, many considerations done for the linear regression are still valid for the logistic regression, noting that the $z$-statistic replaces the $t$-statistic. The estimated intercept of the logistic regression is usually not interesting: its main purpose is to adjust the average fitted probabilities to the proportion of ones in the data.

Then, prediction is performed simply using 
$
\hat{p}(X) = \dfrac{e^{\hat{\beta}_0 + \hat{\beta}_1X}}{1+e^{\hat{\beta}_0 + \hat{\beta}_1X}}.
$
For categorical input variables, one can use the same approach seen for linear regression, using dummy variables.

### Multiple logistic regression

Extends analogously to the linear regression case.

### Logistic regression for more than 2 response classes

Possible, but not frequently used.

## Linear Discriminant Analysis (LDA)

Logistic regression involves directly modeling $\mathbb{P}(Y = k|X = x)$ using the logistic function. We now consider an alternative and less direct approach to estimating these probabilities. In this alternative approach, we model the distribution of the predictors $X$ separately in each of the response classes (i.e. given $Y$), and then use Bayes’ theorem to flip these around into estimates $\mathbb{P}(Y = k|X = x)$.

We define:
- $\pi_k$: prior probability that an observation comes from the $k$-th class (usually this is the proportion of responses of class $k$ in the training set);
- $f_k(x) = \mathbb{P}(X = x| Y=k)$ the density function of $X$ for the $k$-th class.
Then Bayes states that
$$
p_k(x) = \mathbb{P}(Y=k | X=x) = \dfrac{\pi_kf_k(x)}{\sum_{l=1}^{K}\pi_lf_l(x)}.
$$

$p_k(X)$ is called posterior probability, and we want to estimate it to get as close as possible to the Bayes Classifier, which has the lowest possible error rate out of all classifiers.

see a clear example on https://en.wikipedia.org/wiki/Naive_Bayes_classifier

### case $p=1$

Let's assume that $f_k(x)$ is normal or Gaussian, i.e.
$$
f_k(x) = \dfrac{1}{\sqrt{2\pi}\sigma_k}\exp{\left(-\dfrac{(x-\mu_k)^2}{2\sigma^2_k}\right)}.
$$

We also assume that all variances are equal, so we will drop the $k$ subscript
With some calculus, one can show that assigning an observation to the class for which $p_k(x)$ is largest is equivalent to assigning it to the class for which
$$
\delta_k(x) = x \cdot \dfrac{\mu_k}{\sigma^2} - \dfrac{\mu^2_k}{2\sigma^2} + \log{\pi_k}
$$
is largest.

We have to estimate $\mu_1, ..., \mu_k$, $\pi_1, ..., \pi_k$ and $\sigma^2$. LDA does it in the following way:

- $\hat{\mu}_k = \dfrac{1}{n_k}\sum_{i:y_i=k}{x_i}$ ---- (average of all training observations from class $k$),

- $\hat{\sigma}^2 = \dfrac{1}{n-K}\sum_{k=1}^{K}\sum_{i:y_i=k}(x_i-\hat{\mu}_k)^2$ ---- (weighted average of variances for each class),

- $\pi_k = \dfrac{n_k}{n}$ ---- (proportion of training observations of class $k$),

where $n$ is the size of the training set and $n_k$ the number of training observations having $k$ as response class.

So LDA assigns $X=x$ to the class for which
$$
\hat{\delta}_k(x) = x \cdot \dfrac{\hat{\mu}_k}{\hat{\sigma}^2} - \dfrac{\hat{\mu}^2_k}{2\hat{\sigma}^2} + \log{\hat{\pi}_k}
$$
is largest, and the equation is linear in $x$, whence the name.

### case p>1

Analogously using multivariate Gaussian distribution and covariance matrix.

### Remark

In a binary setting, the threshold of the LDA can be manually adjusted to prefer precision over recall or vice-vicersa.

## Quadratic Discriminant Analysis (QDA)

Like LDA, the QDA classifier results from assuming that the observations from each class are drawn from a Gaussian distribution, and plugging estimates for the parameters into Bayes’ theorem in order to perform prediction. However, unlike LDA, QDA assumes that each class has its own covariance matrix, so that the $\delta$ function becomes quadratic in $x$.

- LDA: high bias, low variance
- QDA: low bias, high variance

Roughly speaking, LDA tends to be a better bet than QDA if there are relatively few training observations and so reducing variance is crucial. In contrast, QDA is recommended if the training set is very large, so that the variance of the classifier is not a major concern, or if the assumption of a common covariance matrix for the K classes is clearly untenable.