\section{Classification}
The response variable is qualitative!

Logistic regression, linear discriminant analysis, K-nearest neighbors

\subsection{Why not linear regression?}
A dummy variable encoding implies an ordering on the outcomes, putting some categories above others. 

\subsection{Logistic regression}
Logistic regression models the probability that Y belongs to a particular category.

\subsubsection{The logistic model}
How should we model the relationship between $p(X) = Pr(Y = 1 | X)$?

In logistic regression, we use the logistic function:
\[
p(X) = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}}
\]
The model is fitted with maximum likelihood 

We find that 
\[
\frac{p(X)}{1-p(X)}=e^{\beta_0 + \beta_1 X} 
\]
which is the odds. 

By taking the logarithm of both sides we arrive at
\[
\log \Bigg(\frac{p(X)}{1-p(X)} \Bigg) = \beta_0 + \beta_1 X
\]
The logistic regression model has a logit that is linear in X.

\subsubsection{Estimating the regression coefficients}
Likelihood function:
\[
l(\beta_0, \beta_1) = \prod_{i:y_i = 1}p(x_i) \prod_{i': y_i' = 0} (1 - p(x_i'))
\]
The estimates $\hat{\beta}_0$ and $\hat{\beta}_1$ are chosen to maximize this likelihood function.

\subsubsection{Making predictions}
Simply use the formula
\[
\hat{p}(X) = \frac{e^{\hat{\beta}_0 + \hat{\beta}_1X}}{1 + e^{\hat{\beta}_0+\hat{\beta}_1X}}
\]

\subsubsection{Multiple logistic regression}
Predicting a binary response using multiple predictors
\[
\log \Bigg(\frac{p(X)}{1 - p(X)} \Bigg) = \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p 
\]
which can be rewritten as \[
p(X) = \frac{e^{\beta_0 + \beta_1 X_1 + \dots + \beta_p X_p}}{1 + e^{\beta_0 + \beta_1 X_1 + \dots + \beta_p X_p}}
\]

\subsubsection{Logistic regression for >2 response classes}
Linear discriminant analysis is popular for the multiple-class setting!

\subsection{Linear discriminant analysis}
When the classes are well-separated, the parameter estimates for the logistic regression model are surprisingly unstable. 

If $n$ is small and the distribution of the predictors $X$ is approximately normal in each of the classes, the linear discriminant model is again more stable than the logistic regression model

\subsubsection{Using Bayes' theorem for classification}
Let $\pi_k$ represent the overall or prior probability that a randomly chosen observation comes from the $k$th class. Let $f_k(x) = Pr(X = x | Y = k)$ denote the density function of $X$ for an observation that comes from the $k$th class. Then Bayes' theorem states that
\[
Pr(Y = k|X=x) = \frac{\pi_k f_k(x)}{\sum_{l=1}^K \pi_l f_l(x)}
\]

We know that the Bayes classifier has the lowest possible error rate out of all classifiers. If we can find a way to estimate $f_k(X)$, then we can develop a classifier that approximates the Bayes classifier.

\subsubsection{Linear discriminant analysis for p=1}
Suppose we assume that $f_k(x)$ is normal or Gaussian. In the one-dimensional setting, the normal density takes the form
\[
f_k(x) = \frac{1}{\sqrt{2 \pi} \sigma_k} \exp \Bigg(- \frac{1}{2 \sigma_k^2} (x - \mu_k)^2 \Bigg)
\]

where $\mu_k$ and $\sigma_k^2$ are the mean and variance parameters for the $k$th class. We find that
\[
p_k(x) = \frac{\pi_k \frac{1}{\sqrt{2 \pi}\sigma}\exp \Bigg(- \frac{1}{2\sigma^2}(x - \mu_k)^2 \Bigg)}{\sum_{l=1}^K\pi_l \frac{1}{\sqrt{2 \pi}\sigma}\exp \Bigg(- \frac{1}{2\sigma^2}(x - \mu_l)^2 \Bigg)}
\]

Taking the log and rearranging the terms, we can show that this is equivalent to assigning the observation to the class for which 
\[
\delta_k (x) = x\frac{\mu_k}{\sigma^2} - \frac{\mu_k^2}{2\sigma^2} + \log(\pi_k)
\]
is the largest. 

\begin{example}
$K=2$ and $\pi_1 = \pi_2$, then the Bayes classifier assigns an observation to class 1 if $2x(\mu_1 - \mu_2) > \mu_1^2 - \mu_2^2$ and to class 2 otherwise. In this case, the Bayes decision boundary corresponds to the point where
\[
x = \frac{\mu_1^2 - \mu_2^2}{2(\mu_1 - \mu_2} = \frac{\mu_1 + \mu_2}{2}
\]
\end{example}

The linear discriminant analysis (LDA) method approximates the Bayes classifier by plugging estimates for $\pi_k$, $\mu_k$ and $\sigma^2$. In particular, the following estimates are used:
\[
\hat{\mu}_k = \frac{1}{n_k}\sum_{i:y_i = k}x_i
\]
\[
\hat{\sigma}^2 = \frac{1}{n-K}\sum_{k=1}^K \sum_{y_i = k}(x_i - \hat{\mu}_k)^2
\]
where $n$ is the total number of training observations, and $n_k$ is the number or training observations inn the kth class. LDA estimates $\pi_k$ as 
\[
\hat{\pi}_k = n_k/n
\]
LDA assigns an observation $X = x$ to the class for which
\[
\hat{\delta}_k(x) = x \frac{\hat{\mu}_k}{\hat{\sigma}^2} -
\frac{\hat{\mu}_k^2}{2 \hat{\sigma}^2} + \log(\hat{\pi}_k)
\]
is largest. The word linear in the classifier's name stems from the fact that the discriminant functions are linear functions of x.

The LDA classifier results from assuming that the observations within each class come from a normal distribution with class-specific mean vector and a common variance $\sigma^2$.

\subsubsection{Linear discriminant analysis for p > 1}
We now extend the LDA classifier to the case of multiple predictors. We will assume that $X = (X_1, X_2, \dots, X_p)$ is drawn from a multivariate Gaussian (or multivariate normal) distribution, with a class-specific mean vector and a common covariance matrix.  

The multivariate Gaussian density is defined as
\[
f(x) = \frac{1}{(2\pi)^{p/2}|\mathbf{\Sigma}|^{1/2}} \exp \bigg( -\frac{1}{2}(x-\mu)^T \mathbf{\Sigma}^{-1}(x - \mu) \bigg)
\]
Using this we can see that the Bayes classifier assigns an observation $X=x$ to the class for which 
\[
\delta_k(x) = x^T \mathbf{\Sigma}^{-1}\mu_k - \frac{1}{2}\mu_k^T \mathbf{\Sigma}^{-1}\mu_k + \log(\pi_k)
\]
is the largest. 

The Bayes decision boundaries are the set of values $x$ for which $\delta_k(x) = \delta_l(x)$, i.e.
\[
x^T\mathbf{\Sigma}^{-1}\mu_k - \frac{1}{2}\mu_k^T \mathbf{\Sigma}^{-1}\mu_k = x^T \mathbf{\Sigma}^{-1}\mu_l - \frac{1}{2}\mu_l^T\mathbf{\Sigma}^{-1}\mu_l
\]
for $k \neq l$ (th term $\pi_k$ has disappeared as all three classes have same number of training observations). 

Class-specific performance is also important in medicine and biology. Sensitivity is the percentage of true defaulters that are identified. The specificity is the percentage of non-defaulters that are correctly identified. 

The Bayes classifier, and by extension LDA, uses a threshold of 50\% for the posterior probability of default in order to assign an observation to the default class. This threshold can be changed!

\begin{enumerate}
    \item False positive rate: type I error, 1 - specificity
    \item True positive rate: 1 - Type II error, power, sensitivity
\end{enumerate}

\subsubsection{Quadratic discriminant analysis}
LDA assumes that the observations within each class are drawn from a multivariate Gaussian distribution with a class-specific mean vecotr and a covariance matrix that is common to all $K$ classes. Quadratic discriminant analysis (QDA) assumes that each class has its own covariance matrix. Under this assumption, the Bayes classifier assigns an observation $X = x$ to the class for which
\[
\delta_k(x) = -\frac{1}{2}(x - \mu_k)^T \mathbf{\Sigma}_k^{-1}(x - \mu_k) - \frac{1}{2}\log |\mathbf{\Sigma}_k| + \log \pi_k
\]
\[
= -\frac{1}{2}x^T \mathbf{\Sigma}_k^{-1}x + x^T \mathbf{\Sigma}_k^{-1}\mu_k - 
\frac{1}{2}\mu_k^T \mathbf{\Sigma}_k^{-1}\mu_k - \frac{1}{2}\log|\mathbf{\Sigma}_k| + \log \pi_k
\]
is the largest. 

Why would one prefer LDA to QDA, or vice versa? The answer lies in the bias-variance trade-off. LDA is much less flexible classifier than QDA, and so has substantially lower variance. LDA can suffer from high bias. LDA tends to be a better bet than QDA if there are relatively few training observations.

\subsection{A comparison of classification methods}
We have discussed the K-nearest neighbors method (KNN), logistic regression, LDA and QDA. 

Both logistic regression and LDA produce linear decision boundaries. KNN is expected to dominate LDA and logistic regression when the decision boundary is highly non-linear.


