# Theory

In this chapter, we are interested in predicting qualitative responses $G(x)$, given a vextor of inputs $x=(x_1,x_2,\ldots,x_p)$. The predictor $G(x)$ takes values in a discrete set $C$, comprised of $K$ classes, labelled as $1, 2, \ldots, K$, such that $G(x)\in C$. We would like to estimate the probability of the input to belong to a given class. Then, we could classify observations to the class with largest probability, following the Bayes classifier, which leads to the lowest error rate.


## Linear regression

One could use linear regression to model the conditional probability 

$$
f_k(x)=Pr(G=k|X=x)=\beta_0+\beta^Tx,
$$

where $f_k(x)$ is a linear model and $\beta=(\beta_1,\ldots,\beta_p)$. Then, one could assign the class with highest probability. For a binary response, linear regression is equivalent as linear discriminant analysis, which assumes response densities for each class to be Gaussian. The problem with this approach is that the output of the regression can be negative or larger than one. 

## Logistic regression

Instead of modelling the response, logistic regression models the  probability of the input to belong to each category. For a binary problem, with classes $G=\{1,2\}$, the posterior probabilities 

\begin{align}
p(x)&=Pr(G=1|X=x)=\frac{\exp{(\beta_0+\beta_1^Tx)}}{1+\exp{(\beta_0+\beta^Tx)}},\\
1-p(x)&=Pr(G=2|X=x)=\frac{1}{1+\exp{(\beta_0+\beta^Tx)}}.
\end{align}

By applying the monotone *logit* transformation, which is obtained by dividing both expressions and taking the log,

$$
\log\frac{Pr(G=1|X=x)}{Pr(G=2|X=x)}=\log\frac{p(x)}{1-p(x)}=\beta_0+\beta^Tx.
$$

For the case of one input, increasing $x$ by one unit changes the log-odds by $\beta_1$ or the odds by $\exp(\beta_1)$. For instance, if $9$ out of $10$ default, then the probability of default is $p(x)=0.9$ and the odds are $9$.

The decision boundary is the set of points for which log-odds are zero, ie. the hyperplane defined by $\{x|\beta_0+\beta^Tx=0\}$.

The logistic regression model is usually fit by maximum likelihood for conditional likelihood of $G$ given $x$, $Pr(G|x)$. The likelihood for $N$ observations is given by

$$
l(\beta)=\Pi_{i:y_i=1}p(x_i)\Pi_{i:y_i=2}(1-p(x_i)). 
$$

This expression can be maximized using Newton-type optimization methods. 

Given the estimated paramters, we can compute the z-statistic as $\hat{\beta}_i/SE(\hat{\beta}_i)$, where a large $z$ holds agains the hull hypothesis $H_0:\beta_i=0$.

Non linear boundaries can be obtained with logistic regression by adding polynomial terms as inputs. 

## Discriminant analysis

Instead of modelling directly the posterior distribution, discriminant analysis models first the probability density in each class and then applies Bayes' theorem to compute the posterior. Then one classifies to the highest density. Let $f_k(x)=Pr(X=x|G=k)$ the class-conditional density and $\pi_k=Pr(K=k)$ the prior probability for class $k$, with $\sum_k=\pi_k=1$, then the posterior is given by

$$
Pr(G=k|X=x)=\frac{\pi_k f_k(x)}{\sum_{l=1}^K\pi_l f_l(x)}.
$$

### Linear discriminat analysis

Linear discriminant analysis (LDA) models each class as a multivariate normal disctribution, with classes sharing the save covariance, ie. $X\sim\mathcal{N}(\mu,\Sigma)$, $\Sigma_k=\Sigma \quad\forall k$, and 

\begin{align}
f_k(x)=\frac{1}{(2\pi)^{p/2}|\Sigma|^{1/2}}\exp{(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu))}.
\end{align}

Comparing two classes:

\begin{align}
\log\frac{Pr(G=k|X=x)}{Pr(G=l|X=x)}=\log\frac{\pi_k}{\pi_l}+\log\frac{f_k(x)}{f_l(x)}
=\log\frac{\pi_k}{\pi_l}-\frac{1}{2}(\mu_k+\mu_l)^T\Sigma^{-1}(\mu_k-\mu_l)+x^T\Sigma^{-1}(\mu_k-\mu_l),
\end{align}

which is linear in $x$, as the quadratic terms vanished. The decision boundaries $Pr(G=k|X=x)=Pr(G=l|X=x)$ are then linear and are hyperplanes in $\mathbb{R}^p$. When prior densities are equal, $\pi_k=\pi$, decison boundaries between classes pass through the averaged positions of their respective means, $(\mu_k+\mu_l)/2$ (see figure).


The linear discriminant functions are obtained from the previous equation

$$
\delta_k(x)=\log\pi_k-\frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k+x^T\Sigma^{-1}\mu_k,
$$

so $G(x)=\arg\max_k\delta_k(x)$. The parameters of the Gaussian distribution are obtained as

$$
\pi_k=\frac{N_k}{N},
\hat{\mu}_k=\sum_{g_i=k}x_i/N_k, \hat{\Sigma}=\sum_k\sum_{g_i=k}(x_i-\hat{\mu}_k)(x_i-\hat{\mu}_k)^T/(N-K). 
$$

We can see that the prior density accounts for the number of samples in the class, $N_k$, and the covariance is based on the distance to the classes centroids and in normalized by the number of dof $N-K$.

### Quadratic discriminant analysis

Quadratic discriminant analysis (QDA) models the a differenct covariance for each class as $\Sigma_k$. This leads to quadratic discriminant functions

$$
\delta_k(x)=\log\pi_k-\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)-\frac{1}{2}\log|\Sigma_k|,
$$

which is obtained directly from the class density and for which quadratic terms are not cancelled out as in LDA. 

QDA estimates $kp(p+1)/2$ parameters, which may lead to higher variance than LDA. If the quadratic assumption is correct or the separation boundary is slightly nonlinear, then QDA can lead to lower error rates than LDA by reducing the bias.   

### Regularized discriminant analysis

Regularized discriminant analysis consist of a trade-of between LDA and QDA by shrinking the separate covariances of QDQ towards a common covariance as

$$
\Sigma_k(\alpha)=\alpha\Sigma_k+(1-\alpha)\Sigma, 
$$

where $\alpha\in[0,1]$ is a regularization parameter, which can be selected using validation data or cross-validation. 

### Naive Bayes

Naive Bayes assumes that features are independent, which translates into diagonal covariance matrices, $\Sigma_k=\sigma^2\mathbb{I}$. The discriminant function becomes 

$$
\delta_k(x)=\log(\pi_kf_k(x))=-\frac{1}{2}\left(\frac{\|x-\mu_k\|^2}{\sigma_k^2} + \log\sigma_k^2\right)+\log\pi_k.
$$

Naive Bayes can be useful when $p$ is large, in which case other methods can break down. In this case, fewer parameters need to be obtained, which may reduce variance with a small increase in bias. 

## Comparison of methods

No one method will dominate in all cases. For linear problems, LDA and logistic regression will lead to good results. They both have linear boundaries and lead to generally very similar results. LDA assumes a multivariate Gaussian distribution so it can lead to better results if the condition holds. Logistic regression is very popular for $k=2$ and can be unstable for small $N$ or when classes are well separated. LDA is more popular in the opposite cases and when the Gaussian assumption holds. 

K-NN can lead to better results than the other methods for highly nonlinear decision boundaries. QDA may lead to better results than K-NN for slightly nonlinear problems for low $n$, as QDA makes some assumptions on the decision boundaries and KNN is a non-parametric method, with no assumptions on the boubdary. QDA is attractive for a low number of features. 

Naive Bayes can be an option when $p$ is very large. 

## Alternative methods

Alternative methods when data are not separable, such as SVM, K-NN, SVM, Decision Trees, Random Forest, Bagging, AdaBoost, will be described in separate chapters. 
 

# Examples

Link to examples in R and Python Scikit-Learn are given below. 