# Discriminant Analysis

In [logistic regression](https://github.com/johnnychiuchiu/Machine-Learning/blob/master/LogisticRegression/logisticRegression.pdf), we models the probability of our response y belongs to a particular category by building a linear model of x on the log-odds of y. In discriminant analysis, instead of estimating this probability, we calculate the probability of each observation belongs to each category. It is indeed similar to [Gaussian Mixture Model](https://github.com/johnnychiuchiu/Machine-Learning/blob/master/Clustering/GaussianMixtureModel/gmm.ipynb), the difference is that in GMM we don't have the category in the response variable, but in discriminant analysis we do have it. The difference make the whole problem become more straight forward, since we don't need to use EM step to estimate the mean and variance of each group. Instead, we can use the mean and variance of each group to calculate the probability that a certain observation belongs to each cluster k.

Something we need to keep in mind is that discriminant analysis do not necessary use Gaussian to calculate how likelihood a certain observation belongs to different category. In the simplest form, we can calculate this likelihood (*not to be confused with maximum likelihood, the likelihood here does not related to MLE*) by calculate the Euclidean distance between the observation to the mean of each group. Also, the predictors in discriminant analysis are required to be numerical, since we cannot calculate the distance of categorical variables.

In the following paragraph, we will discuss different methods to do classification using discriminant analysis.

### Linear Discriminant Analysis (LDA) 

##### Linear Discriminant Function from Mahalnobis Distance

As previously discussed, we can simply use Euclidean distance(L2-Norm) to decide the new observation should be in which group by calculating its distance to each group. However, it will have two problems.
1. it combines variables with different units and scales of measurement, e.g. we are adding apples and oranges.
2. It doesnot take into account the different variances of the variables. For example, the variables with large variances should be weighted less; 
3. It doesnot take into account the different correlations among variables. If two variables are highly correlated them both should not be highly weighted.

We can address the second problem by dividing the standard deviation of each variable for calculating Eculidean distance. However, the third problem is still there.

Therefore, we should need a method that help us deal with the correlation problem when calculating distance. The distance we use to address this problem is called **Mahalnobis Distance**, which is defined as follows according to wikipedia:

The Mahalanobis distance of an observation $\displaystyle {\vec {x}}=(x_{1},x_{2},x_{3},\dots ,x_{N})^{T}$ from a set of observations with mean $\displaystyle {\vec {\mu }}=(\mu _{1},\mu _{2},\mu _{3},\dots ,\mu _{N})^{T}$ and covariance matrix S is defined as:

$$\displaystyle d_{M}^2({\vec {x}})={({\vec {x}}-{\vec {\mu }})^{T}S^{-1}({\vec {x}}-{\vec {\mu }})}$$


The Mahalanobis distance has taken into account variances and covariances of the variables. Therefore, when the covariances is high, the weight of both variable will be lower calculated from this distance.

If we denote $\vec{x}$ as $x$, the Linear Discriminant Function derived from Mahalnobis distance is 

$$L_i = \bar{x_i}'S^{-1}x - \frac{1}{2}\bar{x_i}'S^{-1}\bar{x_i}$$

We classify new observation into group $i$ where $L_i$ is maximum.
If we have two predictor($x_1, x_2$), and two reponse categories($1, 2$), then in the LDF from Mahalnobis, we will have 6 coefficients. For example, the LDF can be
$$L_1 = \frac{7}{4}x_1+\frac{2}{3}x_2-\frac{65}{8}$$
$$L_2 = x_1+\frac{1}{3}x_2-\frac{5}{2}$$

To sum up, we first use Mahalnobis Distance that take variances and covariances into account to calculate our ideal distance, and then we choose the group that has smaller distance to our new observation as the classified group, that is if $d_1 < d_2$, then we should classify $x$ in to group 1. The inequality $d_1 < d_2$ equals $L_1 > L_2$. Therefore, we can also use $L_i$ to get the result. It is a linear function of x, since we assume the covariance of each group is the same. Therefore, we deriving L_i from $d_1 < d_2$, the quadratic term can be deleted from both side.

##### Fisher's Linear Discriminant Function (LDF)

Since the classification rule for the two groups case depends only on a single difference $L_1-L_2$, it can be reformulated in terms of a single linear function of $x$ given by

$$Fisher's LDF = LD(x) = (\bar{x_1}-\bar{x_1})'S^{-1}x$$

It can be think of a projection function, that is, the function project $x$ on the straight line defined by this function. We decide whether a new observation $x$ should be classify into group 1 or 2 by comparing the distance of $LD(x)$ to $LD(\bar{x_1})$ and $LD(x)$ to $LD(\bar{x_2})$. If the previous is smaller, then we say the new observation should be classify as group 1. Note that the distance we use to calculate the $LD(x)$ to $LD(\bar{x_i})$ is Eculidean Distance, because it represent the distance of point on this straight line.

For example, a Fisher's LDF can be
$$LD= \frac{2}{3}x_1+\frac{1}{9}x_2$$

We then use the same function using new observation, the mean vector of $x_1$ and $x_2$. Then by comparing the distance between $\big(LD(new observation),LD(x_1)\big)$ and $\big(LD(new observation),LD(x_2)\big)$, whichever is smaller, we classify new observation in that group.

##### Using Bayes' Theorem for Classification

From *Bayes Formula*, we know that 

$$Pr(Y=k|X=x) = \frac{\pi_k f_k(x)}{\sum_{j=1}^{k}\pi_j f_j(x)}$$

where $\pi_j$ is the posterior probability of our response.
If we assume $f_i{x}$ is a multivariate normal distribution, then after some derivation by assuming all the m groups have the same covariance, we can get:

$$Pr(Y=k|X=x) = \frac{\pi_k e^{L_k}}{\sum_{j=1}^{k}\pi_j e^{L_j}}$$

In the *lda function of MASS* library in r, it has a parameter called `prior`. It is the parameter we use to specify the prior probability for each group in our response variable. If unspecified, the class proportions for the training set are used. If present, the probabilities should be specified in the order of the factor levels.

If the prior probabilties are equal, then the Bayes classification rule reduces to the same equation as minimum Mahalbonis distance or maximum discrimanint score. 

We can also think of LDF as a classifier that get results from assuming that the observations from each class are drawn from a Gaussian distribution with the same covariance matrix, and plugging estimates for the parameters into Bates' theorem in order to preform prediction.

---

### • An LDA example in R

**Load Libraries**

In [26]:
library(dplyr)
library(MASS)
data(iris)
iris$Species = as.character(iris$Species)

**Make the dataframe into only two classes**

In [23]:
iris_two_class = iris %>% filter(Species %in% c('setosa','versicolor'))

**Fit a Linear discriminant analysis model **

In [24]:
fit_two_class = lda(Species ~ . , data=iris_two_class, prior=c(0.5,0.5))

**Check the fitted result**

In [25]:
fit_two_class

Call:
lda(Species ~ ., data = iris_two_class, prior = c(0.5, 0.5))

Prior probabilities of groups:
    setosa versicolor 
       0.5        0.5 

Group means:
           Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa            5.006       3.428        1.462       0.246
versicolor        5.936       2.770        4.260       1.326

Coefficients of linear discriminants:
                   LD1
Sepal.Length -0.300458
Sepal.Width  -1.773845
Petal.Length  2.142260
Petal.Width   3.035726

##### Predict new observation

• ***Calculate manually***

The *MASS* package calculates Fisher's LDF. The coefficient shown above is the coefficient for Fisher's LDF. If we have a new observation (5, 3, 2, 0.5), then we can classify it into either group by calculating

In [54]:
new_obs = data.frame(Sepal.Length=5, Sepal.Width=3, Petal.Length=2, Petal.Width=0.5)
LD_new_observation = sum(fit_two_class$scaling * new_obs)
print(LD_new_observation)

[1] -1.021443


In [55]:
LD_setosa = sum(fit_two_class$scaling*fit_two_class$means[1,])
LD_versicolor = sum(fit_two_class$scaling*fit_two_class$means[2,])
print(LD_setosa)
print(LD_versicolor)

[1] -3.706061
[1] 6.45433


In [52]:
distance_setosa = (LD_new_observation-LD_setosa)^2
distance_versicolor = (LD_new_observation-LD_versicolor)^2
print(distance_setosa)
print(distance_versicolor)

[1] 7.207177
[1] 55.88717


The distance between LD(x) to setosa is smaller than versicolor, so the new observation should be classify as Sentosa. We can also use library to get the same result.

• ***Calculate using library***

In [53]:
predict(fit_two_class, newdata=new_obs)

Unnamed: 0,setosa,versicolor
1,1,2.687037e-11

Unnamed: 0,LD1
1,-2.395577


We can see that the posterior probability for *Sentosa* is much higher than *Versicolor*.

### Quadratic Discriminant Analysis (QDA) 

As we have discussed, LDA assume that the covariance matrix is common to all $K$ classes. If the homoscedasticity assumption is dropped, then the quadratic term cannot be deleted when deriving the discriminant function. Therefore, suppose we have two classes, and denote that the distance of a datapoint to group1 and group2 is $d_1$ and $d_2$, then the inqeuality $d_1^2 < d_2^2$ becomes $Q_1 > Q_2$, where 
$$Q_i = -\frac{1}{2}x'S_i^{-1}x+\bar{x_i}'S_i^{-1}x-\frac{1}{2}\bar{x_i}'S_i^{-1}\bar{x_i}$$

It is a quadratic function of $x$, and we can use this function to form the decision boundary.

In [58]:
fit_two_class_qda = qda(Species ~ . , data=iris_two_class, prior=c(0.5,0.5))

In [59]:
fit_two_class_qda

Call:
qda(Species ~ ., data = iris_two_class, prior = c(0.5, 0.5))

Prior probabilities of groups:
    setosa versicolor 
       0.5        0.5 

Group means:
           Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa            5.006       3.428        1.462       0.246
versicolor        5.936       2.770        4.260       1.326

The output of qda contains the group means, But it does not contain the coefficients of hte linear discriminants, because hte QDA classifier involves a quadratic, rather than a linear function of the predictors.

### A Comparision of Classification Methods

* When the classes are well-separated, then LDA or QDA will perform better, since the parameters estimates for the logistics regression using MLE, which will be unstable in this case
* LDA assumes that the observations are drawn from a Gaussian distribution with a common covariance matrix in each class, and so can provide some imporvements over logistic regression when this assumption holds. Conversely, logistic regression can outperform LDA if there Gaussian assumptions are not met.

### Reference
* Predictive Analytics: Paramertic Models for Regression and Classificantion by Ajit C. Tamhane and Edward C. Malthouse
* [An Introduction to Statistical Learning with Applications in R](http://www-bcf.usc.edu/~gareth/ISL/)
* [Mahalanobis Distance Wikipedia](https://en.wikipedia.org/wiki/Mahalanobis_distance)