# Discriminant Analysis

### Table of Contents
1. [Introduction](#Introduction)
2. [Decision Rule Formulation](#Decision-Rule-Formulation)
3. [Linear Discriminant Analysis](#Linear-Discriminant-Analysis)
4. [Quadratic Discriminant Analysis](#Quadratic-Discriminant-Analysis)
5. [Comparison of LDA and QDA](#Comparison-of-LDA-and-QDA)
6. [Regularised Discriminant Analysis](#Regularised-Discriminant-Analysis)
7. [Evaluation of Discriminant Analysis](#Evaluation-of-Discriminant-Analysis)
8. [Conclusion](#Conclusion)
9. [References](#References)

## Introduction
Discriminant Analysis is a technique widely used in classification. Applications include medical research, finance and ecology. There are a number of different methods which correspond to this type of analysis, each depending on their own assumptions about the data or parameters. We will focus on linear discriminant analysis (LDA), quadratic discriminant analysis (QDA) and regularised discriminant analysis (RDA).

## Decision Rule Formulation
First, we note that all discriminant classification method requires a decision rule:
$$\hat{G}^*(x) \in \arg \max_{k} \mathbb{P}(G = k | X = x)$$

Here, $\hat{G}^*(x)$ is the optimal class we have labelled the given data as, with $G=k$ the possible class labels and $X=x$ the observed data.

We will now define some notation and reformulate the decision rule for our analysis.

First, we would like to know the class posteriors $\mathbb{P}(G|X)$.

Let $f_k(x)$ be class-conditional density of X given class $G=k$.

Let $\pi_k$ be the prior probability of belonging to class $k$.

Using Bayes' theorem, we obtain:
$$\mathbb{P}(G = k \mid X = x) = \frac{f_k(x) \pi_k}{\sum_{j=1}^{K} f_j(x) \pi_j}$$

For LDA and QDA, we make the assumption that the class-conditional densities are Gaussian.

Hence, we have:

$$f_k(x) = \frac{1}{(2\pi)^{d/2} |\Sigma|^{1/2}} \exp \left( -\frac{1}{2} (x - \mu_k)^\top \Sigma^{-1} (x - \mu_k) \right)$$


## Linear Discriminant Analysis
Next, for LDA we further assume that the covariance matrices for each class are equal.

$$\Sigma_k = \Sigma, \; \forall k$$

This assumption is vital for LDA and this is what differentiates LDA from QDA.

We want to compare two different classes to help to classify our data. We will use the log-odds ratio and see that in LDA, the log-odds results in a linear expression. This is the reason for the name 'Linear Discriminant Analysis'.

We have that:
$$\mathbb{P}(G = k \mid X = x) = \frac{f_k(x) \pi_k}{\sum_{j=1}^{K} f_j(x) \pi_j}$$
Then, the log-odds is:
$$
\begin{align*}
\log \frac{\mathbb{P}(G = k \mid X = x)}{\mathbb{P}(G = \ell \mid X = x)} 
& = \log \frac{f_k(x)}{f_\ell(x)} + \log \frac{\pi_k}{\pi_\ell} \\
& = \log \frac{\pi_k}{\pi_\ell} - \frac{1}{2} (\mu_k + \mu_\ell)^\top \Sigma^{-1} (\mu_k - \mu_\ell) \\
& \quad + x^\top \Sigma^{-1} (\mu_k - \mu_\ell)
\end{align*}
$$

We can see that the covariance assumption allowed terms to cancel out, giving a final expression which is linear in $x$. The decision boundary which allows us to distinguish between different classes is the set where the conditional-class probabilities are equal for the different classes. Since the log-odds gives a linear expression, the decision boundary in LDA is a hyperplane.

Furthermore, we define the discriminant function in LDA:
$$
\delta_k(x) = x^\top \Sigma^{-1} \mu_k - \frac{1}{2} \mu_k^\top \Sigma^{-1} \mu_k + \log \pi_k
$$
with $\Sigma$ the covariance matrix, and $\mu_k$ the mean and $\pi_k$ the prior probability for class $k$

We can see that the log-odds is the difference $\delta_k(x) - \delta_l(x)$.
To classify our data optimally, we must maximise the posterior probability in the decision rule for a given $k$. This is equivalent to maximising the the log-odds, which in turn is equivalent to maximising the discriminant function so this is what we aim to do. Hence, we have now reformulated the decision rule to maximising the discriminant function for each class $k$.

Clearly, we have a number of parameters here. In practice, these values will be unknwn and we need to estimate these using the training data to define our Gaussian densities. Hence, we estimate the prior probability, the class mean and the class covariance matrices as follows:

$$
\hat{\pi}_k = \frac{N_k}{N} 
$$

$$
\hat{\mu}_k = \frac{\sum_{g_i = k} x_i}{N_k}
$$

$$
\hat{\Sigma} = \frac{\sum_{k=1}^{K} \sum_{g_i = k} (x_i - \hat{\mu}_k)(x_i - \hat{\mu}_k)^\top}{N - K}
$$

The 'hat' notation denotes the parameter estimates, $N_k$ is the number of observations in the training data which correspond to the $k$-th class and $K$ is the number of different classes.

To classify datapoints, we label them with the class $k$ which maximises their discriminant function.

## Quadratic Discriminant Analysis
Suppose now that we omit the assumption that the covariance matrices are equal. We can then show by a similar derivation that the discriminant function is now quadratic in $x$.

The discriminant function in QDA:

$$
\delta_k(x) = -\frac{1}{2} \log |\Sigma_k| - \frac{1}{2} (x - \mu_k)^\top \Sigma_k^{-1} (x - \mu_k) + \log \pi_k.
$$

with $\mu_k$ the mean, $\Sigma_k$ the covariance matrix and $\pi_k$ the prior probability for class $k$.

Now, the decision boundary is defined by a quadratic equation due to this chang in discriminant function.

We still must estimate the parameters for the Gaussian density, however, we must now estimate the covariance matrix for each class sepearately.

We again aim to maximise the discriminant function in QDA to classify our datapoints.

## Comparison of LDA and QDA
One drawback of QDA is that if $p$ is large then this significantly increases the number of parameters and the computational cost. If there are $K$ classes, then in LDA we need the differences in discriminant functions of a chosen class $K$ and the other $K-1$ classes. Each difference requifres $p+1$ parameters, hence, LDA requires $(K-1)(p+1)$ parameters. On the other hand, QDA requires $(K-1)(p(p+3)/2+1)$ parameters since each difference requires $(p(p+3)/2+1)$ parameters.

The covariance matrix assumption in LDA can negatively impact model performance, whereas in QDA, the absence of this assumption causes QDA to be less likely to be affected by errors due to this. Furthermore, LDA assumes linearity of the log-odds in $x$ which limits the usefulness of LDA in situations where this relationship is more complex.

It is clear that LDA is more useful when the data given supports a linear decision boundary, and QDA is more useful when the data given supports a quadratic decision boundary. These decision boundaries are simple and this is why both techniques are widely used. The bias-variance tradeoff is allowing bias from a simpler decision boundary in order to reduce the variance, since more complex decision boundaries typically have much hugher variances.

## Regularised Discriminant Analysis

In 1989, Friedman developed a method which aimed to give a combination of the benefits of both LDA and QDA within one model. As we discussed above, LDA has a common covariance matrix for all classes and QDA has different covariance matrices for different classes. This technique - regularised discriminant analysis - allows the shrinkage of the distinct covariance matrices in QDA towards the common matrix in LDA.

We control the shrinkage with parameter $\alpha \in [0, 1]$ s follows:
$$
\hat{\Sigma}_k(\alpha) = \alpha \hat{\Sigma}_k + (1 - \alpha) \hat{\Sigma}
$$
with $\hat{\Sigma}$ the covariance matrix used in LDA and $\hat{\Sigma}_k$ the covariance matrix used in QDA for corresponding $k$.

Then, using these new covariance matrices we carry out discriminant analysis as we did before, to obtain decision boundaries and classify our data.

## Evaluation of Discriminant Analysis
We now discuss the advantages and drawbacks of discriminant analysis methods in general.
### Advantages
- Can handle classification problems involving more than two classes, unlike other methods including logistic regression.
- Helps to understand group differences by showing which variables are important to classify between groups.
- Produces interpretable models.
- Permits the use of prior information through prior probabilities.
- Efficiently handles datasets with large $p$ (the number of covariates).

### Drawbacks
- Normality assumptions may be invalid and violating these can impact the overall performance of the model.
- Discriminant analysis is not very robust in the presence of outliers.
- Highly-correlated covariates will lead to unstable estimates and poor model performance.
- Overfitting is a possibility.

## Conclusion
We discussed the mathematical theory behind discriminant analysis techniques, including linear discriminatn analysis, quadratic discriminant analysis and regularised discriminant analysis. Next, we evaluated the use of discriminant analysis as a technique in general. Now we briefly discuss the suitability of discriminant analysis for our classification problem, before implementing the discussed techniques with our income dataset.

We are aiming to classify datapoints from our income dataset as either '<=50K' or '>50K'. Reflecting upon the above discussion, discriminant analysis is definitely suitable for our problem, however, its success will need to be evaluated and we must do this for a number of different discriminant classification methods to ensure we are using the optimal method. It would make sense to use LDA, QDA and RDA since we have discussed these.

LDA and QDA can be implemented using the scikit-learn machine learning library. RDA is not part of this library so we will create our own 'RegularisedDiscriminantAnalysis' class to classify our datapoints in a similar form to the sciki-learn LinearDiscriminantAnalysis and QuadraticDiscriminantAnalysis functions.

We will evaluate our models using metrics which may include accuracy, confusion matrices, precision, recall and F1-score. To conclude, we will compare their overall performance with receiver-operator curves. These metrics and curves will be implemented using scikit-learn also. We will use matplotlib and seaborn to create engaging visualisations.

## References
[1] Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. Vol. 2. New York: springer, 2009.

[2] Research Method Article: https://researchmethod.net/discriminant-analysis/

[3] scikit-learn Documentation - 1.2 Linear and Quadratic Discriminant Analysis: https://scikit-learn.org/stable/modules/lda_qda.html

[4] scikit-learn Documenation - 6.3 Preprocessing data: https://scikit-learn.org/stable/modules/preprocessing.html

[5] sckit-learn Documentation - 3.4 Metrics and scoring: https://scikit-learn.org/stable/modules/model_evaluation.html#

[5] Sebastian Raschka’s PCA vs LDA article with Python Examples: https://sebastianraschka.com/Articles/2014_python_lda.html#lda-via-scikit-learn

[6] Data Science Toolbox Lecture Notes on ROC Curves: https://dsbristol.github.io/dst/assets/slides/05.1-Classification.pdf